[bug] incorrect calculation of disk allocated resources

Version

I’ve tested and reproduced on 2 older versions of Nomad (0.5.6 and 0.7.1) and this seems to be still an issue in 0.9.4 as well.

Issue

Nomad can not allocate resources under the false presumption that we ran out of disk resources.

Reproduction steps

Start a Nomad agent. Compare the Allocated Resources section of a nomad node-status -selfwith df -h reported available space. Now try to schedule a job with an ephemeral disk size similar to what is available to Nomad. This should succeed. Now fill up the disk using some random data: dd if=/dev/zero of=/root/test.img bs=1M count=4096. Try a nomad plan again, this should still succeed. Now restart the Nomad agent and try a nomad plan again, it should now fail stating it ran out of disk resources.

The behaviour w.r.t. disk resources is different than for example CPU and Memory:

  • Nomad allocated resources are reporting disk « free » fingerprinted during agent startup, instead of « total » disk space available.
  • Nomad seems to deduct « allocated » GiB’s from fingerprint reported « free » GiB’s. This is not correct as the reported « free » GiB’s changes over time when you restart the Nomad agent and have running jobs occupying disk space.

A real world example which clearly shows we should have plenty of disk resources available on this node but due the deduction: (free (566 GiB) – allocated (440 GiB)) we ran out:

ygersie@worker059:~$ nomad node-status -self
<snip>...
Allocated Resources
CPU              Memory         Disk             IOPS
12000/26388 MHz  24 GiB/62 GiB  440 GiB/566 GiB  0/0

Allocation Resource Utilization
CPU             Memory
1427/26388 MHz  20 GiB/62 GiB

Host Resource Utilization
CPU             Memory         Disk
1970/26388 MHz  28 GiB/63 GiB  354 GiB/984 GiB
<snip>....
ygersie@worker059:~$ df -h /
Filesystem      Size  Used Avail Use% Mounted on
/dev/vda1       985G  355G  591G  38% /

and here an example job which asks to schedule Redis on worker059:

ygersie@nomad001:~$ cat example.nomad
job "example" {
  region = "ams1"
  datacenters = ["zone2"]
  type = "service"
  constraint {
    attribute = "${node.unique.name}"
    value     = "worker059"
  }

  group "cache" {
    count = 1
    restart {
      attempts = 10
      interval = "5m"
      delay = "25s"
      mode = "delay"
    }
    ephemeral_disk {
      size = 150000
    }
    task "redis" {
      driver = "docker"
      config {
        image = "redis:3.2"
        port_map {
          db = 6379
        }
      }
      resources {
        cpu    = 500 # 500 MHz
        memory = 256 # 256MB
        network {
          mbits = 1
          port "db" {}
        }
      }
    }
  }
}
ygersie@nomad001:~$ nomad plan example.nomad
+ Job: "example"
+ Task Group: "cache" (1 create)
  + Task: "redis" (forces create)

Scheduler dry-run:
- WARNING: Failed to place all allocations.
  Task Group "cache" (failed to place 1 allocation):
    * Constraint "${node.unique.name} = worker059" filtered 63 nodes
    * Resources exhausted on 1 nodes
    * Dimension "disk exhausted" exhausted on 1 nodes

Job Modify Index: 0
To submit the job with version verification run:

nomad run -check-index 0 example.nomad

When running the job with the check-index flag, the job will only be run if the
server side version matches the job modify index returned. If the index has
changed, another user has modified the job and the plan's results are
potentially invalid.

Impact

We can not schedule any more jobs even though we have plenty of disk resources available.. This is a pretty significant bug as I do not know a way to workaround it except lowering the resource ask dramatically. For CPU and Memory we can override what Nomad thinks it has available but not for Disk.

hashicorp/nomad

************************************************************************************

************************************************************************************

You are most welcome and thanks for the follow up. As I see it there’s a couple of options:

  • Calculate allocation disk usage during fingerprinting and add that back to the “available/free” total. This can be a costly operation depending on the amount of files located in the alloc storage dir of Nomad.
  • For Unix fingerprinting use the (total)-(filesystem configured reserved root)-(reserved for external sources).
  • Use filesystem total with a default sane reserved percentage and leave it to the users to adjust if necessary.

I would opt for the simple solution (option #3) as it would also be similar to memory and cpu, but I do not have the context you guys have.

Répondre

Entrez vos coordonnées ci-dessous ou cliquez sur une icône pour vous connecter:

Logo WordPress.com

Vous commentez à l'aide de votre compte WordPress.com. Déconnexion /  Changer )

Photo Google

Vous commentez à l'aide de votre compte Google. Déconnexion /  Changer )

Image Twitter

Vous commentez à l'aide de votre compte Twitter. Déconnexion /  Changer )

Photo Facebook

Vous commentez à l'aide de votre compte Facebook. Déconnexion /  Changer )

Connexion à %s