You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@aurora.apache.org by Miguel Moreno <mi...@piksel.com> on 2018/02/13 07:50:32 UTC

Thermos Observer is showing 0.0 For CPU and Memory

Hi everybody.


We are running Aurora into AWS machines with Mesos agents.


The jobs are working well and no error in the log but, when trying to look into the resources usage in the Thermos Observer endopint (hostname:1338) it always shows 0.0 for cpu and memory.


Scrubbed __main__.log:


Log file created at: 2018/02/07 09:45:08
Running on machine: -
[DIWEF]mmdd hh:mm:ss.uuuuuu pid file:line] msg
Command line: /srv/apps/mesos/var/sandbox/thermos_executor.pex --announcer-ensemble (zookeepers) --mesos-containerizer-path /usr/libexec/mesos/mesos-containerizer --runner-logger-destination file --runner-logger-mode rotate --runner-rotate-log-backups 10 --runner-rotate-log-size-mb 5 --sandbox-mount-point /srv/apps/mesos/var/sandbox
I0207 09:45:08.067967 12 executor_base.py:45] Executor [None]: registered() called with:
I0207 09:45:08.068506 12 executor_base.py:45] Executor [None]:    ExecutorInfo:  executor_id {
  value: "executor_id"
}
resources {
  name: "cpus"
  type: SCALAR
  scalar {
    value: 0.25
  }
  role: "*"
}
resources {
  name: "mem"
  type: SCALAR
  scalar {
    value: 128.0
  }
  role: "*"
}
command {
  uris {
    value: "/usr/share/aurora/bin/thermos_executor.pex"
    executable: true
  }
  uris {
    value: "/etc/mesos-files/.docker/config.json"
    executable: true
  }
  uris {
    value: "/usr/share/aurora/bin/thermos_executor.pex"
    executable: true
  }
  value: "${MESOS_SANDBOX=.}/thermos_executor.pex --announcer-ensemble (zookeepers) --mesos-containerizer-path /usr/libexec/mesos/mesos-containerizer --runner-logger-destination file --runner-logger-mode rotate --runner-rotate-log-backups 10 --runner-rotate-log-size-mb 5 --sandbox-mount-point /srv/apps/mesos/var/sandbox"
}
framework_id {
  value: "6ff6b40c-3dfe-419f-b0ba-df52eb354be5-0000"
}
name: "AuroraExecutor"
source: "source"
container {
  type: DOCKER
  docker {
    image: "docker"
  }
}
labels {
  labels {
    key: "source"
    value: "source"
  }
}
I0207 09:45:08.068696 12 executor_base.py:45] Executor [None]:    FrameworkInfo: user: "root"
name: "Aurora"
id {
  value: "6ff6b40c-3dfe-419f-b0ba-df52eb354be5-0000"
}
failover_timeout: 1814400.0
checkpoint: true
role: "batch"
hostname: "host"
principal: "aurora"
capabilities {
  type: REVOCABLE_RESOURCES
}
capabilities {
  type: GPU_RESOURCES
}
I0207 09:45:08.069024 12 executor_base.py:45] Executor [None]:    SlaveInfo:     hostname: "hostnamel"
resources {
  name: "cpus"
  type: SCALAR
  scalar {
    value: 7.5
  }
  role: "*"
}
resources {
  name: "mem"
  type: SCALAR
  scalar {
    value: 14526.83
  }
  role: "*"
}
resources {
  name: "ports"
  type: RANGES
  ranges {
    range {
      begin: 20000
      end: 30000
    }
  }
  role: "*"
}
resources {
  name: "disk"
  type: SCALAR
  scalar {
    value: 95541.0
  }
  role: "*"
}
attributes {
  name: "az"
  type: TEXT
  text {
    value: "eu-west-1b"
  }
}
attributes {
  name: "stage"
  type: TEXT
  text {
    value: "sandbox"
  }
}
id {
  value: "ce246e68-974b-4af5-9f23-39ab6f4be186-S179"
}
checkpoint: true
port: 5051
I0207 09:45:08.070421 12 executor_base.py:45] Executor [None]: TaskInfo: name: "name"
task_id {
  value: "task_id"
}
slave_id {
  value: "ce246e68-974b-4af5-9f23-39ab6f4be186-S179"
}
resources {
  name: "cpus"
  type: SCALAR
  scalar {
    value: 1.0
  }
  role: "*"
}
resources {
  name: "disk"
  type: SCALAR
  scalar {
    value: 2048.0
  }
  role: "*"
}
resources {
  name: "mem"
  type: SCALAR
  scalar {
    value: 1024.0
  }
  role: "*"
}
executor {
  executor_id {
    value: "executor_id"
  }
  resources {
    name: "cpus"
    type: SCALAR
    scalar {
 value: 0.25
    }
    role: "*"
  }
  resources {
    name: "mem"
    type: SCALAR
    scalar {
      value: 128.0
    }
    role: "*"
  }
  command {
    uris {
      value: "/usr/share/aurora/bin/thermos_executor.pex"
      executable: true
    }
    uris {
      value: "/etc/mesos-files/.docker/config.json"
      executable: true
    }
    uris {
      value: "/usr/share/aurora/bin/thermos_executor.pex"
      executable: true
    }
    value: "${MESOS_SANDBOX=.}/thermos_executor.pex --announcer-ensemble (zookeepers) --mesos-containerizer-path /usr/libexec/mesos/mesos-containerizer --runner-logger-destination file --runner-logger-mode rotate --runner-rotate-log-backups 10 --runner-rotate-log-size-mb 5 --sandbox-mount-point /srv/apps/mesos/var/sandbox"
  }
  framework_id {
    value: "6ff6b40c-3dfe-419f-b0ba-df52eb354be5-0000"
  }
  name: "AuroraExecutor"
  source: "source"
  container {
    type: DOCKER
    docker {
      image: "docker"
    }
  }
  labels {
    labels {
      key: "source"
      value: "source"
    }
  }
}
data: “data"
labels {
  labels {
    key: "org.apache.aurora.tier"
    value: "preemptible"
  }
}
discovery {
  visibility: CLUSTER
  name: "name"
  environment: "devel"
  location: "aurora-eu-west-1"
}
I0207 09:45:08.070538 12 executor_base.py:45] Executor [None]: launchTask got task: task 12 executor_base.py:45] Executor [ce246e68-974b-4af5-9f23-39ab6f4be186-S179]: Assigned task: AssignedTask(task=TaskConfig(isService=False, contactEmail=None, taskLinks={}, tier='preemptible', mesosFetcherUris=set([]), executorConfig=ExecutorConfig(data='{"environment": "devel", "health_check_config": {"health_checker": {"http": {"expected_response_code": 0, "endpoint": "/health", "expected_response": "ok"}}, "min_consecutive_successes": 1, "initial_interval_secs": 15.0, "max_consecutive_failures": 0, "timeout_secs": 1.0, "interval_secs": 10.0}, "container": {"image": "docker", "parameters": []}, "name": "name", "service": false, "max_task_failures": 1, "cron_collision_policy": "KILL_EXISTING", "enable_hooks": false, "cluster": "devcluster", "task": {"processes": [{processes, requestedPorts=set([]), maxTaskFailures=1, priority=1, ramMb=1024, job=JobKey(environment='devel', role='role', name='name'), production=False, diskMb=2048, resources=set([Resource(ramMb=None, numGpus=None, namedPort=None, diskMb=2048, numCpus=None), Resource(ramMb=None, numGpus=None, namedPort=None, diskMb=None, numCpus=1.0), Resource(ramMb=1024, numGpus=None, namedPort=None, diskMb=None, numCpus=None)]), owner=Identity(user='user'), container=Container(docker=DockerContainer(image='ddocker', parameters=[]), mesos=None), metadata=set([]), numCpus=1.0, constraints=set([Constraint(name='stage', constraint=TaskConstraint(limit=None, value=ValueConstraint(negated=False, values=set(['sandbox']))))])), taskId='t_id', instanceId=0, assignedPorts={}, slaveHost='host', slaveId='id’)
I0207 09:45:08.071496 12 executor_base.py:45] Executor [ce246e68-974b-4af5-9f23-39ab6f4be186-S179]: Updating task_id=> STARTING
I0207 09:45:08.071634 12 executor_base.py:45] Executor [ce246e68-974b-4af5-9f23-39ab6f4be186-S179]:    Reason: Initializing sandbox.
D0207 09:45:08.072333 12 sandbox.py:129] DirectorySandbox: mkdir /srv/apps/mesos/var/workdir/slaves/ce246e68-974b-4af5-9f23-39ab6f4be186-S179/frameworks/6ff6b40c-3dfe-419f-b0ba-df52eb354be5-0000/executors/task_id/runs/300ff840-b4f8-4b25-8283-4b13cc7fab2d/sandbox
D0207 09:45:09.482642 12 ckpt.py:379] Running state machine for process=setup_environment/seq=0
D0207 09:45:09.482762 12 ckpt.py:379] Running state machine for process=setup_environment/seq=1
D0207 09:45:09.482841 12 recordio.py:137] /srv/apps/mesos/var/sandbox/checkpoints/checkpoints/task_id/runner has no data (current offset = 981)
D0207 09:45:09.482932 12 resource.py:136] Initialising resource collection for task task_id
D0207 09:45:09.483011 12 resource.py:147] Initialising ResourceHistory of length 180
D0207 09:45:09.620017 12 resource.py:187] Commencing resource monitoring for task "task_id"
I0207 09:45:09.620552 12 aurora_executor.py:188] Send TASK_RUNNING status update. status: StatusResult('No health-check defined, task is assumed healthy.', status='TASK_RUNNING')
I0207 09:45:09.620950 12 executor_base.py:45] Executor [ce246e68-974b-4af5-9f23-39ab6f4be186-S179]: Updating task_id => RUNNING
I0207 09:45:09.621148 12 executor_base.py:45] Executor [ce246e68-974b-4af5-9f23-39ab6f4be186-S179]:    Reason: No health-check defined, task is assumed healthy.
D0207 09:45:09.622956 12 disk.py:44] DiskCollectorThread: finished collection of /srv/apps/mesos/var/workdir/slaves/ce246e68-974b-4af5-9f23-39ab6f4be186-S179/frameworks/6ff6b40c-3dfe-419f-b0ba-df52eb354be5-0000/executors/task_id/runs/300ff840-b4f8-4b25-8283-4b13cc7fab2d/sandbox in 0.8ms
D0207 09:45:29.654671 12 ckpt.py:379] Running state machine for process=setup_environment/seq=2
D0207 09:45:29.654998 12 ckpt.py:379] Running state machine for process=setup_environment/seq=3
D0207 09:45:29.655204 12 ckpt.py:379] Running state machine for process=s3_download/seq=0
D0207 09:45:29.655424 12 ckpt.py:379] Running state machine for process=s3_download/seq=1
D0207 09:45:29.655628 12 ckpt.py:379] Running state machine for process=s3_download/seq=2
D0207 09:45:29.655834 12 ckpt.py:379] Running state machine for process=s3_download/seq=3
D0207 09:45:29.656034 12 ckpt.py:379] Running state machine for process=single_h265_transcode_file/seq=0
D0207 09:45:29.656261 12 ckpt.py:379] Running state machine for process=single_h265_transcode_file/seq=1
D0207 09:45:29.656470 12 ckpt.py:379] Running state machine for process=single_h265_transcode_file/seq=2
D0207 09:45:29.656608 12 recordio.py:137] /srv/apps/mesos/var/sandbox/checkpoints/checkpoints/task_id/runner has no data (current offset = 1569)
D0207 09:45:49.657543 12 process_collector_psutil.py:93] Calculated rate for pid=63 and children: 1.24659524141
D0207 09:46:09.659521 12 disk.py:44] DiskCollectorThread: finished collection of /srv/apps/mesos/var/workdir/slaves/ce246e68-974b-4af5-9f23-39ab6f4be186-S179/frameworks/6ff6b40c-3dfe-419f-b0ba-df52eb354be5-0000/executors/task_id/runs/300ff840-b4f8-4b25-8283-4b13cc7fab2d/sandbox in 1.3ms
D0207 09:46:09.665703 12 process_collector_psutil.py:93] Calculated rate for pid=63 and children: 1.24998856702
D0207 09:46:29.758064 12 process_collector_psutil.py:93] Calculated rate for pid=63 and children: 1.24475260671
D0207 09:46:49.759083 12 process_collector_psutil.py:93] Calculated rate for pid=63 and children: 1.24843625817
D0207 09:47:09.756055 12 disk.py:44] DiskCollectorThread: finished collection of /srv/apps/mesos/var/workdir/slaves/ce246e68-974b-4af5-9f23-39ab6f4be186-S179/frameworks/6ff6b40c-3dfe-419f-b0ba-df52eb354be5-0000/executors/ttask_id/runs/300ff840-b4f8-4b25-8283-4b13cc7fab2d/sandbox in 1.3ms
D0207 09:47:09.759665 12 process_collector_psutil.py:93] Calculated rate for pid=63 and children: 1.24596081776
D0207 09:47:29.761161 12 process_collector_psutil.py:93] Calculated rate for pid=63 and children: 1.24940960757
D0207 09:47:49.761781 12 process_collector_psutil.py:93] Calculated rate for pid=63 and children: 1.24846117073
D0207 09:48:09.855884 12 process_collector_psutil.py:93] Calculated rate for pid=63 and children: 1.24563907521
D0207 09:48:09.862927 12 disk.py:44] DiskCollectorThread: finished collection of /srv/apps/mesos/var/workdir/slaves/ce246e68-974b-4af5-9f23-39ab6f4be186-S179/frameworks/6ff6b40c-3dfe-419f-b0ba-df52eb354be5-0000/executors/task_id/runs/300ff840-b4f8-4b25-8283-4b13cc7fab2d/sandbox in 0.9ms
D0207 09:48:29.957253 12 process_collector_psutil.py:93] Calculated rate for pid=63 and children: 1.24767839681
D0207 09:48:49.960608 12 process_collector_psutil.py:93] Calculated rate for pid=63 and children: 1.24828815758
W0207 09:45:08.216031 12 monitor.py:92] Could not read from checkpoint /srv/apps/mesos/var/sandbox/checkpoints/checkpoints/task_id/runner
I0207 09:45:08.216214 12 thermos_task_runner.py:278] Forking off runner with cmdline: /usr/bin/python2.7 /srv/apps/mesos/var/sandbox/thermos_runner.pex --process_logger_mode=rotate --task_id=task_id --log_to_disk=DEBUG --rotate_log_size_mb=5 --hostname=host --rotate_log_backups=10 --thermos_json=/srv/apps/mesos/var/sandbox/task.json --sandbox=/srv/apps/mesos/var/workdir/slaves/ce246e68-974b-4af5-9f23-39ab6f4be186-S179/frameworks/6ff6b40c-3dfe-419f-b0ba-df52eb354be5-0000/executors/task_id/runs/300ff840-b4f8-4b25-8283-4b13cc7fab2d/sandbox --log_dir=/srv/apps/mesos/var/sandbox --checkpoint_root=/srv/apps/mesos/var/sandbox/checkpoints –container_sandbox=/srv/apps/mesos/var/workdir/slaves/ce246e68-974b-4af5-9f23-39ab6f4be186-S179/frameworks/6ff6b40c-3dfe-419f-b0ba-df52eb354be5-0000/executors/ttask_id/runs/300ff840-b4f8-4b25-8283-4b13cc7fab2d/sandbox --process_logger_destination=file
D0207 09:45:08.218667 12 thermos_task_runner.py:291] Waiting for task to start.
D0207 09:45:08.218888 12 thermos_task_runner.py:300]   - sleeping...
D0207 09:45:08.719671 12 thermos_task_runner.py:300]   - sleeping...
D0207 09:45:09.232539 12 aurora_executor.py:159] Task started.
W0207 09:45:09.355899 12 health_checker.py:389] No health-checks defined, will use a no-op health-checker.
D0207 09:45:09.482341 12 ckpt.py:348] Flipping task state from (undefined) to ACTIVE
D0207 09:45:09.482521 12 ckpt.py:328] Initializing TaskRunner header to RunnerHeader(launch_time_ms=1517996708593, uid=0, task_id='task_id', hostname='host', sandbox='/srv/apps/mesos/var/workdir/slaves/ce246e68-974b-4af5-9f23-39ab6f4be186-S179/frameworks/6ff6b40c-3dfe-419f-b0ba-df52eb354be5-0000/executors/task_id/runs/300ff840-b4f8-4b25-8283-4b13cc7fab2d/sandbox', log_dir='/srv/apps/mesos/var/workdir/slaves/ce246e68-974b-4af5-9f23-39ab6f4be186-S179/frameworks/6ff6b40c-3dfe-419f-b0ba-df52eb354be5-0000/executors/ttask_id/runs/300ff840-b4f8-4b25-8283-4b13cc7fab2d/sandbox/.logs', ports={}, user='root')




For what I see in : D0207 09:47:29.761161 12 process_collector_psutil.py:93] Calculated rate for pid=63 and children: 1.24940960757


It's actually calculating the CPU usage but it's not showing it.


AWS agents have Ubuntu 14.04.5 LTS installed and docker containers where jobs are running, Debian 9.


Thank you.

Miguel.


Re: Thermos Observer is showing 0.0 For CPU and Memory

Posted by Miguel Moreno <mi...@piksel.com>.
Hi Stephan.


Thanks for your answer. Finally we fixed it by adding the --pid=host option to the docker default parameters into aurora-scheduler config.


Thanks again and best regards.

Miguel.

________________________________
De: Stephan Erb <st...@blue-yonder.com>
Enviado: domingo, 25 de febrero de 2018 11:36:44
Para: user@aurora.apache.org
Asunto: Re: Thermos Observer is showing 0.0 For CPU and Memory


Hi Miguel,



I believe you are running into a current limitation of the observer: The executor checkpoints its process PIDs to disk and the observer reads them. The observer then tries to get process statistics for each PID. If those live within a PID namespace, then the observer won’t be able to see the correct processes.



A quick workaround would be to ensure your tasks are not running within a PID namespace. For the Mesos containerizer you have to stop using the `namespaces/pid` isolator. If you are using the Docker containerize you will have to tweak the corresponding Docker option.



The correct fix would be for the observer to enter the task namespace before reading the stats.



Best regards,

Stephan





From: Miguel Moreno <mi...@piksel.com>
Reply-To: "user@aurora.apache.org" <us...@aurora.apache.org>
Date: Tuesday, 13. February 2018 at 08:51
To: "user@aurora.apache.org" <us...@aurora.apache.org>
Subject: Thermos Observer is showing 0.0 For CPU and Memory



Hi everybody.



We are running Aurora into AWS machines with Mesos agents.



The jobs are working well and no error in the log but, when trying to look into the resources usage in the Thermos Observer endopint (hostname:1338) it always shows 0.0 for cpu and memory.



Scrubbed __main__.log:



Log file created at: 2018/02/07 09:45:08
Running on machine: -
[DIWEF]mmdd hh:mm:ss.uuuuuu pid file:line] msg
Command line: /srv/apps/mesos/var/sandbox/thermos_executor.pex --announcer-ensemble (zookeepers) --mesos-containerizer-path /usr/libexec/mesos/mesos-containerizer --runner-logger-destination file --runner-logger-mode rotate --runner-rotate-log-backups 10 --runner-rotate-log-size-mb 5 --sandbox-mount-point /srv/apps/mesos/var/sandbox
I0207 09:45:08.067967 12 executor_base.py:45] Executor [None]: registered() called with:
I0207 09:45:08.068506 12 executor_base.py:45] Executor [None]:    ExecutorInfo:  executor_id {
  value: "executor_id"
}
resources {
  name: "cpus"
  type: SCALAR
  scalar {
    value: 0.25
  }
  role: "*"
}
resources {
  name: "mem"
  type: SCALAR
  scalar {
    value: 128.0
  }
  role: "*"
}
command {
  uris {
    value: "/usr/share/aurora/bin/thermos_executor.pex"
    executable: true
  }
  uris {
    value: "/etc/mesos-files/.docker/config.json"
    executable: true
  }
  uris {
    value: "/usr/share/aurora/bin/thermos_executor.pex"
    executable: true
  }
  value: "${MESOS_SANDBOX=.}/thermos_executor.pex --announcer-ensemble (zookeepers) --mesos-containerizer-path /usr/libexec/mesos/mesos-containerizer --runner-logger-destination file --runner-logger-mode rotate --runner-rotate-log-backups 10 --runner-rotate-log-size-mb 5 --sandbox-mount-point /srv/apps/mesos/var/sandbox"
}
framework_id {
  value: "6ff6b40c-3dfe-419f-b0ba-df52eb354be5-0000"
}
name: "AuroraExecutor"
source: "source"
container {
  type: DOCKER
  docker {
    image: "docker"
  }
}
labels {
  labels {
    key: "source"
    value: "source"
  }
}
I0207 09:45:08.068696 12 executor_base.py:45] Executor [None]:    FrameworkInfo: user: "root"
name: "Aurora"
id {
  value: "6ff6b40c-3dfe-419f-b0ba-df52eb354be5-0000"
}
failover_timeout: 1814400.0
checkpoint: true
role: "batch"
hostname: "host"
principal: "aurora"
capabilities {
  type: REVOCABLE_RESOURCES
}
capabilities {
  type: GPU_RESOURCES
}
I0207 09:45:08.069024 12 executor_base.py:45] Executor [None]:    SlaveInfo:     hostname: "hostnamel"
resources {
  name: "cpus"
  type: SCALAR
  scalar {
    value: 7.5
  }
  role: "*"
}
resources {
  name: "mem"
  type: SCALAR
  scalar {
    value: 14526.83
  }
  role: "*"
}
resources {
  name: "ports"
  type: RANGES
  ranges {
    range {
      begin: 20000
      end: 30000
    }
  }
  role: "*"
}
resources {
  name: "disk"
  type: SCALAR
  scalar {
    value: 95541.0
  }
  role: "*"
}
attributes {
  name: "az"
  type: TEXT
  text {
    value: "eu-west-1b"
  }
}
attributes {
  name: "stage"
  type: TEXT
  text {
    value: "sandbox"
  }
}
id {
  value: "ce246e68-974b-4af5-9f23-39ab6f4be186-S179"
}
checkpoint: true
port: 5051
I0207 09:45:08.070421 12 executor_base.py:45] Executor [None]: TaskInfo: name: "name"
task_id {
  value: "task_id"
}
slave_id {
  value: "ce246e68-974b-4af5-9f23-39ab6f4be186-S179"
}
resources {
  name: "cpus"
  type: SCALAR
  scalar {
    value: 1.0
  }
  role: "*"
}
resources {
  name: "disk"
  type: SCALAR
  scalar {
    value: 2048.0
  }
  role: "*"
}
resources {
  name: "mem"
  type: SCALAR
  scalar {
    value: 1024.0
  }
  role: "*"
}
executor {
  executor_id {
    value: "executor_id"
  }
  resources {
    name: "cpus"
    type: SCALAR
    scalar {
 value: 0.25
    }
    role: "*"
  }
  resources {
    name: "mem"
    type: SCALAR
    scalar {
      value: 128.0
    }
    role: "*"
  }
  command {
    uris {
      value: "/usr/share/aurora/bin/thermos_executor.pex"
      executable: true
    }
    uris {
      value: "/etc/mesos-files/.docker/config.json"
      executable: true
    }
    uris {
      value: "/usr/share/aurora/bin/thermos_executor.pex"
      executable: true
    }
    value: "${MESOS_SANDBOX=.}/thermos_executor.pex --announcer-ensemble (zookeepers) --mesos-containerizer-path /usr/libexec/mesos/mesos-containerizer --runner-logger-destination file --runner-logger-mode rotate --runner-rotate-log-backups 10 --runner-rotate-log-size-mb 5 --sandbox-mount-point /srv/apps/mesos/var/sandbox"
  }
  framework_id {
    value: "6ff6b40c-3dfe-419f-b0ba-df52eb354be5-0000"
  }
  name: "AuroraExecutor"
  source: "source"
  container {
    type: DOCKER
    docker {
      image: "docker"
    }
  }
  labels {
    labels {
      key: "source"
      value: "source"
    }
  }
}
data: “data"
labels {
  labels {
    key: "org.apache.aurora.tier"
    value: "preemptible"
  }
}
discovery {
  visibility: CLUSTER
  name: "name"
  environment: "devel"
  location: "aurora-eu-west-1"
}
I0207 09:45:08.070538 12 executor_base.py:45] Executor [None]: launchTask got task: task 12 executor_base.py:45] Executor [ce246e68-974b-4af5-9f23-39ab6f4be186-S179]: Assigned task: AssignedTask(task=TaskConfig(isService=False, contactEmail=None, taskLinks={}, tier='preemptible', mesosFetcherUris=set([]), executorConfig=ExecutorConfig(data='{"environment": "devel", "health_check_config": {"health_checker": {"http": {"expected_response_code": 0, "endpoint": "/health", "expected_response": "ok"}}, "min_consecutive_successes": 1, "initial_interval_secs": 15.0, "max_consecutive_failures": 0, "timeout_secs": 1.0, "interval_secs": 10.0}, "container": {"image": "docker", "parameters": []}, "name": "name", "service": false, "max_task_failures": 1, "cron_collision_policy": "KILL_EXISTING", "enable_hooks": false, "cluster": "devcluster", "task": {"processes": [{processes, requestedPorts=set([]), maxTaskFailures=1, priority=1, ramMb=1024, job=JobKey(environment='devel', role='role', name='name'), production=False, diskMb=2048, resources=set([Resource(ramMb=None, numGpus=None, namedPort=None, diskMb=2048, numCpus=None), Resource(ramMb=None, numGpus=None, namedPort=None, diskMb=None, numCpus=1.0), Resource(ramMb=1024, numGpus=None, namedPort=None, diskMb=None, numCpus=None)]), owner=Identity(user='user'), container=Container(docker=DockerContainer(image='ddocker', parameters=[]), mesos=None), metadata=set([]), numCpus=1.0, constraints=set([Constraint(name='stage', constraint=TaskConstraint(limit=None, value=ValueConstraint(negated=False, values=set(['sandbox']))))])), taskId='t_id', instanceId=0, assignedPorts={}, slaveHost='host', slaveId='id’)
I0207 09:45:08.071496 12 executor_base.py:45] Executor [ce246e68-974b-4af5-9f23-39ab6f4be186-S179]: Updating task_id=> STARTING
I0207 09:45:08.071634 12 executor_base.py:45] Executor [ce246e68-974b-4af5-9f23-39ab6f4be186-S179]:    Reason: Initializing sandbox.
D0207 09:45:08.072333 12 sandbox.py:129] DirectorySandbox: mkdir /srv/apps/mesos/var/workdir/slaves/ce246e68-974b-4af5-9f23-39ab6f4be186-S179/frameworks/6ff6b40c-3dfe-419f-b0ba-df52eb354be5-0000/executors/task_id/runs/300ff840-b4f8-4b25-8283-4b13cc7fab2d/sandbox
D0207 09:45:09.482642 12 ckpt.py:379] Running state machine for process=setup_environment/seq=0
D0207 09:45:09.482762 12 ckpt.py:379] Running state machine for process=setup_environment/seq=1
D0207 09:45:09.482841 12 recordio.py:137] /srv/apps/mesos/var/sandbox/checkpoints/checkpoints/task_id/runner has no data (current offset = 981)
D0207 09:45:09.482932 12 resource.py:136] Initialising resource collection for task task_id
D0207 09:45:09.483011 12 resource.py:147] Initialising ResourceHistory of length 180
D0207 09:45:09.620017 12 resource.py:187] Commencing resource monitoring for task "task_id"
I0207 09:45:09.620552 12 aurora_executor.py:188] Send TASK_RUNNING status update. status: StatusResult('No health-check defined, task is assumed healthy.', status='TASK_RUNNING')
I0207 09:45:09.620950 12 executor_base.py:45] Executor [ce246e68-974b-4af5-9f23-39ab6f4be186-S179]: Updating task_id => RUNNING
I0207 09:45:09.621148 12 executor_base.py:45] Executor [ce246e68-974b-4af5-9f23-39ab6f4be186-S179]:    Reason: No health-check defined, task is assumed healthy.
D0207 09:45:09.622956 12 disk.py:44] DiskCollectorThread: finished collection of /srv/apps/mesos/var/workdir/slaves/ce246e68-974b-4af5-9f23-39ab6f4be186-S179/frameworks/6ff6b40c-3dfe-419f-b0ba-df52eb354be5-0000/executors/task_id/runs/300ff840-b4f8-4b25-8283-4b13cc7fab2d/sandbox in 0.8ms
D0207 09:45:29.654671 12 ckpt.py:379] Running state machine for process=setup_environment/seq=2
D0207 09:45:29.654998 12 ckpt.py:379] Running state machine for process=setup_environment/seq=3
D0207 09:45:29.655204 12 ckpt.py:379] Running state machine for process=s3_download/seq=0
D0207 09:45:29.655424 12 ckpt.py:379] Running state machine for process=s3_download/seq=1
D0207 09:45:29.655628 12 ckpt.py:379] Running state machine for process=s3_download/seq=2
D0207 09:45:29.655834 12 ckpt.py:379] Running state machine for process=s3_download/seq=3
D0207 09:45:29.656034 12 ckpt.py:379] Running state machine for process=single_h265_transcode_file/seq=0
D0207 09:45:29.656261 12 ckpt.py:379] Running state machine for process=single_h265_transcode_file/seq=1
D0207 09:45:29.656470 12 ckpt.py:379] Running state machine for process=single_h265_transcode_file/seq=2
D0207 09:45:29.656608 12 recordio.py:137] /srv/apps/mesos/var/sandbox/checkpoints/checkpoints/task_id/runner has no data (current offset = 1569)
D0207 09:45:49.657543 12 process_collector_psutil.py:93] Calculated rate for pid=63 and children: 1.24659524141
D0207 09:46:09.659521 12 disk.py:44] DiskCollectorThread: finished collection of /srv/apps/mesos/var/workdir/slaves/ce246e68-974b-4af5-9f23-39ab6f4be186-S179/frameworks/6ff6b40c-3dfe-419f-b0ba-df52eb354be5-0000/executors/task_id/runs/300ff840-b4f8-4b25-8283-4b13cc7fab2d/sandbox in 1.3ms
D0207 09:46:09.665703 12 process_collector_psutil.py:93] Calculated rate for pid=63 and children: 1.24998856702
D0207 09:46:29.758064 12 process_collector_psutil.py:93] Calculated rate for pid=63 and children: 1.24475260671
D0207 09:46:49.759083 12 process_collector_psutil.py:93] Calculated rate for pid=63 and children: 1.24843625817
D0207 09:47:09.756055 12 disk.py:44] DiskCollectorThread: finished collection of /srv/apps/mesos/var/workdir/slaves/ce246e68-974b-4af5-9f23-39ab6f4be186-S179/frameworks/6ff6b40c-3dfe-419f-b0ba-df52eb354be5-0000/executors/ttask_id/runs/300ff840-b4f8-4b25-8283-4b13cc7fab2d/sandbox in 1.3ms
D0207 09:47:09.759665 12 process_collector_psutil.py:93] Calculated rate for pid=63 and children: 1.24596081776
D0207 09:47:29.761161 12 process_collector_psutil.py:93] Calculated rate for pid=63 and children: 1.24940960757
D0207 09:47:49.761781 12 process_collector_psutil.py:93] Calculated rate for pid=63 and children: 1.24846117073
D0207 09:48:09.855884 12 process_collector_psutil.py:93] Calculated rate for pid=63 and children: 1.24563907521
D0207 09:48:09.862927 12 disk.py:44] DiskCollectorThread: finished collection of /srv/apps/mesos/var/workdir/slaves/ce246e68-974b-4af5-9f23-39ab6f4be186-S179/frameworks/6ff6b40c-3dfe-419f-b0ba-df52eb354be5-0000/executors/task_id/runs/300ff840-b4f8-4b25-8283-4b13cc7fab2d/sandbox in 0.9ms
D0207 09:48:29.957253 12 process_collector_psutil.py:93] Calculated rate for pid=63 and children: 1.24767839681
D0207 09:48:49.960608 12 process_collector_psutil.py:93] Calculated rate for pid=63 and children: 1.24828815758
W0207 09:45:08.216031 12 monitor.py:92] Could not read from checkpoint /srv/apps/mesos/var/sandbox/checkpoints/checkpoints/task_id/runner
I0207 09:45:08.216214 12 thermos_task_runner.py:278] Forking off runner with cmdline: /usr/bin/python2.7 /srv/apps/mesos/var/sandbox/thermos_runner.pex --process_logger_mode=rotate --task_id=task_id --log_to_disk=DEBUG --rotate_log_size_mb=5 --hostname=host --rotate_log_backups=10 --thermos_json=/srv/apps/mesos/var/sandbox/task.json --sandbox=/srv/apps/mesos/var/workdir/slaves/ce246e68-974b-4af5-9f23-39ab6f4be186-S179/frameworks/6ff6b40c-3dfe-419f-b0ba-df52eb354be5-0000/executors/task_id/runs/300ff840-b4f8-4b25-8283-4b13cc7fab2d/sandbox --log_dir=/srv/apps/mesos/var/sandbox --checkpoint_root=/srv/apps/mesos/var/sandbox/checkpoints –container_sandbox=/srv/apps/mesos/var/workdir/slaves/ce246e68-974b-4af5-9f23-39ab6f4be186-S179/frameworks/6ff6b40c-3dfe-419f-b0ba-df52eb354be5-0000/executors/ttask_id/runs/300ff840-b4f8-4b25-8283-4b13cc7fab2d/sandbox --process_logger_destination=file
D0207 09:45:08.218667 12 thermos_task_runner.py:291] Waiting for task to start.
D0207 09:45:08.218888 12 thermos_task_runner.py:300]   - sleeping...
D0207 09:45:08.719671 12 thermos_task_runner.py:300]   - sleeping...
D0207 09:45:09.232539 12 aurora_executor.py:159] Task started.
W0207 09:45:09.355899 12 health_checker.py:389] No health-checks defined, will use a no-op health-checker.
D0207 09:45:09.482341 12 ckpt.py:348] Flipping task state from (undefined) to ACTIVE
D0207 09:45:09.482521 12 ckpt.py:328] Initializing TaskRunner header to RunnerHeader(launch_time_ms=1517996708593, uid=0, task_id='task_id', hostname='host', sandbox='/srv/apps/mesos/var/workdir/slaves/ce246e68-974b-4af5-9f23-39ab6f4be186-S179/frameworks/6ff6b40c-3dfe-419f-b0ba-df52eb354be5-0000/executors/task_id/runs/300ff840-b4f8-4b25-8283-4b13cc7fab2d/sandbox', log_dir='/srv/apps/mesos/var/workdir/slaves/ce246e68-974b-4af5-9f23-39ab6f4be186-S179/frameworks/6ff6b40c-3dfe-419f-b0ba-df52eb354be5-0000/executors/ttask_id/runs/300ff840-b4f8-4b25-8283-4b13cc7fab2d/sandbox/.logs', ports={}, user='root')







For what I see in : D0207 09:47:29.761161 12 process_collector_psutil.py:93] Calculated rate for pid=63 and children: 1.24940960757



It's actually calculating the CPU usage but it's not showing it.



AWS agents have Ubuntu 14.04.5 LTS installed and docker containers where jobs are running, Debian 9.



Thank you.

Miguel.



Re: Thermos Observer is showing 0.0 For CPU and Memory

Posted by Stephan Erb <st...@blue-yonder.com>.
Hi Miguel,

I believe you are running into a current limitation of the observer: The executor checkpoints its process PIDs to disk and the observer reads them. The observer then tries to get process statistics for each PID. If those live within a PID namespace, then the observer won’t be able to see the correct processes.

A quick workaround would be to ensure your tasks are not running within a PID namespace. For the Mesos containerizer you have to stop using the `namespaces/pid` isolator. If you are using the Docker containerize you will have to tweak the corresponding Docker option.

The correct fix would be for the observer to enter the task namespace before reading the stats.

Best regards,
Stephan


From: Miguel Moreno <mi...@piksel.com>
Reply-To: "user@aurora.apache.org" <us...@aurora.apache.org>
Date: Tuesday, 13. February 2018 at 08:51
To: "user@aurora.apache.org" <us...@aurora.apache.org>
Subject: Thermos Observer is showing 0.0 For CPU and Memory


Hi everybody.



We are running Aurora into AWS machines with Mesos agents.



The jobs are working well and no error in the log but, when trying to look into the resources usage in the Thermos Observer endopint (hostname:1338) it always shows 0.0 for cpu and memory.



Scrubbed __main__.log:


Log file created at: 2018/02/07 09:45:08
Running on machine: -
[DIWEF]mmdd hh:mm:ss.uuuuuu pid file:line] msg
Command line: /srv/apps/mesos/var/sandbox/thermos_executor.pex --announcer-ensemble (zookeepers) --mesos-containerizer-path /usr/libexec/mesos/mesos-containerizer --runner-logger-destination file --runner-logger-mode rotate --runner-rotate-log-backups 10 --runner-rotate-log-size-mb 5 --sandbox-mount-point /srv/apps/mesos/var/sandbox
I0207 09:45:08.067967 12 executor_base.py:45] Executor [None]: registered() called with:
I0207 09:45:08.068506 12 executor_base.py:45] Executor [None]:    ExecutorInfo:  executor_id {
  value: "executor_id"
}
resources {
  name: "cpus"
  type: SCALAR
  scalar {
    value: 0.25
  }
  role: "*"
}
resources {
  name: "mem"
  type: SCALAR
  scalar {
    value: 128.0
  }
  role: "*"
}
command {
  uris {
    value: "/usr/share/aurora/bin/thermos_executor.pex"
    executable: true
  }
  uris {
    value: "/etc/mesos-files/.docker/config.json"
    executable: true
  }
  uris {
    value: "/usr/share/aurora/bin/thermos_executor.pex"
    executable: true
  }
  value: "${MESOS_SANDBOX=.}/thermos_executor.pex --announcer-ensemble (zookeepers) --mesos-containerizer-path /usr/libexec/mesos/mesos-containerizer --runner-logger-destination file --runner-logger-mode rotate --runner-rotate-log-backups 10 --runner-rotate-log-size-mb 5 --sandbox-mount-point /srv/apps/mesos/var/sandbox"
}
framework_id {
  value: "6ff6b40c-3dfe-419f-b0ba-df52eb354be5-0000"
}
name: "AuroraExecutor"
source: "source"
container {
  type: DOCKER
  docker {
    image: "docker"
  }
}
labels {
  labels {
    key: "source"
    value: "source"
  }
}
I0207 09:45:08.068696 12 executor_base.py:45] Executor [None]:    FrameworkInfo: user: "root"
name: "Aurora"
id {
  value: "6ff6b40c-3dfe-419f-b0ba-df52eb354be5-0000"
}
failover_timeout: 1814400.0
checkpoint: true
role: "batch"
hostname: "host"
principal: "aurora"
capabilities {
  type: REVOCABLE_RESOURCES
}
capabilities {
  type: GPU_RESOURCES
}
I0207 09:45:08.069024 12 executor_base.py:45] Executor [None]:    SlaveInfo:     hostname: "hostnamel"
resources {
  name: "cpus"
  type: SCALAR
  scalar {
    value: 7.5
  }
  role: "*"
}
resources {
  name: "mem"
  type: SCALAR
  scalar {
    value: 14526.83
  }
  role: "*"
}
resources {
  name: "ports"
  type: RANGES
  ranges {
    range {
      begin: 20000
      end: 30000
    }
  }
  role: "*"
}
resources {
  name: "disk"
  type: SCALAR
  scalar {
    value: 95541.0
  }
  role: "*"
}
attributes {
  name: "az"
  type: TEXT
  text {
    value: "eu-west-1b"
  }
}
attributes {
  name: "stage"
  type: TEXT
  text {
    value: "sandbox"
  }
}
id {
  value: "ce246e68-974b-4af5-9f23-39ab6f4be186-S179"
}
checkpoint: true
port: 5051
I0207 09:45:08.070421 12 executor_base.py:45] Executor [None]: TaskInfo: name: "name"
task_id {
  value: "task_id"
}
slave_id {
  value: "ce246e68-974b-4af5-9f23-39ab6f4be186-S179"
}
resources {
  name: "cpus"
  type: SCALAR
  scalar {
    value: 1.0
  }
  role: "*"
}
resources {
  name: "disk"
  type: SCALAR
  scalar {
    value: 2048.0
  }
  role: "*"
}
resources {
  name: "mem"
  type: SCALAR
  scalar {
    value: 1024.0
  }
  role: "*"
}
executor {
  executor_id {
    value: "executor_id"
  }
  resources {
    name: "cpus"
    type: SCALAR
    scalar {
 value: 0.25
    }
    role: "*"
  }
  resources {
    name: "mem"
    type: SCALAR
    scalar {
      value: 128.0
    }
    role: "*"
  }
  command {
    uris {
      value: "/usr/share/aurora/bin/thermos_executor.pex"
      executable: true
    }
    uris {
      value: "/etc/mesos-files/.docker/config.json"
      executable: true
    }
    uris {
      value: "/usr/share/aurora/bin/thermos_executor.pex"
      executable: true
    }
    value: "${MESOS_SANDBOX=.}/thermos_executor.pex --announcer-ensemble (zookeepers) --mesos-containerizer-path /usr/libexec/mesos/mesos-containerizer --runner-logger-destination file --runner-logger-mode rotate --runner-rotate-log-backups 10 --runner-rotate-log-size-mb 5 --sandbox-mount-point /srv/apps/mesos/var/sandbox"
  }
  framework_id {
    value: "6ff6b40c-3dfe-419f-b0ba-df52eb354be5-0000"
  }
  name: "AuroraExecutor"
  source: "source"
  container {
    type: DOCKER
    docker {
      image: "docker"
    }
  }
  labels {
    labels {
      key: "source"
      value: "source"
    }
  }
}
data: “data"
labels {
  labels {
    key: "org.apache.aurora.tier"
    value: "preemptible"
  }
}
discovery {
  visibility: CLUSTER
  name: "name"
  environment: "devel"
  location: "aurora-eu-west-1"
}
I0207 09:45:08.070538 12 executor_base.py:45] Executor [None]: launchTask got task: task 12 executor_base.py:45] Executor [ce246e68-974b-4af5-9f23-39ab6f4be186-S179]: Assigned task: AssignedTask(task=TaskConfig(isService=False, contactEmail=None, taskLinks={}, tier='preemptible', mesosFetcherUris=set([]), executorConfig=ExecutorConfig(data='{"environment": "devel", "health_check_config": {"health_checker": {"http": {"expected_response_code": 0, "endpoint": "/health", "expected_response": "ok"}}, "min_consecutive_successes": 1, "initial_interval_secs": 15.0, "max_consecutive_failures": 0, "timeout_secs": 1.0, "interval_secs": 10.0}, "container": {"image": "docker", "parameters": []}, "name": "name", "service": false, "max_task_failures": 1, "cron_collision_policy": "KILL_EXISTING", "enable_hooks": false, "cluster": "devcluster", "task": {"processes": [{processes, requestedPorts=set([]), maxTaskFailures=1, priority=1, ramMb=1024, job=JobKey(environment='devel', role='role', name='name'), production=False, diskMb=2048, resources=set([Resource(ramMb=None, numGpus=None, namedPort=None, diskMb=2048, numCpus=None), Resource(ramMb=None, numGpus=None, namedPort=None, diskMb=None, numCpus=1.0), Resource(ramMb=1024, numGpus=None, namedPort=None, diskMb=None, numCpus=None)]), owner=Identity(user='user'), container=Container(docker=DockerContainer(image='ddocker', parameters=[]), mesos=None), metadata=set([]), numCpus=1.0, constraints=set([Constraint(name='stage', constraint=TaskConstraint(limit=None, value=ValueConstraint(negated=False, values=set(['sandbox']))))])), taskId='t_id', instanceId=0, assignedPorts={}, slaveHost='host', slaveId='id’)
I0207 09:45:08.071496 12 executor_base.py:45] Executor [ce246e68-974b-4af5-9f23-39ab6f4be186-S179]: Updating task_id=> STARTING
I0207 09:45:08.071634 12 executor_base.py:45] Executor [ce246e68-974b-4af5-9f23-39ab6f4be186-S179]:    Reason: Initializing sandbox.
D0207 09:45:08.072333 12 sandbox.py:129] DirectorySandbox: mkdir /srv/apps/mesos/var/workdir/slaves/ce246e68-974b-4af5-9f23-39ab6f4be186-S179/frameworks/6ff6b40c-3dfe-419f-b0ba-df52eb354be5-0000/executors/task_id/runs/300ff840-b4f8-4b25-8283-4b13cc7fab2d/sandbox
D0207 09:45:09.482642 12 ckpt.py:379] Running state machine for process=setup_environment/seq=0
D0207 09:45:09.482762 12 ckpt.py:379] Running state machine for process=setup_environment/seq=1
D0207 09:45:09.482841 12 recordio.py:137] /srv/apps/mesos/var/sandbox/checkpoints/checkpoints/task_id/runner has no data (current offset = 981)
D0207 09:45:09.482932 12 resource.py:136] Initialising resource collection for task task_id
D0207 09:45:09.483011 12 resource.py:147] Initialising ResourceHistory of length 180
D0207 09:45:09.620017 12 resource.py:187] Commencing resource monitoring for task "task_id"
I0207 09:45:09.620552 12 aurora_executor.py:188] Send TASK_RUNNING status update. status: StatusResult('No health-check defined, task is assumed healthy.', status='TASK_RUNNING')
I0207 09:45:09.620950 12 executor_base.py:45] Executor [ce246e68-974b-4af5-9f23-39ab6f4be186-S179]: Updating task_id => RUNNING
I0207 09:45:09.621148 12 executor_base.py:45] Executor [ce246e68-974b-4af5-9f23-39ab6f4be186-S179]:    Reason: No health-check defined, task is assumed healthy.
D0207 09:45:09.622956 12 disk.py:44] DiskCollectorThread: finished collection of /srv/apps/mesos/var/workdir/slaves/ce246e68-974b-4af5-9f23-39ab6f4be186-S179/frameworks/6ff6b40c-3dfe-419f-b0ba-df52eb354be5-0000/executors/task_id/runs/300ff840-b4f8-4b25-8283-4b13cc7fab2d/sandbox in 0.8ms
D0207 09:45:29.654671 12 ckpt.py:379] Running state machine for process=setup_environment/seq=2
D0207 09:45:29.654998 12 ckpt.py:379] Running state machine for process=setup_environment/seq=3
D0207 09:45:29.655204 12 ckpt.py:379] Running state machine for process=s3_download/seq=0
D0207 09:45:29.655424 12 ckpt.py:379] Running state machine for process=s3_download/seq=1
D0207 09:45:29.655628 12 ckpt.py:379] Running state machine for process=s3_download/seq=2
D0207 09:45:29.655834 12 ckpt.py:379] Running state machine for process=s3_download/seq=3
D0207 09:45:29.656034 12 ckpt.py:379] Running state machine for process=single_h265_transcode_file/seq=0
D0207 09:45:29.656261 12 ckpt.py:379] Running state machine for process=single_h265_transcode_file/seq=1
D0207 09:45:29.656470 12 ckpt.py:379] Running state machine for process=single_h265_transcode_file/seq=2
D0207 09:45:29.656608 12 recordio.py:137] /srv/apps/mesos/var/sandbox/checkpoints/checkpoints/task_id/runner has no data (current offset = 1569)
D0207 09:45:49.657543 12 process_collector_psutil.py:93] Calculated rate for pid=63 and children: 1.24659524141
D0207 09:46:09.659521 12 disk.py:44] DiskCollectorThread: finished collection of /srv/apps/mesos/var/workdir/slaves/ce246e68-974b-4af5-9f23-39ab6f4be186-S179/frameworks/6ff6b40c-3dfe-419f-b0ba-df52eb354be5-0000/executors/task_id/runs/300ff840-b4f8-4b25-8283-4b13cc7fab2d/sandbox in 1.3ms
D0207 09:46:09.665703 12 process_collector_psutil.py:93] Calculated rate for pid=63 and children: 1.24998856702
D0207 09:46:29.758064 12 process_collector_psutil.py:93] Calculated rate for pid=63 and children: 1.24475260671
D0207 09:46:49.759083 12 process_collector_psutil.py:93] Calculated rate for pid=63 and children: 1.24843625817
D0207 09:47:09.756055 12 disk.py:44] DiskCollectorThread: finished collection of /srv/apps/mesos/var/workdir/slaves/ce246e68-974b-4af5-9f23-39ab6f4be186-S179/frameworks/6ff6b40c-3dfe-419f-b0ba-df52eb354be5-0000/executors/ttask_id/runs/300ff840-b4f8-4b25-8283-4b13cc7fab2d/sandbox in 1.3ms
D0207 09:47:09.759665 12 process_collector_psutil.py:93] Calculated rate for pid=63 and children: 1.24596081776
D0207 09:47:29.761161 12 process_collector_psutil.py:93] Calculated rate for pid=63 and children: 1.24940960757
D0207 09:47:49.761781 12 process_collector_psutil.py:93] Calculated rate for pid=63 and children: 1.24846117073
D0207 09:48:09.855884 12 process_collector_psutil.py:93] Calculated rate for pid=63 and children: 1.24563907521
D0207 09:48:09.862927 12 disk.py:44] DiskCollectorThread: finished collection of /srv/apps/mesos/var/workdir/slaves/ce246e68-974b-4af5-9f23-39ab6f4be186-S179/frameworks/6ff6b40c-3dfe-419f-b0ba-df52eb354be5-0000/executors/task_id/runs/300ff840-b4f8-4b25-8283-4b13cc7fab2d/sandbox in 0.9ms
D0207 09:48:29.957253 12 process_collector_psutil.py:93] Calculated rate for pid=63 and children: 1.24767839681
D0207 09:48:49.960608 12 process_collector_psutil.py:93] Calculated rate for pid=63 and children: 1.24828815758
W0207 09:45:08.216031 12 monitor.py:92] Could not read from checkpoint /srv/apps/mesos/var/sandbox/checkpoints/checkpoints/task_id/runner
I0207 09:45:08.216214 12 thermos_task_runner.py:278] Forking off runner with cmdline: /usr/bin/python2.7 /srv/apps/mesos/var/sandbox/thermos_runner.pex --process_logger_mode=rotate --task_id=task_id --log_to_disk=DEBUG --rotate_log_size_mb=5 --hostname=host --rotate_log_backups=10 --thermos_json=/srv/apps/mesos/var/sandbox/task.json --sandbox=/srv/apps/mesos/var/workdir/slaves/ce246e68-974b-4af5-9f23-39ab6f4be186-S179/frameworks/6ff6b40c-3dfe-419f-b0ba-df52eb354be5-0000/executors/task_id/runs/300ff840-b4f8-4b25-8283-4b13cc7fab2d/sandbox --log_dir=/srv/apps/mesos/var/sandbox --checkpoint_root=/srv/apps/mesos/var/sandbox/checkpoints –container_sandbox=/srv/apps/mesos/var/workdir/slaves/ce246e68-974b-4af5-9f23-39ab6f4be186-S179/frameworks/6ff6b40c-3dfe-419f-b0ba-df52eb354be5-0000/executors/ttask_id/runs/300ff840-b4f8-4b25-8283-4b13cc7fab2d/sandbox --process_logger_destination=file
D0207 09:45:08.218667 12 thermos_task_runner.py:291] Waiting for task to start.
D0207 09:45:08.218888 12 thermos_task_runner.py:300]   - sleeping...
D0207 09:45:08.719671 12 thermos_task_runner.py:300]   - sleeping...
D0207 09:45:09.232539 12 aurora_executor.py:159] Task started.
W0207 09:45:09.355899 12 health_checker.py:389] No health-checks defined, will use a no-op health-checker.
D0207 09:45:09.482341 12 ckpt.py:348] Flipping task state from (undefined) to ACTIVE
D0207 09:45:09.482521 12 ckpt.py:328] Initializing TaskRunner header to RunnerHeader(launch_time_ms=1517996708593, uid=0, task_id='task_id', hostname='host', sandbox='/srv/apps/mesos/var/workdir/slaves/ce246e68-974b-4af5-9f23-39ab6f4be186-S179/frameworks/6ff6b40c-3dfe-419f-b0ba-df52eb354be5-0000/executors/task_id/runs/300ff840-b4f8-4b25-8283-4b13cc7fab2d/sandbox', log_dir='/srv/apps/mesos/var/workdir/slaves/ce246e68-974b-4af5-9f23-39ab6f4be186-S179/frameworks/6ff6b40c-3dfe-419f-b0ba-df52eb354be5-0000/executors/ttask_id/runs/300ff840-b4f8-4b25-8283-4b13cc7fab2d/sandbox/.logs', ports={}, user='root')







For what I see in : D0207 09:47:29.761161 12 process_collector_psutil.py:93] Calculated rate for pid=63 and children: 1.24940960757



It's actually calculating the CPU usage but it's not showing it.



AWS agents have Ubuntu 14.04.5 LTS installed and docker containers where jobs are running, Debian 9.



Thank you.
Miguel.