You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Chun-Hung Hsiao (JIRA)" <ji...@apache.org> on 2017/09/06 00:42:00 UTC

[jira] [Created] (MESOS-7939) Garbage collection does not kick in early enough during agent startup

Chun-Hung Hsiao created MESOS-7939:
--------------------------------------

             Summary: Garbage collection does not kick in early enough during agent startup
                 Key: MESOS-7939
                 URL: https://issues.apache.org/jira/browse/MESOS-7939
             Project: Mesos
          Issue Type: Bug
          Components: agent
            Reporter: Chun-Hung Hsiao
            Assignee: Chun-Hung Hsiao
            Priority: Critical
             Fix For: 1.4.1


A couple of completed tasks used up the disk space for sandboxes. After that, a task is failed due to insufficient disk space for downloading its artifacts:
{noformat}
I0901 17:54:47.402745 31039 fetcher.cpp:222] Fetching URI 'https://downloads.mesosphere.com/java/jre-8u131-linux-x64.tar.gz'
I0901 17:54:47.402756 31039 fetcher.cpp:165] Downloading resource from 'https://downloads.mesosphere.com/java/jre-8u131-linux-x64.tar.gz' to '/var/lib/mesos/slave/slaves/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-S5/frameworks/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-0005/executors/node__4ae69c7c-e32e-41d2-a485-88145a3e385c/runs/602befac-3ff5-44d7-acac-aeebdc0e4666/containers/218b84a5-0301-4c95-89c5-c863145f86d8/jre-8u131-linux-x64.tar.gz'
E0901 17:54:47.796036 31039 fetcher.cpp:579] EXIT with status 1: Failed to fetch 'https://downloads.mesosphere.com/java/jre-8u131-linux-x64.tar.gz': Error downloading resource: Failed writing received data to disk/application
{noformat}
As a result, the container is destroyed:
{noformat}
I0901 17:54:48.000000 25508 containerizer.cpp:2599] Container 602befac-3ff5-44d7-acac-aeebdc0e4666 has exited
I0901 17:54:48.000000 25508 containerizer.cpp:2158] Destroying container 602befac-3ff5-44d7-acac-aeebdc0e4666 in RUNNING state
I0901 17:54:48.000000 25508 containerizer.cpp:2699] Transitioning the state of container 602befac-3ff5-44d7-acac-aeebdc0e4666 from RUNNING to DESTROYING
I0901 17:54:48.000000 25508 linux_launcher.cpp:505] Asked to destroy container 602befac-3ff5-44d7-acac-aeebdc0e4666
I0901 17:54:48.000000 25508 linux_launcher.cpp:548] Using freezer to destroy cgroup mesos/602befac-3ff5-44d7-acac-aeebdc0e4666
I0901 17:54:48.000000 25510 cgroups.cpp:3055] Freezing cgroup /sys/fs/cgroup/freezer/mesos/602befac-3ff5-44d7-acac-aeebdc0e4666/mesos
I0901 17:54:48.000000 25508 cgroups.cpp:3055] Freezing cgroup /sys/fs/cgroup/freezer/mesos/602befac-3ff5-44d7-acac-aeebdc0e4666
I0901 17:54:48.000000 25510 cgroups.cpp:1413] Successfully froze cgroup /sys/fs/cgroup/freezer/mesos/602befac-3ff5-44d7-acac-aeebdc0e4666/mesos after 3.090176ms
I0901 17:54:48.000000 25508 cgroups.cpp:1413] Successfully froze cgroup /sys/fs/cgroup/freezer/mesos/602befac-3ff5-44d7-acac-aeebdc0e4666 after 5.518848ms
I0901 17:54:48.000000 25510 cgroups.cpp:3073] Thawing cgroup /sys/fs/cgroup/freezer/mesos/602befac-3ff5-44d7-acac-aeebdc0e4666/mesos
I0901 17:54:48.000000 25512 cgroups.cpp:3073] Thawing cgroup /sys/fs/cgroup/freezer/mesos/602befac-3ff5-44d7-acac-aeebdc0e4666
I0901 17:54:48.000000 25512 cgroups.cpp:1442] Successfully thawed cgroup /sys/fs/cgroup/freezer/mesos/602befac-3ff5-44d7-acac-aeebdc0e4666 after 2.351104ms
I0901 17:54:49.000000 25509 cgroups.cpp:1442] Successfully thawed cgroup /sys/fs/cgroup/freezer/mesos/602befac-3ff5-44d7-acac-aeebdc0e4666/mesos after 106.335232ms
I0901 17:54:49.000000 25512 disk.cpp:320] Checking disk usage at '/var/lib/mesos/slave/slaves/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-S5/frameworks/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-0005/executors/node__4ae69c7c-e32e-41d2-a485-88145a3e385c/runs/602befac-3ff5-44d7-acac-aeebdc0e4666/container-path' for container 602befac-3ff5-44d7-acac-aeebdc0e4666 has been cancelled
I0901 17:54:49.000000 25511 disk.cpp:320] Checking disk usage at '/var/lib/mesos/slave/slaves/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-S5/frameworks/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-0005/executors/node__4ae69c7c-e32e-41d2-a485-88145a3e385c/runs/602befac-3ff5-44d7-acac-aeebdc0e4666' for container 602befac-3ff5-44d7-acac-aeebdc0e4666 has been cancelled
I0901 17:54:49.000000 25511 container_assigner.cpp:101] Unregistering container_id[value: "602befac-3ff5-44d7-acac-aeebdc0e4666"].
{noformat}
However, due to insufficient disk space, the agent was unable to store the TASK_FAILED state, and this led to an agent crash:
{noformat}
I0901 17:54:49.000000 25513 status_update_manager.cpp:323] Received status update TASK_FAILED (UUID: bf24c3da-db23-4c82-a09f-a3b859e8cad4) for task node-0-server__e5e468a3-b2ee-42ee-80e8-edc19a3aef84 of framework 9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-0005
F0901 17:54:49.000000 25513 slave.cpp:4748] CHECK_READY(future): is FAILED: Failed to open '/var/lib/mesos/slave/meta/slaves/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-S5/frameworks/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-0005/executors/node__4ae69c7c-e32e-41d2-a485-88145a3e385c/runs/602befac-3ff5-44d7-acac-aeebdc0e4666/tasks/node-0-server__e5e468a3-b2ee-42ee-80e8-edc19a3aef84/task.updates' for status updates: No space left on device Failed to handle status update TASK_FAILED (UUID: bf24c3da-db23-4c82-a09f-a3b859e8cad4) for task node-0-server__e5e468a3-b2ee-42ee-80e8-edc19a3aef84 of framework 9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-0005
{noformat}
When the agent restarted, it tried to destroy this orphan container but failed, then crashed again, causing a restart failure loop:
{noformat}
F0901 17:55:06.000000 31114 slave.cpp:4748] CHECK_READY(future): is FAILED: Failed to open '/var/lib/mesos/slave/meta/slaves/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-S5/frameworks/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-0005/executors/node__4ae69c7c-e32e-41d2-a485-88145a3e385c/runs/602befac-3ff5-44d7-acac-aeebdc0e4666/tasks/node-0-server__e5e468a3-b2ee-42ee-80e8-edc19a3aef84/task.updates' for status updates: No space left on device Failed to handle status update TASK_FAILED (UUID: fb9c3951-9a93-4925-a7f0-9ba7e38d2398) for task node-0-server__e5e468a3-b2ee-42ee-80e8-edc19a3aef84 of framework 9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-0005
{noformat}
To prevent the above from happening, garbage collection should be kicked in early enough to clean up the container sandboxes of failed tasks.

Related ticket: MESOS-7031



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)