You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Chun-Hung Hsiao (JIRA)" <ji...@apache.org> on 2017/09/07 03:07:07 UTC
[jira] [Updated] (MESOS-7939) Disk usage check for garbage collection

     [ https://issues.apache.org/jira/browse/MESOS-7939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chun-Hung Hsiao updated MESOS-7939:
-----------------------------------
    Summary: Disk usage check for garbage collection   (was: Garbage collection does not kick in early enough during agent startup)

> Disk usage check for garbage collection 
> ----------------------------------------
>
>                 Key: MESOS-7939
>                 URL: https://issues.apache.org/jira/browse/MESOS-7939
>             Project: Mesos
>          Issue Type: Bug
>          Components: agent
>            Reporter: Chun-Hung Hsiao
>            Assignee: Chun-Hung Hsiao
>            Priority: Critical
>             Fix For: 1.4.1
>
>
> A couple of completed tasks used up the disk space for sandboxes. After that, a task is failed due to insufficient disk space for downloading its artifacts:
> {noformat}
> I0901 17:54:47.402745 31039 fetcher.cpp:222] Fetching URI 'https://downloads.mesosphere.com/java/jre-8u131-linux-x64.tar.gz'
> I0901 17:54:47.402756 31039 fetcher.cpp:165] Downloading resource from 'https://downloads.mesosphere.com/java/jre-8u131-linux-x64.tar.gz' to '/var/lib/mesos/slave/slaves/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-S5/frameworks/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-0005/executors/node__4ae69c7c-e32e-41d2-a485-88145a3e385c/runs/602befac-3ff5-44d7-acac-aeebdc0e4666/containers/218b84a5-0301-4c95-89c5-c863145f86d8/jre-8u131-linux-x64.tar.gz'
> E0901 17:54:47.796036 31039 fetcher.cpp:579] EXIT with status 1: Failed to fetch 'https://downloads.mesosphere.com/java/jre-8u131-linux-x64.tar.gz': Error downloading resource: Failed writing received data to disk/application
> {noformat}
> As a result, the container is destroyed:
> {noformat}
> I0901 17:54:48.000000 25508 containerizer.cpp:2599] Container 602befac-3ff5-44d7-acac-aeebdc0e4666 has exited
> I0901 17:54:48.000000 25508 containerizer.cpp:2158] Destroying container 602befac-3ff5-44d7-acac-aeebdc0e4666 in RUNNING state
> I0901 17:54:48.000000 25508 containerizer.cpp:2699] Transitioning the state of container 602befac-3ff5-44d7-acac-aeebdc0e4666 from RUNNING to DESTROYING
> I0901 17:54:48.000000 25508 linux_launcher.cpp:505] Asked to destroy container 602befac-3ff5-44d7-acac-aeebdc0e4666
> I0901 17:54:48.000000 25508 linux_launcher.cpp:548] Using freezer to destroy cgroup mesos/602befac-3ff5-44d7-acac-aeebdc0e4666
> I0901 17:54:48.000000 25510 cgroups.cpp:3055] Freezing cgroup /sys/fs/cgroup/freezer/mesos/602befac-3ff5-44d7-acac-aeebdc0e4666/mesos
> I0901 17:54:48.000000 25508 cgroups.cpp:3055] Freezing cgroup /sys/fs/cgroup/freezer/mesos/602befac-3ff5-44d7-acac-aeebdc0e4666
> I0901 17:54:48.000000 25510 cgroups.cpp:1413] Successfully froze cgroup /sys/fs/cgroup/freezer/mesos/602befac-3ff5-44d7-acac-aeebdc0e4666/mesos after 3.090176ms
> I0901 17:54:48.000000 25508 cgroups.cpp:1413] Successfully froze cgroup /sys/fs/cgroup/freezer/mesos/602befac-3ff5-44d7-acac-aeebdc0e4666 after 5.518848ms
> I0901 17:54:48.000000 25510 cgroups.cpp:3073] Thawing cgroup /sys/fs/cgroup/freezer/mesos/602befac-3ff5-44d7-acac-aeebdc0e4666/mesos
> I0901 17:54:48.000000 25512 cgroups.cpp:3073] Thawing cgroup /sys/fs/cgroup/freezer/mesos/602befac-3ff5-44d7-acac-aeebdc0e4666
> I0901 17:54:48.000000 25512 cgroups.cpp:1442] Successfully thawed cgroup /sys/fs/cgroup/freezer/mesos/602befac-3ff5-44d7-acac-aeebdc0e4666 after 2.351104ms
> I0901 17:54:49.000000 25509 cgroups.cpp:1442] Successfully thawed cgroup /sys/fs/cgroup/freezer/mesos/602befac-3ff5-44d7-acac-aeebdc0e4666/mesos after 106.335232ms
> I0901 17:54:49.000000 25512 disk.cpp:320] Checking disk usage at '/var/lib/mesos/slave/slaves/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-S5/frameworks/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-0005/executors/node__4ae69c7c-e32e-41d2-a485-88145a3e385c/runs/602befac-3ff5-44d7-acac-aeebdc0e4666/container-path' for container 602befac-3ff5-44d7-acac-aeebdc0e4666 has been cancelled
> I0901 17:54:49.000000 25511 disk.cpp:320] Checking disk usage at '/var/lib/mesos/slave/slaves/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-S5/frameworks/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-0005/executors/node__4ae69c7c-e32e-41d2-a485-88145a3e385c/runs/602befac-3ff5-44d7-acac-aeebdc0e4666' for container 602befac-3ff5-44d7-acac-aeebdc0e4666 has been cancelled
> I0901 17:54:49.000000 25511 container_assigner.cpp:101] Unregistering container_id[value: "602befac-3ff5-44d7-acac-aeebdc0e4666"].
> {noformat}
> However, due to insufficient disk space, the agent was unable to store the TASK_FAILED state, and this led to an agent crash:
> {noformat}
> I0901 17:54:49.000000 25513 status_update_manager.cpp:323] Received status update TASK_FAILED (UUID: bf24c3da-db23-4c82-a09f-a3b859e8cad4) for task node-0-server__e5e468a3-b2ee-42ee-80e8-edc19a3aef84 of framework 9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-0005
> F0901 17:54:49.000000 25513 slave.cpp:4748] CHECK_READY(future): is FAILED: Failed to open '/var/lib/mesos/slave/meta/slaves/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-S5/frameworks/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-0005/executors/node__4ae69c7c-e32e-41d2-a485-88145a3e385c/runs/602befac-3ff5-44d7-acac-aeebdc0e4666/tasks/node-0-server__e5e468a3-b2ee-42ee-80e8-edc19a3aef84/task.updates' for status updates: No space left on device Failed to handle status update TASK_FAILED (UUID: bf24c3da-db23-4c82-a09f-a3b859e8cad4) for task node-0-server__e5e468a3-b2ee-42ee-80e8-edc19a3aef84 of framework 9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-0005
> {noformat}
> When the agent restarted, it tried to destroy this orphan container but failed, then crashed again, causing a restart failure loop:
> {noformat}
> F0901 17:55:06.000000 31114 slave.cpp:4748] CHECK_READY(future): is FAILED: Failed to open '/var/lib/mesos/slave/meta/slaves/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-S5/frameworks/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-0005/executors/node__4ae69c7c-e32e-41d2-a485-88145a3e385c/runs/602befac-3ff5-44d7-acac-aeebdc0e4666/tasks/node-0-server__e5e468a3-b2ee-42ee-80e8-edc19a3aef84/task.updates' for status updates: No space left on device Failed to handle status update TASK_FAILED (UUID: fb9c3951-9a93-4925-a7f0-9ba7e38d2398) for task node-0-server__e5e468a3-b2ee-42ee-80e8-edc19a3aef84 of framework 9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-0005
> {noformat}
> To prevent the above from happening, garbage collection should be kicked in early enough to clean up the container sandboxes of failed tasks.
> Related ticket: MESOS-7031



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)