You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Jie Yu (JIRA)" <ji...@apache.org> on 2017/11/14 22:53:00 UTC

[jira] [Commented] (MESOS-8161) Potentially dangerous dangling mount when stopping task with persistent volume

    [ https://issues.apache.org/jira/browse/MESOS-8161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16252635#comment-16252635 ] 

Jie Yu commented on MESOS-8161:
-------------------------------

First, why are you sending `TASK_ERROR`? Can you make sure to kill the process associated with the task before task is transitioning into terminal state?

{noformat}
E1101 20:21:14.140836 102231 slave.cpp:4520] Termination of executor 'node-1_executor__cbf96a3e-1b67-44c6-830d-ec1655d7ce05' of framework db61f6d4-fd0f-48be-927d-14282c12301f-0014 failed: Failed to kill all processes in the container: Timed out after 1mins
{noformat}

Looks like launcher cannot kill all tasks, and timed out in 1min. In that case, we won't proceed with isolator cleanups (which will remove the persistent volume mounts). Do you guys know why processes cannot be killed in 1min? Can you repo the issue and show the stack for each remaining process in that cgroup (i.e., cat /proc/<pid>/stack for each pid in the cgroup tasks file)?

There's nothing Mesos can do there. We cannot call isolator cleanup if the processes are not killed.



> Potentially dangerous dangling mount when stopping task with persistent volume
> ------------------------------------------------------------------------------
>
>                 Key: MESOS-8161
>                 URL: https://issues.apache.org/jira/browse/MESOS-8161
>             Project: Mesos
>          Issue Type: Bug
>            Reporter: Zhitao Li
>            Priority: Critical
>
> While we fixed a case in MESOS-7366 when an executor terminates, it seems like a very similar case can still happen if a task with a persistent volume terminates, executor still active, and [this unmount call|https://github.com/apache/mesos/blob/6f98b8d6d149c5497d16f588c683a68fccba4fc9/src/slave/containerizer/mesos/isolators/filesystem/linux.cpp#L489] fails due to "device busy".
> I believe if agent gc or something other things run on the host mount namespace, it is possible to lose persistent volume data because of this.
> Agent log:
> {code:none}
> I1101 20:19:44.137109 102240 slave.cpp:3961] Sending acknowledgement for status update TASK_RUNNING (UUID: ecdd32b8-8eba-40c5-92c8-3398310f142b) for task node-1__23fa9624-4608-404f-8d6f-0235559588
> 8f of framework db61f6d4-fd0f-48be-927d-14282c12301f-0014 to executor(1)@10.70.142.140:36929
> I1101 20:19:44.235196 102233 status_update_manager.cpp:395] Received status update acknowledgement (UUID: ecdd32b8-8eba-40c5-92c8-3398310f142b) for task node-1__23fa9624-4608-404f-8d6f-02355595888
> f of framework db61f6d4-fd0f-48be-927d-14282c12301f-0014
> I1101 20:19:44.235302 102233 status_update_manager.cpp:832] Checkpointing ACK for status update TASK_RUNNING (UUID: ecdd32b8-8eba-40c5-92c8-3398310f142b) for task node-1__23fa9624-4608-404f-8d6f-0
> 2355595888f of framework db61f6d4-fd0f-48be-927d-14282c12301f-0014
> I1101 20:19:59.135591 102213 slave.cpp:3634] Handling status update TASK_RUNNING (UUID: c1667f59-b404-43ab-b096-b12397fb00f0) for task node-1__23fa9624-4608-404f-8d6f-02355595888f of framework db6
> 1f6d4-fd0f-48be-927d-14282c12301f-0014 from executor(1)@10.70.142.140:36929
> I1101 20:19:59.136494 102216 status_update_manager.cpp:323] Received status update TASK_RUNNING (UUID: c1667f59-b404-43ab-b096-b12397fb00f0) for task node-1__23fa9624-4608-404f-8d6f-02355595888f o
> f framework db61f6d4-fd0f-48be-927d-14282c12301f-0014
> I1101 20:19:59.136540 102216 status_update_manager.cpp:832] Checkpointing UPDATE for status update TASK_RUNNING (UUID: c1667f59-b404-43ab-b096-b12397fb00f0) for task node-1__23fa9624-4608-404f-8d6
> f-02355595888f of framework db61f6d4-fd0f-48be-927d-14282c12301f-0014
> I1101 20:19:59.136724 102234 slave.cpp:4051] Forwarding the update TASK_RUNNING (UUID: c1667f59-b404-43ab-b096-b12397fb00f0) for task node-1__23fa9624-4608-404f-8d6f-02355595888f of framework db61
> f6d4-fd0f-48be-927d-14282c12301f-0014 to master@10.162.12.31:5050
> I1101 20:19:59.136867 102234 slave.cpp:3961] Sending acknowledgement for status update TASK_RUNNING (UUID: c1667f59-b404-43ab-b096-b12397fb00f0) for task node-1__23fa9624-4608-404f-8d6f-0235559588
> 8f of framework db61f6d4-fd0f-48be-927d-14282c12301f-0014 to executor(1)@10.70.142.140:36929
> I1101 20:20:02.010108 102223 http.cpp:277] HTTP GET for /slave(1)/flags from 10.70.142.140:43046 with User-Agent='Python-urllib/2.7'
> I1101 20:20:02.038574 102238 http.cpp:277] HTTP GET for /slave(1)/flags from 10.70.142.140:43144 with User-Agent='Python-urllib/2.7'
> I1101 20:20:02.246388 102237 slave.cpp:5044] Current disk usage 0.23%. Max allowed age: 6.283560425078715days
> I1101 20:20:02.445312 102235 http.cpp:277] HTTP GET for /slave(1)/state.json from 10.70.142.140:44716 with User-Agent='Python-urllib/2.7'
> I1101 20:20:02.448276 102215 http.cpp:277] HTTP GET for /slave(1)/flags from 10.70.142.140:44732 with User-Agent='Python-urllib/2.7'
> I1101 20:20:07.789482 102231 http.cpp:277] HTTP GET for /slave(1)/state.json from 10.70.142.140:56414 with User-Agent='filebundle-agent'
> I1101 20:20:07.913359 102216 status_update_manager.cpp:395] Received status update acknowledgement (UUID: c1667f59-b404-43ab-b096-b12397fb00f0) for task node-1__23fa9624-4608-404f-8d6f-02355595888
> f of framework db61f6d4-fd0f-48be-927d-14282c12301f-0014
> I1101 20:20:07.913455 102216 status_update_manager.cpp:832] Checkpointing ACK for status update TASK_RUNNING (UUID: c1667f59-b404-43ab-b096-b12397fb00f0) for task node-1__23fa9624-4608-404f-8d6f-0
> 2355595888f of framework db61f6d4-fd0f-48be-927d-14282c12301f-0014
> I1101 20:20:14.135632 102231 slave.cpp:3634] Handling status update TASK_ERROR (UUID: 913c25be-dfb6-4ad8-874f-d8e1c789ccc0) for task node-1__23fa9624-4608-404f-8d6f-02355595888f of framework db61f
> 6d4-fd0f-48be-927d-14282c12301f-0014 from executor(1)@10.70.142.140:36929
> E1101 20:20:14.136687 102211 slave.cpp:6736] Unexpected terminal task state TASK_ERROR
> I1101 20:20:14.137081 102230 linux.cpp:627] Removing mount '/var/lib/mesos/slaves/db61f6d4-fd0f-48be-927d-14282c12301f-S193/frameworks/db61f6d4-fd0f-48be-927d-14282c12301f-0014/executors/node-1_ex
> ecutor__cbf96a3e-1b67-44c6-830d-ec1655d7ce05/runs/da3ecaac-35dc-4464-8204-76577dcde2a8/volume' for persistent volume disk(cassandra-test-varung5-framework, cassandra, {resource_id: 411a63af-0fea-4
> d2d-b850-a77039756e99})[a24515db-5b6b-4538-b1da-5c1acf0fe286:volume]:2000000 of container da3ecaac-35dc-4464-8204-76577dcde2a8
> I1101 20:20:14.137557 102241 disk.cpp:207] Updating the disk resources for container da3ecaac-35dc-4464-8204-76577dcde2a8 to cpus(cassandra-test-varung5-framework, cassandra, {resource_id: 58e9c94
> c-7512-45d2-bf7e-453ec42b55bf}):0.1; mem(cassandra-test-varung5-framework, cassandra, {resource_id: 57c52b95-5816-43cd-b8c4-8a46d4271788}):768; ports(cassandra-test-varung5-framework, cassandra, {
> resource_id: 6ed4bec0-b7bf-4513-8c65-197f6e3e1a44}):[31001-31001]
> I1101 20:20:14.137765 102241 disk.cpp:312] Checking disk usage at '/var/lib/mesos/slaves/db61f6d4-fd0f-48be-927d-14282c12301f-S193/frameworks/db61f6d4-fd0f-48be-927d-14282c12301f-0014/executors/no
> de-1_executor__cbf96a3e-1b67-44c6-830d-ec1655d7ce05/runs/da3ecaac-35dc-4464-8204-76577dcde2a8/volume' for container da3ecaac-35dc-4464-8204-76577dcde2a8 has been cancelled
> I1101 20:20:14.137763 102227 memory.cpp:199] Updated 'memory.soft_limit_in_bytes' to 768MB for container da3ecaac-35dc-4464-8204-76577dcde2a8
> E1101 20:20:14.138339 102212 slave.cpp:3903] Failed to update resources for container da3ecaac-35dc-4464-8204-76577dcde2a8 of executor 'node-1_executor__cbf96a3e-1b67-44c6-830d-ec1655d7ce05' running task node-1__23fa9624-4608-404f-8d6f-02355595888f on status update for terminal task, destroying container: Collect failed: Failed to unmount unneeded persistent volume at '/var/lib/mesos/slaves/db61f6d4-fd0f-48be-927d-14282c12301f-S193/frameworks/db61f6d4-fd0f-48be-927d-14282c12301f-0014/executors/node-1_executor__cbf96a3e-1b67-44c6-830d-ec1655d7ce05/runs/da3ecaac-35dc-4464-8204-76577dcde2a8/volume': Failed to unmount '/var/lib/mesos/slaves/db61f6d4-fd0f-48be-927d-14282c12301f-S193/frameworks/db61f6d4-fd0f-48be-927d-14282c12301f-0014/executors/node-1_executor__cbf96a3e-1b67-44c6-830d-ec1655d7ce05/runs/da3ecaac-35dc-4464-8204-76577dcde2a8/volume': Device or resource busy
> I1101 20:20:14.138587 102220 status_update_manager.cpp:323] Received status update TASK_ERROR (UUID: 913c25be-dfb6-4ad8-874f-d8e1c789ccc0) for task node-1__23fa9624-4608-404f-8d6f-02355595888f of framework db61f6d4-fd0f-48be-927d-14282c12301f-0014
> I1101 20:20:14.138608 102233 containerizer.cpp:1955] Destroying container da3ecaac-35dc-4464-8204-76577dcde2a8 in RUNNING state
> I1101 20:20:14.138664 102220 status_update_manager.cpp:832] Checkpointing UPDATE for status update TASK_ERROR (UUID: 913c25be-dfb6-4ad8-874f-d8e1c789ccc0) for task node-1__23fa9624-4608-404f-8d6f-02355595888f of framework db61f6d4-fd0f-48be-927d-14282c12301f-0014
> I1101 20:20:14.138756 102232 linux_launcher.cpp:498] Asked to destroy container da3ecaac-35dc-4464-8204-76577dcde2a8
> I1101 20:20:14.138847 102214 slave.cpp:4051] Forwarding the update TASK_ERROR (UUID: 913c25be-dfb6-4ad8-874f-d8e1c789ccc0) for task node-1__23fa9624-4608-404f-8d6f-02355595888f of framework db61f6d4-fd0f-48be-927d-14282c12301f-0014 to master@10.162.12.31:5050
> I1101 20:20:14.138978 102214 slave.cpp:3961] Sending acknowledgement for status update TASK_ERROR (UUID: 913c25be-dfb6-4ad8-874f-d8e1c789ccc0) for task node-1__23fa9624-4608-404f-8d6f-02355595888f of framework db61f6d4-fd0f-48be-927d-14282c12301f-0014 to executor(1)@10.70.142.140:36929
> I1101 20:20:14.139230 102232 linux_launcher.cpp:541] Using freezer to destroy cgroup mesos/da3ecaac-35dc-4464-8204-76577dcde2a8
> I1101 20:20:14.139691 102227 cpu.cpp:101] Updated 'cpu.shares' to 102 (cpus 0.1) for container da3ecaac-35dc-4464-8204-76577dcde2a8
> I1101 20:20:14.140110 102235 cgroups.cpp:2705] Freezing cgroup /sys/fs/cgroup/freezer/mesos/da3ecaac-35dc-4464-8204-76577dcde2a8
> I1101 20:20:14.141026 102227 cpu.cpp:121] Updated 'cpu.cfs_period_us' to 100ms and 'cpu.cfs_quota_us' to 10ms (cpus 0.1) for container da3ecaac-35dc-4464-8204-76577dcde2a8
> I1101 20:20:14.243934 102236 cgroups.cpp:1439] Successfully froze cgroup /sys/fs/cgroup/freezer/mesos/da3ecaac-35dc-4464-8204-76577dcde2a8 after 103.768064ms
> I1101 20:20:14.246510 102236 cgroups.cpp:2723] Thawing cgroup /sys/fs/cgroup/freezer/mesos/da3ecaac-35dc-4464-8204-76577dcde2a8
> I1101 20:20:14.249119 102218 cgroups.cpp:1468] Successfully thawed cgroup /sys/fs/cgroup/freezer/mesos/da3ecaac-35dc-4464-8204-76577dcde2a8 after 2.564096ms
> I1101 20:20:14.945940 102213 slave.cpp:4179] Got exited event for executor(1)@10.70.142.140:36929
> I1101 20:20:18.942217 102236 http.cpp:277] HTTP GET for /slave(1)/state from 10.70.142.140:52050 with User-Agent='python-requests/2.4.3 CPython/2.7.10 Linux/4.4.82'
> I1101 20:20:20.412132 102235 status_update_manager.cpp:395] Received status update acknowledgement (UUID: 913c25be-dfb6-4ad8-874f-d8e1c789ccc0) for task node-1__23fa9624-4608-404f-8d6f-02355595888f of framework db61f6d4-fd0f-48be-927d-14282c12301f-0014
> I1101 20:20:20.412247 102235 status_update_manager.cpp:832] Checkpointing ACK for status update TASK_ERROR (UUID: 913c25be-dfb6-4ad8-874f-d8e1c789ccc0) for task node-1__23fa9624-4608-404f-8d6f-02355595888f of framework db61f6d4-fd0f-48be-927d-14282c12301f-0014
> W1101 20:20:20.665359 102213 slave.cpp:2669] Ignoring updating pid for framework db61f6d4-fd0f-48be-927d-14282c12301f-0005 because it does not exist
> I1101 20:20:28.572569 102237 http.cpp:277] HTTP GET for /slave(1)/state.json from 10.70.142.140:48290 with User-Agent='Go-http-client/1.1'
> W1101 20:20:28.573552 102215 containerizer.cpp:1876] Skipping status for container da3ecaac-35dc-4464-8204-76577dcde2a8 because: Container does not exist
> I1101 20:20:28.619093 102233 http.cpp:277] HTTP GET for /slave(1)/state.json from 10.70.142.140:48292 with User-Agent='Go-http-client/1.1'
> I1101 20:20:37.789496 102218 http.cpp:277] HTTP GET for /slave(1)/state.json from 10.70.142.140:58214 with User-Agent='filebundle-agent'
> I1101 20:21:01.325791 102215 http.cpp:277] HTTP GET for /slave(1)/flags from 10.70.142.140:55670 with User-Agent='Python-urllib/2.7'
> I1101 20:21:01.333392 102225 http.cpp:277] HTTP GET for /slave(1)/state.json from 10.70.142.140:55691 with User-Agent='Python-urllib/2.7'
> I1101 20:21:01.335985 102215 http.cpp:277] HTTP GET for /slave(1)/flags from 10.70.142.140:55704 with User-Agent='Python-urllib/2.7'
> I1101 20:21:01.428691 102241 http.cpp:277] HTTP GET for /slave(1)/flags from 10.70.142.140:55980 with User-Agent='Python-urllib/2.7'
> I1101 20:21:02.247719 102214 slave.cpp:5044] Current disk usage 0.23%. Max allowed age: 6.283563174420139days
> I1101 20:21:07.789693 102233 http.cpp:277] HTTP GET for /slave(1)/state.json from 10.70.142.140:44932 with User-Agent='filebundle-agent'
> I1101 20:21:09.948168 102236 http.cpp:277] HTTP GET for /slave(1)/state from 10.70.142.140:51804 with User-Agent='python-requests/2.4.3 CPython/2.7.10 Linux/4.4.82'
> E1101 20:21:14.140836 102231 slave.cpp:4520] Termination of executor 'node-1_executor__cbf96a3e-1b67-44c6-830d-ec1655d7ce05' of framework db61f6d4-fd0f-48be-927d-14282c12301f-0014 failed: Failed to kill all processes in the container: Timed out after 1mins
> I1101 20:21:14.140986 102231 slave.cpp:4646] Cleaning up executor 'node-1_executor__cbf96a3e-1b67-44c6-830d-ec1655d7ce05' of framework db61f6d4-fd0f-48be-927d-14282c12301f-0014 at executor(1)@10.70.142.140:36929
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)