You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Gilbert Song (JIRA)" <ji...@apache.org> on 2018/01/20 01:32:00 UTC

[jira] [Commented] (MESOS-8161) Potentially dangerous dangling mount when stopping task with persistent volume

    [ https://issues.apache.org/jira/browse/MESOS-8161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16333107#comment-16333107 ] 

Gilbert Song commented on MESOS-8161:
-------------------------------------

[~zhitao] , any update on this issue? Could we close it if it is not a Mesos issue?

I am updating the priority to `Major` for now. Please change it back if necessary.

> Potentially dangerous dangling mount when stopping task with persistent volume
> ------------------------------------------------------------------------------
>
>                 Key: MESOS-8161
>                 URL: https://issues.apache.org/jira/browse/MESOS-8161
>             Project: Mesos
>          Issue Type: Bug
>            Reporter: Zhitao Li
>            Priority: Critical
>
> While we fixed a case in MESOS-7366 when an executor terminates, it seems like a very similar case can still happen if a task with a persistent volume terminates, executor still active, and [this unmount call|https://github.com/apache/mesos/blob/6f98b8d6d149c5497d16f588c683a68fccba4fc9/src/slave/containerizer/mesos/isolators/filesystem/linux.cpp#L489] fails due to "device busy".
> I believe if agent gc or something other things run on the host mount namespace, it is possible to lose persistent volume data because of this.
> Agent log:
> {code:none}
> I1101 20:19:44.137109 102240 slave.cpp:3961] Sending acknowledgement for status update TASK_RUNNING (UUID: ecdd32b8-8eba-40c5-92c8-3398310f142b) for task node-1__23fa9624-4608-404f-8d6f-0235559588
> 8f of framework db61f6d4-fd0f-48be-927d-14282c12301f-0014 to executor(1)@10.70.142.140:36929
> I1101 20:19:44.235196 102233 status_update_manager.cpp:395] Received status update acknowledgement (UUID: ecdd32b8-8eba-40c5-92c8-3398310f142b) for task node-1__23fa9624-4608-404f-8d6f-02355595888
> f of framework db61f6d4-fd0f-48be-927d-14282c12301f-0014
> I1101 20:19:44.235302 102233 status_update_manager.cpp:832] Checkpointing ACK for status update TASK_RUNNING (UUID: ecdd32b8-8eba-40c5-92c8-3398310f142b) for task node-1__23fa9624-4608-404f-8d6f-0
> 2355595888f of framework db61f6d4-fd0f-48be-927d-14282c12301f-0014
> I1101 20:19:59.135591 102213 slave.cpp:3634] Handling status update TASK_RUNNING (UUID: c1667f59-b404-43ab-b096-b12397fb00f0) for task node-1__23fa9624-4608-404f-8d6f-02355595888f of framework db6
> 1f6d4-fd0f-48be-927d-14282c12301f-0014 from executor(1)@10.70.142.140:36929
> I1101 20:19:59.136494 102216 status_update_manager.cpp:323] Received status update TASK_RUNNING (UUID: c1667f59-b404-43ab-b096-b12397fb00f0) for task node-1__23fa9624-4608-404f-8d6f-02355595888f o
> f framework db61f6d4-fd0f-48be-927d-14282c12301f-0014
> I1101 20:19:59.136540 102216 status_update_manager.cpp:832] Checkpointing UPDATE for status update TASK_RUNNING (UUID: c1667f59-b404-43ab-b096-b12397fb00f0) for task node-1__23fa9624-4608-404f-8d6
> f-02355595888f of framework db61f6d4-fd0f-48be-927d-14282c12301f-0014
> I1101 20:19:59.136724 102234 slave.cpp:4051] Forwarding the update TASK_RUNNING (UUID: c1667f59-b404-43ab-b096-b12397fb00f0) for task node-1__23fa9624-4608-404f-8d6f-02355595888f of framework db61
> f6d4-fd0f-48be-927d-14282c12301f-0014 to master@10.162.12.31:5050
> I1101 20:19:59.136867 102234 slave.cpp:3961] Sending acknowledgement for status update TASK_RUNNING (UUID: c1667f59-b404-43ab-b096-b12397fb00f0) for task node-1__23fa9624-4608-404f-8d6f-0235559588
> 8f of framework db61f6d4-fd0f-48be-927d-14282c12301f-0014 to executor(1)@10.70.142.140:36929
> I1101 20:20:02.010108 102223 http.cpp:277] HTTP GET for /slave(1)/flags from 10.70.142.140:43046 with User-Agent='Python-urllib/2.7'
> I1101 20:20:02.038574 102238 http.cpp:277] HTTP GET for /slave(1)/flags from 10.70.142.140:43144 with User-Agent='Python-urllib/2.7'
> I1101 20:20:02.246388 102237 slave.cpp:5044] Current disk usage 0.23%. Max allowed age: 6.283560425078715days
> I1101 20:20:02.445312 102235 http.cpp:277] HTTP GET for /slave(1)/state.json from 10.70.142.140:44716 with User-Agent='Python-urllib/2.7'
> I1101 20:20:02.448276 102215 http.cpp:277] HTTP GET for /slave(1)/flags from 10.70.142.140:44732 with User-Agent='Python-urllib/2.7'
> I1101 20:20:07.789482 102231 http.cpp:277] HTTP GET for /slave(1)/state.json from 10.70.142.140:56414 with User-Agent='filebundle-agent'
> I1101 20:20:07.913359 102216 status_update_manager.cpp:395] Received status update acknowledgement (UUID: c1667f59-b404-43ab-b096-b12397fb00f0) for task node-1__23fa9624-4608-404f-8d6f-02355595888
> f of framework db61f6d4-fd0f-48be-927d-14282c12301f-0014
> I1101 20:20:07.913455 102216 status_update_manager.cpp:832] Checkpointing ACK for status update TASK_RUNNING (UUID: c1667f59-b404-43ab-b096-b12397fb00f0) for task node-1__23fa9624-4608-404f-8d6f-0
> 2355595888f of framework db61f6d4-fd0f-48be-927d-14282c12301f-0014
> I1101 20:20:14.135632 102231 slave.cpp:3634] Handling status update TASK_ERROR (UUID: 913c25be-dfb6-4ad8-874f-d8e1c789ccc0) for task node-1__23fa9624-4608-404f-8d6f-02355595888f of framework db61f
> 6d4-fd0f-48be-927d-14282c12301f-0014 from executor(1)@10.70.142.140:36929
> E1101 20:20:14.136687 102211 slave.cpp:6736] Unexpected terminal task state TASK_ERROR
> I1101 20:20:14.137081 102230 linux.cpp:627] Removing mount '/var/lib/mesos/slaves/db61f6d4-fd0f-48be-927d-14282c12301f-S193/frameworks/db61f6d4-fd0f-48be-927d-14282c12301f-0014/executors/node-1_ex
> ecutor__cbf96a3e-1b67-44c6-830d-ec1655d7ce05/runs/da3ecaac-35dc-4464-8204-76577dcde2a8/volume' for persistent volume disk(cassandra-test-varung5-framework, cassandra, {resource_id: 411a63af-0fea-4
> d2d-b850-a77039756e99})[a24515db-5b6b-4538-b1da-5c1acf0fe286:volume]:2000000 of container da3ecaac-35dc-4464-8204-76577dcde2a8
> I1101 20:20:14.137557 102241 disk.cpp:207] Updating the disk resources for container da3ecaac-35dc-4464-8204-76577dcde2a8 to cpus(cassandra-test-varung5-framework, cassandra, {resource_id: 58e9c94
> c-7512-45d2-bf7e-453ec42b55bf}):0.1; mem(cassandra-test-varung5-framework, cassandra, {resource_id: 57c52b95-5816-43cd-b8c4-8a46d4271788}):768; ports(cassandra-test-varung5-framework, cassandra, {
> resource_id: 6ed4bec0-b7bf-4513-8c65-197f6e3e1a44}):[31001-31001]
> I1101 20:20:14.137765 102241 disk.cpp:312] Checking disk usage at '/var/lib/mesos/slaves/db61f6d4-fd0f-48be-927d-14282c12301f-S193/frameworks/db61f6d4-fd0f-48be-927d-14282c12301f-0014/executors/no
> de-1_executor__cbf96a3e-1b67-44c6-830d-ec1655d7ce05/runs/da3ecaac-35dc-4464-8204-76577dcde2a8/volume' for container da3ecaac-35dc-4464-8204-76577dcde2a8 has been cancelled
> I1101 20:20:14.137763 102227 memory.cpp:199] Updated 'memory.soft_limit_in_bytes' to 768MB for container da3ecaac-35dc-4464-8204-76577dcde2a8
> E1101 20:20:14.138339 102212 slave.cpp:3903] Failed to update resources for container da3ecaac-35dc-4464-8204-76577dcde2a8 of executor 'node-1_executor__cbf96a3e-1b67-44c6-830d-ec1655d7ce05' running task node-1__23fa9624-4608-404f-8d6f-02355595888f on status update for terminal task, destroying container: Collect failed: Failed to unmount unneeded persistent volume at '/var/lib/mesos/slaves/db61f6d4-fd0f-48be-927d-14282c12301f-S193/frameworks/db61f6d4-fd0f-48be-927d-14282c12301f-0014/executors/node-1_executor__cbf96a3e-1b67-44c6-830d-ec1655d7ce05/runs/da3ecaac-35dc-4464-8204-76577dcde2a8/volume': Failed to unmount '/var/lib/mesos/slaves/db61f6d4-fd0f-48be-927d-14282c12301f-S193/frameworks/db61f6d4-fd0f-48be-927d-14282c12301f-0014/executors/node-1_executor__cbf96a3e-1b67-44c6-830d-ec1655d7ce05/runs/da3ecaac-35dc-4464-8204-76577dcde2a8/volume': Device or resource busy
> I1101 20:20:14.138587 102220 status_update_manager.cpp:323] Received status update TASK_ERROR (UUID: 913c25be-dfb6-4ad8-874f-d8e1c789ccc0) for task node-1__23fa9624-4608-404f-8d6f-02355595888f of framework db61f6d4-fd0f-48be-927d-14282c12301f-0014
> I1101 20:20:14.138608 102233 containerizer.cpp:1955] Destroying container da3ecaac-35dc-4464-8204-76577dcde2a8 in RUNNING state
> I1101 20:20:14.138664 102220 status_update_manager.cpp:832] Checkpointing UPDATE for status update TASK_ERROR (UUID: 913c25be-dfb6-4ad8-874f-d8e1c789ccc0) for task node-1__23fa9624-4608-404f-8d6f-02355595888f of framework db61f6d4-fd0f-48be-927d-14282c12301f-0014
> I1101 20:20:14.138756 102232 linux_launcher.cpp:498] Asked to destroy container da3ecaac-35dc-4464-8204-76577dcde2a8
> I1101 20:20:14.138847 102214 slave.cpp:4051] Forwarding the update TASK_ERROR (UUID: 913c25be-dfb6-4ad8-874f-d8e1c789ccc0) for task node-1__23fa9624-4608-404f-8d6f-02355595888f of framework db61f6d4-fd0f-48be-927d-14282c12301f-0014 to master@10.162.12.31:5050
> I1101 20:20:14.138978 102214 slave.cpp:3961] Sending acknowledgement for status update TASK_ERROR (UUID: 913c25be-dfb6-4ad8-874f-d8e1c789ccc0) for task node-1__23fa9624-4608-404f-8d6f-02355595888f of framework db61f6d4-fd0f-48be-927d-14282c12301f-0014 to executor(1)@10.70.142.140:36929
> I1101 20:20:14.139230 102232 linux_launcher.cpp:541] Using freezer to destroy cgroup mesos/da3ecaac-35dc-4464-8204-76577dcde2a8
> I1101 20:20:14.139691 102227 cpu.cpp:101] Updated 'cpu.shares' to 102 (cpus 0.1) for container da3ecaac-35dc-4464-8204-76577dcde2a8
> I1101 20:20:14.140110 102235 cgroups.cpp:2705] Freezing cgroup /sys/fs/cgroup/freezer/mesos/da3ecaac-35dc-4464-8204-76577dcde2a8
> I1101 20:20:14.141026 102227 cpu.cpp:121] Updated 'cpu.cfs_period_us' to 100ms and 'cpu.cfs_quota_us' to 10ms (cpus 0.1) for container da3ecaac-35dc-4464-8204-76577dcde2a8
> I1101 20:20:14.243934 102236 cgroups.cpp:1439] Successfully froze cgroup /sys/fs/cgroup/freezer/mesos/da3ecaac-35dc-4464-8204-76577dcde2a8 after 103.768064ms
> I1101 20:20:14.246510 102236 cgroups.cpp:2723] Thawing cgroup /sys/fs/cgroup/freezer/mesos/da3ecaac-35dc-4464-8204-76577dcde2a8
> I1101 20:20:14.249119 102218 cgroups.cpp:1468] Successfully thawed cgroup /sys/fs/cgroup/freezer/mesos/da3ecaac-35dc-4464-8204-76577dcde2a8 after 2.564096ms
> I1101 20:20:14.945940 102213 slave.cpp:4179] Got exited event for executor(1)@10.70.142.140:36929
> I1101 20:20:18.942217 102236 http.cpp:277] HTTP GET for /slave(1)/state from 10.70.142.140:52050 with User-Agent='python-requests/2.4.3 CPython/2.7.10 Linux/4.4.82'
> I1101 20:20:20.412132 102235 status_update_manager.cpp:395] Received status update acknowledgement (UUID: 913c25be-dfb6-4ad8-874f-d8e1c789ccc0) for task node-1__23fa9624-4608-404f-8d6f-02355595888f of framework db61f6d4-fd0f-48be-927d-14282c12301f-0014
> I1101 20:20:20.412247 102235 status_update_manager.cpp:832] Checkpointing ACK for status update TASK_ERROR (UUID: 913c25be-dfb6-4ad8-874f-d8e1c789ccc0) for task node-1__23fa9624-4608-404f-8d6f-02355595888f of framework db61f6d4-fd0f-48be-927d-14282c12301f-0014
> W1101 20:20:20.665359 102213 slave.cpp:2669] Ignoring updating pid for framework db61f6d4-fd0f-48be-927d-14282c12301f-0005 because it does not exist
> I1101 20:20:28.572569 102237 http.cpp:277] HTTP GET for /slave(1)/state.json from 10.70.142.140:48290 with User-Agent='Go-http-client/1.1'
> W1101 20:20:28.573552 102215 containerizer.cpp:1876] Skipping status for container da3ecaac-35dc-4464-8204-76577dcde2a8 because: Container does not exist
> I1101 20:20:28.619093 102233 http.cpp:277] HTTP GET for /slave(1)/state.json from 10.70.142.140:48292 with User-Agent='Go-http-client/1.1'
> I1101 20:20:37.789496 102218 http.cpp:277] HTTP GET for /slave(1)/state.json from 10.70.142.140:58214 with User-Agent='filebundle-agent'
> I1101 20:21:01.325791 102215 http.cpp:277] HTTP GET for /slave(1)/flags from 10.70.142.140:55670 with User-Agent='Python-urllib/2.7'
> I1101 20:21:01.333392 102225 http.cpp:277] HTTP GET for /slave(1)/state.json from 10.70.142.140:55691 with User-Agent='Python-urllib/2.7'
> I1101 20:21:01.335985 102215 http.cpp:277] HTTP GET for /slave(1)/flags from 10.70.142.140:55704 with User-Agent='Python-urllib/2.7'
> I1101 20:21:01.428691 102241 http.cpp:277] HTTP GET for /slave(1)/flags from 10.70.142.140:55980 with User-Agent='Python-urllib/2.7'
> I1101 20:21:02.247719 102214 slave.cpp:5044] Current disk usage 0.23%. Max allowed age: 6.283563174420139days
> I1101 20:21:07.789693 102233 http.cpp:277] HTTP GET for /slave(1)/state.json from 10.70.142.140:44932 with User-Agent='filebundle-agent'
> I1101 20:21:09.948168 102236 http.cpp:277] HTTP GET for /slave(1)/state from 10.70.142.140:51804 with User-Agent='python-requests/2.4.3 CPython/2.7.10 Linux/4.4.82'
> E1101 20:21:14.140836 102231 slave.cpp:4520] Termination of executor 'node-1_executor__cbf96a3e-1b67-44c6-830d-ec1655d7ce05' of framework db61f6d4-fd0f-48be-927d-14282c12301f-0014 failed: Failed to kill all processes in the container: Timed out after 1mins
> I1101 20:21:14.140986 102231 slave.cpp:4646] Cleaning up executor 'node-1_executor__cbf96a3e-1b67-44c6-830d-ec1655d7ce05' of framework db61f6d4-fd0f-48be-927d-14282c12301f-0014 at executor(1)@10.70.142.140:36929
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)