You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Zhitao Li (JIRA)" <ji...@apache.org> on 2017/09/20 23:39:01 UTC

[jira] [Commented] (MESOS-7366) Agent sandbox gc could accidentally delete the entire persistent volume content

    [ https://issues.apache.org/jira/browse/MESOS-7366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16174009#comment-16174009 ] 

Zhitao Li commented on MESOS-7366:
----------------------------------

[~jieyu], sorry for reviving this task, but we might have missed a case for {{unmount} in linux.cpp. [This unmount call |https://github.com/apache/mesos/blob/6f98b8d6d149c5497d16f588c683a68fccba4fc9/src/slave/containerizer/mesos/isolators/filesystem/linux.cpp#L489] can still fail if device is busy.


> Agent sandbox gc could accidentally delete the entire persistent volume content
> -------------------------------------------------------------------------------
>
>                 Key: MESOS-7366
>                 URL: https://issues.apache.org/jira/browse/MESOS-7366
>             Project: Mesos
>          Issue Type: Bug
>    Affects Versions: 1.0.2, 1.1.1, 1.2.0
>            Reporter: Zhitao Li
>            Assignee: Jie Yu
>            Priority: Blocker
>             Fix For: 1.0.4, 1.1.2, 1.2.1
>
>
> When 1) a persistent volume is mounted, 2) umount is stuck or something, 3) executor directory gc being invoked, agent seems to emit a log like:
> ```
>  Failed to delete directory  <executor_dir>/runs/<uuid>/volume: Device or resource busy
> ```
> After this, the persistent volume directory is empty.
> This could trigger data loss on critical workload so we should fix this ASAP.
> The triggering environment is a custom executor w/o rootfs image.
> Please let me know if you need more signal.
> {noformat}
> I0407 15:18:22.752624 22758 paths.cpp:536] Trying to chown '/var/lib/mesos/slaves/91ec544d-ac98-4958-bd7f-85d1f7822421-S3296/frameworks/5d030fd5-0fb6-4366-9dee-706261fa0749-0014/executors/node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7/runs/d5a56564-3e24-4c60-9919-746710b78377' to user 'uber'
> I0407 15:18:22.763229 22758 slave.cpp:6179] Launching executor 'node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7' of framework 5d030fd5-0fb6-4366-9dee-706261fa0749-0014 with resources cpus(cassandra-cstar-location-store, cassandra, {resource_id: 29e2ac63-d605-4982-a463-fa311be94e0a}):0.1; mem(cassandra-cstar-location-store, cassandra, {resource_id: 2e1223f3-41a2-419f-85cc-cbc839c19c70}):768; ports(cassandra-cstar-location-store, cassandra, {resource_id: fdd6598f-f32b-4c90-a622-226684528139}):[31001-31001] in work directory '/var/lib/mesos/slaves/91ec544d-ac98-4958-bd7f-85d1f7822421-S3296/frameworks/5d030fd5-0fb6-4366-9dee-706261fa0749-0014/executors/node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7/runs/d5a56564-3e24-4c60-9919-746710b78377'
> I0407 15:18:22.764103 22758 slave.cpp:1987] Queued task 'node-29__c6fdf823-e31a-4b78-a34f-e47e749c07f4' for executor 'node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7' of framework 5d030fd5-0fb6-4366-9dee-706261fa0749-0014
> I0407 15:18:22.766253 22764 containerizer.cpp:943] Starting container d5a56564-3e24-4c60-9919-746710b78377 for executor 'node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7' of framework 5d030fd5-0fb6-4366-9dee-706261fa0749-0014
> I0407 15:18:22.767514 22766 linux.cpp:730] Mounting '/var/lib/mesos/volumes/roles/cassandra-cstar-location-store/d6290423-2ba4-4975-86f4-ffd84ad138ff' to '/var/lib/mesos/slaves/91ec544d-ac98-4958-bd7f-85d1f7822421-S3296/frameworks/5d030fd5-0fb6-4366-9dee-706261fa0749-0014/executors/node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7/runs/d5a56564-3e24-4c60-9919-746710b78377/volume' for persistent volume disk(cassandra-cstar-location-store, cassandra, {resource_id: fefc15d6-0c6f-4eac-a3f8-c34d0335c5ec})[d6290423-2ba4-4975-86f4-ffd84ad138ff:volume]:6466445 of container d5a56564-3e24-4c60-9919-746710b78377
> I0407 15:18:22.894340 22768 containerizer.cpp:1494] Checkpointing container's forked pid 6892 to '/var/lib/mesos/meta/slaves/91ec544d-ac98-4958-bd7f-85d1f7822421-S3296/frameworks/5d030fd5-0fb6-4366-9dee-706261fa0749-0014/executors/node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7/runs/d5a56564-3e24-4c60-9919-746710b78377/pids/forked.pid'
> I0407 15:19:01.011916 22749 slave.cpp:3231] Got registration for executor 'node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7' of framework 5d030fd5-0fb6-4366-9dee-706261fa0749-0014 from executor(1)@10.14.6.132:36837
> I0407 15:19:01.031939 22770 slave.cpp:2191] Sending queued task 'node-29__c6fdf823-e31a-4b78-a34f-e47e749c07f4' to executor 'node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7' of framework 5d030fd5-0fb6-4366-9dee-706261fa0749-0014 at executor(1)@10.14.6.132:36837
> I0407 15:26:14.012861 22749 linux.cpp:627] Removing mount '/var/lib/mesos/slaves/91ec544d-ac98-4958-bd7f-85d1f7822421-S3296/fra
> meworks/5d030fd5-0fb6-4366-9dee-706261fa0749-0014/executors/node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7/runs/d5a5656
> 4-3e24-4c60-9919-746710b78377/volume' for persistent volume disk(cassandra-cstar-location-store, cassandra, {resource_id: fefc15d6-0c6f-4eac-a3f8-c34d0335c5ec})[d6290423-2ba4-4975-86f4-ffd84ad138ff:volume]:6466445 of container d5a56564-3e24-4c60-9919-746710b78377
> E0407 15:26:14.013828 22756 slave.cpp:3903] Failed to update resources for container d5a56564-3e24-4c60-9919-746710b78377 of executor 'node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7' running task node-29__c6fdf823-e31a-4b78-a34f-e47e749c07f4 on status update for terminal task, destroying container: Collect failed: Failed to unmount unneeded persistent volume at '/var/lib/mesos/slaves/91ec544d-ac98-4958-bd7f-85d1f7822421-S3296/frameworks/5d030fd5-0fb6-4366-9dee-706261fa0749-0014/executors/node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7/runs/d5a56564-3e24-4c60-9919-746710b78377/volume': Failed to unmount '/var/lib/mesos/slaves/91ec544d-ac98-4958-bd7f-85d1f7822421-S3296/frameworks/5d030fd5-0fb6-4366-9dee-706261fa0749-0014/executors/node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7/runs/d5a56564-3e24-4c60-9919-746710b78377/volume': Device or resource busy
> I0407 15:26:14.545647 22747 linux.cpp:810] Unmounting volume '/var/lib/mesos/slaves/91ec544d-ac98-4958-bd7f-85d1f7822421-S3296/frameworks/5d030fd5-0fb6-4366-9dee-706261fa0749-0014/executors/node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7/runs/d5a56564-3e24-4c60-9919-746710b78377/volume' for container d5a56564-3e24-4c60-9919-746710b78377
> E0407 15:26:14.546123 22753 slave.cpp:4520] Termination of executor 'node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7' of framework 5d030fd5-0fb6-4366-9dee-706261fa0749-0014 failed: Failed to clean up an isolator when destroying container: Failed to unmount volume '/var/lib/mesos/slaves/91ec544d-ac98-4958-bd7f-85d1f7822421-S3296/frameworks/5d030fd5-0fb6-4366-9dee-706261fa0749-0014/executors/node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7/runs/d5a56564-3e24-4c60-9919-746710b78377/volume': Failed to unmount '/var/lib/mesos/slaves/91ec544d-ac98-4958-bd7f-85d1f7822421-S3296/frameworks/5d030fd5-0fb6-4366-9dee-706261fa0749-0014/executors/node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7/runs/d5a56564-3e24-4c60-9919-746710b78377/volume': Device or resource busy
> I0407 15:26:14.566028 22744 slave.cpp:4646] Cleaning up executor 'node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7' of framework 5d030fd5-0fb6-4366-9dee-706261fa0749-0014 at executor(1)@10.14.6.132:36837
> I0407 15:26:14.566186 22768 gc.cpp:55] Scheduling '/var/lib/mesos/slaves/91ec544d-ac98-4958-bd7f-85d1f7822421-S3296/frameworks/5d030fd5-0fb6-4366-9dee-706261fa0749-0014/executors/node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7/runs/d5a56564-3e24-4c60-9919-746710b78377' for gc 6.99999344714074days in the future
> I0407 15:26:14.566299 22768 gc.cpp:55] Scheduling '/var/lib/mesos/slaves/91ec544d-ac98-4958-bd7f-85d1f7822421-S3296/frameworks/5d030fd5-0fb6-4366-9dee-706261fa0749-0014/executors/node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7' for gc 6.99999344665481days in the future
> I0407 15:26:14.566337 22768 gc.cpp:55] Scheduling '/var/lib/mesos/meta/slaves/91ec544d-ac98-4958-bd7f-85d1f7822421-S3296/frameworks/5d030fd5-0fb6-4366-9dee-706261fa0749-0014/executors/node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7/runs/d5a56564-3e24-4c60-9919-746710b78377' for gc 6.99999344637926days in the future
> I0407 15:26:14.566368 22768 gc.cpp:55] Scheduling '/var/lib/mesos/meta/slaves/91ec544d-ac98-4958-bd7f-85d1f7822421-S3296/frameworks/5d030fd5-0fb6-4366-9dee-706261fa0749-0014/executors/node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7' for gc 6.99999344597333days in the future
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)