You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Jie Yu (JIRA)" <ji...@apache.org> on 2017/04/11 23:40:41 UTC

[jira] [Commented] (MESOS-7366) Agent sandbox gc could accidentally delete the entire persistent volume content

    [ https://issues.apache.org/jira/browse/MESOS-7366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15965142#comment-15965142 ] 

Jie Yu commented on MESOS-7366:
-------------------------------

commit 5590666384bcaab457a4d727de9a39818b84e2a3
Author: Jie Yu <yu...@gmail.com>
Date:   Fri Apr 7 16:41:45 2017 -0700

    Added a test to verify persistent volume mount points removal.

    This test is used to catch regression related to MESOS-7366.

    Review: https://reviews.apache.org/r/58280

commit cf7d1ae9849d654e7e7eafefdff6824146a102b4
Author: Jie Yu <yu...@gmail.com>
Date:   Fri Apr 7 16:38:00 2017 -0700

    Lazily unmount persistent volumes in DockerContainerizer.

    Use MNT_DETACH to unmount persistent volumes in DockerContainerizer to
    workaround an issue of incorrect handling of container destroy
    failures. Currently, if unmount fails there, the containerizer will
    still treat the container as terminated, and the agent will schedule
    the cleanup of the container's sandbox. Since the mount hasn't been
    removed in the sandbox (e.g., due to EBUSY), that'll result in data in
    the persistent volume being incorrectly deleted. Use MNT_DETACH so
    that the mount point in the sandbox will be removed immediately. See
    MESOS-7366 for more details.

    Review: https://reviews.apache.org/r/58279

commit f96f5b6b25a444309fe021fa229b3b8286093b5e
Author: Jie Yu <yu...@gmail.com>
Date:   Fri Apr 7 16:33:53 2017 -0700

    Lazily unmount persistent volumes in MesosContainerizer.

    Use MNT_DETACH when unmounting persistent volumes in Linux filesystem
    isolator to workaround an issue of incorrect handling of container
    destroy failures. Currently, if isolator cleanup returns a failure,
    the slave will treat the container as terminated, and will schedule
    the cleanup of the container's sandbox. Since the mount hasn't been
    removed in the sandbox (e.g., due to EBUSY), that'll result in data in
    the persistent volume being incorrectly deleted. Use MNT_DETACH so
    that the mount point in the sandbox will be removed immediately.  See
    MESOS-7366 for more details.

    Review: https://reviews.apache.org/r/58278

> Agent sandbox gc could accidentally delete the entire persistent volume content
> -------------------------------------------------------------------------------
>
>                 Key: MESOS-7366
>                 URL: https://issues.apache.org/jira/browse/MESOS-7366
>             Project: Mesos
>          Issue Type: Bug
>    Affects Versions: 1.0.2, 1.1.1, 1.2.0
>            Reporter: Zhitao Li
>            Assignee: Jie Yu
>            Priority: Blocker
>
> When 1) a persistent volume is mounted, 2) umount is stuck or something, 3) executor directory gc being invoked, agent seems to emit a log like:
> ```
>  Failed to delete directory  <executor_dir>/runs/<uuid>/volume: Device or resource busy
> ```
> After this, the persistent volume directory is empty.
> This could trigger data loss on critical workload so we should fix this ASAP.
> The triggering environment is a custom executor w/o rootfs image.
> Please let me know if you need more signal.
> {noformat}
> I0407 15:18:22.752624 22758 paths.cpp:536] Trying to chown '/var/lib/mesos/slaves/91ec544d-ac98-4958-bd7f-85d1f7822421-S3296/frameworks/5d030fd5-0fb6-4366-9dee-706261fa0749-0014/executors/node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7/runs/d5a56564-3e24-4c60-9919-746710b78377' to user 'uber'
> I0407 15:18:22.763229 22758 slave.cpp:6179] Launching executor 'node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7' of framework 5d030fd5-0fb6-4366-9dee-706261fa0749-0014 with resources cpus(cassandra-cstar-location-store, cassandra, {resource_id: 29e2ac63-d605-4982-a463-fa311be94e0a}):0.1; mem(cassandra-cstar-location-store, cassandra, {resource_id: 2e1223f3-41a2-419f-85cc-cbc839c19c70}):768; ports(cassandra-cstar-location-store, cassandra, {resource_id: fdd6598f-f32b-4c90-a622-226684528139}):[31001-31001] in work directory '/var/lib/mesos/slaves/91ec544d-ac98-4958-bd7f-85d1f7822421-S3296/frameworks/5d030fd5-0fb6-4366-9dee-706261fa0749-0014/executors/node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7/runs/d5a56564-3e24-4c60-9919-746710b78377'
> I0407 15:18:22.764103 22758 slave.cpp:1987] Queued task 'node-29__c6fdf823-e31a-4b78-a34f-e47e749c07f4' for executor 'node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7' of framework 5d030fd5-0fb6-4366-9dee-706261fa0749-0014
> I0407 15:18:22.766253 22764 containerizer.cpp:943] Starting container d5a56564-3e24-4c60-9919-746710b78377 for executor 'node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7' of framework 5d030fd5-0fb6-4366-9dee-706261fa0749-0014
> I0407 15:18:22.767514 22766 linux.cpp:730] Mounting '/var/lib/mesos/volumes/roles/cassandra-cstar-location-store/d6290423-2ba4-4975-86f4-ffd84ad138ff' to '/var/lib/mesos/slaves/91ec544d-ac98-4958-bd7f-85d1f7822421-S3296/frameworks/5d030fd5-0fb6-4366-9dee-706261fa0749-0014/executors/node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7/runs/d5a56564-3e24-4c60-9919-746710b78377/volume' for persistent volume disk(cassandra-cstar-location-store, cassandra, {resource_id: fefc15d6-0c6f-4eac-a3f8-c34d0335c5ec})[d6290423-2ba4-4975-86f4-ffd84ad138ff:volume]:6466445 of container d5a56564-3e24-4c60-9919-746710b78377
> I0407 15:18:22.894340 22768 containerizer.cpp:1494] Checkpointing container's forked pid 6892 to '/var/lib/mesos/meta/slaves/91ec544d-ac98-4958-bd7f-85d1f7822421-S3296/frameworks/5d030fd5-0fb6-4366-9dee-706261fa0749-0014/executors/node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7/runs/d5a56564-3e24-4c60-9919-746710b78377/pids/forked.pid'
> I0407 15:19:01.011916 22749 slave.cpp:3231] Got registration for executor 'node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7' of framework 5d030fd5-0fb6-4366-9dee-706261fa0749-0014 from executor(1)@10.14.6.132:36837
> I0407 15:19:01.031939 22770 slave.cpp:2191] Sending queued task 'node-29__c6fdf823-e31a-4b78-a34f-e47e749c07f4' to executor 'node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7' of framework 5d030fd5-0fb6-4366-9dee-706261fa0749-0014 at executor(1)@10.14.6.132:36837
> I0407 15:26:14.012861 22749 linux.cpp:627] Removing mount '/var/lib/mesos/slaves/91ec544d-ac98-4958-bd7f-85d1f7822421-S3296/fra
> meworks/5d030fd5-0fb6-4366-9dee-706261fa0749-0014/executors/node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7/runs/d5a5656
> 4-3e24-4c60-9919-746710b78377/volume' for persistent volume disk(cassandra-cstar-location-store, cassandra, {resource_id: fefc15d6-0c6f-4eac-a3f8-c34d0335c5ec})[d6290423-2ba4-4975-86f4-ffd84ad138ff:volume]:6466445 of container d5a56564-3e24-4c60-9919-746710b78377
> E0407 15:26:14.013828 22756 slave.cpp:3903] Failed to update resources for container d5a56564-3e24-4c60-9919-746710b78377 of executor 'node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7' running task node-29__c6fdf823-e31a-4b78-a34f-e47e749c07f4 on status update for terminal task, destroying container: Collect failed: Failed to unmount unneeded persistent volume at '/var/lib/mesos/slaves/91ec544d-ac98-4958-bd7f-85d1f7822421-S3296/frameworks/5d030fd5-0fb6-4366-9dee-706261fa0749-0014/executors/node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7/runs/d5a56564-3e24-4c60-9919-746710b78377/volume': Failed to unmount '/var/lib/mesos/slaves/91ec544d-ac98-4958-bd7f-85d1f7822421-S3296/frameworks/5d030fd5-0fb6-4366-9dee-706261fa0749-0014/executors/node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7/runs/d5a56564-3e24-4c60-9919-746710b78377/volume': Device or resource busy
> I0407 15:26:14.545647 22747 linux.cpp:810] Unmounting volume '/var/lib/mesos/slaves/91ec544d-ac98-4958-bd7f-85d1f7822421-S3296/frameworks/5d030fd5-0fb6-4366-9dee-706261fa0749-0014/executors/node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7/runs/d5a56564-3e24-4c60-9919-746710b78377/volume' for container d5a56564-3e24-4c60-9919-746710b78377
> E0407 15:26:14.546123 22753 slave.cpp:4520] Termination of executor 'node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7' of framework 5d030fd5-0fb6-4366-9dee-706261fa0749-0014 failed: Failed to clean up an isolator when destroying container: Failed to unmount volume '/var/lib/mesos/slaves/91ec544d-ac98-4958-bd7f-85d1f7822421-S3296/frameworks/5d030fd5-0fb6-4366-9dee-706261fa0749-0014/executors/node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7/runs/d5a56564-3e24-4c60-9919-746710b78377/volume': Failed to unmount '/var/lib/mesos/slaves/91ec544d-ac98-4958-bd7f-85d1f7822421-S3296/frameworks/5d030fd5-0fb6-4366-9dee-706261fa0749-0014/executors/node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7/runs/d5a56564-3e24-4c60-9919-746710b78377/volume': Device or resource busy
> I0407 15:26:14.566028 22744 slave.cpp:4646] Cleaning up executor 'node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7' of framework 5d030fd5-0fb6-4366-9dee-706261fa0749-0014 at executor(1)@10.14.6.132:36837
> I0407 15:26:14.566186 22768 gc.cpp:55] Scheduling '/var/lib/mesos/slaves/91ec544d-ac98-4958-bd7f-85d1f7822421-S3296/frameworks/5d030fd5-0fb6-4366-9dee-706261fa0749-0014/executors/node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7/runs/d5a56564-3e24-4c60-9919-746710b78377' for gc 6.99999344714074days in the future
> I0407 15:26:14.566299 22768 gc.cpp:55] Scheduling '/var/lib/mesos/slaves/91ec544d-ac98-4958-bd7f-85d1f7822421-S3296/frameworks/5d030fd5-0fb6-4366-9dee-706261fa0749-0014/executors/node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7' for gc 6.99999344665481days in the future
> I0407 15:26:14.566337 22768 gc.cpp:55] Scheduling '/var/lib/mesos/meta/slaves/91ec544d-ac98-4958-bd7f-85d1f7822421-S3296/frameworks/5d030fd5-0fb6-4366-9dee-706261fa0749-0014/executors/node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7/runs/d5a56564-3e24-4c60-9919-746710b78377' for gc 6.99999344637926days in the future
> I0407 15:26:14.566368 22768 gc.cpp:55] Scheduling '/var/lib/mesos/meta/slaves/91ec544d-ac98-4958-bd7f-85d1f7822421-S3296/frameworks/5d030fd5-0fb6-4366-9dee-706261fa0749-0014/executors/node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7' for gc 6.99999344597333days in the future
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)