You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "haosdent (JIRA)" <ji...@apache.org> on 2015/09/11 11:30:45 UTC
[jira] [Comment Edited] (MESOS-3349) PersistentVolumeTest.AccessPersistentVolume fails when run as root.

    [ https://issues.apache.org/jira/browse/MESOS-3349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14738557#comment-14738557 ] 

haosdent edited comment on MESOS-3349 at 9/11/15 9:30 AM:
----------------------------------------------------------

According [CLONE_NEWNS|http://stackoverflow.com/questions/22889241/linux-understanding-the-mount-namespace-clone-clone-newns-flag], [bind_mount|https://lwn.net/Articles/159092/]. I think could explain the behaviours so far.

In LinuxFilesystemIsolatorProcess, we mount persistent volume (default behaviour make-private) before launch the executor. After LinuxLauncher fork with CLONE_NEWNS, we could umount persistent volume in LinuxFilesystemIsolatorProcess. But this don't affect the executor continue to hold that mount point. When slave receive TASK_FINISH and call LinuxFilesystemIsolatorProcess try to rmdir that mount point, it would failed because executor is still running and holding the mount point (after add some trace code to show when executor exited, I observed this.). So a possible way to fix this is to use make-shared or make-slave when mount persistent volume. By my attempts failed on this. 

{code}
45 22 8:3 /tmp/PersistentVolumeTest_AccessPersistentVolume_L6fY1a/volumes/roles/role1/id1 /tmp/PersistentVolumeTest_AccessPersistentVolume_L6fY1a/slaves/20150911-170559-162297291-49192-21628-S0/frameworks/20150911-170559-162297291-49192-21628-0000/executors/c6bcf76f-7cf5-42e6-8eb8-2d21e393ba3d/runs/454bbfa3-0305-4900-b05d-389f6b215c32/path1 rw,relatime shared:1 - ext4 /dev/disk/by-uuid/98708f21-a59d-4b80-a85c-27b78c22e316 rw,errors=remount-ro,data=ordered
{code}

{code}
78 48 8:3 /tmp/PersistentVolumeTest_AccessPersistentVolume_L6fY1a/volumes/roles/role1/id1 /tmp/PersistentVolumeTest_AccessPersistentVolume_L6fY1a/slaves/20150911-170559-162297291-49192-21628-S0/frameworks/20150911-170559-162297291-49192-21628-0000/executors/c6bcf76f-7cf5-42e6-8eb8-2d21e393ba3d/runs/454bbfa3-0305-4900-b05d-389f6b215c32/path1 rw,relatime shared:1 - ext4 /dev/disk/by-uuid/98708f21-a59d-4b80-a85c-27b78c22e316 rw,errors=remount-ro,data=ordered
{code}

Could see the persistent volumes have already mount as shared, but this test still failed.


was (Author: haosdent@gmail.com):
After see this http://stackoverflow.com/questions/22889241/linux-understanding-the-mount-namespace-clone-clone-newns-flag about CLONE_NEWNS. I think could explain the behaviours so far.

In LinuxFilesystemIsolatorProcess, we mount in parent (pid is 24073).
{code}
I0910 18:07:42.768034 24073 linux.cpp:598] Mounting '/tmp/PersistentVolumeTest_AccessPersistentVolume_PTx7g0/volumes/roles/role1/id1' to '/tmp/PersistentVolumeTest_AccessPersistentVolume_PTx7g0/slaves/20150910-180742-162297291-42795-24055-S0/frameworks/20150910-180742-162297291-42795-24055-0000/executors/72989615-cc6e-449c-a561-264fcee7edc3/runs/0cdc0d01-4c59-48e8-925a-7a6c06feb2ae/path1' for persistent volume disk(role1)[id1:path1]:64 of container 0cdc0d01-4c59-48e8-925a-7a6c06feb2ae
{code}

After LinuxLauncher fork with CLONE_NEWNS, child(pid is 24071) could unmount it. But still could not rmdir it, because it has another mount point handled by parent.
{code}
I0910 18:07:44.868654 24071 linux.cpp:493] Removing mount '/tmp/PersistentVolumeTest_AccessPersistentVolume_PTx7g0/slaves/20150910-180742-162297291-42795-24055-S0/frameworks/20150910-180742-162297291-42795-24055-0000/executors/72989615-cc6e-449c-a561-264fcee7edc3/runs/0cdc0d01-4c59-48e8-925a-7a6c06feb2ae/path1' for persistent volume disk(role1)[id1:path1]:64 of container 0cdc0d01-4c59-48e8-925a-7a6c06feb2ae
E0910 18:07:44.876619 24076 slave.cpp:2870] Failed to update resources for container 0cdc0d01-4c59-48e8-925a-7a6c06feb2ae of executor 72989615-cc6e-449c-a561-264fcee7edc3 running task 72989615-cc6e-449c-a561-264fcee7edc3 on status update for terminal task, destroying container: Collect failed: Failed to remove persistent volume mount point at '/tmp/PersistentVolumeTest_AccessPersistentVolume_PTx7g0/slaves/20150910-180742-162297291-42795-24055-S0/frameworks/20150910-180742-162297291-42795-24055-0000/executors/72989615-cc6e-449c-a561-264fcee7edc3/runs/0cdc0d01-4c59-48e8-925a-7a6c06feb2ae/path1': Device or resource busy
{code}


> PersistentVolumeTest.AccessPersistentVolume fails when run as root.
> -------------------------------------------------------------------
>
>                 Key: MESOS-3349
>                 URL: https://issues.apache.org/jira/browse/MESOS-3349
>             Project: Mesos
>          Issue Type: Bug
>          Components: test
>         Environment: Ubuntu 14.04, CentOS 5
>            Reporter: Benjamin Mahler
>            Assignee: haosdent
>              Labels: flaky-test
>
> When running the tests as root:
> {noformat}
> [ RUN      ] PersistentVolumeTest.AccessPersistentVolume
> I0901 02:17:26.435140 39432 exec.cpp:133] Version: 0.25.0
> I0901 02:17:26.442129 39461 exec.cpp:207] Executor registered on slave 20150901-021726-1828659978-52102-32604-S0
> Registered executor on hostname
> Starting task d8ff1f00-e720-4a61-b440-e111009dfdc3
> sh -c 'echo abc > path1/file'
> Forked command at 39484
> Command exited with status 0 (pid: 39484)
> ../../src/tests/persistent_volume_tests.cpp:579: Failure
> Value of: os::exists(path::join(directory, "path1"))
>   Actual: true
> Expected: false
> [  FAILED  ] PersistentVolumeTest.AccessPersistentVolume (777 ms)
> {noformat}
> FYI [~jieyu] [~mcypark]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)