You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Qian Zhang (JIRA)" <ji...@apache.org> on 2016/04/18 03:48:25 UTC

[jira] [Comment Edited] (MESOS-5225) Command executor can not start when joining a CNI network

    [ https://issues.apache.org/jira/browse/MESOS-5225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15245044#comment-15245044 ] 

Qian Zhang edited comment on MESOS-5225 at 4/18/16 1:47 AM:
------------------------------------------------------------

The root cause of this bug is, before the command executor (mesos-executor) is started by mesos-containerizer, we bind mount {{/etc/hosts}}, {{/etc/hostname}}, {{/etc/resolv.conf}} in the container's rootfs (see {{NetworkCniIsolatorSetup::execute()}} for details), but for command executor, we will NOT do the {{chroot}} before launching it (see {{LinuxFilesystemIsolatorProcess::prepare()}}, we will only set rootfs in {{ContainerLaunchInfo}} if it is not a command task), instead the command executor will do the {{chroot}} itself when launching the task (https://github.com/apache/mesos/blob/0.28.0/src/launcher/executor.cpp#L369).

So when the command executor is launched, it is still using agent host FS, that means the bind mounts that we do will not take effect for it. Obviously in agent host FS, the {{/etc/hosts}} does not have the pair of container's hostname and IP, so the hostname lookup in libprocess will fail.


was (Author: qianzhang):
The root cause of this bug is, before the command executor (mesos-executor) is started by mesos-containerizer, we bind mount {{/etc/hosts}}, {{/etc/hostname}}, {{/etc/resolv.conf}} in the container's rootfs (see {{NetworkCniIsolatorSetup::execute()}} for details), but for command executor, we will NOT do the {{chroot}} before launching it (see {{LinuxFilesystemIsolatorProcess::prepare()}}, we will only set rootfs in {{ContainerLaunchInfo}} for if it is not a command task), instead the command executor will do the {{chroot}} itself when launching the task (https://github.com/apache/mesos/blob/0.28.0/src/launcher/executor.cpp#L369).

So when the command executor is launched, it is still using agent host FS, that means the bind mounts that we do will not take effect for it. Obviously in agent host FS, the {{/etc/hosts}} does not have the pair of container's hostname and IP, so the hostname lookup in libprocess will fail.

> Command executor can not start when joining a CNI network
> ---------------------------------------------------------
>
>                 Key: MESOS-5225
>                 URL: https://issues.apache.org/jira/browse/MESOS-5225
>             Project: Mesos
>          Issue Type: Bug
>          Components: isolation
>            Reporter: Qian Zhang
>            Assignee: Qian Zhang
>
> Reproduce steps:
> 1. Start master
> {code}
> sudo ./bin/mesos-master.sh --work_dir=/tmp
> {code}
>  
> 2. Start agent
> {code}
> sudo ./bin/mesos-slave.sh --master=192.168.122.171:5050 --containerizers=mesos --image_providers=docker --isolation=filesystem/linux,docker/runtime,network/cni --network_cni_config_dir=/opt/cni/net_configs --network_cni_plugins_dir=/opt/cni/plugins}}
> {code}
>  
> 3. Launch a command task with mesos-execute, and it will join a CNI network {{net1}}.
> {code}
> sudo src/mesos-execute --master=192.168.122.171:5050 --name=test --docker_image=library/busybox --networks=net1 --command="sleep 10" --shell=true
> I0418 08:25:35.746758 24923 scheduler.cpp:177] Version: 0.29.0
> Subscribed with ID '3c4796f0-eee7-4939-a036-7c6387c370eb-0000'
> Submitted task 'test' to agent 'b74535d8-276f-4e09-ab47-53e3721ab271-S0'
> Received status update TASK_FAILED for task 'test'
>   message: 'Executor terminated'
>   source: SOURCE_AGENT
>   reason: REASON_EXECUTOR_TERMINATED
> {code}
> So the task failed with the reason "executor terminated". Here is the agent log:
> {code}
> I0418 08:25:35.804873 24911 slave.cpp:1514] Got assigned task test for framework 3c4796f0-eee7-4939-a036-7c6387c370eb-0000
> I0418 08:25:35.807937 24911 slave.cpp:1633] Launching task test for framework 3c4796f0-eee7-4939-a036-7c6387c370eb-0000
> I0418 08:25:35.812503 24911 paths.cpp:528] Trying to chown '/tmp/mesos/slaves/b74535d8-276f-4e09-ab47-53e3721ab271-S0/frameworks/3c4796f0-eee7-4939-a036-7c6387c370eb-0000/executors/t
> est/runs/2b29d6d6-b314-477f-b734-7771d07d41e3' to user 'root'
> I0418 08:25:35.820339 24911 slave.cpp:5620] Launching executor test of framework 3c4796f0-eee7-4939-a036-7c6387c370eb-0000 with resources cpus(*):0.1; mem(*):32 in work directory '/t
> mp/mesos/slaves/b74535d8-276f-4e09-ab47-53e3721ab271-S0/frameworks/3c4796f0-eee7-4939-a036-7c6387c370eb-0000/executors/test/runs/2b29d6d6-b314-477f-b734-7771d07d41e3'
> I0418 08:25:35.822576 24914 containerizer.cpp:698] Starting container '2b29d6d6-b314-477f-b734-7771d07d41e3' for executor 'test' of framework '3c4796f0-eee7-4939-a036-7c6387c370eb-00
> 00'
> I0418 08:25:35.825996 24911 slave.cpp:1851] Queuing task 'test' for executor 'test' of framework 3c4796f0-eee7-4939-a036-7c6387c370eb-0000
> I0418 08:25:35.832348 24911 provisioner.cpp:285] Provisioning image rootfs '/tmp/mesos/provisioner/containers/2b29d6d6-b314-477f-b734-7771d07d41e3/backends/copy/rootfses/d219ec3a-ea3
> 1-45f6-b578-a62cd02392e7' for container 2b29d6d6-b314-477f-b734-7771d07d41e3
> I0418 08:25:36.061249 24913 linux_launcher.cpp:281] Cloning child process with flags = CLONE_NEWNET | CLONE_NEWUTS | CLONE_NEWNS
> I0418 08:25:36.071208 24915 cni.cpp:643] Bind mounted '/proc/24950/ns/net' to '/run/mesos/isolators/network/cni/2b29d6d6-b314-477f-b734-7771d07d41e3/ns' for container 2b29d6d6-b314-4
> 77f-b734-7771d07d41e3
> I0418 08:25:36.250573 24916 cni.cpp:962] Got assigned IPv4 address '192.168.1.2/24' from CNI network 'net1' for container 2b29d6d6-b314-477f-b734-7771d07d41e3
> I0418 08:25:36.252002 24917 cni.cpp:765] Unable to find DNS nameservers for container 2b29d6d6-b314-477f-b734-7771d07d41e3. Using host '/etc/resolv.conf'
> I0418 08:25:37.663487 24916 containerizer.cpp:1696] Executor for container '2b29d6d6-b314-477f-b734-7771d07d41e3' has exited
> I0418 08:25:37.663745 24916 containerizer.cpp:1461] Destroying container '2b29d6d6-b314-477f-b734-7771d07d41e3'
> I0418 08:25:37.670574 24915 cgroups.cpp:2676] Freezing cgroup /sys/fs/cgroup/freezer/mesos/2b29d6d6-b314-477f-b734-7771d07d41e3
> I0418 08:25:37.676864 24912 cgroups.cpp:1409] Successfully froze cgroup /sys/fs/cgroup/freezer/mesos/2b29d6d6-b314-477f-b734-7771d07d41e3 after 6.061056ms
> I0418 08:25:37.680552 24913 cgroups.cpp:2694] Thawing cgroup /sys/fs/cgroup/freezer/mesos/2b29d6d6-b314-477f-b734-7771d07d41e3
> I0418 08:25:37.683346 24913 cgroups.cpp:1438] Successfully thawed cgroup /sys/fs/cgroup/freezer/mesos/2b29d6d6-b314-477f-b734-7771d07d41e3 after 2.46016ms
> I0418 08:25:37.874023 24914 cni.cpp:1121] Unmounted the network namespace handle '/run/mesos/isolators/network/cni/2b29d6d6-b314-477f-b734-7771d07d41e3/ns' for container 2b29d6d6-b31
> 4-477f-b734-7771d07d41e3
> I0418 08:25:37.874194 24914 cni.cpp:1132] Removed the container directory '/run/mesos/isolators/network/cni/2b29d6d6-b314-477f-b734-7771d07d41e3'
> I0418 08:25:37.877306 24912 linux.cpp:814] Ignoring unmounting sandbox/work directory for container 2b29d6d6-b314-477f-b734-7771d07d41e3
> I0418 08:25:37.879295 24912 provisioner.cpp:338] Destroying container rootfs at '/tmp/mesos/provisioner/containers/2b29d6d6-b314-477f-b734-7771d07d41e3/backends/copy/rootfses/d219ec3
> a-ea31-45f6-b578-a62cd02392e7' for container 2b29d6d6-b314-477f-b734-7771d07d41e3
> I0418 08:25:37.970871 24914 slave.cpp:4113] Executor 'test' of framework 3c4796f0-eee7-4939-a036-7c6387c370eb-0000 exited with status 1
> I0418 08:25:37.975452 24914 slave.cpp:3201] Handling status update TASK_FAILED (UUID: a5e19b2d-b234-4adc-8791-9046af4c1395) for task test of framework 3c4796f0-eee7-4939-a036-7c6387c
> 370eb-0000 from @0.0.0.0:0
> W0418 08:25:37.978974 24911 containerizer.cpp:1303] Ignoring update for unknown container: 2b29d6d6-b314-477f-b734-7771d07d41e3
> I0418 08:25:37.980370 24917 status_update_manager.cpp:320] Received status update TASK_FAILED (UUID: a5e19b2d-b234-4adc-8791-9046af4c1395) for task test of framework 3c4796f0-eee7-49
> 39-a036-7c6387c370eb-0000
> I0418 08:25:37.983105 24913 slave.cpp:3599] Forwarding the update TASK_FAILED (UUID: a5e19b2d-b234-4adc-8791-9046af4c1395) for task test of framework 3c4796f0-eee7-4939-a036-7c6387c3
> 70eb-0000 to master@192.168.122.171:5050
> I0418 08:25:38.017352 24917 slave.cpp:2232] Asked to shut down framework 3c4796f0-eee7-4939-a036-7c6387c370eb-0000 by master@192.168.122.171:5050
> I0418 08:25:38.018487 24917 slave.cpp:2257] Shutting down framework 3c4796f0-eee7-4939-a036-7c6387c370eb-0000
> I0418 08:25:38.019630 24917 slave.cpp:4217] Cleaning up executor 'test' of framework 3c4796f0-eee7-4939-a036-7c6387c370eb-0000
> I0418 08:25:38.020967 24911 gc.cpp:55] Scheduling '/tmp/mesos/slaves/b74535d8-276f-4e09-ab47-53e3721ab271-S0/frameworks/3c4796f0-eee7-4939-a036-7c6387c370eb-0000/executors/test/runs/
> 2b29d6d6-b314-477f-b734-7771d07d41e3' for gc 6.99999975983704days in the future
> I0418 08:25:38.022328 24917 slave.cpp:4305] Cleaning up framework 3c4796f0-eee7-4939-a036-7c6387c370eb-0000
> I0418 08:25:38.022847 24915 status_update_manager.cpp:282] Closing status update streams for framework 3c4796f0-eee7-4939-a036-7c6387c370eb-0000
> I0418 08:25:38.022459 24912 gc.cpp:55] Scheduling '/tmp/mesos/slaves/b74535d8-276f-4e09-ab47-53e3721ab271-S0/frameworks/3c4796f0-eee7-4939-a036-7c6387c370eb-0000/executors/test' for 
> gc 6.99999974402963days in the future
> I0418 08:25:38.023483 24916 gc.cpp:55] Scheduling '/tmp/mesos/slaves/b74535d8-276f-4e09-ab47-53e3721ab271-S0/frameworks/3c4796f0-eee7-4939-a036-7c6387c370eb-0000' for gc 6.9999997358
> 2222days in the future
> ...
> {code}
> And this is the stderr of the executor:
> {code}
> cat /tmp/mesos/slaves/b74535d8-276f-4e09-ab47-53e3721ab271-S0/frameworks/3c4796f0-eee7-4939-a036-7c6387c370eb-0000/executors/test/runs/2b29d6d6-b314-477f-b734-7771d07d41e3/stderr 
> + /home/stack/workspace/mesos/build/src/mesos-containerizer mount --help=false --operation=make-rslave --path=/
> + grep -E /tmp/mesos/.+ /proc/self/mountinfo
> + grep -v 2b29d6d6-b314-477f-b734-7771d07d41e3
> + cut -d  -f5
> + xargs --no-run-if-empty umount -l
> + mount -n --rbind /tmp/mesos/provisioner/containers/2b29d6d6-b314-477f-b734-7771d07d41e3/backends/copy/rootfses/d219ec3a-ea31-45f6-b578-a62cd02392e7 /tmp/mesos/slaves/b74535d8-276f-4e09-ab47-53e3721ab271-S0/frameworks/3c4796f0-eee7-4939-a036-7c6387c370eb-0000/executors/test/runs/2b29d6d6-b314-477f-b734-7771d07d41e3/.rootfs
> Failed to obtain the IP address for '2b29d6d6-b314-477f-b734-7771d07d41e3'; the DNS service may not be able to resolve it: Name or service not known
> {code}
> So the reason why executor terminated is, the libprocess in it failed to resolved its hostname {{2b29d6d6-b314-477f-b734-7771d07d41e3}}, see https://github.com/apache/mesos/blob/0.28.0/3rdparty/libprocess/src/process.cpp#L929:L935 for details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)