You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@aurora.apache.org by Zameer Manji <zm...@uber.com> on 2016/11/01 01:42:14 UTC
Aurora, Thermos, PID 1, and You
Hey,
Recently I have experienced a number of issues in a production environment
with the DockerContainerizer, Aurora and Thermos. Although my experience is
specific to Docker, I believe this applies to anyone using the Mesos
Containerizer with pid isolation. The root cause of these issues originate
to the interactions between how we launch the executor, and the role of PID
1.
The CommandInfo for the ExecutorInfo uses the default `shell` value which
is `true`[1]. This means that in any PID isolated container the `sh`
process that launches the executor will become PID 1. Here is an example
`ps` output from vagrant showing this:
````
root@aurora:/# ps auxf
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 250 0.0 0.0 21928 2124 ? Ss 01:19 0:00 /bin/bash
root 469 0.0 0.0 19176 1240 ? R+ 01:28 0:00 \_ ps auxf
root 1 0.0 0.0 4328 636 ? Ss 01:10 0:00 /bin/sh -c
${MESOS_SANDBOX=.}/thermos_executor.pex --announcer-ensemble localhost:2181
--announcer-zookeeper-auth-config /home/vagrant/aurora/examples/
vagrant/config/announcer-auth.json --mesos-containerizer
root 5 0.7 1.4 1201128 45604 ? Sl 01:10 0:08 python2.7
/mnt/mesos/sandbox/thermos_executor.pex --announcer-ensemble localhost:2181
--announcer-zookeeper-auth-config /home/vagrant/aurora/examples/
vagrant/config/announcer-auth.json --mesos-containerizer-
root 23 0.1 0.6 115668 20764 ? S 01:10 0:01 \_
/usr/local/bin/python2.7 /mnt/mesos/sandbox/thermos_runner.pex
--task_id=www-data-devel-hello_docker_engine-0-5f443832-a13e-4cde-97e3-89aa905f2487
--log_to_disk=DEBUG --hostname=192.168.33.7 --thermos_js
root 29 0.0 0.5 113476 17936 ? Ss 01:10 0:00 \_
/usr/local/bin/python2.7 /mnt/mesos/sandbox/thermos_runner.pex
--task_id=www-data-devel-hello_docker_engine-0-5f443832-a13e-4cde-97e3-89aa905f2487
--log_to_disk=DEBUG --hostname=192.168.33.7 --thermo
root 34 0.0 0.0 20040 1476 ? S 01:10 0:00 |
\_ /bin/bash -c while true; do echo hello world sleep 10
done
root 468 0.0 0.0 4228 348 ? S 01:28 0:00 |
\_ sleep 10
root 31 0.0 0.5 113476 17936 ? Ss 01:10 0:00 \_
/usr/local/bin/python2.7 /mnt/mesos/sandbox/thermos_runner.pex
--task_id=www-data-devel-hello_docker_engine-0-5f443832-a13e-4cde-97e3-89aa905f2487
--log_to_disk=DEBUG --hostname=192.168.33.7 --thermo
root 32 0.0 0.0 20040 1476 ? S 01:10 0:00
\_ /bin/bash -c while true; do echo hello world sleep 10
done
root 467 0.0 0.0 4228 352 ? S 01:28 0:00
\_ sleep 10
root 47 0.0 0.0 24116 3052 ? S 01:10 0:00 python
./daemon.py
````
This means processes that double fork/daemonize will be re parented to `sh`
and not our executor. You can see that the `python daemon.py` process has
been reparented to `sh` and not the executor and is outside of the scope of
the runners. This has a number of undesirable implications, perhaps most
concerning is that processes that end up reparenting to PID 1 will not
receive SIGTERM or SIGKILL from thermos but instead will be killed by the
kernel when thermos decides to to exit. If anyone here decides to run
published images that use popular software that double forks (like nginx),
you will never be able to ensure the processes die cleanly.
I've been thinking about this problem for a while and upon advice from
others and my own research I believe the best solution is as follows:
1. We have good reasons for setting `shell=True` when launching the
executor. I'm not comfortable changing this because I'm not sure of all of
the implications if we choose another method.
2. The thermos runners end up forking off the target processes. I think the
runners should be responsible for all of the processes that are created by
the children.
3. We can make the runners responsible for their grand children by using
`prctl(2)`[2] and setting the `PR_SET_CHILD_SUBREAPER` bit for each runner.
This means double forked processes will be reparented to the runner and not
PID 1
4. On task tear down, we make the runners send SIGTERM and SIGKILL to the
PIDs they recorded and any other children they have.
5. Each runner would need to have a SIGCHLD handler to handle zombie
processes that are reparented to it.
[1]: https://github.com/apache/aurora/blob/783baaefb9a814ca01fad78181fe3d
f3de5b34af/src/main/java/org/apache/aurora/scheduler/configuration/executor/
ExecutorModule.java#L109-L135
[2]: http://man7.org/linux/man-pages/man2/prctl.2.html
--
Zameer Manji
Re: Aurora, Thermos, PID 1, and You
Posted by Zameer Manji <zm...@apache.org>.
Filed a task https://issues.apache.org/jira/browse/AURORA-1808 to track
this work since there are no objections.
On Mon, Oct 31, 2016 at 6:42 PM, Zameer Manji <zm...@apache.org> wrote:
> Re sending this from my @apache.org email in case my previous email got
> caught in spam.
>
> On Mon, Oct 31, 2016 at 6:42 PM, Zameer Manji <zm...@uber.com> wrote:
>
>> Hey,
>>
>> Recently I have experienced a number of issues in a production
>> environment with the DockerContainerizer, Aurora and Thermos. Although my
>> experience is specific to Docker, I believe this applies to anyone using
>> the Mesos Containerizer with pid isolation. The root cause of these issues
>> originate to the interactions between how we launch the executor, and the
>> role of PID 1.
>>
>> The CommandInfo for the ExecutorInfo uses the default `shell` value which
>> is `true`[1]. This means that in any PID isolated container the `sh`
>> process that launches the executor will become PID 1. Here is an example
>> `ps` output from vagrant showing this:
>> ````
>> root@aurora:/# ps auxf
>> USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
>> root 250 0.0 0.0 21928 2124 ? Ss 01:19 0:00 /bin/bash
>> root 469 0.0 0.0 19176 1240 ? R+ 01:28 0:00 \_ ps
>> auxf
>> root 1 0.0 0.0 4328 636 ? Ss 01:10 0:00 /bin/sh
>> -c ${MESOS_SANDBOX=.}/thermos_executor.pex --announcer-ensemble
>> localhost:2181 --announcer-zookeeper-auth-config
>> /home/vagrant/aurora/examples/vagrant/config/announcer-auth.json
>> --mesos-containerizer
>> root 5 0.7 1.4 1201128 45604 ? Sl 01:10 0:08
>> python2.7 /mnt/mesos/sandbox/thermos_executor.pex --announcer-ensemble
>> localhost:2181 --announcer-zookeeper-auth-config
>> /home/vagrant/aurora/examples/vagrant/config/announcer-auth.json
>> --mesos-containerizer-
>> root 23 0.1 0.6 115668 20764 ? S 01:10 0:01 \_
>> /usr/local/bin/python2.7 /mnt/mesos/sandbox/thermos_runner.pex
>> --task_id=www-data-devel-hello_docker_engine-0-5f443832-a13e-4cde-97e3-89aa905f2487
>> --log_to_disk=DEBUG --hostname=192.168.33.7 --thermos_js
>> root 29 0.0 0.5 113476 17936 ? Ss 01:10 0:00 \_
>> /usr/local/bin/python2.7 /mnt/mesos/sandbox/thermos_runner.pex
>> --task_id=www-data-devel-hello_docker_engine-0-5f443832-a13e-4cde-97e3-89aa905f2487
>> --log_to_disk=DEBUG --hostname=192.168.33.7 --thermo
>> root 34 0.0 0.0 20040 1476 ? S 01:10 0:00 |
>> \_ /bin/bash -c while true; do echo hello world sleep 10
>> done
>> root 468 0.0 0.0 4228 348 ? S 01:28 0:00 |
>> \_ sleep 10
>> root 31 0.0 0.5 113476 17936 ? Ss 01:10 0:00 \_
>> /usr/local/bin/python2.7 /mnt/mesos/sandbox/thermos_runner.pex
>> --task_id=www-data-devel-hello_docker_engine-0-5f443832-a13e-4cde-97e3-89aa905f2487
>> --log_to_disk=DEBUG --hostname=192.168.33.7 --thermo
>> root 32 0.0 0.0 20040 1476 ? S 01:10 0:00
>> \_ /bin/bash -c while true; do echo hello world sleep 10
>> done
>> root 467 0.0 0.0 4228 352 ? S 01:28 0:00
>> \_ sleep 10
>> root 47 0.0 0.0 24116 3052 ? S 01:10 0:00 python
>> ./daemon.py
>> ````
>>
>> This means processes that double fork/daemonize will be re parented to
>> `sh` and not our executor. You can see that the `python daemon.py` process
>> has been reparented to `sh` and not the executor and is outside of the
>> scope of the runners. This has a number of undesirable implications,
>> perhaps most concerning is that processes that end up reparenting to PID 1
>> will not receive SIGTERM or SIGKILL from thermos but instead will be killed
>> by the kernel when thermos decides to to exit. If anyone here decides to
>> run published images that use popular software that double forks (like
>> nginx), you will never be able to ensure the processes die cleanly.
>>
>> I've been thinking about this problem for a while and upon advice from
>> others and my own research I believe the best solution is as follows:
>> 1. We have good reasons for setting `shell=True` when launching the
>> executor. I'm not comfortable changing this because I'm not sure of all of
>> the implications if we choose another method.
>> 2. The thermos runners end up forking off the target processes. I think
>> the runners should be responsible for all of the processes that are created
>> by the children.
>> 3. We can make the runners responsible for their grand children by using
>> `prctl(2)`[2] and setting the `PR_SET_CHILD_SUBREAPER` bit for each runner.
>> This means double forked processes will be reparented to the runner and not
>> PID 1
>> 4. On task tear down, we make the runners send SIGTERM and SIGKILL to the
>> PIDs they recorded and any other children they have.
>> 5. Each runner would need to have a SIGCHLD handler to handle zombie
>> processes that are reparented to it.
>>
>> [1]: https://github.com/apache/aurora/blob/783baaefb9a814ca0
>> 1fad78181fe3df3de5b34af/src/main/java/org/apache/aurora/sche
>> duler/configuration/executor/ExecutorModule.java#L109-L135
>> [2]: http://man7.org/linux/man-pages/man2/prctl.2.html
>>
>> --
>> Zameer Manji
>>
>> --
>> Zameer Manji
>>
>
--
Zameer Manji
Re: Aurora, Thermos, PID 1, and You
Posted by Zameer Manji <zm...@apache.org>.
Re sending this from my @apache.org email in case my previous email got
caught in spam.
On Mon, Oct 31, 2016 at 6:42 PM, Zameer Manji <zm...@uber.com> wrote:
> Hey,
>
> Recently I have experienced a number of issues in a production environment
> with the DockerContainerizer, Aurora and Thermos. Although my experience is
> specific to Docker, I believe this applies to anyone using the Mesos
> Containerizer with pid isolation. The root cause of these issues originate
> to the interactions between how we launch the executor, and the role of PID
> 1.
>
> The CommandInfo for the ExecutorInfo uses the default `shell` value which
> is `true`[1]. This means that in any PID isolated container the `sh`
> process that launches the executor will become PID 1. Here is an example
> `ps` output from vagrant showing this:
> ````
> root@aurora:/# ps auxf
> USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
> root 250 0.0 0.0 21928 2124 ? Ss 01:19 0:00 /bin/bash
> root 469 0.0 0.0 19176 1240 ? R+ 01:28 0:00 \_ ps
> auxf
> root 1 0.0 0.0 4328 636 ? Ss 01:10 0:00 /bin/sh
> -c ${MESOS_SANDBOX=.}/thermos_executor.pex --announcer-ensemble
> localhost:2181 --announcer-zookeeper-auth-config
> /home/vagrant/aurora/examples/vagrant/config/announcer-auth.json
> --mesos-containerizer
> root 5 0.7 1.4 1201128 45604 ? Sl 01:10 0:08 python2.7
> /mnt/mesos/sandbox/thermos_executor.pex --announcer-ensemble
> localhost:2181 --announcer-zookeeper-auth-config
> /home/vagrant/aurora/examples/vagrant/config/announcer-auth.json
> --mesos-containerizer-
> root 23 0.1 0.6 115668 20764 ? S 01:10 0:01 \_
> /usr/local/bin/python2.7 /mnt/mesos/sandbox/thermos_runner.pex
> --task_id=www-data-devel-hello_docker_engine-0-5f443832-a13e-4cde-97e3-89aa905f2487
> --log_to_disk=DEBUG --hostname=192.168.33.7 --thermos_js
> root 29 0.0 0.5 113476 17936 ? Ss 01:10 0:00 \_
> /usr/local/bin/python2.7 /mnt/mesos/sandbox/thermos_runner.pex
> --task_id=www-data-devel-hello_docker_engine-0-5f443832-a13e-4cde-97e3-89aa905f2487
> --log_to_disk=DEBUG --hostname=192.168.33.7 --thermo
> root 34 0.0 0.0 20040 1476 ? S 01:10 0:00 |
> \_ /bin/bash -c while true; do echo hello world sleep 10
> done
> root 468 0.0 0.0 4228 348 ? S 01:28 0:00 |
> \_ sleep 10
> root 31 0.0 0.5 113476 17936 ? Ss 01:10 0:00 \_
> /usr/local/bin/python2.7 /mnt/mesos/sandbox/thermos_runner.pex
> --task_id=www-data-devel-hello_docker_engine-0-5f443832-a13e-4cde-97e3-89aa905f2487
> --log_to_disk=DEBUG --hostname=192.168.33.7 --thermo
> root 32 0.0 0.0 20040 1476 ? S 01:10 0:00
> \_ /bin/bash -c while true; do echo hello world sleep 10
> done
> root 467 0.0 0.0 4228 352 ? S 01:28 0:00
> \_ sleep 10
> root 47 0.0 0.0 24116 3052 ? S 01:10 0:00 python
> ./daemon.py
> ````
>
> This means processes that double fork/daemonize will be re parented to
> `sh` and not our executor. You can see that the `python daemon.py` process
> has been reparented to `sh` and not the executor and is outside of the
> scope of the runners. This has a number of undesirable implications,
> perhaps most concerning is that processes that end up reparenting to PID 1
> will not receive SIGTERM or SIGKILL from thermos but instead will be killed
> by the kernel when thermos decides to to exit. If anyone here decides to
> run published images that use popular software that double forks (like
> nginx), you will never be able to ensure the processes die cleanly.
>
> I've been thinking about this problem for a while and upon advice from
> others and my own research I believe the best solution is as follows:
> 1. We have good reasons for setting `shell=True` when launching the
> executor. I'm not comfortable changing this because I'm not sure of all of
> the implications if we choose another method.
> 2. The thermos runners end up forking off the target processes. I think
> the runners should be responsible for all of the processes that are created
> by the children.
> 3. We can make the runners responsible for their grand children by using
> `prctl(2)`[2] and setting the `PR_SET_CHILD_SUBREAPER` bit for each runner.
> This means double forked processes will be reparented to the runner and not
> PID 1
> 4. On task tear down, we make the runners send SIGTERM and SIGKILL to the
> PIDs they recorded and any other children they have.
> 5. Each runner would need to have a SIGCHLD handler to handle zombie
> processes that are reparented to it.
>
> [1]: https://github.com/apache/aurora/blob/783baaefb9a814ca0
> 1fad78181fe3df3de5b34af/src/main/java/org/apache/aurora/
> scheduler/configuration/executor/ExecutorModule.java#L109-L135
> [2]: http://man7.org/linux/man-pages/man2/prctl.2.html
>
> --
> Zameer Manji
>
> --
> Zameer Manji
>