You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@aurora.apache.org by Zameer Manji <zm...@uber.com> on 2016/11/01 01:42:14 UTC

Aurora, Thermos, PID 1, and You

Hey,

Recently I have experienced a number of issues in a production environment
with the DockerContainerizer, Aurora and Thermos. Although my experience is
specific to Docker, I believe this applies to anyone using the Mesos
Containerizer with pid isolation. The root cause of these issues originate
to the interactions between how we launch the executor, and the role of PID
1.

The CommandInfo for the ExecutorInfo uses the default `shell` value which
is `true`[1]. This means that in any PID isolated container the `sh`
process that launches the executor will become PID 1. Here is an example
`ps` output from vagrant showing this:
````
root@aurora:/# ps auxf
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root       250  0.0  0.0  21928  2124 ?        Ss   01:19   0:00 /bin/bash
root       469  0.0  0.0  19176  1240 ?        R+   01:28   0:00  \_ ps auxf
root         1  0.0  0.0   4328   636 ?        Ss   01:10   0:00 /bin/sh -c
${MESOS_SANDBOX=.}/thermos_executor.pex --announcer-ensemble localhost:2181
--announcer-zookeeper-auth-config /home/vagrant/aurora/examples/
vagrant/config/announcer-auth.json --mesos-containerizer
root         5  0.7  1.4 1201128 45604 ?       Sl   01:10   0:08 python2.7
/mnt/mesos/sandbox/thermos_executor.pex --announcer-ensemble localhost:2181
--announcer-zookeeper-auth-config /home/vagrant/aurora/examples/
vagrant/config/announcer-auth.json --mesos-containerizer-
root        23  0.1  0.6 115668 20764 ?        S    01:10   0:01  \_
/usr/local/bin/python2.7 /mnt/mesos/sandbox/thermos_runner.pex
--task_id=www-data-devel-hello_docker_engine-0-5f443832-a13e-4cde-97e3-89aa905f2487
--log_to_disk=DEBUG --hostname=192.168.33.7 --thermos_js
root        29  0.0  0.5 113476 17936 ?        Ss   01:10   0:00      \_
/usr/local/bin/python2.7 /mnt/mesos/sandbox/thermos_runner.pex
--task_id=www-data-devel-hello_docker_engine-0-5f443832-a13e-4cde-97e3-89aa905f2487
--log_to_disk=DEBUG --hostname=192.168.33.7 --thermo
root        34  0.0  0.0  20040  1476 ?        S    01:10   0:00      |
\_ /bin/bash -c      while true; do       echo hello world       sleep 10
  done
root       468  0.0  0.0   4228   348 ?        S    01:28   0:00      |
  \_ sleep 10
root        31  0.0  0.5 113476 17936 ?        Ss   01:10   0:00      \_
/usr/local/bin/python2.7 /mnt/mesos/sandbox/thermos_runner.pex
--task_id=www-data-devel-hello_docker_engine-0-5f443832-a13e-4cde-97e3-89aa905f2487
--log_to_disk=DEBUG --hostname=192.168.33.7 --thermo
root        32  0.0  0.0  20040  1476 ?        S    01:10   0:00
 \_ /bin/bash -c      while true; do       echo hello world       sleep 10
    done
root       467  0.0  0.0   4228   352 ?        S    01:28   0:00
   \_ sleep 10
root        47  0.0  0.0  24116  3052 ?        S    01:10   0:00 python
./daemon.py
````

This means processes that double fork/daemonize will be re parented to `sh`
and not our executor. You can see that the `python daemon.py` process has
been reparented to `sh` and not the executor and is outside of the scope of
the runners. This has a number of undesirable implications, perhaps most
concerning is that processes that end up reparenting to PID 1 will not
receive SIGTERM or SIGKILL from thermos but instead will be killed by the
kernel when thermos decides to to exit. If anyone here decides to run
published images that use popular software that double forks (like nginx),
you will never be able to ensure the processes die cleanly.

I've been thinking about this problem for a while and upon advice from
others and my own research I believe the best solution is as follows:
1. We have good reasons for setting `shell=True` when launching the
executor. I'm not comfortable changing this because I'm not sure of all of
the implications if we choose another method.
2. The thermos runners end up forking off the target processes. I think the
runners should be responsible for all of the processes that are created by
the children.
3. We can make the runners responsible for their grand children by using
`prctl(2)`[2] and setting the `PR_SET_CHILD_SUBREAPER` bit for each runner.
This means double forked processes will be reparented to the runner and not
PID 1
4. On task tear down, we make the runners send SIGTERM and SIGKILL to the
PIDs they recorded and any other children they have.
5. Each runner would need to have a SIGCHLD handler to handle zombie
processes that are reparented to it.

[1]: https://github.com/apache/aurora/blob/783baaefb9a814ca01fad78181fe3d
f3de5b34af/src/main/java/org/apache/aurora/scheduler/configuration/executor/
ExecutorModule.java#L109-L135
[2]: http://man7.org/linux/man-pages/man2/prctl.2.html

-- 
Zameer Manji

Re: Aurora, Thermos, PID 1, and You

Posted by Zameer Manji <zm...@apache.org>.

Filed a task https://issues.apache.org/jira/browse/AURORA-1808 to track
this work since there are no objections.

On Mon, Oct 31, 2016 at 6:42 PM, Zameer Manji <zm...@apache.org> wrote:

> Re sending this from my @apache.org email in case my previous email got
> caught in spam.
>
> On Mon, Oct 31, 2016 at 6:42 PM, Zameer Manji <zm...@uber.com> wrote:
>
>> Hey,
>>
>> Recently I have experienced a number of issues in a production
>> environment with the DockerContainerizer, Aurora and Thermos. Although my
>> experience is specific to Docker, I believe this applies to anyone using
>> the Mesos Containerizer with pid isolation. The root cause of these issues
>> originate to the interactions between how we launch the executor, and the
>> role of PID 1.
>>
>> The CommandInfo for the ExecutorInfo uses the default `shell` value which
>> is `true`[1]. This means that in any PID isolated container the `sh`
>> process that launches the executor will become PID 1. Here is an example
>> `ps` output from vagrant showing this:
>> ````
>> root@aurora:/# ps auxf
>> USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
>> root       250  0.0  0.0  21928  2124 ?        Ss   01:19   0:00 /bin/bash
>> root       469  0.0  0.0  19176  1240 ?        R+   01:28   0:00  \_ ps
>> auxf
>> root         1  0.0  0.0   4328   636 ?        Ss   01:10   0:00 /bin/sh
>> -c ${MESOS_SANDBOX=.}/thermos_executor.pex --announcer-ensemble
>> localhost:2181 --announcer-zookeeper-auth-config
>> /home/vagrant/aurora/examples/vagrant/config/announcer-auth.json
>> --mesos-containerizer
>> root         5  0.7  1.4 1201128 45604 ?       Sl   01:10   0:08
>> python2.7 /mnt/mesos/sandbox/thermos_executor.pex --announcer-ensemble
>> localhost:2181 --announcer-zookeeper-auth-config
>> /home/vagrant/aurora/examples/vagrant/config/announcer-auth.json
>> --mesos-containerizer-
>> root        23  0.1  0.6 115668 20764 ?        S    01:10   0:01  \_
>> /usr/local/bin/python2.7 /mnt/mesos/sandbox/thermos_runner.pex
>> --task_id=www-data-devel-hello_docker_engine-0-5f443832-a13e-4cde-97e3-89aa905f2487
>> --log_to_disk=DEBUG --hostname=192.168.33.7 --thermos_js
>> root        29  0.0  0.5 113476 17936 ?        Ss   01:10   0:00      \_
>> /usr/local/bin/python2.7 /mnt/mesos/sandbox/thermos_runner.pex
>> --task_id=www-data-devel-hello_docker_engine-0-5f443832-a13e-4cde-97e3-89aa905f2487
>> --log_to_disk=DEBUG --hostname=192.168.33.7 --thermo
>> root        34  0.0  0.0  20040  1476 ?        S    01:10   0:00      |
>> \_ /bin/bash -c      while true; do       echo hello world       sleep 10
>>   done
>> root       468  0.0  0.0   4228   348 ?        S    01:28   0:00      |
>>     \_ sleep 10
>> root        31  0.0  0.5 113476 17936 ?        Ss   01:10   0:00      \_
>> /usr/local/bin/python2.7 /mnt/mesos/sandbox/thermos_runner.pex
>> --task_id=www-data-devel-hello_docker_engine-0-5f443832-a13e-4cde-97e3-89aa905f2487
>> --log_to_disk=DEBUG --hostname=192.168.33.7 --thermo
>> root        32  0.0  0.0  20040  1476 ?        S    01:10   0:00
>>  \_ /bin/bash -c      while true; do       echo hello world       sleep 10
>>     done
>> root       467  0.0  0.0   4228   352 ?        S    01:28   0:00
>>      \_ sleep 10
>> root        47  0.0  0.0  24116  3052 ?        S    01:10   0:00 python
>> ./daemon.py
>> ````
>>
>> This means processes that double fork/daemonize will be re parented to
>> `sh` and not our executor. You can see that the `python daemon.py` process
>> has been reparented to `sh` and not the executor and is outside of the
>> scope of the runners. This has a number of undesirable implications,
>> perhaps most concerning is that processes that end up reparenting to PID 1
>> will not receive SIGTERM or SIGKILL from thermos but instead will be killed
>> by the kernel when thermos decides to to exit. If anyone here decides to
>> run published images that use popular software that double forks (like
>> nginx), you will never be able to ensure the processes die cleanly.
>>
>> I've been thinking about this problem for a while and upon advice from
>> others and my own research I believe the best solution is as follows:
>> 1. We have good reasons for setting `shell=True` when launching the
>> executor. I'm not comfortable changing this because I'm not sure of all of
>> the implications if we choose another method.
>> 2. The thermos runners end up forking off the target processes. I think
>> the runners should be responsible for all of the processes that are created
>> by the children.
>> 3. We can make the runners responsible for their grand children by using
>> `prctl(2)`[2] and setting the `PR_SET_CHILD_SUBREAPER` bit for each runner.
>> This means double forked processes will be reparented to the runner and not
>> PID 1
>> 4. On task tear down, we make the runners send SIGTERM and SIGKILL to the
>> PIDs they recorded and any other children they have.
>> 5. Each runner would need to have a SIGCHLD handler to handle zombie
>> processes that are reparented to it.
>>
>> [1]: https://github.com/apache/aurora/blob/783baaefb9a814ca0
>> 1fad78181fe3df3de5b34af/src/main/java/org/apache/aurora/sche
>> duler/configuration/executor/ExecutorModule.java#L109-L135
>> [2]: http://man7.org/linux/man-pages/man2/prctl.2.html
>>
>> --
>> Zameer Manji
>>
>> --
>> Zameer Manji
>>
>


-- 
Zameer Manji

Re: Aurora, Thermos, PID 1, and You

Posted by Zameer Manji <zm...@apache.org>.

Re sending this from my @apache.org email in case my previous email got
caught in spam.

On Mon, Oct 31, 2016 at 6:42 PM, Zameer Manji <zm...@uber.com> wrote:

> Hey,
>
> Recently I have experienced a number of issues in a production environment
> with the DockerContainerizer, Aurora and Thermos. Although my experience is
> specific to Docker, I believe this applies to anyone using the Mesos
> Containerizer with pid isolation. The root cause of these issues originate
> to the interactions between how we launch the executor, and the role of PID
> 1.
>
> The CommandInfo for the ExecutorInfo uses the default `shell` value which
> is `true`[1]. This means that in any PID isolated container the `sh`
> process that launches the executor will become PID 1. Here is an example
> `ps` output from vagrant showing this:
> ````
> root@aurora:/# ps auxf
> USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
> root       250  0.0  0.0  21928  2124 ?        Ss   01:19   0:00 /bin/bash
> root       469  0.0  0.0  19176  1240 ?        R+   01:28   0:00  \_ ps
> auxf
> root         1  0.0  0.0   4328   636 ?        Ss   01:10   0:00 /bin/sh
> -c ${MESOS_SANDBOX=.}/thermos_executor.pex --announcer-ensemble
> localhost:2181 --announcer-zookeeper-auth-config
> /home/vagrant/aurora/examples/vagrant/config/announcer-auth.json
> --mesos-containerizer
> root         5  0.7  1.4 1201128 45604 ?       Sl   01:10   0:08 python2.7
> /mnt/mesos/sandbox/thermos_executor.pex --announcer-ensemble
> localhost:2181 --announcer-zookeeper-auth-config
> /home/vagrant/aurora/examples/vagrant/config/announcer-auth.json
> --mesos-containerizer-
> root        23  0.1  0.6 115668 20764 ?        S    01:10   0:01  \_
> /usr/local/bin/python2.7 /mnt/mesos/sandbox/thermos_runner.pex
> --task_id=www-data-devel-hello_docker_engine-0-5f443832-a13e-4cde-97e3-89aa905f2487
> --log_to_disk=DEBUG --hostname=192.168.33.7 --thermos_js
> root        29  0.0  0.5 113476 17936 ?        Ss   01:10   0:00      \_
> /usr/local/bin/python2.7 /mnt/mesos/sandbox/thermos_runner.pex
> --task_id=www-data-devel-hello_docker_engine-0-5f443832-a13e-4cde-97e3-89aa905f2487
> --log_to_disk=DEBUG --hostname=192.168.33.7 --thermo
> root        34  0.0  0.0  20040  1476 ?        S    01:10   0:00      |
> \_ /bin/bash -c      while true; do       echo hello world       sleep 10
>   done
> root       468  0.0  0.0   4228   348 ?        S    01:28   0:00      |
>     \_ sleep 10
> root        31  0.0  0.5 113476 17936 ?        Ss   01:10   0:00      \_
> /usr/local/bin/python2.7 /mnt/mesos/sandbox/thermos_runner.pex
> --task_id=www-data-devel-hello_docker_engine-0-5f443832-a13e-4cde-97e3-89aa905f2487
> --log_to_disk=DEBUG --hostname=192.168.33.7 --thermo
> root        32  0.0  0.0  20040  1476 ?        S    01:10   0:00
>  \_ /bin/bash -c      while true; do       echo hello world       sleep 10
>     done
> root       467  0.0  0.0   4228   352 ?        S    01:28   0:00
>    \_ sleep 10
> root        47  0.0  0.0  24116  3052 ?        S    01:10   0:00 python
> ./daemon.py
> ````
>
> This means processes that double fork/daemonize will be re parented to
> `sh` and not our executor. You can see that the `python daemon.py` process
> has been reparented to `sh` and not the executor and is outside of the
> scope of the runners. This has a number of undesirable implications,
> perhaps most concerning is that processes that end up reparenting to PID 1
> will not receive SIGTERM or SIGKILL from thermos but instead will be killed
> by the kernel when thermos decides to to exit. If anyone here decides to
> run published images that use popular software that double forks (like
> nginx), you will never be able to ensure the processes die cleanly.
>
> I've been thinking about this problem for a while and upon advice from
> others and my own research I believe the best solution is as follows:
> 1. We have good reasons for setting `shell=True` when launching the
> executor. I'm not comfortable changing this because I'm not sure of all of
> the implications if we choose another method.
> 2. The thermos runners end up forking off the target processes. I think
> the runners should be responsible for all of the processes that are created
> by the children.
> 3. We can make the runners responsible for their grand children by using
> `prctl(2)`[2] and setting the `PR_SET_CHILD_SUBREAPER` bit for each runner.
> This means double forked processes will be reparented to the runner and not
> PID 1
> 4. On task tear down, we make the runners send SIGTERM and SIGKILL to the
> PIDs they recorded and any other children they have.
> 5. Each runner would need to have a SIGCHLD handler to handle zombie
> processes that are reparented to it.
>
> [1]: https://github.com/apache/aurora/blob/783baaefb9a814ca0
> 1fad78181fe3df3de5b34af/src/main/java/org/apache/aurora/
> scheduler/configuration/executor/ExecutorModule.java#L109-L135
> [2]: http://man7.org/linux/man-pages/man2/prctl.2.html
>
> --
> Zameer Manji
>
> --
> Zameer Manji
>