You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mesos.apache.org by Niklas Nielsen <ni...@qni.dk> on 2014/03/01 21:52:13 UTC

Re: Review Request 18597: Added signal escalation to command executor.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/18597/
-----------------------------------------------------------

(Updated March 1, 2014, 12:52 p.m.)


Review request for mesos.


Bugs: MESOS-1031
    https://issues.apache.org/jira/browse/MESOS-1031


Repository: mesos-git


Description
-------

This patch makes command executor shutdown a bit more graceful
(than sending SIGKILL upfront) by adding signal escalation.
Signal escalation checks for liveness of process trees after SIGTERM
has been sent. If all processes are dead before the given gracePeriod,
shutdown() returns. If not, SIGKILL will be sent to all pids in
the process trees. gracePeriod is set to 3 seconds, in order to issue
SIGKILL before shutdownTimeout in the slave is triggered
(EXECUTOR_SHUTDOWN_GRACE_PERIOD is 5 seconds).


Diffs
-----

  src/launcher/executor.cpp e30d77a 

Diff: https://reviews.apache.org/r/18597/diff/


Testing
-------

Functional testing and make check.


Thanks,

Niklas Nielsen


Re: Review Request 18597: Added signal escalation to command executor.

Posted by Niklas Nielsen <ni...@qni.dk>.

> On March 6, 2014, 8:08 p.m., Ben Mahler wrote:
> > src/launcher/executor.cpp, line 254
> > <https://reviews.apache.org/r/18597/diff/1/?file=506548#file506548line254>
> >
> >     We really try to avoid blocking in a libprocess. Instead of polling and sleeping, can you use a combination of the 'reaped' signal and a timeout delay()?
> >     
> >     If you get the reaped signal, you're either done, or you need to cleanup the remainder of the tree.
> >     
> >     If your delay() expires you need to clean up the tree.
> >     
> >     Also, we need to be careful about our assumptions here, we may not be inside a 'pid' namespace and we may have privileges to signal pids outside our tree. In the face of stale pids, we may accidentally signal other processes after the signal escalation delay, no?
> 
> Niklas Nielsen wrote:
>     Great points - I somehow convinced myself that I needed to keep the executor from returning; must have been late.
>     
>     As to the other approach; Cleaning up the process tree should be possible by picking up the process group id up front and send SIGKILL after the delay or reap. We could do an inline killpg() or change the killtree() extension to work on process group alone (without a known root). What do you think?

FYI; Per our discussion in https://issues.apache.org/jira/browse/MESOS-1031 - I will redo escalation based on pid namespaces and delay().
The fall-back (without pid namespaces: Mac OS X and old Linux's (prior to Linux 2.6.3X)) can potentially leave orphan processes behind, but seems to be the only safe method to use.

Will close this RR for now and open a new for the other signal escalation mechanism.


- Niklas


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/18597/#review36490
-----------------------------------------------------------


On March 3, 2014, 11:13 a.m., Niklas Nielsen wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/18597/
> -----------------------------------------------------------
> 
> (Updated March 3, 2014, 11:13 a.m.)
> 
> 
> Review request for mesos and Ben Mahler.
> 
> 
> Bugs: MESOS-1031
>     https://issues.apache.org/jira/browse/MESOS-1031
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> This patch makes command executor shutdown a bit more graceful
> (than sending SIGKILL upfront) by adding signal escalation.
> Signal escalation checks for liveness of process trees after SIGTERM
> has been sent. If all processes are dead before the given gracePeriod,
> shutdown() returns. If not, SIGKILL will be sent to all pids in
> the process trees. gracePeriod is set to 3 seconds, in order to issue
> SIGKILL before shutdownTimeout in the slave is triggered
> (EXECUTOR_SHUTDOWN_GRACE_PERIOD is 5 seconds).
> 
> 
> Diffs
> -----
> 
>   src/launcher/executor.cpp e30d77a 
> 
> Diff: https://reviews.apache.org/r/18597/diff/
> 
> 
> Testing
> -------
> 
> Functional testing and make check.
> 
> 
> Thanks,
> 
> Niklas Nielsen
> 
>


Re: Review Request 18597: Added signal escalation to command executor.

Posted by Niklas Nielsen <ni...@qni.dk>.

> On March 6, 2014, 8:08 p.m., Ben Mahler wrote:
> > src/launcher/executor.cpp, line 254
> > <https://reviews.apache.org/r/18597/diff/1/?file=506548#file506548line254>
> >
> >     We really try to avoid blocking in a libprocess. Instead of polling and sleeping, can you use a combination of the 'reaped' signal and a timeout delay()?
> >     
> >     If you get the reaped signal, you're either done, or you need to cleanup the remainder of the tree.
> >     
> >     If your delay() expires you need to clean up the tree.
> >     
> >     Also, we need to be careful about our assumptions here, we may not be inside a 'pid' namespace and we may have privileges to signal pids outside our tree. In the face of stale pids, we may accidentally signal other processes after the signal escalation delay, no?

Great points - I somehow convinced myself that I needed to keep the executor from returning; must have been late.

As to the other approach; Cleaning up the process tree should be possible by picking up the process group id up front and send SIGKILL after the delay or reap. We could do an inline killpg() or change the killtree() extension to work on process group alone (without a known root). What do you think?


- Niklas


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/18597/#review36490
-----------------------------------------------------------


On March 3, 2014, 11:13 a.m., Niklas Nielsen wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/18597/
> -----------------------------------------------------------
> 
> (Updated March 3, 2014, 11:13 a.m.)
> 
> 
> Review request for mesos and Ben Mahler.
> 
> 
> Bugs: MESOS-1031
>     https://issues.apache.org/jira/browse/MESOS-1031
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> This patch makes command executor shutdown a bit more graceful
> (than sending SIGKILL upfront) by adding signal escalation.
> Signal escalation checks for liveness of process trees after SIGTERM
> has been sent. If all processes are dead before the given gracePeriod,
> shutdown() returns. If not, SIGKILL will be sent to all pids in
> the process trees. gracePeriod is set to 3 seconds, in order to issue
> SIGKILL before shutdownTimeout in the slave is triggered
> (EXECUTOR_SHUTDOWN_GRACE_PERIOD is 5 seconds).
> 
> 
> Diffs
> -----
> 
>   src/launcher/executor.cpp e30d77a 
> 
> Diff: https://reviews.apache.org/r/18597/diff/
> 
> 
> Testing
> -------
> 
> Functional testing and make check.
> 
> 
> Thanks,
> 
> Niklas Nielsen
> 
>


Re: Review Request 18597: Added signal escalation to command executor.

Posted by Ben Mahler <be...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/18597/#review36490
-----------------------------------------------------------



src/launcher/executor.cpp
<https://reviews.apache.org/r/18597/#comment67475>

    We really try to avoid blocking in a libprocess. Instead of polling and sleeping, can you use a combination of the 'reaped' signal and a timeout delay()?
    
    If you get the reaped signal, you're either done, or you need to cleanup the remainder of the tree.
    
    If your delay() expires you need to clean up the tree.
    
    Also, we need to be careful about our assumptions here, we may not be inside a 'pid' namespace and we may have privileges to signal pids outside our tree. In the face of stale pids, we may accidentally signal other processes after the signal escalation delay, no?


- Ben Mahler


On March 3, 2014, 7:13 p.m., Niklas Nielsen wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/18597/
> -----------------------------------------------------------
> 
> (Updated March 3, 2014, 7:13 p.m.)
> 
> 
> Review request for mesos and Ben Mahler.
> 
> 
> Bugs: MESOS-1031
>     https://issues.apache.org/jira/browse/MESOS-1031
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> This patch makes command executor shutdown a bit more graceful
> (than sending SIGKILL upfront) by adding signal escalation.
> Signal escalation checks for liveness of process trees after SIGTERM
> has been sent. If all processes are dead before the given gracePeriod,
> shutdown() returns. If not, SIGKILL will be sent to all pids in
> the process trees. gracePeriod is set to 3 seconds, in order to issue
> SIGKILL before shutdownTimeout in the slave is triggered
> (EXECUTOR_SHUTDOWN_GRACE_PERIOD is 5 seconds).
> 
> 
> Diffs
> -----
> 
>   src/launcher/executor.cpp e30d77a 
> 
> Diff: https://reviews.apache.org/r/18597/diff/
> 
> 
> Testing
> -------
> 
> Functional testing and make check.
> 
> 
> Thanks,
> 
> Niklas Nielsen
> 
>


Re: Review Request 18597: Added signal escalation to command executor.

Posted by Mesos ReviewBot <de...@mesos.apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/18597/#review36501
-----------------------------------------------------------


Patch looks great!

Reviews applied: [18595, 18594, 18597]

All tests passed.

- Mesos ReviewBot


On March 3, 2014, 7:13 p.m., Niklas Nielsen wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/18597/
> -----------------------------------------------------------
> 
> (Updated March 3, 2014, 7:13 p.m.)
> 
> 
> Review request for mesos and Ben Mahler.
> 
> 
> Bugs: MESOS-1031
>     https://issues.apache.org/jira/browse/MESOS-1031
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> This patch makes command executor shutdown a bit more graceful
> (than sending SIGKILL upfront) by adding signal escalation.
> Signal escalation checks for liveness of process trees after SIGTERM
> has been sent. If all processes are dead before the given gracePeriod,
> shutdown() returns. If not, SIGKILL will be sent to all pids in
> the process trees. gracePeriod is set to 3 seconds, in order to issue
> SIGKILL before shutdownTimeout in the slave is triggered
> (EXECUTOR_SHUTDOWN_GRACE_PERIOD is 5 seconds).
> 
> 
> Diffs
> -----
> 
>   src/launcher/executor.cpp e30d77a 
> 
> Diff: https://reviews.apache.org/r/18597/diff/
> 
> 
> Testing
> -------
> 
> Functional testing and make check.
> 
> 
> Thanks,
> 
> Niklas Nielsen
> 
>


Re: Review Request 18597: Added signal escalation to command executor.

Posted by Niklas Nielsen <ni...@qni.dk>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/18597/
-----------------------------------------------------------

(Updated March 3, 2014, 11:13 a.m.)


Review request for mesos and Ben Mahler.


Bugs: MESOS-1031
    https://issues.apache.org/jira/browse/MESOS-1031


Repository: mesos-git


Description
-------

This patch makes command executor shutdown a bit more graceful
(than sending SIGKILL upfront) by adding signal escalation.
Signal escalation checks for liveness of process trees after SIGTERM
has been sent. If all processes are dead before the given gracePeriod,
shutdown() returns. If not, SIGKILL will be sent to all pids in
the process trees. gracePeriod is set to 3 seconds, in order to issue
SIGKILL before shutdownTimeout in the slave is triggered
(EXECUTOR_SHUTDOWN_GRACE_PERIOD is 5 seconds).


Diffs
-----

  src/launcher/executor.cpp e30d77a 

Diff: https://reviews.apache.org/r/18597/diff/


Testing
-------

Functional testing and make check.


Thanks,

Niklas Nielsen