You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@mesos.apache.org by Kevin Klues <kl...@gmail.com> on 2016/01/19 23:58:05 UTC

Review Request 42519: Fixed race between SchedDriver.{stop(), abort()} and SchedDriver.join().

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/42519/
-----------------------------------------------------------

Review request for mesos, Ben Mahler and Greg Mann.


Bugs: MESOS-4409
    https://issues.apache.org/jira/browse/MESOS-4409


Repository: mesos


Description
-------

Previously, it was possible for join() to return before a schedDriver
was actually fully stopped or aborted (breaking the semantics of the
join() call). The race came from a short circuit in join(), which
simply checked for status != DRIVER_RUNNING before returning. It appears
this short circuit was introduced to handle cases where initialize() or
start() ended up aborting before ever starting the driver to begin with.
However, it unintentionally covers cases where stop() or abort() were
called *after* the driver started running as well.

The problem is that stop() and abort() will change the status
to DRIVER_STOPPED or DRIVER_ABORTED before actually processing
dispatched stop or abort events (which happen asynchronously in a
libprocess thread). Under normal operation, join() would wait for these
events to trigger a latch that allowed the join() call to return.
However, with the short circuit, join() exits immediately even if the
libprocess thread hasn't yet processed the stop() or abort() events.

This commit fixes the semantics of the join() call to avoid this race.
We considered removing the latch completely and replacing it with
process.wait(), but, unlike the latch, this wouldn't ensure that stop()
or abort() was ever called in the first place.


Diffs
-----

  src/sched/sched.cpp 38940b7e2563a2156be2f8c228afdc27c45b6e83 

Diff: https://reviews.apache.org/r/42519/diff/


Testing
-------

Ran the entire 'make check' suite with no failures on both Mac OS X and ubuntu 14.04.


Thanks,

Kevin Klues


Re: Review Request 42519: Fixed race between SchedDriver.{stop(), abort()} and SchedDriver.join().

Posted by Kevin Klues <kl...@gmail.com>.

> On Jan. 22, 2016, 6:58 a.m., Ben Mahler wrote:
> > Logically looks good, just a couple of trivial comments and we can get this landed!

I was going to start fixing things up, but it looks like you took care of it for me.  Thanks!


- Kevin


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/42519/#review115796
-----------------------------------------------------------


On Jan. 19, 2016, 10:58 p.m., Kevin Klues wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/42519/
> -----------------------------------------------------------
> 
> (Updated Jan. 19, 2016, 10:58 p.m.)
> 
> 
> Review request for mesos, Ben Mahler and Greg Mann.
> 
> 
> Bugs: MESOS-4409
>     https://issues.apache.org/jira/browse/MESOS-4409
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Previously, it was possible for join() to return before a schedDriver
> was actually fully stopped or aborted (breaking the semantics of the
> join() call). The race came from a short circuit in join(), which
> simply checked for status != DRIVER_RUNNING before returning. It appears
> this short circuit was introduced to handle cases where initialize() or
> start() ended up aborting before ever starting the driver to begin with.
> However, it unintentionally covers cases where stop() or abort() were
> called *after* the driver started running as well.
> 
> The problem is that stop() and abort() will change the status
> to DRIVER_STOPPED or DRIVER_ABORTED before actually processing
> dispatched stop or abort events (which happen asynchronously in a
> libprocess thread). Under normal operation, join() would wait for these
> events to trigger a latch that allowed the join() call to return.
> However, with the short circuit, join() exits immediately even if the
> libprocess thread hasn't yet processed the stop() or abort() events.
> 
> This commit fixes the semantics of the join() call to avoid this race.
> We considered removing the latch completely and replacing it with
> process.wait(), but, unlike the latch, this wouldn't ensure that stop()
> or abort() was ever called in the first place.
> 
> 
> Diffs
> -----
> 
>   src/sched/sched.cpp 38940b7e2563a2156be2f8c228afdc27c45b6e83 
> 
> Diff: https://reviews.apache.org/r/42519/diff/
> 
> 
> Testing
> -------
> 
> Ran the entire 'make check' suite with no failures on both Mac OS X and ubuntu 14.04.
> 
> 
> Thanks,
> 
> Kevin Klues
> 
>


Re: Review Request 42519: Fixed race between SchedDriver.{stop(), abort()} and SchedDriver.join().

Posted by Ben Mahler <be...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/42519/#review115796
-----------------------------------------------------------


Logically looks good, just a couple of trivial comments and we can get this landed!


src/sched/sched.cpp (lines 1912 - 1913)
<https://reviews.apache.org/r/42519/#comment176853>

    How about a return of the status inside this 'if' block, and no 'else' block? That would be a little more consistent with our coding style (and the original structure of join() before this patch).



src/sched/sched.cpp (line 1918)
<https://reviews.apache.org/r/42519/#comment176852>

    We tend to use newlines between logical blocks of code, so there should be one after the closing brace here.



src/sched/sched.cpp (lines 1919 - 1922)
<https://reviews.apache.org/r/42519/#comment176851>

    Ok, so we don't need to, but how about we lock it just to keep all access to `status` synchronized? This seems pretty subtle to explain?


- Ben Mahler


On Jan. 19, 2016, 10:58 p.m., Kevin Klues wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/42519/
> -----------------------------------------------------------
> 
> (Updated Jan. 19, 2016, 10:58 p.m.)
> 
> 
> Review request for mesos, Ben Mahler and Greg Mann.
> 
> 
> Bugs: MESOS-4409
>     https://issues.apache.org/jira/browse/MESOS-4409
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Previously, it was possible for join() to return before a schedDriver
> was actually fully stopped or aborted (breaking the semantics of the
> join() call). The race came from a short circuit in join(), which
> simply checked for status != DRIVER_RUNNING before returning. It appears
> this short circuit was introduced to handle cases where initialize() or
> start() ended up aborting before ever starting the driver to begin with.
> However, it unintentionally covers cases where stop() or abort() were
> called *after* the driver started running as well.
> 
> The problem is that stop() and abort() will change the status
> to DRIVER_STOPPED or DRIVER_ABORTED before actually processing
> dispatched stop or abort events (which happen asynchronously in a
> libprocess thread). Under normal operation, join() would wait for these
> events to trigger a latch that allowed the join() call to return.
> However, with the short circuit, join() exits immediately even if the
> libprocess thread hasn't yet processed the stop() or abort() events.
> 
> This commit fixes the semantics of the join() call to avoid this race.
> We considered removing the latch completely and replacing it with
> process.wait(), but, unlike the latch, this wouldn't ensure that stop()
> or abort() was ever called in the first place.
> 
> 
> Diffs
> -----
> 
>   src/sched/sched.cpp 38940b7e2563a2156be2f8c228afdc27c45b6e83 
> 
> Diff: https://reviews.apache.org/r/42519/diff/
> 
> 
> Testing
> -------
> 
> Ran the entire 'make check' suite with no failures on both Mac OS X and ubuntu 14.04.
> 
> 
> Thanks,
> 
> Kevin Klues
> 
>


Re: Review Request 42519: Fixed race between SchedDriver.{stop(), abort()} and SchedDriver.join().

Posted by Greg Mann <gr...@mesosphere.io>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/42519/#review115462
-----------------------------------------------------------

Ship it!


Ship It!

- Greg Mann


On Jan. 19, 2016, 10:58 p.m., Kevin Klues wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/42519/
> -----------------------------------------------------------
> 
> (Updated Jan. 19, 2016, 10:58 p.m.)
> 
> 
> Review request for mesos, Ben Mahler and Greg Mann.
> 
> 
> Bugs: MESOS-4409
>     https://issues.apache.org/jira/browse/MESOS-4409
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Previously, it was possible for join() to return before a schedDriver
> was actually fully stopped or aborted (breaking the semantics of the
> join() call). The race came from a short circuit in join(), which
> simply checked for status != DRIVER_RUNNING before returning. It appears
> this short circuit was introduced to handle cases where initialize() or
> start() ended up aborting before ever starting the driver to begin with.
> However, it unintentionally covers cases where stop() or abort() were
> called *after* the driver started running as well.
> 
> The problem is that stop() and abort() will change the status
> to DRIVER_STOPPED or DRIVER_ABORTED before actually processing
> dispatched stop or abort events (which happen asynchronously in a
> libprocess thread). Under normal operation, join() would wait for these
> events to trigger a latch that allowed the join() call to return.
> However, with the short circuit, join() exits immediately even if the
> libprocess thread hasn't yet processed the stop() or abort() events.
> 
> This commit fixes the semantics of the join() call to avoid this race.
> We considered removing the latch completely and replacing it with
> process.wait(), but, unlike the latch, this wouldn't ensure that stop()
> or abort() was ever called in the first place.
> 
> 
> Diffs
> -----
> 
>   src/sched/sched.cpp 38940b7e2563a2156be2f8c228afdc27c45b6e83 
> 
> Diff: https://reviews.apache.org/r/42519/diff/
> 
> 
> Testing
> -------
> 
> Ran the entire 'make check' suite with no failures on both Mac OS X and ubuntu 14.04.
> 
> 
> Thanks,
> 
> Kevin Klues
> 
>