You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mesos.apache.org by vi...@apache.org on 2016/04/25 23:32:32 UTC

[39/50] mesos git commit: Fixed race between {stop(), abort()} and join() in the scheduler driver.

Fixed race between {stop(), abort()} and join() in the scheduler driver.

Previously, it was possible for join() to return before a schedDriver
was actually fully stopped or aborted (breaking the semantics of the
join() call). The race came from a short circuit in join(), which
simply checked for status != DRIVER_RUNNING before returning. It appears
this short circuit was introduced to handle cases where initialize() or
start() ended up aborting before ever starting the driver to begin with.
However, it unintentionally covers cases where stop() or abort() were
called *after* the driver started running as well.

The problem is that stop() and abort() will change the status
to DRIVER_STOPPED or DRIVER_ABORTED before actually processing
dispatched stop or abort events (which happen asynchronously in a
libprocess thread). Under normal operation, join() would wait for these
events to trigger a latch that allowed the join() call to return.
However, with the short circuit, join() exits immediately even if the
libprocess thread hasn't yet processed the stop() or abort() events.

This commit fixes the semantics of the join() call to avoid this race.
We considered removing the latch completely and replacing it with
process.wait(), but, unlike the latch, this wouldn't ensure that stop()
or abort() was ever called in the first place.

Review: https://reviews.apache.org/r/42519/


Project: http://git-wip-us.apache.org/repos/asf/mesos/repo
Commit: http://git-wip-us.apache.org/repos/asf/mesos/commit/daada4f8
Tree: http://git-wip-us.apache.org/repos/asf/mesos/tree/daada4f8
Diff: http://git-wip-us.apache.org/repos/asf/mesos/diff/daada4f8

Branch: refs/heads/0.26.x
Commit: daada4f8fbd1e12809b002609284d0736c2d187f
Parents: 38398e8
Author: Kevin Klues <kl...@gmail.com>
Authored: Thu Jan 21 23:12:00 2016 -0800
Committer: Michael Park <mp...@apache.org>
Committed: Fri Feb 26 20:59:07 2016 -0800

----------------------------------------------------------------------
 src/sched/sched.cpp | 13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/mesos/blob/daada4f8/src/sched/sched.cpp
----------------------------------------------------------------------
diff --git a/src/sched/sched.cpp b/src/sched/sched.cpp
index a6faf92..aaeb02f 100644
--- a/src/sched/sched.cpp
+++ b/src/sched/sched.cpp
@@ -1859,19 +1859,20 @@ Status MesosSchedulerDriver::abort()
 
 Status MesosSchedulerDriver::join()
 {
-  // Exit early if the driver is not running.
+  // We can use the process pointer to detect if the driver was ever
+  // started properly. If it wasn't, we return the current status
+  // (which should either be DRIVER_NOT_STARTED or DRIVER_ABORTED).
   synchronized (mutex) {
-    if (status != DRIVER_RUNNING) {
+    if (process == NULL) {
+      CHECK(status == DRIVER_NOT_STARTED || status == DRIVER_ABORTED);
+
       return status;
     }
   }
 
-  // If the driver was running, the latch will be triggered regardless
-  // of the current `status`. Wait for this to happen to signify
-  // termination.
+  // Otherwise, wait for stop() or abort() to trigger the latch.
   CHECK_NOTNULL(latch)->await();
 
-  // Now return the current `status` of the driver.
   synchronized (mutex) {
     CHECK(status == DRIVER_ABORTED || status == DRIVER_STOPPED);