You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mesos.apache.org by bm...@apache.org on 2017/10/22 19:18:58 UTC

mesos git commit: Fixed a crash in ProcessManager::resume due to race.

Repository: mesos
Updated Branches:
  refs/heads/master a85a22baa -> 64c9b5caf


Fixed a crash in ProcessManager::resume due to race.

When ProcessManager::resume moves the process into a BLOCKED
state, it's possible that a TerminateEvent is enqueued and the
process is placed back in the run queue. Another worker thread
can pick it off the run queue, process the TerminateEvent, and
delete the process! At this point, when the original thread
in ProcessManager::resume tries to check if any events were
enqueued before transitioning to BLOCKED, it will access a
deleted process and crash.

Some example crash paths involving process::Latch below.

Path 1:

T1 creates latch, spawns latch process, and puts it in run queue
T1 waits on latch
T1 ProcessManager::wait on latch see it in BOTTOM

T2 worker thread dequeues the latch process
T2 ProcessManager::resume runs initialize, moves it to READY
T2 ProcessManager::resume sees empty queue
T2 ProcessManager::resume sets to BLOCKED

T3 triggers the latch, terminates the latch process
T3 enqueue TerminateEvent
T3 enqueue sees state BLOCKED
T3 swaps from BLOCKED->READY & enqueues latch process into run queue

T1 extracts latch process from run queue
T1 ProcessManager::resume sees READY
T1 ProcessManager::resume dequeues terminate event
T1 ProcessManager::resume calls ProcessManager::cleanup
T1 ProcessManager::cleanup sets to TERMINATING
T1 ProcessManager::cleanup decommissions event queue
T1 ProcessManager::cleanup waits for latch refs to go away
T1 ProcessManager::cleanup calls SocketManager::exited
T1 ProcessManager::cleanup opens gate
T1 ProcessManager::cleanup deletes the latch process

T2 ProcessManager::resume checks if event queue is empty again (crash)

Path 2:

T1 creates latch, spawns latch process, and puts it in run queue
T1 waits on latch
T1 ProcessManager::wait on latch see it in BOTTOM
T1 ProcessManager::wait extracts latch process from run queue
T1 ProcessManager::resume runs initialize, moves it to READY
T1 ProcessManager::resume sees empty queue
T1 ProcessManager::resume sets to BLOCKED

T3 triggers the latch, terminates the latch process
T3 enqueue TerminateEvent
T3 enqueue sees state BLOCKED
T3 swaps from BLOCKED->READY & enqueues latch process into run queue

T2 worker thread dequeues the latch process
T2 ProcessManager::resume sees READY
T2 ProcessManager::resume dequeues terminate event
T2 ProcessManager::resume calls ProcessManager::cleanup
T2 ProcessManager::cleanup sets to TERMINATING
T2 ProcessManager::cleanup decommissions event queue
T2 ProcessManager::cleanup waits for latch refs to go away
T2 ProcessManager::cleanup calls SocketManager::exited
T2 ProcessManager::cleanup opens gate
T2 ProcessManager::resume deletes the latch process

T1 ProcessManager::resume checks if event queue is empty again (crash)

Review: https://reviews.apache.org/r/63208


Project: http://git-wip-us.apache.org/repos/asf/mesos/repo
Commit: http://git-wip-us.apache.org/repos/asf/mesos/commit/64c9b5ca
Tree: http://git-wip-us.apache.org/repos/asf/mesos/tree/64c9b5ca
Diff: http://git-wip-us.apache.org/repos/asf/mesos/diff/64c9b5ca

Branch: refs/heads/master
Commit: 64c9b5cafced6da5e5c209387ba23f182e4f1a7a
Parents: a85a22b
Author: Benjamin Mahler <bm...@apache.org>
Authored: Sat Oct 21 19:20:00 2017 -0700
Committer: Benjamin Mahler <bm...@apache.org>
Committed: Sun Oct 22 12:09:51 2017 -0700

----------------------------------------------------------------------
 3rdparty/libprocess/src/process.cpp | 45 ++++++++++++++++++++------------
 1 file changed, 28 insertions(+), 17 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/mesos/blob/64c9b5ca/3rdparty/libprocess/src/process.cpp
----------------------------------------------------------------------
diff --git a/3rdparty/libprocess/src/process.cpp b/3rdparty/libprocess/src/process.cpp
index 4d8d483..71ae712 100644
--- a/3rdparty/libprocess/src/process.cpp
+++ b/3rdparty/libprocess/src/process.cpp
@@ -3243,6 +3243,7 @@ void ProcessManager::resume(ProcessBase* process)
 
   VLOG(3) << "Resuming " << process->pid << " at " << Clock::now();
 
+  bool manage = process->manage;
   bool terminate = false;
   bool blocked = false;
 
@@ -3259,6 +3260,11 @@ void ProcessManager::resume(ProcessBase* process)
     process->state.store(state);
   }
 
+  // We must hold a reference to the process because it's possible
+  // that another worker races ahead and deletes the process after
+  // we set the state to BLOCKED (see the comment below).
+  ProcessReference reference = process->reference;
+
   while (!terminate && !blocked) {
     Event* event = nullptr;
 
@@ -3269,23 +3275,31 @@ void ProcessManager::resume(ProcessBase* process)
     if (!process->events->consumer.empty()) {
       event = process->events->consumer.dequeue();
     } else {
+      // We now transition the process to BLOCKED. It's possible that
+      // events get enqueued while we're still in the READY state.
+      // If this happens, the process would not have been enqueued
+      // into the run queue and we need to continue processing the
+      // events!
+      //
+      // However, when checking for such events, we need to be
+      // careful not to process the events if they were enqueued
+      // *after* we transitioned to BLOCKED. In this case, the
+      // process was put in the run queue and transitioned back
+      // to READY by `ProcessBase::enqueue`. So, we check for this
+      // case by seeing if we can atomically swap the state from
+      // BLOCKED to READY.
+      //
+      // We also need to make sure we hold a reference to the
+      // process when we check the queue again (see the reference
+      // held above the loop), as it's possible that another worker
+      // thread dequeued the process off the run queue and raced
+      // ahead processing a termination event and deleted the
+      // process!
       state = ProcessBase::State::BLOCKED;
       process->state.store(state);
       blocked = true;
 
-      // Now check that we didn't miss any events that got added
-      // before we set ourselves to BLOCKED since we won't have been
-      // added to the run queue in those circumstances so we need to
-      // serve those events!
       if (!process->events->consumer.empty()) {
-        // Make sure the state is in READY! Either we need to
-        // explicitly do this because `ProcessBase::enqueue` saw us as
-        // READY (or BOTTOM) and didn't change the state or we're
-        // racing with `ProcessBase::enqueue` because they saw us at
-        // BLOCKED and are trying to change the state. If they change
-        // the state then they'll also enqueue this process, which
-        // means we need to bail because another thread might resume
-        // (and the reason we'll bail is because `blocked` is true)!
         if (process->state.compare_exchange_strong(
                 state,
                 ProcessBase::State::READY)) {
@@ -3355,11 +3369,8 @@ void ProcessManager::resume(ProcessBase* process)
     }
   }
 
-  // Must read and store if we are managing `process` because in the
-  // event we are not managing `process` it might get deallocated
-  // after we open the gate in `ProcessManager::cleanup()` and thus
-  // can't be dereferenced.
-  bool manage = process->manage;
+  // Clear the reference before we cleanup!
+  reference = ProcessReference();
 
   if (terminate) {
     cleanup(process);