You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@mesos.apache.org by Benjamin Mahler <bm...@apache.org> on 2017/10/22 03:00:15 UTC

Review Request 63208: Fixed a crash in ProcessManager::resume due to race.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/63208/
-----------------------------------------------------------

Review request for mesos and Benjamin Hindman.


Bugs: MESOS-7921
    https://issues.apache.org/jira/browse/MESOS-7921


Repository: mesos


Description
-------

When ProcessManager::resume moves the process into a BLOCKED
state, it's possible that a TerminateEvent is enqueued and the
process is placed back in the run queue. Another worker thread
can pick it off the run queue, process the TerminateEvent, and
delete the process! At this point, when the original thread
in ProcessManager::resume tries to check if any events were
enqueued before transitioning to BLOCKED, it will access a
deleted process and crash.

Some example crash paths involving process::Latch below.

Path 1:

T1 creates latch, spawns latch process, and puts it in run queue
T1 waits on latch
T1 ProcessManager::wait on latch see it in BOTTOM

T2 worker thread dequeues the latch process
T2 ProcessManager::resume runs initialize, moves it to READY
T2 ProcessManager::resume sees empty queue
T2 ProcessManager::resume sets to BLOCKED

T3 triggers the latch, terminates the latch process
T3 enqueue TerminateEvent
T3 enqueue sees state BLOCKED
T3 swaps from BLOCKED->READY & enqueues latch process into run queue

T1 extracts latch process from run queue
T1 ProcessManager::resume sees READY
T1 ProcessManager::resume dequeues terminate event
T1 ProcessManager::resume calls ProcessManager::cleanup
T1 ProcessManager::cleanup sets to TERMINATING
T1 ProcessManager::cleanup decommissions event queue
T1 ProcessManager::cleanup waits for latch refs to go away
T1 ProcessManager::cleanup calls SocketManager::exited
T1 ProcessManager::cleanup opens gate
T1 ProcessManager::cleanup deletes the latch process

T2 ProcessManager::resume checks if event queue is empty again (crash)

Path 2:

T1 creates latch, spawns latch process, and puts it in run queue
T1 waits on latch
T1 ProcessManager::wait on latch see it in BOTTOM
T1 ProcessManager::wait extracts latch process from run queue
T1 ProcessManager::resume runs initialize, moves it to READY
T1 ProcessManager::resume sees empty queue
T1 ProcessManager::resume sets to BLOCKED

T3 triggers the latch, terminates the latch process
T3 enqueue TerminateEvent
T3 enqueue sees state BLOCKED
T3 swaps from BLOCKED->READY & enqueues latch process into run queue

T2 worker thread dequeues the latch process
T2 ProcessManager::resume sees READY
T2 ProcessManager::resume dequeues terminate event
T2 ProcessManager::resume calls ProcessManager::cleanup
T2 ProcessManager::cleanup sets to TERMINATING
T2 ProcessManager::cleanup decommissions event queue
T2 ProcessManager::cleanup waits for latch refs to go away
T2 ProcessManager::cleanup calls SocketManager::exited
T2 ProcessManager::cleanup opens gate
T2 ProcessManager::resume deletes the latch process

T1 ProcessManager::resume checks if event queue is empty again (crash)


Diffs
-----

  3rdparty/libprocess/src/process.cpp 4d8d483cfa5c3bac8bbe6a985f1cc2b737ae691e 


Diff: https://reviews.apache.org/r/63208/diff/1/


Testing
-------

* Tested by injecting a sleep [here](https://github.com/apache/mesos/blob/1.4.0/3rdparty/libprocess/src/process.cpp#L3278).
* Ran throughput benchmark, results are highly variable across runs, see results [here](https://docs.google.com/spreadsheets/d/1yoQtvAWiojLWBV0TlqjCxdc1RMtNERGSEZzgCxkWoxI/edit?usp=sharing).


Thanks,

Benjamin Mahler


Re: Review Request 63208: Fixed a crash in ProcessManager::resume due to race.

Posted by Benjamin Mahler <bm...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/63208/
-----------------------------------------------------------

(Updated Oct. 22, 2017, 7:11 p.m.)


Review request for mesos and Benjamin Hindman.


Changes
-------

Updated the fix to also avoid accessing `process->manage` unsafely by taking a reference outside the loop.
Updated the benchmark graphs.


Bugs: MESOS-7921
    https://issues.apache.org/jira/browse/MESOS-7921


Repository: mesos


Description
-------

When ProcessManager::resume moves the process into a BLOCKED
state, it's possible that a TerminateEvent is enqueued and the
process is placed back in the run queue. Another worker thread
can pick it off the run queue, process the TerminateEvent, and
delete the process! At this point, when the original thread
in ProcessManager::resume tries to check if any events were
enqueued before transitioning to BLOCKED, it will access a
deleted process and crash.

Some example crash paths involving process::Latch below.

Path 1:

T1 creates latch, spawns latch process, and puts it in run queue
T1 waits on latch
T1 ProcessManager::wait on latch see it in BOTTOM

T2 worker thread dequeues the latch process
T2 ProcessManager::resume runs initialize, moves it to READY
T2 ProcessManager::resume sees empty queue
T2 ProcessManager::resume sets to BLOCKED

T3 triggers the latch, terminates the latch process
T3 enqueue TerminateEvent
T3 enqueue sees state BLOCKED
T3 swaps from BLOCKED->READY & enqueues latch process into run queue

T1 extracts latch process from run queue
T1 ProcessManager::resume sees READY
T1 ProcessManager::resume dequeues terminate event
T1 ProcessManager::resume calls ProcessManager::cleanup
T1 ProcessManager::cleanup sets to TERMINATING
T1 ProcessManager::cleanup decommissions event queue
T1 ProcessManager::cleanup waits for latch refs to go away
T1 ProcessManager::cleanup calls SocketManager::exited
T1 ProcessManager::cleanup opens gate
T1 ProcessManager::cleanup deletes the latch process

T2 ProcessManager::resume checks if event queue is empty again (crash)

Path 2:

T1 creates latch, spawns latch process, and puts it in run queue
T1 waits on latch
T1 ProcessManager::wait on latch see it in BOTTOM
T1 ProcessManager::wait extracts latch process from run queue
T1 ProcessManager::resume runs initialize, moves it to READY
T1 ProcessManager::resume sees empty queue
T1 ProcessManager::resume sets to BLOCKED

T3 triggers the latch, terminates the latch process
T3 enqueue TerminateEvent
T3 enqueue sees state BLOCKED
T3 swaps from BLOCKED->READY & enqueues latch process into run queue

T2 worker thread dequeues the latch process
T2 ProcessManager::resume sees READY
T2 ProcessManager::resume dequeues terminate event
T2 ProcessManager::resume calls ProcessManager::cleanup
T2 ProcessManager::cleanup sets to TERMINATING
T2 ProcessManager::cleanup decommissions event queue
T2 ProcessManager::cleanup waits for latch refs to go away
T2 ProcessManager::cleanup calls SocketManager::exited
T2 ProcessManager::cleanup opens gate
T2 ProcessManager::resume deletes the latch process

T1 ProcessManager::resume checks if event queue is empty again (crash)


Diffs (updated)
-----

  3rdparty/libprocess/src/process.cpp 4d8d483cfa5c3bac8bbe6a985f1cc2b737ae691e 


Diff: https://reviews.apache.org/r/63208/diff/2/

Changes: https://reviews.apache.org/r/63208/diff/1-2/


Testing
-------

* Tested by injecting a sleep [here](https://github.com/apache/mesos/blob/1.4.0/3rdparty/libprocess/src/process.cpp#L3278).
* Ran throughput benchmark, results are highly variable across runs, see results [here](https://docs.google.com/spreadsheets/d/1yoQtvAWiojLWBV0TlqjCxdc1RMtNERGSEZzgCxkWoxI/edit?usp=sharing).


Thanks,

Benjamin Mahler


Re: Review Request 63208: Fixed a crash in ProcessManager::resume due to race.

Posted by Mesos Reviewbot <re...@mesos.apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/63208/#review188905
-----------------------------------------------------------



Patch looks great!

Reviews applied: [63208]

Passed command: export OS='ubuntu:14.04' BUILDTOOL='autotools' COMPILER='gcc' CONFIGURATION='--verbose --disable-libtool-wrappers' ENVIRONMENT='GLOG_v=1 MESOS_VERBOSE=1'; ./support/docker-build.sh

- Mesos Reviewbot


On Oct. 22, 2017, 3 a.m., Benjamin Mahler wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/63208/
> -----------------------------------------------------------
> 
> (Updated Oct. 22, 2017, 3 a.m.)
> 
> 
> Review request for mesos and Benjamin Hindman.
> 
> 
> Bugs: MESOS-7921
>     https://issues.apache.org/jira/browse/MESOS-7921
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> When ProcessManager::resume moves the process into a BLOCKED
> state, it's possible that a TerminateEvent is enqueued and the
> process is placed back in the run queue. Another worker thread
> can pick it off the run queue, process the TerminateEvent, and
> delete the process! At this point, when the original thread
> in ProcessManager::resume tries to check if any events were
> enqueued before transitioning to BLOCKED, it will access a
> deleted process and crash.
> 
> Some example crash paths involving process::Latch below.
> 
> Path 1:
> 
> T1 creates latch, spawns latch process, and puts it in run queue
> T1 waits on latch
> T1 ProcessManager::wait on latch see it in BOTTOM
> 
> T2 worker thread dequeues the latch process
> T2 ProcessManager::resume runs initialize, moves it to READY
> T2 ProcessManager::resume sees empty queue
> T2 ProcessManager::resume sets to BLOCKED
> 
> T3 triggers the latch, terminates the latch process
> T3 enqueue TerminateEvent
> T3 enqueue sees state BLOCKED
> T3 swaps from BLOCKED->READY & enqueues latch process into run queue
> 
> T1 extracts latch process from run queue
> T1 ProcessManager::resume sees READY
> T1 ProcessManager::resume dequeues terminate event
> T1 ProcessManager::resume calls ProcessManager::cleanup
> T1 ProcessManager::cleanup sets to TERMINATING
> T1 ProcessManager::cleanup decommissions event queue
> T1 ProcessManager::cleanup waits for latch refs to go away
> T1 ProcessManager::cleanup calls SocketManager::exited
> T1 ProcessManager::cleanup opens gate
> T1 ProcessManager::cleanup deletes the latch process
> 
> T2 ProcessManager::resume checks if event queue is empty again (crash)
> 
> Path 2:
> 
> T1 creates latch, spawns latch process, and puts it in run queue
> T1 waits on latch
> T1 ProcessManager::wait on latch see it in BOTTOM
> T1 ProcessManager::wait extracts latch process from run queue
> T1 ProcessManager::resume runs initialize, moves it to READY
> T1 ProcessManager::resume sees empty queue
> T1 ProcessManager::resume sets to BLOCKED
> 
> T3 triggers the latch, terminates the latch process
> T3 enqueue TerminateEvent
> T3 enqueue sees state BLOCKED
> T3 swaps from BLOCKED->READY & enqueues latch process into run queue
> 
> T2 worker thread dequeues the latch process
> T2 ProcessManager::resume sees READY
> T2 ProcessManager::resume dequeues terminate event
> T2 ProcessManager::resume calls ProcessManager::cleanup
> T2 ProcessManager::cleanup sets to TERMINATING
> T2 ProcessManager::cleanup decommissions event queue
> T2 ProcessManager::cleanup waits for latch refs to go away
> T2 ProcessManager::cleanup calls SocketManager::exited
> T2 ProcessManager::cleanup opens gate
> T2 ProcessManager::resume deletes the latch process
> 
> T1 ProcessManager::resume checks if event queue is empty again (crash)
> 
> 
> Diffs
> -----
> 
>   3rdparty/libprocess/src/process.cpp 4d8d483cfa5c3bac8bbe6a985f1cc2b737ae691e 
> 
> 
> Diff: https://reviews.apache.org/r/63208/diff/1/
> 
> 
> Testing
> -------
> 
> * Tested by injecting a sleep [here](https://github.com/apache/mesos/blob/1.4.0/3rdparty/libprocess/src/process.cpp#L3278).
> * Ran throughput benchmark, results are highly variable across runs, see results [here](https://docs.google.com/spreadsheets/d/1yoQtvAWiojLWBV0TlqjCxdc1RMtNERGSEZzgCxkWoxI/edit?usp=sharing).
> 
> 
> Thanks,
> 
> Benjamin Mahler
> 
>


Re: Review Request 63208: Fixed a crash in ProcessManager::resume due to race.

Posted by Benjamin Mahler <bm...@apache.org>.

> On Oct. 22, 2017, 9:09 a.m., Benjamin Hindman wrote:
> > 3rdparty/libprocess/src/process.cpp
> > Lines 3291 (patched)
> > <https://reviews.apache.org/r/63208/diff/1/?file=1865231#file1865231line3291>
> >
> >     Any way we could not have to grab this reference each time we loop? This is in the hot path.

Yeah, I'll do that, it's also not a complete fix I realized because there's an access to `process->manage` below the loop!


- Benjamin


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/63208/#review188906
-----------------------------------------------------------


On Oct. 22, 2017, 3 a.m., Benjamin Mahler wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/63208/
> -----------------------------------------------------------
> 
> (Updated Oct. 22, 2017, 3 a.m.)
> 
> 
> Review request for mesos and Benjamin Hindman.
> 
> 
> Bugs: MESOS-7921
>     https://issues.apache.org/jira/browse/MESOS-7921
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> When ProcessManager::resume moves the process into a BLOCKED
> state, it's possible that a TerminateEvent is enqueued and the
> process is placed back in the run queue. Another worker thread
> can pick it off the run queue, process the TerminateEvent, and
> delete the process! At this point, when the original thread
> in ProcessManager::resume tries to check if any events were
> enqueued before transitioning to BLOCKED, it will access a
> deleted process and crash.
> 
> Some example crash paths involving process::Latch below.
> 
> Path 1:
> 
> T1 creates latch, spawns latch process, and puts it in run queue
> T1 waits on latch
> T1 ProcessManager::wait on latch see it in BOTTOM
> 
> T2 worker thread dequeues the latch process
> T2 ProcessManager::resume runs initialize, moves it to READY
> T2 ProcessManager::resume sees empty queue
> T2 ProcessManager::resume sets to BLOCKED
> 
> T3 triggers the latch, terminates the latch process
> T3 enqueue TerminateEvent
> T3 enqueue sees state BLOCKED
> T3 swaps from BLOCKED->READY & enqueues latch process into run queue
> 
> T1 extracts latch process from run queue
> T1 ProcessManager::resume sees READY
> T1 ProcessManager::resume dequeues terminate event
> T1 ProcessManager::resume calls ProcessManager::cleanup
> T1 ProcessManager::cleanup sets to TERMINATING
> T1 ProcessManager::cleanup decommissions event queue
> T1 ProcessManager::cleanup waits for latch refs to go away
> T1 ProcessManager::cleanup calls SocketManager::exited
> T1 ProcessManager::cleanup opens gate
> T1 ProcessManager::cleanup deletes the latch process
> 
> T2 ProcessManager::resume checks if event queue is empty again (crash)
> 
> Path 2:
> 
> T1 creates latch, spawns latch process, and puts it in run queue
> T1 waits on latch
> T1 ProcessManager::wait on latch see it in BOTTOM
> T1 ProcessManager::wait extracts latch process from run queue
> T1 ProcessManager::resume runs initialize, moves it to READY
> T1 ProcessManager::resume sees empty queue
> T1 ProcessManager::resume sets to BLOCKED
> 
> T3 triggers the latch, terminates the latch process
> T3 enqueue TerminateEvent
> T3 enqueue sees state BLOCKED
> T3 swaps from BLOCKED->READY & enqueues latch process into run queue
> 
> T2 worker thread dequeues the latch process
> T2 ProcessManager::resume sees READY
> T2 ProcessManager::resume dequeues terminate event
> T2 ProcessManager::resume calls ProcessManager::cleanup
> T2 ProcessManager::cleanup sets to TERMINATING
> T2 ProcessManager::cleanup decommissions event queue
> T2 ProcessManager::cleanup waits for latch refs to go away
> T2 ProcessManager::cleanup calls SocketManager::exited
> T2 ProcessManager::cleanup opens gate
> T2 ProcessManager::resume deletes the latch process
> 
> T1 ProcessManager::resume checks if event queue is empty again (crash)
> 
> 
> Diffs
> -----
> 
>   3rdparty/libprocess/src/process.cpp 4d8d483cfa5c3bac8bbe6a985f1cc2b737ae691e 
> 
> 
> Diff: https://reviews.apache.org/r/63208/diff/1/
> 
> 
> Testing
> -------
> 
> * Tested by injecting a sleep [here](https://github.com/apache/mesos/blob/1.4.0/3rdparty/libprocess/src/process.cpp#L3278).
> * Ran throughput benchmark, results are highly variable across runs, see results [here](https://docs.google.com/spreadsheets/d/1yoQtvAWiojLWBV0TlqjCxdc1RMtNERGSEZzgCxkWoxI/edit?usp=sharing).
> 
> 
> Thanks,
> 
> Benjamin Mahler
> 
>


Re: Review Request 63208: Fixed a crash in ProcessManager::resume due to race.

Posted by Benjamin Hindman <be...@berkeley.edu>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/63208/#review188906
-----------------------------------------------------------


Fix it, then Ship it!




Good debugging! Thanks Ben!


3rdparty/libprocess/src/process.cpp
Lines 3291 (patched)
<https://reviews.apache.org/r/63208/#comment265878>

    Any way we could not have to grab this reference each time we loop? This is in the hot path.


- Benjamin Hindman


On Oct. 22, 2017, 3 a.m., Benjamin Mahler wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/63208/
> -----------------------------------------------------------
> 
> (Updated Oct. 22, 2017, 3 a.m.)
> 
> 
> Review request for mesos and Benjamin Hindman.
> 
> 
> Bugs: MESOS-7921
>     https://issues.apache.org/jira/browse/MESOS-7921
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> When ProcessManager::resume moves the process into a BLOCKED
> state, it's possible that a TerminateEvent is enqueued and the
> process is placed back in the run queue. Another worker thread
> can pick it off the run queue, process the TerminateEvent, and
> delete the process! At this point, when the original thread
> in ProcessManager::resume tries to check if any events were
> enqueued before transitioning to BLOCKED, it will access a
> deleted process and crash.
> 
> Some example crash paths involving process::Latch below.
> 
> Path 1:
> 
> T1 creates latch, spawns latch process, and puts it in run queue
> T1 waits on latch
> T1 ProcessManager::wait on latch see it in BOTTOM
> 
> T2 worker thread dequeues the latch process
> T2 ProcessManager::resume runs initialize, moves it to READY
> T2 ProcessManager::resume sees empty queue
> T2 ProcessManager::resume sets to BLOCKED
> 
> T3 triggers the latch, terminates the latch process
> T3 enqueue TerminateEvent
> T3 enqueue sees state BLOCKED
> T3 swaps from BLOCKED->READY & enqueues latch process into run queue
> 
> T1 extracts latch process from run queue
> T1 ProcessManager::resume sees READY
> T1 ProcessManager::resume dequeues terminate event
> T1 ProcessManager::resume calls ProcessManager::cleanup
> T1 ProcessManager::cleanup sets to TERMINATING
> T1 ProcessManager::cleanup decommissions event queue
> T1 ProcessManager::cleanup waits for latch refs to go away
> T1 ProcessManager::cleanup calls SocketManager::exited
> T1 ProcessManager::cleanup opens gate
> T1 ProcessManager::cleanup deletes the latch process
> 
> T2 ProcessManager::resume checks if event queue is empty again (crash)
> 
> Path 2:
> 
> T1 creates latch, spawns latch process, and puts it in run queue
> T1 waits on latch
> T1 ProcessManager::wait on latch see it in BOTTOM
> T1 ProcessManager::wait extracts latch process from run queue
> T1 ProcessManager::resume runs initialize, moves it to READY
> T1 ProcessManager::resume sees empty queue
> T1 ProcessManager::resume sets to BLOCKED
> 
> T3 triggers the latch, terminates the latch process
> T3 enqueue TerminateEvent
> T3 enqueue sees state BLOCKED
> T3 swaps from BLOCKED->READY & enqueues latch process into run queue
> 
> T2 worker thread dequeues the latch process
> T2 ProcessManager::resume sees READY
> T2 ProcessManager::resume dequeues terminate event
> T2 ProcessManager::resume calls ProcessManager::cleanup
> T2 ProcessManager::cleanup sets to TERMINATING
> T2 ProcessManager::cleanup decommissions event queue
> T2 ProcessManager::cleanup waits for latch refs to go away
> T2 ProcessManager::cleanup calls SocketManager::exited
> T2 ProcessManager::cleanup opens gate
> T2 ProcessManager::resume deletes the latch process
> 
> T1 ProcessManager::resume checks if event queue is empty again (crash)
> 
> 
> Diffs
> -----
> 
>   3rdparty/libprocess/src/process.cpp 4d8d483cfa5c3bac8bbe6a985f1cc2b737ae691e 
> 
> 
> Diff: https://reviews.apache.org/r/63208/diff/1/
> 
> 
> Testing
> -------
> 
> * Tested by injecting a sleep [here](https://github.com/apache/mesos/blob/1.4.0/3rdparty/libprocess/src/process.cpp#L3278).
> * Ran throughput benchmark, results are highly variable across runs, see results [here](https://docs.google.com/spreadsheets/d/1yoQtvAWiojLWBV0TlqjCxdc1RMtNERGSEZzgCxkWoxI/edit?usp=sharing).
> 
> 
> Thanks,
> 
> Benjamin Mahler
> 
>