You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mesos.apache.org by Jie Yu <yu...@gmail.com> on 2012/07/16 23:40:12 UTC
Review Request: Fixed a bug in ProcessManager::cleanup() that will cause
non-deterministic assertion failure (e.g. CoodinatorTest.Elect).
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/5992/
-----------------------------------------------------------
Review request for mesos, Benjamin Hindman and Vinod Kone.
Description
-------
The bug description can be found here:
https://issues.apache.org/jira/browse/MESOS-232
The cause of this bug is that the pointer "process" in ProcessManager::cleanup() is referenced in the CHECK statement after it has been cleaned up. The same memory chunk (originally pointed by "process") maybe reused by the memory allocator to store another process instance, and that process is put into the runq somehow.
To prove the above theory, I did a simple experiment. I changed the code of ProcessManager::cleanup() as follows:
void ProcessManager::cleanup(ProcessBase* process)
{
...
// Possible gate non-libprocess threads are waiting at.
Gate* gate = NULL;
// ADDED BY JIE YU FOR EXPERIMENT.
// Save the process id to comparison.
UPID savedPid = process->pid;
...
// Confirm process not in runq.
synchronized (runq) {
// ADDED BY JIE YU FOR EXPERIMENT.
// Print saved and current pids if the assertion is going to fail.
if (find(runq.begin(), runq.end(), process) != runq.end()) {
std::cerr << savedPid << " " << process->pid << std::endl;
}
CHECK(find(runq.begin(), runq.end(), process) == runq.end());
}
...
}
Now, here is the output when the assertion fails:
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from CoordinatorTest
[ RUN ] CoordinatorTest.Elect
__latch__(3258)@127.0.1.1:59292 __latch__(3262)@127.0.1.1:59292
F0716 14:03:54.896371 24575 process.cpp:2348] Check failed: find(runq.begin(), runq.end(), process) == runq.end()
*** Check failure stack trace: ***
@ 0x7ffff79c944d google::LogMessage::Fail()
@ 0x7ffff79cbbdf google::LogMessage::SendToLog()
@ 0x7ffff79c904b google::LogMessage::Flush()
@ 0x7ffff79cc46d google::LogMessageFatal::~LogMessageFatal()
@ 0x7ffff78cbd00 process::ProcessManager::cleanup()
@ 0x7ffff78cb527 process::ProcessManager::resume()
@ 0x7ffff78c3e9c process::schedule()
@ 0x7ffff69aae9a start_thread
@ 0x7ffff66d84bd (unknown)
As can be seen, the savedPid is different from process->pid, which proved my theory.
Diffs
-----
third_party/libprocess/src/process.cpp b560b7c
Diff: https://reviews.apache.org/r/5992/diff/
Testing
-------
make check.
Thanks,
Jie Yu
Re: Review Request: Fixed a bug in ProcessManager::cleanup() that will cause
non-deterministic assertion failure (e.g. CoodinatorTest.Elect).
Posted by Benjamin Hindman <be...@berkeley.edu>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/5992/#review9308
-----------------------------------------------------------
Ship it!
Ship It!
- Benjamin Hindman
On July 16, 2012, 9:40 p.m., Jie Yu wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/5992/
> -----------------------------------------------------------
>
> (Updated July 16, 2012, 9:40 p.m.)
>
>
> Review request for mesos, Benjamin Hindman and Vinod Kone.
>
>
> Description
> -------
>
> The bug description can be found here:
> https://issues.apache.org/jira/browse/MESOS-232
>
> The cause of this bug is that the pointer "process" in ProcessManager::cleanup() is referenced in the CHECK statement after it has been cleaned up. The same memory chunk (originally pointed by "process") maybe reused by the memory allocator to store another process instance, and that process is put into the runq somehow.
>
> To prove the above theory, I did a simple experiment. I changed the code of ProcessManager::cleanup() as follows:
>
> void ProcessManager::cleanup(ProcessBase* process)
> {
> ...
> // Possible gate non-libprocess threads are waiting at.
> Gate* gate = NULL;
>
> // ADDED BY JIE YU FOR EXPERIMENT.
> // Save the process id to comparison.
> UPID savedPid = process->pid;
> ...
>
> // Confirm process not in runq.
> synchronized (runq) {
> // ADDED BY JIE YU FOR EXPERIMENT.
> // Print saved and current pids if the assertion is going to fail.
> if (find(runq.begin(), runq.end(), process) != runq.end()) {
> std::cerr << savedPid << " " << process->pid << std::endl;
> }
>
> CHECK(find(runq.begin(), runq.end(), process) == runq.end());
> }
> ...
> }
>
> Now, here is the output when the assertion fails:
>
> [==========] Running 1 test from 1 test case.
> [----------] Global test environment set-up.
> [----------] 1 test from CoordinatorTest
> [ RUN ] CoordinatorTest.Elect
> __latch__(3258)@127.0.1.1:59292 __latch__(3262)@127.0.1.1:59292
> F0716 14:03:54.896371 24575 process.cpp:2348] Check failed: find(runq.begin(), runq.end(), process) == runq.end()
> *** Check failure stack trace: ***
> @ 0x7ffff79c944d google::LogMessage::Fail()
> @ 0x7ffff79cbbdf google::LogMessage::SendToLog()
> @ 0x7ffff79c904b google::LogMessage::Flush()
> @ 0x7ffff79cc46d google::LogMessageFatal::~LogMessageFatal()
> @ 0x7ffff78cbd00 process::ProcessManager::cleanup()
> @ 0x7ffff78cb527 process::ProcessManager::resume()
> @ 0x7ffff78c3e9c process::schedule()
> @ 0x7ffff69aae9a start_thread
> @ 0x7ffff66d84bd (unknown)
>
> As can be seen, the savedPid is different from process->pid, which proved my theory.
>
>
> Diffs
> -----
>
> third_party/libprocess/src/process.cpp b560b7c
>
> Diff: https://reviews.apache.org/r/5992/diff/
>
>
> Testing
> -------
>
> make check.
>
>
> Thanks,
>
> Jie Yu
>
>