You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mesos.apache.org by Jie Yu <yu...@gmail.com> on 2012/07/16 23:40:12 UTC

Review Request: Fixed a bug in ProcessManager::cleanup() that will cause non-deterministic assertion failure (e.g. CoodinatorTest.Elect).

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/5992/
-----------------------------------------------------------

Review request for mesos, Benjamin Hindman and Vinod Kone.


Description
-------

The bug description can be found here:
https://issues.apache.org/jira/browse/MESOS-232

The cause of this bug is that the pointer "process" in ProcessManager::cleanup() is referenced in the CHECK statement after it has been cleaned up. The same memory chunk (originally pointed by "process") maybe reused by the memory allocator to store another process instance, and that process is put into the runq somehow.

To prove the above theory, I did a simple experiment. I changed the code of ProcessManager::cleanup() as follows:

void ProcessManager::cleanup(ProcessBase* process)
{
  ...
  // Possible gate non-libprocess threads are waiting at.
  Gate* gate = NULL;

  // ADDED BY JIE YU FOR EXPERIMENT.
  // Save the process id to comparison.
  UPID savedPid = process->pid;
  ...
  
  // Confirm process not in runq.
  synchronized (runq) {
    // ADDED BY JIE YU FOR EXPERIMENT.
    // Print saved and current pids if the assertion is going to fail.
    if (find(runq.begin(), runq.end(), process) != runq.end()) {
      std::cerr << savedPid << " " << process->pid << std::endl;
    }

    CHECK(find(runq.begin(), runq.end(), process) == runq.end());
  }
  ...
}

Now, here is the output when the assertion fails:

[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from CoordinatorTest
[ RUN      ] CoordinatorTest.Elect
__latch__(3258)@127.0.1.1:59292 __latch__(3262)@127.0.1.1:59292
F0716 14:03:54.896371 24575 process.cpp:2348] Check failed: find(runq.begin(), runq.end(), process) == runq.end()
*** Check failure stack trace: ***
    @     0x7ffff79c944d  google::LogMessage::Fail()
    @     0x7ffff79cbbdf  google::LogMessage::SendToLog()
    @     0x7ffff79c904b  google::LogMessage::Flush()
    @     0x7ffff79cc46d  google::LogMessageFatal::~LogMessageFatal()
    @     0x7ffff78cbd00  process::ProcessManager::cleanup()
    @     0x7ffff78cb527  process::ProcessManager::resume()
    @     0x7ffff78c3e9c  process::schedule()
    @     0x7ffff69aae9a  start_thread
    @     0x7ffff66d84bd  (unknown)

As can be seen, the savedPid is different from process->pid, which proved my theory.


Diffs
-----

  third_party/libprocess/src/process.cpp b560b7c 

Diff: https://reviews.apache.org/r/5992/diff/


Testing
-------

make check.


Thanks,

Jie Yu


Re: Review Request: Fixed a bug in ProcessManager::cleanup() that will cause non-deterministic assertion failure (e.g. CoodinatorTest.Elect).

Posted by Benjamin Hindman <be...@berkeley.edu>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/5992/#review9308
-----------------------------------------------------------

Ship it!


Ship It!

- Benjamin Hindman


On July 16, 2012, 9:40 p.m., Jie Yu wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/5992/
> -----------------------------------------------------------
> 
> (Updated July 16, 2012, 9:40 p.m.)
> 
> 
> Review request for mesos, Benjamin Hindman and Vinod Kone.
> 
> 
> Description
> -------
> 
> The bug description can be found here:
> https://issues.apache.org/jira/browse/MESOS-232
> 
> The cause of this bug is that the pointer "process" in ProcessManager::cleanup() is referenced in the CHECK statement after it has been cleaned up. The same memory chunk (originally pointed by "process") maybe reused by the memory allocator to store another process instance, and that process is put into the runq somehow.
> 
> To prove the above theory, I did a simple experiment. I changed the code of ProcessManager::cleanup() as follows:
> 
> void ProcessManager::cleanup(ProcessBase* process)
> {
>   ...
>   // Possible gate non-libprocess threads are waiting at.
>   Gate* gate = NULL;
> 
>   // ADDED BY JIE YU FOR EXPERIMENT.
>   // Save the process id to comparison.
>   UPID savedPid = process->pid;
>   ...
>   
>   // Confirm process not in runq.
>   synchronized (runq) {
>     // ADDED BY JIE YU FOR EXPERIMENT.
>     // Print saved and current pids if the assertion is going to fail.
>     if (find(runq.begin(), runq.end(), process) != runq.end()) {
>       std::cerr << savedPid << " " << process->pid << std::endl;
>     }
> 
>     CHECK(find(runq.begin(), runq.end(), process) == runq.end());
>   }
>   ...
> }
> 
> Now, here is the output when the assertion fails:
> 
> [==========] Running 1 test from 1 test case.
> [----------] Global test environment set-up.
> [----------] 1 test from CoordinatorTest
> [ RUN      ] CoordinatorTest.Elect
> __latch__(3258)@127.0.1.1:59292 __latch__(3262)@127.0.1.1:59292
> F0716 14:03:54.896371 24575 process.cpp:2348] Check failed: find(runq.begin(), runq.end(), process) == runq.end()
> *** Check failure stack trace: ***
>     @     0x7ffff79c944d  google::LogMessage::Fail()
>     @     0x7ffff79cbbdf  google::LogMessage::SendToLog()
>     @     0x7ffff79c904b  google::LogMessage::Flush()
>     @     0x7ffff79cc46d  google::LogMessageFatal::~LogMessageFatal()
>     @     0x7ffff78cbd00  process::ProcessManager::cleanup()
>     @     0x7ffff78cb527  process::ProcessManager::resume()
>     @     0x7ffff78c3e9c  process::schedule()
>     @     0x7ffff69aae9a  start_thread
>     @     0x7ffff66d84bd  (unknown)
> 
> As can be seen, the savedPid is different from process->pid, which proved my theory.
> 
> 
> Diffs
> -----
> 
>   third_party/libprocess/src/process.cpp b560b7c 
> 
> Diff: https://reviews.apache.org/r/5992/diff/
> 
> 
> Testing
> -------
> 
> make check.
> 
> 
> Thanks,
> 
> Jie Yu
> 
>