You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mesos.apache.org by Jie Yu <yu...@gmail.com> on 2013/10/16 23:49:24 UTC

Review Request 14686: Catch-up Replicated Log 2: Fixed a Race Condition Caused by Passing Process Wrapper Pointers.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/14686/
-----------------------------------------------------------

Review request for mesos, Benjamin Hindman, Ben Mahler, and Vinod Kone.


Repository: mesos-git


Description
-------

This is the second patch of a series of patches that implement a catch-up mechanism for replicated log. This patch fixed a race condition caused by passing process wrapper pointers (e.g. Network, Replica, etc.).

In our code base, one commonly used routine is to use a wrapper class to wrap a libprocess so that we don't need to worry about whether we should directly call the function or should use a dispatch. For example:

class FooProcess : public Process<FooProcess> {
  int foo();
};

class Foo {
  Future<int> foo() { return dispatch(process, &FooProcess::foo); }
  ...
  FooProcess* process;
};

If multiple entities want to share a libprocess (e.g. FooProcess), currently, we pass a pointer of its wrapper (e.g. Foo*) to the users. However, this is problematic in some cases. The main reason is that we don't know when it is safe to destroy the libprocess (FooProcess) and delete the wrapper (Foo) if a pointer of this wrapper has been passed to some other entities in the system and we don't know whether they will use it or not later.

There is a race condition in our first patch. Here is the buggy interleaving:

T1                                   T2

appending.discard();
                                     ExplicitPromiseProcess::initialize() {
onDiscarded() {
  terminate ExplicitPromiseProcess
}

delete network;
                      
                                       ...
                                       network->broadcast(protocol::promise, request)
                                     }

(We added a new test CoordinatorTest.FillTimeout which can reproduce this bug and will segfault without this patch.)


To fix this problem, we introduce a concept called ProcessHandle to generalize process wrappers (see process.hpp). It internally maintains a shared pointer to the underlying libprocess. It is copyable. Multiple copies of the same handle share the same underlying libprocess. When the last copy is deleted, we will terminate the underlying process and delete the libprocess. If a user wants to early terminate the libprocess, he can explicitly call the terminate() function.

Now, creating a wrapper for a libprocess is simply as that:

class Foo : public ProcessHandle<FooProcess> { ... };

This change won't break the existing code (i.e., if you use the old way of passing wrapper pointers, this change is like a noop).

This is a joint work with Yan Xu.


Diffs
-----

  3rdparty/libprocess/include/process/process.hpp b502324 
  src/Makefile.am a2d8242 
  src/log/catchup.hpp PRE-CREATION 
  src/log/catchup.cpp PRE-CREATION 
  src/log/consensus.hpp PRE-CREATION 
  src/log/consensus.cpp PRE-CREATION 
  src/log/coordinator.hpp 3f6fb7c 
  src/log/coordinator.cpp 6e6466f 
  src/log/log.hpp 77edc7a 
  src/log/network.hpp d34cf78 
  src/log/replica.hpp d1f5ead 
  src/log/replica.cpp 59a6ff3 
  src/log/zookeeper.hpp PRE-CREATION 
  src/log/zookeeper.cpp PRE-CREATION 
  src/tests/log_tests.cpp ff5f86c 

Diff: https://reviews.apache.org/r/14686/diff/


Testing
-------

bin/mesos-tests.sh --gtest_filter=CoordinatorTest*:ReplicaTest*:LogTest* --gtest_repeat=100


Thanks,

Jie Yu


Re: Review Request 14686: Catch-up Replicated Log 2: Fixed a Race Condition Caused by Passing Process Wrapper Pointers.

Posted by Benjamin Hindman <be...@berkeley.edu>.

> On Oct. 19, 2013, 11:35 p.m., Benjamin Hindman wrote:
> > 3rdparty/libprocess/include/process/process.hpp, line 379
> > <https://reviews.apache.org/r/14686/diff/2/?file=365570#file365570line379>
> >
> >     This is great! A wonderful step in the evolution of libprocess!

Ignore this issue, my bad.


- Benjamin


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/14686/#review27222
-----------------------------------------------------------


On Oct. 17, 2013, 12:11 a.m., Jie Yu wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/14686/
> -----------------------------------------------------------
> 
> (Updated Oct. 17, 2013, 12:11 a.m.)
> 
> 
> Review request for mesos, Benjamin Hindman, Ben Mahler, and Vinod Kone.
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> This is the second patch of a series of patches that implement a catch-up mechanism for replicated log. This patch fixed a race condition caused by passing process wrapper pointers (e.g. Network, Replica, etc.).
> 
> In our code base, one commonly used routine is to use a wrapper class to wrap a libprocess so that we don't need to worry about whether we should directly call the function or should use a dispatch. For example:
> 
> class FooProcess : public Process<FooProcess> {
>   int foo();
> };
> 
> class Foo {
>   Future<int> foo() { return dispatch(process, &FooProcess::foo); }
>   ...
>   FooProcess* process;
> };
> 
> If multiple entities want to share a libprocess (e.g. FooProcess), currently, we pass a pointer of its wrapper (e.g. Foo*) to the users. However, this is problematic in some cases. The main reason is that we don't know when it is safe to destroy the libprocess (FooProcess) and delete the wrapper (Foo) if a pointer of this wrapper has been passed to some other entities in the system and we don't know whether they will use it or not later.
> 
> There is a race condition in our first patch. Here is the buggy interleaving:
> 
> T1                                   T2
> 
> appending.discard();
>                                      ExplicitPromiseProcess::initialize() {
> onDiscarded() {
>   terminate ExplicitPromiseProcess
> }
> 
> delete network;
>                       
>                                        ...
>                                        network->broadcast(protocol::promise, request)
>                                      }
> 
> (We added a new test CoordinatorTest.FillTimeout which can reproduce this bug and will segfault without this patch.)
> 
> 
> To fix this problem, we introduce a concept called ProcessHandle to generalize process wrappers (see process.hpp). It internally maintains a shared pointer to the underlying libprocess. It is copyable. Multiple copies of the same handle share the same underlying libprocess. When the last copy is deleted, we will terminate the underlying process and delete the libprocess. If a user wants to early terminate the libprocess, he can explicitly call the terminate() function.
> 
> Now, creating a wrapper for a libprocess is simply as that:
> 
> class Foo : public ProcessHandle<FooProcess> { ... };
> 
> This change won't break the existing code (i.e., if you use the old way of passing wrapper pointers, this change is like a noop).
> 
> This is a joint work with Yan Xu.
> 
> 
> Diffs
> -----
> 
>   3rdparty/libprocess/include/process/process.hpp b502324 
>   src/Makefile.am a2d8242 
>   src/log/catchup.hpp PRE-CREATION 
>   src/log/catchup.cpp PRE-CREATION 
>   src/log/consensus.hpp PRE-CREATION 
>   src/log/consensus.cpp PRE-CREATION 
>   src/log/coordinator.hpp 3f6fb7c 
>   src/log/coordinator.cpp 6e6466f 
>   src/log/log.hpp 77edc7a 
>   src/log/network.hpp d34cf78 
>   src/log/replica.hpp d1f5ead 
>   src/log/replica.cpp 59a6ff3 
>   src/log/zookeeper.hpp PRE-CREATION 
>   src/log/zookeeper.cpp PRE-CREATION 
>   src/tests/log_tests.cpp ff5f86c 
> 
> Diff: https://reviews.apache.org/r/14686/diff/
> 
> 
> Testing
> -------
> 
> bin/mesos-tests.sh --gtest_filter=CoordinatorTest*:ReplicaTest*:LogTest* --gtest_repeat=100
> 
> 
> Thanks,
> 
> Jie Yu
> 
>


Re: Review Request 14686: Catch-up Replicated Log 2: Fixed a Race Condition Caused by Passing Process Wrapper Pointers.

Posted by Benjamin Hindman <be...@berkeley.edu>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/14686/#review27222
-----------------------------------------------------------


Can you pull the libprocess stuff out of this review and make it it's own review that also includes tests please!? ;)


3rdparty/libprocess/include/process/process.hpp
<https://reviews.apache.org/r/14686/#comment53007>

    This is great! A wonderful step in the evolution of libprocess!



3rdparty/libprocess/include/process/process.hpp
<https://reviews.apache.org/r/14686/#comment53009>

    Can't callers just do 'ProcessHandleBase(new MyProcess(...))' rather than having to have the 'Base' typedef? This is more inline with how we did it with Process<T> and ProcessBase (e.g., for passing a specific ID).



3rdparty/libprocess/include/process/process.hpp
<https://reviews.apache.org/r/14686/#comment53008>

    Self seems a bit weird, when I looked at how it was used below I think I'd prefer people just use the explicit process type.


- Benjamin Hindman


On Oct. 17, 2013, 12:11 a.m., Jie Yu wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/14686/
> -----------------------------------------------------------
> 
> (Updated Oct. 17, 2013, 12:11 a.m.)
> 
> 
> Review request for mesos, Benjamin Hindman, Ben Mahler, and Vinod Kone.
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> This is the second patch of a series of patches that implement a catch-up mechanism for replicated log. This patch fixed a race condition caused by passing process wrapper pointers (e.g. Network, Replica, etc.).
> 
> In our code base, one commonly used routine is to use a wrapper class to wrap a libprocess so that we don't need to worry about whether we should directly call the function or should use a dispatch. For example:
> 
> class FooProcess : public Process<FooProcess> {
>   int foo();
> };
> 
> class Foo {
>   Future<int> foo() { return dispatch(process, &FooProcess::foo); }
>   ...
>   FooProcess* process;
> };
> 
> If multiple entities want to share a libprocess (e.g. FooProcess), currently, we pass a pointer of its wrapper (e.g. Foo*) to the users. However, this is problematic in some cases. The main reason is that we don't know when it is safe to destroy the libprocess (FooProcess) and delete the wrapper (Foo) if a pointer of this wrapper has been passed to some other entities in the system and we don't know whether they will use it or not later.
> 
> There is a race condition in our first patch. Here is the buggy interleaving:
> 
> T1                                   T2
> 
> appending.discard();
>                                      ExplicitPromiseProcess::initialize() {
> onDiscarded() {
>   terminate ExplicitPromiseProcess
> }
> 
> delete network;
>                       
>                                        ...
>                                        network->broadcast(protocol::promise, request)
>                                      }
> 
> (We added a new test CoordinatorTest.FillTimeout which can reproduce this bug and will segfault without this patch.)
> 
> 
> To fix this problem, we introduce a concept called ProcessHandle to generalize process wrappers (see process.hpp). It internally maintains a shared pointer to the underlying libprocess. It is copyable. Multiple copies of the same handle share the same underlying libprocess. When the last copy is deleted, we will terminate the underlying process and delete the libprocess. If a user wants to early terminate the libprocess, he can explicitly call the terminate() function.
> 
> Now, creating a wrapper for a libprocess is simply as that:
> 
> class Foo : public ProcessHandle<FooProcess> { ... };
> 
> This change won't break the existing code (i.e., if you use the old way of passing wrapper pointers, this change is like a noop).
> 
> This is a joint work with Yan Xu.
> 
> 
> Diffs
> -----
> 
>   3rdparty/libprocess/include/process/process.hpp b502324 
>   src/Makefile.am a2d8242 
>   src/log/catchup.hpp PRE-CREATION 
>   src/log/catchup.cpp PRE-CREATION 
>   src/log/consensus.hpp PRE-CREATION 
>   src/log/consensus.cpp PRE-CREATION 
>   src/log/coordinator.hpp 3f6fb7c 
>   src/log/coordinator.cpp 6e6466f 
>   src/log/log.hpp 77edc7a 
>   src/log/network.hpp d34cf78 
>   src/log/replica.hpp d1f5ead 
>   src/log/replica.cpp 59a6ff3 
>   src/log/zookeeper.hpp PRE-CREATION 
>   src/log/zookeeper.cpp PRE-CREATION 
>   src/tests/log_tests.cpp ff5f86c 
> 
> Diff: https://reviews.apache.org/r/14686/diff/
> 
> 
> Testing
> -------
> 
> bin/mesos-tests.sh --gtest_filter=CoordinatorTest*:ReplicaTest*:LogTest* --gtest_repeat=100
> 
> 
> Thanks,
> 
> Jie Yu
> 
>


Re: Review Request 14686: Catch-up Replicated Log 2: Fixed a Race Condition Caused by Passing Process Wrapper Pointers.

Posted by Jie Yu <yu...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/14686/
-----------------------------------------------------------

(Updated Oct. 17, 2013, 12:11 a.m.)


Review request for mesos, Benjamin Hindman, Ben Mahler, and Vinod Kone.


Changes
-------

Removed unneeded destructors.


Repository: mesos-git


Description
-------

This is the second patch of a series of patches that implement a catch-up mechanism for replicated log. This patch fixed a race condition caused by passing process wrapper pointers (e.g. Network, Replica, etc.).

In our code base, one commonly used routine is to use a wrapper class to wrap a libprocess so that we don't need to worry about whether we should directly call the function or should use a dispatch. For example:

class FooProcess : public Process<FooProcess> {
  int foo();
};

class Foo {
  Future<int> foo() { return dispatch(process, &FooProcess::foo); }
  ...
  FooProcess* process;
};

If multiple entities want to share a libprocess (e.g. FooProcess), currently, we pass a pointer of its wrapper (e.g. Foo*) to the users. However, this is problematic in some cases. The main reason is that we don't know when it is safe to destroy the libprocess (FooProcess) and delete the wrapper (Foo) if a pointer of this wrapper has been passed to some other entities in the system and we don't know whether they will use it or not later.

There is a race condition in our first patch. Here is the buggy interleaving:

T1                                   T2

appending.discard();
                                     ExplicitPromiseProcess::initialize() {
onDiscarded() {
  terminate ExplicitPromiseProcess
}

delete network;
                      
                                       ...
                                       network->broadcast(protocol::promise, request)
                                     }

(We added a new test CoordinatorTest.FillTimeout which can reproduce this bug and will segfault without this patch.)


To fix this problem, we introduce a concept called ProcessHandle to generalize process wrappers (see process.hpp). It internally maintains a shared pointer to the underlying libprocess. It is copyable. Multiple copies of the same handle share the same underlying libprocess. When the last copy is deleted, we will terminate the underlying process and delete the libprocess. If a user wants to early terminate the libprocess, he can explicitly call the terminate() function.

Now, creating a wrapper for a libprocess is simply as that:

class Foo : public ProcessHandle<FooProcess> { ... };

This change won't break the existing code (i.e., if you use the old way of passing wrapper pointers, this change is like a noop).

This is a joint work with Yan Xu.


Diffs (updated)
-----

  3rdparty/libprocess/include/process/process.hpp b502324 
  src/Makefile.am a2d8242 
  src/log/catchup.hpp PRE-CREATION 
  src/log/catchup.cpp PRE-CREATION 
  src/log/consensus.hpp PRE-CREATION 
  src/log/consensus.cpp PRE-CREATION 
  src/log/coordinator.hpp 3f6fb7c 
  src/log/coordinator.cpp 6e6466f 
  src/log/log.hpp 77edc7a 
  src/log/network.hpp d34cf78 
  src/log/replica.hpp d1f5ead 
  src/log/replica.cpp 59a6ff3 
  src/log/zookeeper.hpp PRE-CREATION 
  src/log/zookeeper.cpp PRE-CREATION 
  src/tests/log_tests.cpp ff5f86c 

Diff: https://reviews.apache.org/r/14686/diff/


Testing
-------

bin/mesos-tests.sh --gtest_filter=CoordinatorTest*:ReplicaTest*:LogTest* --gtest_repeat=100


Thanks,

Jie Yu