You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mesos.apache.org by Jiang Yan Xu <ya...@jxu.me> on 2013/11/20 01:05:44 UTC

Review Request 15706: Fixed Group to retry when authentication failed due to retryable errors.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/15706/
-----------------------------------------------------------

Review request for mesos, Benjamin Hindman, Ben Mahler, Ian Downes, Jie Yu, and Vinod Kone.


Bugs: MESOS-814
    https://issues.apache.org/jira/browse/MESOS-814


Repository: mesos-git


Description
-------

See summary.


Diffs
-----

  src/zookeeper/group.hpp 04068e357cec95457d1f24c166d0b60f86d997d2 
  src/zookeeper/group.cpp 12c781b29f4300ca8a29660adc3f1e55e03d5d04 

Diff: https://reviews.apache.org/r/15706/diff/


Testing
-------

make check & mesos-tests.sh --gtest_filter=GroupTest*:ZooKeeperTest*:ZooKeeperMasterContenderDetectorTest* with high iterations.

This fix is for a problem not easy to expose through unit tests so no new tests were written.


Thanks,

Jiang Yan Xu


Re: Review Request 15706: Fixed Group to retry when authentication failed due to retryable errors.

Posted by Ben Mahler <be...@gmail.com>.

> On Nov. 26, 2013, 3:58 a.m., Ben Mahler wrote:
> > src/tests/group_tests.cpp, lines 334-336
> > <https://reviews.apache.org/r/15706/diff/3/?file=390857#file390857line334>
> >
> >     Is this guaranteed to cause retries?
> 
> Jiang Yan Xu wrote:
>     Likely... and I have read logs to see the retries of authenticate() and create() *always* get executed.
>     
>     This test is testing that code's robustness can tolerate many expirations (one type of retryable errors that we can control) but with authenticate() and create() not being called through dispatch() it is hard to control the exact timing. I think high iteration without flakiness should be good proof.

If it's likely and not guaranteed, let's explicitly call out that this test is just attempting to probabilistically trigger the retries.


- Ben


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/15706/#review29414
-----------------------------------------------------------


On Nov. 28, 2013, 8:52 a.m., Jiang Yan Xu wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/15706/
> -----------------------------------------------------------
> 
> (Updated Nov. 28, 2013, 8:52 a.m.)
> 
> 
> Review request for mesos, Benjamin Hindman, Ben Mahler, and Vinod Kone.
> 
> 
> Bugs: MESOS-814
>     https://issues.apache.org/jira/browse/MESOS-814
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> See summary.
> 
> 
> Diffs
> -----
> 
>   src/tests/group_tests.cpp fb4c2f103ab7436b0fa2d6276c1ec1011a9405f0 
>   src/zookeeper/group.hpp 04068e357cec95457d1f24c166d0b60f86d997d2 
>   src/zookeeper/group.cpp 12c781b29f4300ca8a29660adc3f1e55e03d5d04 
> 
> Diff: https://reviews.apache.org/r/15706/diff/
> 
> 
> Testing
> -------
> 
> make check & mesos-tests.sh --gtest_filter=GroupTest*:ZooKeeperTest*:ZooKeeperMasterContenderDetectorTest* with 100 iterations.
> 
> Added a test to exercise authenticate() and create() retries.
> 
> 
> Thanks,
> 
> Jiang Yan Xu
> 
>


Re: Review Request 15706: Fixed Group to retry when authentication failed due to retryable errors.

Posted by Jiang Yan Xu <ya...@jxu.me>.

> On Nov. 26, 2013, 3:58 a.m., Ben Mahler wrote:
> > src/zookeeper/group.cpp, line 350
> > <https://reviews.apache.org/r/15706/diff/3/?file=390859#file390859line350>
> >
> >     Ditto about adding braces.
> >     
> >     What if we're in connecting? Should this instead be a CHECK that we're in AUTHENTICATED?

Ditto. Added CHECK.


> On Nov. 26, 2013, 3:58 a.m., Ben Mahler wrote:
> > src/tests/group_tests.cpp, lines 334-336
> > <https://reviews.apache.org/r/15706/diff/3/?file=390857#file390857line334>
> >
> >     Is this guaranteed to cause retries?

Likely... and I have read logs to see the retries of authenticate() and create() *always* get executed.

This test is testing that code's robustness can tolerate many expirations (one type of retryable errors that we can control) but with authenticate() and create() not being called through dispatch() it is hard to control the exact timing. I think high iteration without flakiness should be good proof.


> On Nov. 26, 2013, 3:58 a.m., Ben Mahler wrote:
> > src/zookeeper/group.cpp, line 321
> > <https://reviews.apache.org/r/15706/diff/3/?file=390859#file390859line321>
> >
> >     Can you CHECK that state is ready here before you call sync()?
> >     
> >     Since this ignores the result of sync(), the assumption here is that if sync returns false, we must be imminently getting a callback from ZK, does this assumption hold? Can you add a note?

About the CHECK:
It's actually not true. The status might still be CONNECTING because during the execution of the initial connected() the network might go down and it might come back soon enough that connected(reconnect=true) is called instead of expired(). In this case it's reconnected but the group setup is not finished yet thus not READY.

About sync():
OK I actually think I overlooked this one. I should need to check the return value because the the errors are retryable (e.g. operation timed out) we might not be notified by ZK. However if authenticate, create or cache fails due to non-retryable error should have aborted and will be notified. (If joins, datas, cancels fail in such a way no retry/group abort will follow).

I rewrote GroupProcess::connected.


> On Nov. 26, 2013, 3:58 a.m., Ben Mahler wrote:
> > src/zookeeper/group.cpp, line 327
> > <https://reviews.apache.org/r/15706/diff/3/?file=390859#file390859line327>
> >
> >     Can you add braces?
> >     
> >     if (...) {
> >       return true;
> >     }
> >     
> >     Should this instead be a CHECK that we're in CONNECTING?

In this review, if a retry() is already timed out after the delay and enqueued when expire() happens, it cannot be cancelled. Thus it can be executed before connected() is called and in connected() when authenticate() and create() are invoked they MIGHT already unnecessary depending on the result of retry().

But now that I have refactored the code so that connected() and authenticate() are only called from one place and within if branches, I can add CHECKs now.


- Jiang Yan


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/15706/#review29414
-----------------------------------------------------------


On Nov. 28, 2013, 8:52 a.m., Jiang Yan Xu wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/15706/
> -----------------------------------------------------------
> 
> (Updated Nov. 28, 2013, 8:52 a.m.)
> 
> 
> Review request for mesos, Benjamin Hindman, Ben Mahler, and Vinod Kone.
> 
> 
> Bugs: MESOS-814
>     https://issues.apache.org/jira/browse/MESOS-814
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> See summary.
> 
> 
> Diffs
> -----
> 
>   src/tests/group_tests.cpp fb4c2f103ab7436b0fa2d6276c1ec1011a9405f0 
>   src/zookeeper/group.hpp 04068e357cec95457d1f24c166d0b60f86d997d2 
>   src/zookeeper/group.cpp 12c781b29f4300ca8a29660adc3f1e55e03d5d04 
> 
> Diff: https://reviews.apache.org/r/15706/diff/
> 
> 
> Testing
> -------
> 
> make check & mesos-tests.sh --gtest_filter=GroupTest*:ZooKeeperTest*:ZooKeeperMasterContenderDetectorTest* with 100 iterations.
> 
> Added a test to exercise authenticate() and create() retries.
> 
> 
> Thanks,
> 
> Jiang Yan Xu
> 
>


Re: Review Request 15706: Fixed Group to retry when authentication failed due to retryable errors.

Posted by Ben Mahler <be...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/15706/#review29414
-----------------------------------------------------------



src/tests/group_tests.cpp
<https://reviews.apache.org/r/15706/#comment56645>

    Is this guaranteed to cause retries?



src/zookeeper/group.cpp
<https://reviews.apache.org/r/15706/#comment56646>

    Can you CHECK that state is ready here before you call sync()?
    
    Since this ignores the result of sync(), the assumption here is that if sync returns false, we must be imminently getting a callback from ZK, does this assumption hold? Can you add a note?



src/zookeeper/group.cpp
<https://reviews.apache.org/r/15706/#comment56641>

    Can you add braces?
    
    if (...) {
      return true;
    }
    
    Should this instead be a CHECK that we're in CONNECTING?



src/zookeeper/group.cpp
<https://reviews.apache.org/r/15706/#comment56642>

    Ditto about adding braces.
    
    What if we're in connecting? Should this instead be a CHECK that we're in AUTHENTICATED?



src/zookeeper/group.cpp
<https://reviews.apache.org/r/15706/#comment56647>

    We may want CHECK_EQ for these kinds of CHECKs, so that a failure prints the actual state.



src/zookeeper/group.cpp
<https://reviews.apache.org/r/15706/#comment56643>

    "if needed" or "if not already authenticated"?



src/zookeeper/group.cpp
<https://reviews.apache.org/r/15706/#comment56644>

    if not already created?


- Ben Mahler


On Nov. 25, 2013, 7:59 p.m., Jiang Yan Xu wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/15706/
> -----------------------------------------------------------
> 
> (Updated Nov. 25, 2013, 7:59 p.m.)
> 
> 
> Review request for mesos, Benjamin Hindman, Ben Mahler, and Vinod Kone.
> 
> 
> Bugs: MESOS-814
>     https://issues.apache.org/jira/browse/MESOS-814
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> See summary.
> 
> 
> Diffs
> -----
> 
>   src/tests/group_tests.cpp fb4c2f103ab7436b0fa2d6276c1ec1011a9405f0 
>   src/zookeeper/group.hpp 04068e357cec95457d1f24c166d0b60f86d997d2 
>   src/zookeeper/group.cpp 12c781b29f4300ca8a29660adc3f1e55e03d5d04 
> 
> Diff: https://reviews.apache.org/r/15706/diff/
> 
> 
> Testing
> -------
> 
> make check & mesos-tests.sh --gtest_filter=GroupTest*:ZooKeeperTest*:ZooKeeperMasterContenderDetectorTest* with high iterations.
> 
> Added a test to exercise authenticate() and create() retries.
> 
> 
> Thanks,
> 
> Jiang Yan Xu
> 
>


Re: Review Request 15706: Fixed Group to retry when authentication failed due to retryable errors.

Posted by Ben Mahler <be...@gmail.com>.

> On Dec. 3, 2013, 7:08 p.m., Ben Mahler wrote:
> > src/zookeeper/group.cpp, line 472
> > <https://reviews.apache.org/r/15706/diff/6/?file=392957#file392957line472>
> >
> >     Maybe you'll want this CHECK above where you call cache as well?
> 
> Jiang Yan Xu wrote:
>     updated() is answering a ZK event at which time memberships can be Some() or None(). It is set to None() at the first line in cache().

I'm referring to the call to cache() in watch(), do you want the same consistency check (CHECK(memberships.isNone())) there when the cache operation returns false?


- Ben


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/15706/#review29666
-----------------------------------------------------------


On Dec. 3, 2013, 7:50 p.m., Jiang Yan Xu wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/15706/
> -----------------------------------------------------------
> 
> (Updated Dec. 3, 2013, 7:50 p.m.)
> 
> 
> Review request for mesos, Benjamin Hindman, Ben Mahler, and Vinod Kone.
> 
> 
> Bugs: MESOS-814
>     https://issues.apache.org/jira/browse/MESOS-814
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> See summary.
> 
> 
> Diffs
> -----
> 
>   src/tests/group_tests.cpp fb4c2f103ab7436b0fa2d6276c1ec1011a9405f0 
>   src/zookeeper/group.hpp 04068e357cec95457d1f24c166d0b60f86d997d2 
>   src/zookeeper/group.cpp 5218286bb6ab989ba16cead1483851cc8adba5ca 
> 
> Diff: https://reviews.apache.org/r/15706/diff/
> 
> 
> Testing
> -------
> 
> make check & mesos-tests.sh --gtest_filter=GroupTest*:ZooKeeperTest*:ZooKeeperMasterContenderDetectorTest* with 100 iterations.
> 
> Added a test to exercise authenticate() and create() retries.
> 
> 
> Thanks,
> 
> Jiang Yan Xu
> 
>


Re: Review Request 15706: Fixed Group to retry when authentication failed due to retryable errors.

Posted by Jiang Yan Xu <ya...@jxu.me>.

> On Dec. 3, 2013, 7:08 p.m., Ben Mahler wrote:
> > src/zookeeper/group.cpp, line 472
> > <https://reviews.apache.org/r/15706/diff/6/?file=392957#file392957line472>
> >
> >     Maybe you'll want this CHECK above where you call cache as well?

updated() is answering a ZK event at which time memberships can be Some() or None(). It is set to None() at the first line in cache().


- Jiang Yan


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/15706/#review29666
-----------------------------------------------------------


On Dec. 3, 2013, 7:50 p.m., Jiang Yan Xu wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/15706/
> -----------------------------------------------------------
> 
> (Updated Dec. 3, 2013, 7:50 p.m.)
> 
> 
> Review request for mesos, Benjamin Hindman, Ben Mahler, and Vinod Kone.
> 
> 
> Bugs: MESOS-814
>     https://issues.apache.org/jira/browse/MESOS-814
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> See summary.
> 
> 
> Diffs
> -----
> 
>   src/tests/group_tests.cpp fb4c2f103ab7436b0fa2d6276c1ec1011a9405f0 
>   src/zookeeper/group.hpp 04068e357cec95457d1f24c166d0b60f86d997d2 
>   src/zookeeper/group.cpp 5218286bb6ab989ba16cead1483851cc8adba5ca 
> 
> Diff: https://reviews.apache.org/r/15706/diff/
> 
> 
> Testing
> -------
> 
> make check & mesos-tests.sh --gtest_filter=GroupTest*:ZooKeeperTest*:ZooKeeperMasterContenderDetectorTest* with 100 iterations.
> 
> Added a test to exercise authenticate() and create() retries.
> 
> 
> Thanks,
> 
> Jiang Yan Xu
> 
>


Re: Review Request 15706: Fixed Group to retry when authentication failed due to retryable errors.

Posted by Ben Mahler <be...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/15706/#review29666
-----------------------------------------------------------

Ship it!


Looks good to me! Just a few cleanup bits below.


src/zookeeper/group.cpp
<https://reviews.apache.org/r/15706/#comment57093>

    Shouldn't we be using the cache return value instead like you did below?
    
    } else if (!cached.get()) {
    
    This is the explicit indication of retry and we have a nice CHECK_SOME(memberships) below.



src/zookeeper/group.cpp
<https://reviews.apache.org/r/15706/#comment57094>

    I think you want this comment above the sync() call.



src/zookeeper/group.cpp
<https://reviews.apache.org/r/15706/#comment57095>

    Maybe you'll want this CHECK above where you call cache as well?



src/zookeeper/group.cpp
<https://reviews.apache.org/r/15706/#comment57096>

    This one can likely stay as is since count on a map is only ever 1 or 0, we're essentially just doing a contains() operation. (We don't have a hashmap so we have to use count() instead).


- Ben Mahler


On Dec. 3, 2013, 5:01 a.m., Jiang Yan Xu wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/15706/
> -----------------------------------------------------------
> 
> (Updated Dec. 3, 2013, 5:01 a.m.)
> 
> 
> Review request for mesos, Benjamin Hindman, Ben Mahler, and Vinod Kone.
> 
> 
> Bugs: MESOS-814
>     https://issues.apache.org/jira/browse/MESOS-814
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> See summary.
> 
> 
> Diffs
> -----
> 
>   src/tests/group_tests.cpp fb4c2f103ab7436b0fa2d6276c1ec1011a9405f0 
>   src/zookeeper/group.hpp 04068e357cec95457d1f24c166d0b60f86d997d2 
>   src/zookeeper/group.cpp 5218286bb6ab989ba16cead1483851cc8adba5ca 
> 
> Diff: https://reviews.apache.org/r/15706/diff/
> 
> 
> Testing
> -------
> 
> make check & mesos-tests.sh --gtest_filter=GroupTest*:ZooKeeperTest*:ZooKeeperMasterContenderDetectorTest* with 100 iterations.
> 
> Added a test to exercise authenticate() and create() retries.
> 
> 
> Thanks,
> 
> Jiang Yan Xu
> 
>


Re: Review Request 15706: Fixed Group to retry when authentication failed due to retryable errors.

Posted by Ben Mahler <be...@gmail.com>.

> On Dec. 3, 2013, 8:28 p.m., Ben Mahler wrote:
> > Ship It!

Committed.


- Ben


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/15706/#review29675
-----------------------------------------------------------


On Dec. 3, 2013, 8 p.m., Jiang Yan Xu wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/15706/
> -----------------------------------------------------------
> 
> (Updated Dec. 3, 2013, 8 p.m.)
> 
> 
> Review request for mesos, Benjamin Hindman, Ben Mahler, and Vinod Kone.
> 
> 
> Bugs: MESOS-814
>     https://issues.apache.org/jira/browse/MESOS-814
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> See summary.
> 
> 
> Diffs
> -----
> 
>   src/tests/group_tests.cpp fb4c2f103ab7436b0fa2d6276c1ec1011a9405f0 
>   src/zookeeper/group.hpp 04068e357cec95457d1f24c166d0b60f86d997d2 
>   src/zookeeper/group.cpp 5218286bb6ab989ba16cead1483851cc8adba5ca 
> 
> Diff: https://reviews.apache.org/r/15706/diff/
> 
> 
> Testing
> -------
> 
> make check & mesos-tests.sh --gtest_filter=GroupTest*:ZooKeeperTest*:ZooKeeperMasterContenderDetectorTest* with 100 iterations.
> 
> Added a test to exercise authenticate() and create() retries.
> 
> 
> Thanks,
> 
> Jiang Yan Xu
> 
>


Re: Review Request 15706: Fixed Group to retry when authentication failed due to retryable errors.

Posted by Ben Mahler <be...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/15706/#review29675
-----------------------------------------------------------

Ship it!


Ship It!

- Ben Mahler


On Dec. 3, 2013, 8 p.m., Jiang Yan Xu wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/15706/
> -----------------------------------------------------------
> 
> (Updated Dec. 3, 2013, 8 p.m.)
> 
> 
> Review request for mesos, Benjamin Hindman, Ben Mahler, and Vinod Kone.
> 
> 
> Bugs: MESOS-814
>     https://issues.apache.org/jira/browse/MESOS-814
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> See summary.
> 
> 
> Diffs
> -----
> 
>   src/tests/group_tests.cpp fb4c2f103ab7436b0fa2d6276c1ec1011a9405f0 
>   src/zookeeper/group.hpp 04068e357cec95457d1f24c166d0b60f86d997d2 
>   src/zookeeper/group.cpp 5218286bb6ab989ba16cead1483851cc8adba5ca 
> 
> Diff: https://reviews.apache.org/r/15706/diff/
> 
> 
> Testing
> -------
> 
> make check & mesos-tests.sh --gtest_filter=GroupTest*:ZooKeeperTest*:ZooKeeperMasterContenderDetectorTest* with 100 iterations.
> 
> Added a test to exercise authenticate() and create() retries.
> 
> 
> Thanks,
> 
> Jiang Yan Xu
> 
>


Re: Review Request 15706: Fixed Group to retry when authentication failed due to retryable errors.

Posted by Jiang Yan Xu <ya...@jxu.me>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/15706/
-----------------------------------------------------------

(Updated Dec. 3, 2013, 8 p.m.)


Review request for mesos, Benjamin Hindman, Ben Mahler, and Vinod Kone.


Bugs: MESOS-814
    https://issues.apache.org/jira/browse/MESOS-814


Repository: mesos-git


Description
-------

See summary.


Diffs (updated)
-----

  src/tests/group_tests.cpp fb4c2f103ab7436b0fa2d6276c1ec1011a9405f0 
  src/zookeeper/group.hpp 04068e357cec95457d1f24c166d0b60f86d997d2 
  src/zookeeper/group.cpp 5218286bb6ab989ba16cead1483851cc8adba5ca 

Diff: https://reviews.apache.org/r/15706/diff/


Testing
-------

make check & mesos-tests.sh --gtest_filter=GroupTest*:ZooKeeperTest*:ZooKeeperMasterContenderDetectorTest* with 100 iterations.

Added a test to exercise authenticate() and create() retries.


Thanks,

Jiang Yan Xu


Re: Review Request 15706: Fixed Group to retry when authentication failed due to retryable errors.

Posted by Jiang Yan Xu <ya...@jxu.me>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/15706/
-----------------------------------------------------------

(Updated Dec. 3, 2013, 7:50 p.m.)


Review request for mesos, Benjamin Hindman, Ben Mahler, and Vinod Kone.


Changes
-------

BenM's


Bugs: MESOS-814
    https://issues.apache.org/jira/browse/MESOS-814


Repository: mesos-git


Description
-------

See summary.


Diffs (updated)
-----

  src/tests/group_tests.cpp fb4c2f103ab7436b0fa2d6276c1ec1011a9405f0 
  src/zookeeper/group.hpp 04068e357cec95457d1f24c166d0b60f86d997d2 
  src/zookeeper/group.cpp 5218286bb6ab989ba16cead1483851cc8adba5ca 

Diff: https://reviews.apache.org/r/15706/diff/


Testing
-------

make check & mesos-tests.sh --gtest_filter=GroupTest*:ZooKeeperTest*:ZooKeeperMasterContenderDetectorTest* with 100 iterations.

Added a test to exercise authenticate() and create() retries.


Thanks,

Jiang Yan Xu


Re: Review Request 15706: Fixed Group to retry when authentication failed due to retryable errors.

Posted by Jiang Yan Xu <ya...@jxu.me>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/15706/
-----------------------------------------------------------

(Updated Dec. 3, 2013, 5:01 a.m.)


Review request for mesos, Benjamin Hindman, Ben Mahler, and Vinod Kone.


Changes
-------

BenM's


Bugs: MESOS-814
    https://issues.apache.org/jira/browse/MESOS-814


Repository: mesos-git


Description
-------

See summary.


Diffs (updated)
-----

  src/tests/group_tests.cpp fb4c2f103ab7436b0fa2d6276c1ec1011a9405f0 
  src/zookeeper/group.hpp 04068e357cec95457d1f24c166d0b60f86d997d2 
  src/zookeeper/group.cpp 5218286bb6ab989ba16cead1483851cc8adba5ca 

Diff: https://reviews.apache.org/r/15706/diff/


Testing
-------

make check & mesos-tests.sh --gtest_filter=GroupTest*:ZooKeeperTest*:ZooKeeperMasterContenderDetectorTest* with 100 iterations.

Added a test to exercise authenticate() and create() retries.


Thanks,

Jiang Yan Xu


Re: Review Request 15706: Fixed Group to retry when authentication failed due to retryable errors.

Posted by Jiang Yan Xu <ya...@jxu.me>.

> On Dec. 2, 2013, 11:36 p.m., Ben Mahler wrote:
> > src/zookeeper/group.cpp, lines 279-280
> > <https://reviews.apache.org/r/15706/diff/4/?file=392431#file392431line279>
> >
> >     Is printing the pid useful?
> >     
> >     s/)/) /
> >     s/to/ to/

It's originally there and I think now that detector & contender use different groups, this might help differentiate them.


- Jiang Yan


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/15706/#review29615
-----------------------------------------------------------


On Dec. 3, 2013, 5:01 a.m., Jiang Yan Xu wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/15706/
> -----------------------------------------------------------
> 
> (Updated Dec. 3, 2013, 5:01 a.m.)
> 
> 
> Review request for mesos, Benjamin Hindman, Ben Mahler, and Vinod Kone.
> 
> 
> Bugs: MESOS-814
>     https://issues.apache.org/jira/browse/MESOS-814
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> See summary.
> 
> 
> Diffs
> -----
> 
>   src/tests/group_tests.cpp fb4c2f103ab7436b0fa2d6276c1ec1011a9405f0 
>   src/zookeeper/group.hpp 04068e357cec95457d1f24c166d0b60f86d997d2 
>   src/zookeeper/group.cpp 5218286bb6ab989ba16cead1483851cc8adba5ca 
> 
> Diff: https://reviews.apache.org/r/15706/diff/
> 
> 
> Testing
> -------
> 
> make check & mesos-tests.sh --gtest_filter=GroupTest*:ZooKeeperTest*:ZooKeeperMasterContenderDetectorTest* with 100 iterations.
> 
> Added a test to exercise authenticate() and create() retries.
> 
> 
> Thanks,
> 
> Jiang Yan Xu
> 
>


Re: Review Request 15706: Fixed Group to retry when authentication failed due to retryable errors.

Posted by Ben Mahler <be...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/15706/#review29615
-----------------------------------------------------------



src/zookeeper/group.cpp
<https://reviews.apache.org/r/15706/#comment57028>

    Is printing the pid useful?
    
    s/)/) /
    s/to/ to/



src/zookeeper/group.cpp
<https://reviews.apache.org/r/15706/#comment57019>

    GroupProcess::retry calls sync without ensuring we're in a CONNECTING state, is that ok?



src/zookeeper/group.cpp
<https://reviews.apache.org/r/15706/#comment57015>

    Ah, so now this is possible because we set retrying to false?
    
    Is there any harm to eliminating setting it to false in expired() and reconnecting()?
    
    Before retrying was used as a guard against duplicate retries, and only set false inside retry(), which is really easy to think about and allows you to have a CHECK(retrying) here, right?
    
    Now retrying is used as a cancellation mechanism, but an imperfect one because we could trigger a second retry (setting retrying=true) before the first retry executes and realizes retrying is false. In this case, the first one would win if successful. If unsuccessful, both will continue to sync() and backoff.



src/zookeeper/group.cpp
<https://reviews.apache.org/r/15706/#comment57018>

    Would it be more clear if sync() returned a Try to differentiate retry-able errors?
    
    This would be consistent with the rest of the code here and would avoid having us look at state to figure this out. Checking state != CONNECTING seems to be a very implicit way to check for non-recoverable error?


- Ben Mahler


On Dec. 2, 2013, 10:39 p.m., Jiang Yan Xu wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/15706/
> -----------------------------------------------------------
> 
> (Updated Dec. 2, 2013, 10:39 p.m.)
> 
> 
> Review request for mesos, Benjamin Hindman, Ben Mahler, and Vinod Kone.
> 
> 
> Bugs: MESOS-814
>     https://issues.apache.org/jira/browse/MESOS-814
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> See summary.
> 
> 
> Diffs
> -----
> 
>   src/tests/group_tests.cpp fb4c2f103ab7436b0fa2d6276c1ec1011a9405f0 
>   src/zookeeper/group.hpp 04068e357cec95457d1f24c166d0b60f86d997d2 
>   src/zookeeper/group.cpp 5218286bb6ab989ba16cead1483851cc8adba5ca 
> 
> Diff: https://reviews.apache.org/r/15706/diff/
> 
> 
> Testing
> -------
> 
> make check & mesos-tests.sh --gtest_filter=GroupTest*:ZooKeeperTest*:ZooKeeperMasterContenderDetectorTest* with 100 iterations.
> 
> Added a test to exercise authenticate() and create() retries.
> 
> 
> Thanks,
> 
> Jiang Yan Xu
> 
>


Re: Review Request 15706: Fixed Group to retry when authentication failed due to retryable errors.

Posted by Jiang Yan Xu <ya...@jxu.me>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/15706/
-----------------------------------------------------------

(Updated Dec. 2, 2013, 10:39 p.m.)


Review request for mesos, Benjamin Hindman, Ben Mahler, and Vinod Kone.


Changes
-------

Rebased. Added comments to test.


Bugs: MESOS-814
    https://issues.apache.org/jira/browse/MESOS-814


Repository: mesos-git


Description
-------

See summary.


Diffs (updated)
-----

  src/tests/group_tests.cpp fb4c2f103ab7436b0fa2d6276c1ec1011a9405f0 
  src/zookeeper/group.hpp 04068e357cec95457d1f24c166d0b60f86d997d2 
  src/zookeeper/group.cpp 5218286bb6ab989ba16cead1483851cc8adba5ca 

Diff: https://reviews.apache.org/r/15706/diff/


Testing
-------

make check & mesos-tests.sh --gtest_filter=GroupTest*:ZooKeeperTest*:ZooKeeperMasterContenderDetectorTest* with 100 iterations.

Added a test to exercise authenticate() and create() retries.


Thanks,

Jiang Yan Xu


Re: Review Request 15706: Fixed Group to retry when authentication failed due to retryable errors.

Posted by Jiang Yan Xu <ya...@jxu.me>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/15706/
-----------------------------------------------------------

(Updated Nov. 28, 2013, 8:52 a.m.)


Review request for mesos, Benjamin Hindman, Ben Mahler, and Vinod Kone.


Changes
-------

- BenM's comments
- Killed GroupProcess::error because it's actually mostly useless since each time error is set, it is followed by abort() which clears it...
- Added CONNECTED state to help determine whether non-retryable error/abort() has happened (state transitions to CONNECTING during abort())

With these changes I think the Group logic is simpler now.


Bugs: MESOS-814
    https://issues.apache.org/jira/browse/MESOS-814


Repository: mesos-git


Description
-------

See summary.


Diffs (updated)
-----

  src/tests/group_tests.cpp fb4c2f103ab7436b0fa2d6276c1ec1011a9405f0 
  src/zookeeper/group.hpp 04068e357cec95457d1f24c166d0b60f86d997d2 
  src/zookeeper/group.cpp 12c781b29f4300ca8a29660adc3f1e55e03d5d04 

Diff: https://reviews.apache.org/r/15706/diff/


Testing (updated)
-------

make check & mesos-tests.sh --gtest_filter=GroupTest*:ZooKeeperTest*:ZooKeeperMasterContenderDetectorTest* with 100 iterations.

Added a test to exercise authenticate() and create() retries.


Thanks,

Jiang Yan Xu


Re: Review Request 15706: Fixed Group to retry when authentication failed due to retryable errors.

Posted by Jiang Yan Xu <ya...@jxu.me>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/15706/
-----------------------------------------------------------

(Updated Nov. 25, 2013, 7:59 p.m.)


Review request for mesos, Benjamin Hindman, Ben Mahler, and Vinod Kone.


Changes
-------

Vinod's comments.
- Split setup() into authenticate() and create()


Bugs: MESOS-814
    https://issues.apache.org/jira/browse/MESOS-814


Repository: mesos-git


Description
-------

See summary.


Diffs (updated)
-----

  src/tests/group_tests.cpp fb4c2f103ab7436b0fa2d6276c1ec1011a9405f0 
  src/zookeeper/group.hpp 04068e357cec95457d1f24c166d0b60f86d997d2 
  src/zookeeper/group.cpp 12c781b29f4300ca8a29660adc3f1e55e03d5d04 

Diff: https://reviews.apache.org/r/15706/diff/


Testing (updated)
-------

make check & mesos-tests.sh --gtest_filter=GroupTest*:ZooKeeperTest*:ZooKeeperMasterContenderDetectorTest* with high iterations.

Added a test to exercise authenticate() and create() retries.


Thanks,

Jiang Yan Xu


Re: Review Request 15706: Fixed Group to retry when authentication failed due to retryable errors.

Posted by Vinod Kone <vi...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/15706/#review29274
-----------------------------------------------------------



src/zookeeper/group.cpp
<https://reviews.apache.org/r/15706/#comment56428>

    I'm confused about the logic here.
    
    Is the expectation that if there is a retryable error when setup() is called you would get another connected() event?
    
    I was expecting the authentication step to go into a queue of its own like join,cancel etc. But that doesn't seem to be the case.
    
    Also, is it possible to write a test for this?


- Vinod Kone


On Nov. 22, 2013, 12:30 a.m., Jiang Yan Xu wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/15706/
> -----------------------------------------------------------
> 
> (Updated Nov. 22, 2013, 12:30 a.m.)
> 
> 
> Review request for mesos, Benjamin Hindman, Ben Mahler, and Vinod Kone.
> 
> 
> Bugs: MESOS-814
>     https://issues.apache.org/jira/browse/MESOS-814
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> See summary.
> 
> 
> Diffs
> -----
> 
>   src/zookeeper/group.hpp 04068e357cec95457d1f24c166d0b60f86d997d2 
>   src/zookeeper/group.cpp 12c781b29f4300ca8a29660adc3f1e55e03d5d04 
> 
> Diff: https://reviews.apache.org/r/15706/diff/
> 
> 
> Testing
> -------
> 
> make check & mesos-tests.sh --gtest_filter=GroupTest*:ZooKeeperTest*:ZooKeeperMasterContenderDetectorTest* with high iterations.
> 
> This fix is for a problem not easy to expose through unit tests so no new tests were written.
> 
> 
> Thanks,
> 
> Jiang Yan Xu
> 
>


Re: Review Request 15706: Fixed Group to retry when authentication failed due to retryable errors.

Posted by Jiang Yan Xu <ya...@jxu.me>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/15706/
-----------------------------------------------------------

(Updated Nov. 22, 2013, 12:30 a.m.)


Review request for mesos, Benjamin Hindman, Ben Mahler, and Vinod Kone.


Changes
-------

BenM's comments.


Bugs: MESOS-814
    https://issues.apache.org/jira/browse/MESOS-814


Repository: mesos-git


Description
-------

See summary.


Diffs (updated)
-----

  src/zookeeper/group.hpp 04068e357cec95457d1f24c166d0b60f86d997d2 
  src/zookeeper/group.cpp 12c781b29f4300ca8a29660adc3f1e55e03d5d04 

Diff: https://reviews.apache.org/r/15706/diff/


Testing
-------

make check & mesos-tests.sh --gtest_filter=GroupTest*:ZooKeeperTest*:ZooKeeperMasterContenderDetectorTest* with high iterations.

This fix is for a problem not easy to expose through unit tests so no new tests were written.


Thanks,

Jiang Yan Xu


Re: Review Request 15706: Fixed Group to retry when authentication failed due to retryable errors.

Posted by Jiang Yan Xu <ya...@jxu.me>.

> On Nov. 20, 2013, 11:23 p.m., Ben Mahler wrote:
> > Looks good, I was looking at how retry() can be initiated from several locations in the code but it appears relatively harmless: we will just do redundant sync() calls on occasion AFAICT.

Each call to retry() is guarded by a "if (!retrying) { ... }" and 'retrying' is immediately set to true after it is dispatched by a delay so even the method on the object executed next won't be doing it again. Seems fine to me. 


- Jiang Yan


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/15706/#review29198
-----------------------------------------------------------


On Nov. 22, 2013, 12:30 a.m., Jiang Yan Xu wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/15706/
> -----------------------------------------------------------
> 
> (Updated Nov. 22, 2013, 12:30 a.m.)
> 
> 
> Review request for mesos, Benjamin Hindman, Ben Mahler, and Vinod Kone.
> 
> 
> Bugs: MESOS-814
>     https://issues.apache.org/jira/browse/MESOS-814
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> See summary.
> 
> 
> Diffs
> -----
> 
>   src/zookeeper/group.hpp 04068e357cec95457d1f24c166d0b60f86d997d2 
>   src/zookeeper/group.cpp 12c781b29f4300ca8a29660adc3f1e55e03d5d04 
> 
> Diff: https://reviews.apache.org/r/15706/diff/
> 
> 
> Testing
> -------
> 
> make check & mesos-tests.sh --gtest_filter=GroupTest*:ZooKeeperTest*:ZooKeeperMasterContenderDetectorTest* with high iterations.
> 
> This fix is for a problem not easy to expose through unit tests so no new tests were written.
> 
> 
> Thanks,
> 
> Jiang Yan Xu
> 
>


Re: Review Request 15706: Fixed Group to retry when authentication failed due to retryable errors.

Posted by Ben Mahler <be...@gmail.com>.

> On Nov. 20, 2013, 11:23 p.m., Ben Mahler wrote:
> > Looks good, I was looking at how retry() can be initiated from several locations in the code but it appears relatively harmless: we will just do redundant sync() calls on occasion AFAICT.
> 
> Jiang Yan Xu wrote:
>     Each call to retry() is guarded by a "if (!retrying) { ... }" and 'retrying' is immediately set to true after it is dispatched by a delay so even the method on the object executed next won't be doing it again. Seems fine to me.

Yes but these are not "calls", rather these are "delays" and so these are asynchronous. Meaning you need to consider what happens when retrying becomes true after the delay but before the delayed call takes place. AFAICT this just results in redundant sync() operations. +1 to testing this :)


- Ben


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/15706/#review29198
-----------------------------------------------------------


On Nov. 22, 2013, 12:30 a.m., Jiang Yan Xu wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/15706/
> -----------------------------------------------------------
> 
> (Updated Nov. 22, 2013, 12:30 a.m.)
> 
> 
> Review request for mesos, Benjamin Hindman, Ben Mahler, and Vinod Kone.
> 
> 
> Bugs: MESOS-814
>     https://issues.apache.org/jira/browse/MESOS-814
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> See summary.
> 
> 
> Diffs
> -----
> 
>   src/zookeeper/group.hpp 04068e357cec95457d1f24c166d0b60f86d997d2 
>   src/zookeeper/group.cpp 12c781b29f4300ca8a29660adc3f1e55e03d5d04 
> 
> Diff: https://reviews.apache.org/r/15706/diff/
> 
> 
> Testing
> -------
> 
> make check & mesos-tests.sh --gtest_filter=GroupTest*:ZooKeeperTest*:ZooKeeperMasterContenderDetectorTest* with high iterations.
> 
> This fix is for a problem not easy to expose through unit tests so no new tests were written.
> 
> 
> Thanks,
> 
> Jiang Yan Xu
> 
>


Re: Review Request 15706: Fixed Group to retry when authentication failed due to retryable errors.

Posted by Ben Mahler <be...@gmail.com>.

> On Nov. 20, 2013, 11:23 p.m., Ben Mahler wrote:
> > Looks good, I was looking at how retry() can be initiated from several locations in the code but it appears relatively harmless: we will just do redundant sync() calls on occasion AFAICT.
> 
> Jiang Yan Xu wrote:
>     Each call to retry() is guarded by a "if (!retrying) { ... }" and 'retrying' is immediately set to true after it is dispatched by a delay so even the method on the object executed next won't be doing it again. Seems fine to me.
> 
> Ben Mahler wrote:
>     Yes but these are not "calls", rather these are "delays" and so these are asynchronous. Meaning you need to consider what happens when retrying becomes true after the delay but before the delayed call takes place. AFAICT this just results in redundant sync() operations. +1 to testing this :)

Oh I see, yeah this seems ok! :)


- Ben


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/15706/#review29198
-----------------------------------------------------------


On Nov. 25, 2013, 7:59 p.m., Jiang Yan Xu wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/15706/
> -----------------------------------------------------------
> 
> (Updated Nov. 25, 2013, 7:59 p.m.)
> 
> 
> Review request for mesos, Benjamin Hindman, Ben Mahler, and Vinod Kone.
> 
> 
> Bugs: MESOS-814
>     https://issues.apache.org/jira/browse/MESOS-814
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> See summary.
> 
> 
> Diffs
> -----
> 
>   src/tests/group_tests.cpp fb4c2f103ab7436b0fa2d6276c1ec1011a9405f0 
>   src/zookeeper/group.hpp 04068e357cec95457d1f24c166d0b60f86d997d2 
>   src/zookeeper/group.cpp 12c781b29f4300ca8a29660adc3f1e55e03d5d04 
> 
> Diff: https://reviews.apache.org/r/15706/diff/
> 
> 
> Testing
> -------
> 
> make check & mesos-tests.sh --gtest_filter=GroupTest*:ZooKeeperTest*:ZooKeeperMasterContenderDetectorTest* with high iterations.
> 
> Added a test to exercise authenticate() and create() retries.
> 
> 
> Thanks,
> 
> Jiang Yan Xu
> 
>


Re: Review Request 15706: Fixed Group to retry when authentication failed due to retryable errors.

Posted by Ben Mahler <be...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/15706/#review29198
-----------------------------------------------------------


Looks good, I was looking at how retry() can be initiated from several locations in the code but it appears relatively harmless: we will just do redundant sync() calls on occasion AFAICT.


src/zookeeper/group.cpp
<https://reviews.apache.org/r/15706/#comment56341>

    Can these be on the same line now?



src/zookeeper/group.cpp
<https://reviews.apache.org/r/15706/#comment56340>

    Result<Nothing> seems a little strange, what is the difference between None() and Some(Nothing())?
    
    Can we make this Try<bool>, returning whether setup was completed? This looks similar to how sync() is used as well.


- Ben Mahler


On Nov. 20, 2013, 12:05 a.m., Jiang Yan Xu wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/15706/
> -----------------------------------------------------------
> 
> (Updated Nov. 20, 2013, 12:05 a.m.)
> 
> 
> Review request for mesos, Benjamin Hindman, Ben Mahler, Ian Downes, Jie Yu, and Vinod Kone.
> 
> 
> Bugs: MESOS-814
>     https://issues.apache.org/jira/browse/MESOS-814
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> See summary.
> 
> 
> Diffs
> -----
> 
>   src/zookeeper/group.hpp 04068e357cec95457d1f24c166d0b60f86d997d2 
>   src/zookeeper/group.cpp 12c781b29f4300ca8a29660adc3f1e55e03d5d04 
> 
> Diff: https://reviews.apache.org/r/15706/diff/
> 
> 
> Testing
> -------
> 
> make check & mesos-tests.sh --gtest_filter=GroupTest*:ZooKeeperTest*:ZooKeeperMasterContenderDetectorTest* with high iterations.
> 
> This fix is for a problem not easy to expose through unit tests so no new tests were written.
> 
> 
> Thanks,
> 
> Jiang Yan Xu
> 
>