You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mesos.apache.org by Ben Mahler <be...@gmail.com> on 2013/03/28 20:22:08 UTC

Review Request: Send NoMasterDetectedMessage to non-contending detectors. Added a disconnected slave map to the master to track disconnected slaves, in order to disallow slave re-registration after a network partition.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/10172/
-----------------------------------------------------------

Review request for mesos, Benjamin Hindman and Vinod Kone.


Description
-------

See above. This is a fix of MESOS-305.

This also fixes MESOS-362.


This addresses bugs MESOS-305 and MESOS-362.
    https://issues.apache.org/jira/browse/MESOS-305
    https://issues.apache.org/jira/browse/MESOS-362


Diffs
-----

  src/detector/detector.cpp 7a8355162d543e017505dd58efd2d7bf96f99623 
  src/master/http.cpp 71b04f01f45ee73d9c246f469e1368223903abed 
  src/master/master.hpp 9776a7cb8448e41e5d52288e3c637737cee15a08 
  src/master/master.cpp 5b0e8c03c516f9fc8bb729c21e876bdde89baf9c 
  src/tests/fault_tolerance_tests.cpp 9d3f8b1bfb58d459b1719d2ba1dbb2e93858fc92 
  src/tests/master_detector_tests.cpp fe3b91fb375e0b09f8f2de3e69e736cd5f5b94ba 

Diff: https://reviews.apache.org/r/10172/diff/


Testing
-------

make check

Added tests for the partitioned slave re-registration.
./bin/mesos-tests.sh --gtest_filter="FaultToleranceTest.PartitionedSlaveReregistration" --verbose --gtest_break_on_failure --gtest_repeat=3000

Ran into MESOS-406, but otherwise no issues.

Would love to add zookeeper master detector tests to ensure these new NoMasterDetected messages make it through, but there seems to be no easy way to induce the timedout() call at the moment. Suggestions?


Thanks,

Ben Mahler


Re: Review Request: Send NoMasterDetectedMessage on session timeout to non-contending detectors. Added a disconnected slave map to the master to track disconnected slaves, in order to disallow slave re-registration after a network partition.

Posted by Ben Mahler <be...@gmail.com>.

> On April 16, 2013, 7:02 p.m., Vinod Kone wrote:
> > src/master/http.cpp, line 270
> > <https://reviews.apache.org/r/10172/diff/2/?file=281465#file281465line270>
> >
> >     did you forget to kill this?

I will be renaming 'connected' to 'activated' in the next review: https://reviews.apache.org/r/10534/


> On April 16, 2013, 7:02 p.m., Vinod Kone wrote:
> > src/tests/fault_tolerance_tests.cpp, line 231
> > <https://reviews.apache.org/r/10172/diff/2/?file=281468#file281468line231>
> >
> >     s/Process/process/ ?

Well.. I want to be clear that I'm referring to a libprocess "Process".


> On April 16, 2013, 7:02 p.m., Vinod Kone wrote:
> > src/tests/fault_tolerance_tests.cpp, line 305
> > <https://reviews.apache.org/r/10172/diff/2/?file=281468#file281468line305>
> >
> >     You could simplify this by while(lostStatus.isPending()).
> >     
> >     But I will leave it up to you.

Punting for now, I want to be explicit about the pings, I also borrowed this from other tests (I know, no excuse ;)).


- Ben


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/10172/#review19267
-----------------------------------------------------------


On April 16, 2013, 12:58 a.m., Ben Mahler wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/10172/
> -----------------------------------------------------------
> 
> (Updated April 16, 2013, 12:58 a.m.)
> 
> 
> Review request for mesos, Benjamin Hindman and Vinod Kone.
> 
> 
> Description
> -------
> 
> See above. This is a fix of MESOS-305.
> 
> This also fixes MESOS-362.
> 
> 
> This addresses bugs MESOS-305 and MESOS-362.
>     https://issues.apache.org/jira/browse/MESOS-305
>     https://issues.apache.org/jira/browse/MESOS-362
> 
> 
> Diffs
> -----
> 
>   src/detector/detector.cpp 7a8355162d543e017505dd58efd2d7bf96f99623 
>   src/master/http.cpp 71b04f01f45ee73d9c246f469e1368223903abed 
>   src/master/master.hpp 9776a7cb8448e41e5d52288e3c637737cee15a08 
>   src/master/master.cpp 5b0e8c03c516f9fc8bb729c21e876bdde89baf9c 
>   src/tests/fault_tolerance_tests.cpp bfb30344ca02cd42c442a373d44d6a3fa287c1e3 
>   src/tests/master_detector_tests.cpp 980f3c720301b83af668e10f479adb9cce4f0c9f 
> 
> Diff: https://reviews.apache.org/r/10172/diff/
> 
> 
> Testing
> -------
> 
> make check
> 
> Added tests for the partitioned slave re-registration.
> ./bin/mesos-tests.sh --gtest_filter="FaultToleranceTest.PartitionedSlaveReregistration" --verbose --gtest_break_on_failure --gtest_repeat=3000
> 
> Ran into MESOS-406, but otherwise no issues.
> 
> Will be adding ZK master detector tests shortly to test that the NoMasterDetectedMessages are being sent.
> 
> 
> Thanks,
> 
> Ben Mahler
> 
>


Re: Review Request: Send NoMasterDetectedMessage on session timeout to non-contending detectors. Added a disconnected slave map to the master to track disconnected slaves, in order to disallow slave re-registration after a network partition.

Posted by Vinod Kone <vi...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/10172/#review19267
-----------------------------------------------------------



src/tests/fault_tolerance_tests.cpp
<https://reviews.apache.org/r/10172/#comment39880>

    thank you



src/master/http.cpp
<https://reviews.apache.org/r/10172/#comment39882>

    did you forget to kill this?



src/tests/fault_tolerance_tests.cpp
<https://reviews.apache.org/r/10172/#comment39893>

    Great comment!



src/tests/fault_tolerance_tests.cpp
<https://reviews.apache.org/r/10172/#comment39894>

    s/Process/process/ ?



src/tests/fault_tolerance_tests.cpp
<https://reviews.apache.org/r/10172/#comment39895>

    Pull this down after driver.start() but before driver.launchTasks()



src/tests/fault_tolerance_tests.cpp
<https://reviews.apache.org/r/10172/#comment39896>

    The lost status happens much later in the test. Pull this down to where it is expected.



src/tests/fault_tolerance_tests.cpp
<https://reviews.apache.org/r/10172/#comment39897>

    s/received/handled/



src/tests/fault_tolerance_tests.cpp
<https://reviews.apache.org/r/10172/#comment39898>

    You could use Future<Protobuf> and DROP_PROTOBUF().
    
    We typically use 'Message', when we are interested in the pids (message.from and message.to).



src/tests/fault_tolerance_tests.cpp
<https://reviews.apache.org/r/10172/#comment39899>

    You could simplify this by while(lostStatus.isPending()).
    
    But I will leave it up to you.



src/tests/fault_tolerance_tests.cpp
<https://reviews.apache.org/r/10172/#comment39900>

    again, you could use FUTURE_PROTOBUF here



src/tests/fault_tolerance_tests.cpp
<https://reviews.apache.org/r/10172/#comment39901>

    Excellent test!


- Vinod Kone


On April 16, 2013, 12:58 a.m., Ben Mahler wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/10172/
> -----------------------------------------------------------
> 
> (Updated April 16, 2013, 12:58 a.m.)
> 
> 
> Review request for mesos, Benjamin Hindman and Vinod Kone.
> 
> 
> Description
> -------
> 
> See above. This is a fix of MESOS-305.
> 
> This also fixes MESOS-362.
> 
> 
> This addresses bugs MESOS-305 and MESOS-362.
>     https://issues.apache.org/jira/browse/MESOS-305
>     https://issues.apache.org/jira/browse/MESOS-362
> 
> 
> Diffs
> -----
> 
>   src/detector/detector.cpp 7a8355162d543e017505dd58efd2d7bf96f99623 
>   src/master/http.cpp 71b04f01f45ee73d9c246f469e1368223903abed 
>   src/master/master.hpp 9776a7cb8448e41e5d52288e3c637737cee15a08 
>   src/master/master.cpp 5b0e8c03c516f9fc8bb729c21e876bdde89baf9c 
>   src/tests/fault_tolerance_tests.cpp bfb30344ca02cd42c442a373d44d6a3fa287c1e3 
>   src/tests/master_detector_tests.cpp 980f3c720301b83af668e10f479adb9cce4f0c9f 
> 
> Diff: https://reviews.apache.org/r/10172/diff/
> 
> 
> Testing
> -------
> 
> make check
> 
> Added tests for the partitioned slave re-registration.
> ./bin/mesos-tests.sh --gtest_filter="FaultToleranceTest.PartitionedSlaveReregistration" --verbose --gtest_break_on_failure --gtest_repeat=3000
> 
> Ran into MESOS-406, but otherwise no issues.
> 
> Will be adding ZK master detector tests shortly to test that the NoMasterDetectedMessages are being sent.
> 
> 
> Thanks,
> 
> Ben Mahler
> 
>


Re: Review Request: Send NoMasterDetectedMessage on session timeout to non-contending detectors. Added a disconnected slave map to the master to track disconnected slaves, in order to disallow slave re-registration after a network partition.

Posted by Ben Mahler <be...@gmail.com>.

> On April 29, 2013, 10:53 p.m., Benjamin Hindman wrote:
> > src/master/master.hpp, lines 232-233
> > <https://reviews.apache.org/r/10172/diff/5/?file=284292#file284292line232>
> >
> >     Why are you calling these *PIDs?
> 
> Ben Mahler wrote:
>     Whoops, leftover from before, s/PIDs/Hosts/?

I removed these in my next review, sorry about that: https://reviews.apache.org/r/10534


- Ben


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/10172/#review19906
-----------------------------------------------------------


On April 24, 2013, 11:55 p.m., Ben Mahler wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/10172/
> -----------------------------------------------------------
> 
> (Updated April 24, 2013, 11:55 p.m.)
> 
> 
> Review request for mesos, Benjamin Hindman and Vinod Kone.
> 
> 
> Description
> -------
> 
> See above. This is a fix of MESOS-305.
> 
> This also fixes MESOS-362.
> 
> 
> This addresses bugs MESOS-305 and MESOS-362.
>     https://issues.apache.org/jira/browse/MESOS-305
>     https://issues.apache.org/jira/browse/MESOS-362
> 
> 
> Diffs
> -----
> 
>   src/detector/detector.cpp 7a8355162d543e017505dd58efd2d7bf96f99623 
>   src/master/http.cpp 71b04f01f45ee73d9c246f469e1368223903abed 
>   src/master/master.hpp 9776a7cb8448e41e5d52288e3c637737cee15a08 
>   src/master/master.cpp c3b26b136a529eee34e9cdf9700176c232f6e436 
>   src/tests/fault_tolerance_tests.cpp d3476f7f5d1b7e2ed45385f5145eaccd6c114d21 
>   src/tests/master_detector_tests.cpp b042d6ffb0c2e58c6c338de2b2534fc6b63f5f08 
>   src/tests/zookeeper_tests.cpp 125b16566d5cd59732fef67d80617724ff71433b 
> 
> Diff: https://reviews.apache.org/r/10172/diff/
> 
> 
> Testing
> -------
> 
> make check
> 
> Added tests for the partitioned slave re-registration.
> ./bin/mesos-tests.sh --gtest_filter="FaultToleranceTest.PartitionedSlaveReregistration" --verbose --gtest_break_on_failure --gtest_repeat=3000
> 
> Ran into MESOS-406, but otherwise no issues.
> 
> Will be adding ZK master detector tests shortly to test that the NoMasterDetectedMessages are being sent.
> 
> 
> Thanks,
> 
> Ben Mahler
> 
>


Re: Review Request: Send NoMasterDetectedMessage on session timeout to non-contending detectors. Added a disconnected slave map to the master to track disconnected slaves, in order to disallow slave re-registration after a network partition.

Posted by Ben Mahler <be...@gmail.com>.

> On April 29, 2013, 10:53 p.m., Benjamin Hindman wrote:
> > src/master/master.hpp, lines 232-233
> > <https://reviews.apache.org/r/10172/diff/5/?file=284292#file284292line232>
> >
> >     Why are you calling these *PIDs?

Whoops, leftover from before, s/PIDs/Hosts/?


- Ben


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/10172/#review19906
-----------------------------------------------------------


On April 24, 2013, 11:55 p.m., Ben Mahler wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/10172/
> -----------------------------------------------------------
> 
> (Updated April 24, 2013, 11:55 p.m.)
> 
> 
> Review request for mesos, Benjamin Hindman and Vinod Kone.
> 
> 
> Description
> -------
> 
> See above. This is a fix of MESOS-305.
> 
> This also fixes MESOS-362.
> 
> 
> This addresses bugs MESOS-305 and MESOS-362.
>     https://issues.apache.org/jira/browse/MESOS-305
>     https://issues.apache.org/jira/browse/MESOS-362
> 
> 
> Diffs
> -----
> 
>   src/detector/detector.cpp 7a8355162d543e017505dd58efd2d7bf96f99623 
>   src/master/http.cpp 71b04f01f45ee73d9c246f469e1368223903abed 
>   src/master/master.hpp 9776a7cb8448e41e5d52288e3c637737cee15a08 
>   src/master/master.cpp c3b26b136a529eee34e9cdf9700176c232f6e436 
>   src/tests/fault_tolerance_tests.cpp d3476f7f5d1b7e2ed45385f5145eaccd6c114d21 
>   src/tests/master_detector_tests.cpp b042d6ffb0c2e58c6c338de2b2534fc6b63f5f08 
>   src/tests/zookeeper_tests.cpp 125b16566d5cd59732fef67d80617724ff71433b 
> 
> Diff: https://reviews.apache.org/r/10172/diff/
> 
> 
> Testing
> -------
> 
> make check
> 
> Added tests for the partitioned slave re-registration.
> ./bin/mesos-tests.sh --gtest_filter="FaultToleranceTest.PartitionedSlaveReregistration" --verbose --gtest_break_on_failure --gtest_repeat=3000
> 
> Ran into MESOS-406, but otherwise no issues.
> 
> Will be adding ZK master detector tests shortly to test that the NoMasterDetectedMessages are being sent.
> 
> 
> Thanks,
> 
> Ben Mahler
> 
>


Re: Review Request: Send NoMasterDetectedMessage on session timeout to non-contending detectors. Added a disconnected slave map to the master to track disconnected slaves, in order to disallow slave re-registration after a network partition.

Posted by Benjamin Hindman <be...@berkeley.edu>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/10172/#review19906
-----------------------------------------------------------

Ship it!



src/master/master.hpp
<https://reviews.apache.org/r/10172/#comment41066>

    Why are you calling these *PIDs?


- Benjamin Hindman


On April 24, 2013, 11:55 p.m., Ben Mahler wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/10172/
> -----------------------------------------------------------
> 
> (Updated April 24, 2013, 11:55 p.m.)
> 
> 
> Review request for mesos, Benjamin Hindman and Vinod Kone.
> 
> 
> Description
> -------
> 
> See above. This is a fix of MESOS-305.
> 
> This also fixes MESOS-362.
> 
> 
> This addresses bugs MESOS-305 and MESOS-362.
>     https://issues.apache.org/jira/browse/MESOS-305
>     https://issues.apache.org/jira/browse/MESOS-362
> 
> 
> Diffs
> -----
> 
>   src/detector/detector.cpp 7a8355162d543e017505dd58efd2d7bf96f99623 
>   src/master/http.cpp 71b04f01f45ee73d9c246f469e1368223903abed 
>   src/master/master.hpp 9776a7cb8448e41e5d52288e3c637737cee15a08 
>   src/master/master.cpp c3b26b136a529eee34e9cdf9700176c232f6e436 
>   src/tests/fault_tolerance_tests.cpp d3476f7f5d1b7e2ed45385f5145eaccd6c114d21 
>   src/tests/master_detector_tests.cpp b042d6ffb0c2e58c6c338de2b2534fc6b63f5f08 
>   src/tests/zookeeper_tests.cpp 125b16566d5cd59732fef67d80617724ff71433b 
> 
> Diff: https://reviews.apache.org/r/10172/diff/
> 
> 
> Testing
> -------
> 
> make check
> 
> Added tests for the partitioned slave re-registration.
> ./bin/mesos-tests.sh --gtest_filter="FaultToleranceTest.PartitionedSlaveReregistration" --verbose --gtest_break_on_failure --gtest_repeat=3000
> 
> Ran into MESOS-406, but otherwise no issues.
> 
> Will be adding ZK master detector tests shortly to test that the NoMasterDetectedMessages are being sent.
> 
> 
> Thanks,
> 
> Ben Mahler
> 
>


Re: Review Request: Send NoMasterDetectedMessage on session timeout to non-contending detectors. Added a disconnected slave map to the master to track disconnected slaves, in order to disallow slave re-registration after a network partition.

Posted by Ben Mahler <be...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/10172/
-----------------------------------------------------------

(Updated April 30, 2013, 1:30 a.m.)


Review request for mesos, Benjamin Hindman and Vinod Kone.


Changes
-------

Rebased off trunk.


Description
-------

See above. This is a fix of MESOS-305.

This also fixes MESOS-362.


This addresses bugs MESOS-305 and MESOS-362.
    https://issues.apache.org/jira/browse/MESOS-305
    https://issues.apache.org/jira/browse/MESOS-362


Diffs (updated)
-----

  src/detector/detector.cpp 7a8355162d543e017505dd58efd2d7bf96f99623 
  src/master/http.cpp 71b04f01f45ee73d9c246f469e1368223903abed 
  src/master/master.hpp 4a8aaee5a9970c0dd5cb022f04e48fb308241e20 
  src/master/master.cpp ff2f9546b3e5c885da0a5986606beaca57ba4d5c 
  src/tests/fault_tolerance_tests.cpp 70e2d558af72cc267240042577cf9f0fbfebe6d6 
  src/tests/master_detector_tests.cpp b042d6ffb0c2e58c6c338de2b2534fc6b63f5f08 
  src/tests/zookeeper_tests.cpp 125b16566d5cd59732fef67d80617724ff71433b 

Diff: https://reviews.apache.org/r/10172/diff/


Testing
-------

make check

Added tests for the partitioned slave re-registration.
./bin/mesos-tests.sh --gtest_filter="FaultToleranceTest.PartitionedSlaveReregistration" --verbose --gtest_break_on_failure --gtest_repeat=3000

Ran into MESOS-406, but otherwise no issues.

Will be adding ZK master detector tests shortly to test that the NoMasterDetectedMessages are being sent.


Thanks,

Ben Mahler


Re: Review Request: Send NoMasterDetectedMessage on session timeout to non-contending detectors. Added a disconnected slave map to the master to track disconnected slaves, in order to disallow slave re-registration after a network partition.

Posted by Ben Mahler <be...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/10172/
-----------------------------------------------------------

(Updated April 24, 2013, 11:55 p.m.)


Review request for mesos, Benjamin Hindman and Vinod Kone.


Changes
-------

Rebased.


Description
-------

See above. This is a fix of MESOS-305.

This also fixes MESOS-362.


This addresses bugs MESOS-305 and MESOS-362.
    https://issues.apache.org/jira/browse/MESOS-305
    https://issues.apache.org/jira/browse/MESOS-362


Diffs (updated)
-----

  src/detector/detector.cpp 7a8355162d543e017505dd58efd2d7bf96f99623 
  src/master/http.cpp 71b04f01f45ee73d9c246f469e1368223903abed 
  src/master/master.hpp 9776a7cb8448e41e5d52288e3c637737cee15a08 
  src/master/master.cpp c3b26b136a529eee34e9cdf9700176c232f6e436 
  src/tests/fault_tolerance_tests.cpp d3476f7f5d1b7e2ed45385f5145eaccd6c114d21 
  src/tests/master_detector_tests.cpp b042d6ffb0c2e58c6c338de2b2534fc6b63f5f08 
  src/tests/zookeeper_tests.cpp 125b16566d5cd59732fef67d80617724ff71433b 

Diff: https://reviews.apache.org/r/10172/diff/


Testing
-------

make check

Added tests for the partitioned slave re-registration.
./bin/mesos-tests.sh --gtest_filter="FaultToleranceTest.PartitionedSlaveReregistration" --verbose --gtest_break_on_failure --gtest_repeat=3000

Ran into MESOS-406, but otherwise no issues.

Will be adding ZK master detector tests shortly to test that the NoMasterDetectedMessages are being sent.


Thanks,

Ben Mahler


Re: Review Request: Send NoMasterDetectedMessage on session timeout to non-contending detectors. Added a disconnected slave map to the master to track disconnected slaves, in order to disallow slave re-registration after a network partition.

Posted by Ben Mahler <be...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/10172/
-----------------------------------------------------------

(Updated April 19, 2013, 8:55 p.m.)


Review request for mesos, Benjamin Hindman and Vinod Kone.


Changes
-------

Vinod's review.


Description
-------

See above. This is a fix of MESOS-305.

This also fixes MESOS-362.


This addresses bugs MESOS-305 and MESOS-362.
    https://issues.apache.org/jira/browse/MESOS-305
    https://issues.apache.org/jira/browse/MESOS-362


Diffs (updated)
-----

  src/detector/detector.cpp 7a8355162d543e017505dd58efd2d7bf96f99623 
  src/master/http.cpp 71b04f01f45ee73d9c246f469e1368223903abed 
  src/master/master.hpp 9776a7cb8448e41e5d52288e3c637737cee15a08 
  src/master/master.cpp 5b0e8c03c516f9fc8bb729c21e876bdde89baf9c 
  src/tests/fault_tolerance_tests.cpp 0348f20a8f4333f7d2f3786c33e55713cbcbcbe0 
  src/tests/master_detector_tests.cpp 980f3c720301b83af668e10f479adb9cce4f0c9f 
  src/tests/zookeeper_tests.cpp 0855f5aee0baef22c4ecbed1b88f14f16bfff532 

Diff: https://reviews.apache.org/r/10172/diff/


Testing
-------

make check

Added tests for the partitioned slave re-registration.
./bin/mesos-tests.sh --gtest_filter="FaultToleranceTest.PartitionedSlaveReregistration" --verbose --gtest_break_on_failure --gtest_repeat=3000

Ran into MESOS-406, but otherwise no issues.

Will be adding ZK master detector tests shortly to test that the NoMasterDetectedMessages are being sent.


Thanks,

Ben Mahler


Re: Review Request: Send NoMasterDetectedMessage on session timeout to non-contending detectors. Added a disconnected slave map to the master to track disconnected slaves, in order to disallow slave re-registration after a network partition.

Posted by Vinod Kone <vi...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/10172/#review19468
-----------------------------------------------------------

Ship it!



src/master/master.cpp
<https://reviews.apache.org/r/10172/#comment40214>

    not yours, but could you add slave id, pid and hostname to the log message?



src/master/master.cpp
<https://reviews.apache.org/r/10172/#comment40213>

    ditto


- Vinod Kone


On April 19, 2013, 6:54 p.m., Ben Mahler wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/10172/
> -----------------------------------------------------------
> 
> (Updated April 19, 2013, 6:54 p.m.)
> 
> 
> Review request for mesos, Benjamin Hindman and Vinod Kone.
> 
> 
> Description
> -------
> 
> See above. This is a fix of MESOS-305.
> 
> This also fixes MESOS-362.
> 
> 
> This addresses bugs MESOS-305 and MESOS-362.
>     https://issues.apache.org/jira/browse/MESOS-305
>     https://issues.apache.org/jira/browse/MESOS-362
> 
> 
> Diffs
> -----
> 
>   src/detector/detector.cpp 7a8355162d543e017505dd58efd2d7bf96f99623 
>   src/master/http.cpp 71b04f01f45ee73d9c246f469e1368223903abed 
>   src/master/master.hpp 9776a7cb8448e41e5d52288e3c637737cee15a08 
>   src/master/master.cpp 5b0e8c03c516f9fc8bb729c21e876bdde89baf9c 
>   src/tests/fault_tolerance_tests.cpp 0348f20a8f4333f7d2f3786c33e55713cbcbcbe0 
>   src/tests/master_detector_tests.cpp 980f3c720301b83af668e10f479adb9cce4f0c9f 
>   src/tests/zookeeper_tests.cpp 0855f5aee0baef22c4ecbed1b88f14f16bfff532 
> 
> Diff: https://reviews.apache.org/r/10172/diff/
> 
> 
> Testing
> -------
> 
> make check
> 
> Added tests for the partitioned slave re-registration.
> ./bin/mesos-tests.sh --gtest_filter="FaultToleranceTest.PartitionedSlaveReregistration" --verbose --gtest_break_on_failure --gtest_repeat=3000
> 
> Ran into MESOS-406, but otherwise no issues.
> 
> Will be adding ZK master detector tests shortly to test that the NoMasterDetectedMessages are being sent.
> 
> 
> Thanks,
> 
> Ben Mahler
> 
>


Re: Review Request: Send NoMasterDetectedMessage on session timeout to non-contending detectors. Added a disconnected slave map to the master to track disconnected slaves, in order to disallow slave re-registration after a network partition.

Posted by Ben Mahler <be...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/10172/
-----------------------------------------------------------

(Updated April 19, 2013, 6:54 p.m.)


Review request for mesos, Benjamin Hindman and Vinod Kone.


Changes
-------

Vinod + rebase.


Description
-------

See above. This is a fix of MESOS-305.

This also fixes MESOS-362.


This addresses bugs MESOS-305 and MESOS-362.
    https://issues.apache.org/jira/browse/MESOS-305
    https://issues.apache.org/jira/browse/MESOS-362


Diffs (updated)
-----

  src/detector/detector.cpp 7a8355162d543e017505dd58efd2d7bf96f99623 
  src/master/http.cpp 71b04f01f45ee73d9c246f469e1368223903abed 
  src/master/master.hpp 9776a7cb8448e41e5d52288e3c637737cee15a08 
  src/master/master.cpp 5b0e8c03c516f9fc8bb729c21e876bdde89baf9c 
  src/tests/fault_tolerance_tests.cpp 0348f20a8f4333f7d2f3786c33e55713cbcbcbe0 
  src/tests/master_detector_tests.cpp 980f3c720301b83af668e10f479adb9cce4f0c9f 
  src/tests/zookeeper_tests.cpp 0855f5aee0baef22c4ecbed1b88f14f16bfff532 

Diff: https://reviews.apache.org/r/10172/diff/


Testing
-------

make check

Added tests for the partitioned slave re-registration.
./bin/mesos-tests.sh --gtest_filter="FaultToleranceTest.PartitionedSlaveReregistration" --verbose --gtest_break_on_failure --gtest_repeat=3000

Ran into MESOS-406, but otherwise no issues.

Will be adding ZK master detector tests shortly to test that the NoMasterDetectedMessages are being sent.


Thanks,

Ben Mahler


Re: Review Request: Send NoMasterDetectedMessage on session timeout to non-contending detectors. Added a disconnected slave map to the master to track disconnected slaves, in order to disallow slave re-registration after a network partition.

Posted by Ben Mahler <be...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/10172/
-----------------------------------------------------------

(Updated April 16, 2013, 12:58 a.m.)


Review request for mesos, Benjamin Hindman and Vinod Kone.


Changes
-------

Vinod's review. Also, updated the test to use the new abstractions!


Summary (updated)
-----------------

Send NoMasterDetectedMessage on session timeout to non-contending detectors. Added a disconnected slave map to the master to track disconnected slaves, in order to disallow slave re-registration after a network partition.


Description
-------

See above. This is a fix of MESOS-305.

This also fixes MESOS-362.


This addresses bugs MESOS-305 and MESOS-362.
    https://issues.apache.org/jira/browse/MESOS-305
    https://issues.apache.org/jira/browse/MESOS-362


Diffs (updated)
-----

  src/detector/detector.cpp 7a8355162d543e017505dd58efd2d7bf96f99623 
  src/master/http.cpp 71b04f01f45ee73d9c246f469e1368223903abed 
  src/master/master.hpp 9776a7cb8448e41e5d52288e3c637737cee15a08 
  src/master/master.cpp 5b0e8c03c516f9fc8bb729c21e876bdde89baf9c 
  src/tests/fault_tolerance_tests.cpp bfb30344ca02cd42c442a373d44d6a3fa287c1e3 
  src/tests/master_detector_tests.cpp 980f3c720301b83af668e10f479adb9cce4f0c9f 

Diff: https://reviews.apache.org/r/10172/diff/


Testing
-------

make check

Added tests for the partitioned slave re-registration.
./bin/mesos-tests.sh --gtest_filter="FaultToleranceTest.PartitionedSlaveReregistration" --verbose --gtest_break_on_failure --gtest_repeat=3000

Ran into MESOS-406, but otherwise no issues.

Will be adding ZK master detector tests shortly to test that the NoMasterDetectedMessages are being sent.


Thanks,

Ben Mahler


Re: Review Request: Send NoMasterDetectedMessage to non-contending detectors. Added a disconnected slave map to the master to track disconnected slaves, in order to disallow slave re-registration after a network partition.

Posted by Ben Mahler <be...@gmail.com>.

> On March 29, 2013, 9:34 p.m., Vinod Kone wrote:
> > src/master/http.cpp, line 268
> > <https://reviews.apache.org/r/10172/diff/1/?file=275912#file275912line268>
> >
> >     what is the difference between activated and connected slaves?

Sent out another review that fixes this.


> On March 29, 2013, 9:34 p.m., Vinod Kone wrote:
> > src/master/master.hpp, lines 232-233
> > <https://reviews.apache.org/r/10172/diff/1/?file=275913#file275913line232>
> >
> >     kill slavePIDs. just use slaves.
> >     
> >     use hashset<SlaveID> deactivated.
> >     
> >     kill active or connected. just maintain one variable.

Fixed in the separate review I sent out.


> On March 29, 2013, 9:34 p.m., Vinod Kone wrote:
> > src/tests/master_detector_tests.cpp, line 92
> > <https://reviews.apache.org/r/10172/diff/1/?file=275916#file275916line92>
> >
> >     why write stuff to the work directory?
> >     
> >     i thought you added a "sandbox" in MesosTest for this stuff?

After discussing in an earlier, we removed the sandbox as the slave work directory is effectively a sandbox already.


- Ben


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/10172/#review18532
-----------------------------------------------------------


On March 29, 2013, 1:38 a.m., Ben Mahler wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/10172/
> -----------------------------------------------------------
> 
> (Updated March 29, 2013, 1:38 a.m.)
> 
> 
> Review request for mesos, Benjamin Hindman and Vinod Kone.
> 
> 
> Description
> -------
> 
> See above. This is a fix of MESOS-305.
> 
> This also fixes MESOS-362.
> 
> 
> This addresses bugs MESOS-305 and MESOS-362.
>     https://issues.apache.org/jira/browse/MESOS-305
>     https://issues.apache.org/jira/browse/MESOS-362
> 
> 
> Diffs
> -----
> 
>   src/detector/detector.cpp 7a8355162d543e017505dd58efd2d7bf96f99623 
>   src/master/http.cpp 71b04f01f45ee73d9c246f469e1368223903abed 
>   src/master/master.hpp 9776a7cb8448e41e5d52288e3c637737cee15a08 
>   src/master/master.cpp 5b0e8c03c516f9fc8bb729c21e876bdde89baf9c 
>   src/tests/fault_tolerance_tests.cpp 9d3f8b1bfb58d459b1719d2ba1dbb2e93858fc92 
>   src/tests/master_detector_tests.cpp fe3b91fb375e0b09f8f2de3e69e736cd5f5b94ba 
> 
> Diff: https://reviews.apache.org/r/10172/diff/
> 
> 
> Testing
> -------
> 
> make check
> 
> Added tests for the partitioned slave re-registration.
> ./bin/mesos-tests.sh --gtest_filter="FaultToleranceTest.PartitionedSlaveReregistration" --verbose --gtest_break_on_failure --gtest_repeat=3000
> 
> Ran into MESOS-406, but otherwise no issues.
> 
> Will be adding ZK master detector tests shortly to test that the NoMasterDetectedMessages are being sent.
> 
> 
> Thanks,
> 
> Ben Mahler
> 
>


Re: Review Request: Send NoMasterDetectedMessage to non-contending detectors. Added a disconnected slave map to the master to track disconnected slaves, in order to disallow slave re-registration after a network partition.

Posted by Vinod Kone <vi...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/10172/#review18532
-----------------------------------------------------------



src/master/http.cpp
<https://reviews.apache.org/r/10172/#comment38840>

    what is the difference between activated and connected slaves?



src/master/master.hpp
<https://reviews.apache.org/r/10172/#comment38847>

    kill slavePIDs. just use slaves.
    
    use hashset<SlaveID> deactivated.
    
    kill active or connected. just maintain one variable.



src/tests/fault_tolerance_tests.cpp
<https://reviews.apache.org/r/10172/#comment38842>

    s/disconnected/removed/ right?



src/tests/fault_tolerance_tests.cpp
<https://reviews.apache.org/r/10172/#comment38843>

    s/disconnect/remove/



src/tests/fault_tolerance_tests.cpp
<https://reviews.apache.org/r/10172/#comment38849>

    new line?



src/tests/fault_tolerance_tests.cpp
<https://reviews.apache.org/r/10172/#comment38848>

    new line?



src/tests/fault_tolerance_tests.cpp
<https://reviews.apache.org/r/10172/#comment38851>

    new line?



src/tests/fault_tolerance_tests.cpp
<https://reviews.apache.org/r/10172/#comment38852>

    s/, this/. This/



src/tests/fault_tolerance_tests.cpp
<https://reviews.apache.org/r/10172/#comment38856>

    kill



src/tests/fault_tolerance_tests.cpp
<https://reviews.apache.org/r/10172/#comment38857>

    Can you expand on this comment. Say something on the lines of simulating the detector behavior during a network partition.



src/tests/fault_tolerance_tests.cpp
<https://reviews.apache.org/r/10172/#comment38855>

    kill



src/tests/fault_tolerance_tests.cpp
<https://reviews.apache.org/r/10172/#comment38853>

    s/registration/re-registration/



src/tests/master_detector_tests.cpp
<https://reviews.apache.org/r/10172/#comment38858>

    why write stuff to the work directory?
    
    i thought you added a "sandbox" in MesosTest for this stuff?


- Vinod Kone


On March 29, 2013, 1:38 a.m., Ben Mahler wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/10172/
> -----------------------------------------------------------
> 
> (Updated March 29, 2013, 1:38 a.m.)
> 
> 
> Review request for mesos, Benjamin Hindman and Vinod Kone.
> 
> 
> Description
> -------
> 
> See above. This is a fix of MESOS-305.
> 
> This also fixes MESOS-362.
> 
> 
> This addresses bugs MESOS-305 and MESOS-362.
>     https://issues.apache.org/jira/browse/MESOS-305
>     https://issues.apache.org/jira/browse/MESOS-362
> 
> 
> Diffs
> -----
> 
>   src/detector/detector.cpp 7a8355162d543e017505dd58efd2d7bf96f99623 
>   src/master/http.cpp 71b04f01f45ee73d9c246f469e1368223903abed 
>   src/master/master.hpp 9776a7cb8448e41e5d52288e3c637737cee15a08 
>   src/master/master.cpp 5b0e8c03c516f9fc8bb729c21e876bdde89baf9c 
>   src/tests/fault_tolerance_tests.cpp 9d3f8b1bfb58d459b1719d2ba1dbb2e93858fc92 
>   src/tests/master_detector_tests.cpp fe3b91fb375e0b09f8f2de3e69e736cd5f5b94ba 
> 
> Diff: https://reviews.apache.org/r/10172/diff/
> 
> 
> Testing
> -------
> 
> make check
> 
> Added tests for the partitioned slave re-registration.
> ./bin/mesos-tests.sh --gtest_filter="FaultToleranceTest.PartitionedSlaveReregistration" --verbose --gtest_break_on_failure --gtest_repeat=3000
> 
> Ran into MESOS-406, but otherwise no issues.
> 
> Will be adding ZK master detector tests shortly to test that the NoMasterDetectedMessages are being sent.
> 
> 
> Thanks,
> 
> Ben Mahler
> 
>


Re: Review Request: Send NoMasterDetectedMessage to non-contending detectors. Added a disconnected slave map to the master to track disconnected slaves, in order to disallow slave re-registration after a network partition.

Posted by Ben Mahler <be...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/10172/
-----------------------------------------------------------

(Updated March 29, 2013, 1:38 a.m.)


Review request for mesos, Benjamin Hindman and Vinod Kone.


Description
-------

See above. This is a fix of MESOS-305.

This also fixes MESOS-362.


This addresses bugs MESOS-305 and MESOS-362.
    https://issues.apache.org/jira/browse/MESOS-305
    https://issues.apache.org/jira/browse/MESOS-362


Diffs
-----

  src/detector/detector.cpp 7a8355162d543e017505dd58efd2d7bf96f99623 
  src/master/http.cpp 71b04f01f45ee73d9c246f469e1368223903abed 
  src/master/master.hpp 9776a7cb8448e41e5d52288e3c637737cee15a08 
  src/master/master.cpp 5b0e8c03c516f9fc8bb729c21e876bdde89baf9c 
  src/tests/fault_tolerance_tests.cpp 9d3f8b1bfb58d459b1719d2ba1dbb2e93858fc92 
  src/tests/master_detector_tests.cpp fe3b91fb375e0b09f8f2de3e69e736cd5f5b94ba 

Diff: https://reviews.apache.org/r/10172/diff/


Testing (updated)
-------

make check

Added tests for the partitioned slave re-registration.
./bin/mesos-tests.sh --gtest_filter="FaultToleranceTest.PartitionedSlaveReregistration" --verbose --gtest_break_on_failure --gtest_repeat=3000

Ran into MESOS-406, but otherwise no issues.

Will be adding ZK master detector tests shortly to test that the NoMasterDetectedMessages are being sent.


Thanks,

Ben Mahler