You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mesos.apache.org by Vinod Kone <vi...@gmail.com> on 2013/08/06 22:44:28 UTC

Re: Review Request 13259: Fixed master to properly handle a disconnected slave.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/13259/
-----------------------------------------------------------

(Updated Aug. 6, 2013, 8:44 p.m.)


Review request for mesos, Benjamin Hindman and Ben Mahler.


Changes
-------

updated Summary and some minor fixes. still waiting for a review.


Summary (updated)
-----------------

Fixed master to properly handle a disconnected slave.


Bugs: MESOS-614
    https://issues.apache.org/jira/browse/MESOS-614


Repository: mesos-git


Description (updated)
-------

When the master is waiting for a disconected checkpointed slave to "reregister", but the restarted slave tries to "register" (because it failed recovery) the master should remove the old slave. Also, we remove the disconnected slave from the allocator so that we don't offer its resources.


Diffs (updated)
-----

  src/master/hierarchical_allocator_process.hpp ce06e7437b760fa6356ef86ae6800907282865ed 
  src/master/master.hpp 8ec0c17c71b7b4679d4f712d0fb742d420c9152d 
  src/master/master.cpp b0a2757af3ec83ead53374504fe24d3a8f7673ad 
  src/tests/slave_recovery_tests.cpp c451e0f4c571a646d375aa89e806e1a4058d39e7 

Diff: https://reviews.apache.org/r/13259/diff/


Testing
-------

make check


Thanks,

Vinod Kone

Re: Review Request 13259: Fixed master to properly handle a disconnected slave.

Posted by Ben Mahler <be...@gmail.com>.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/13259/#review24755
-----------------------------------------------------------

Ship it!



src/master/master.cpp
<https://reviews.apache.org/r/13259/#comment48871>

    break; after this line?


- Ben Mahler


On Aug. 6, 2013, 8:44 p.m., Vinod Kone wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/13259/
> -----------------------------------------------------------
> 
> (Updated Aug. 6, 2013, 8:44 p.m.)
> 
> 
> Review request for mesos, Benjamin Hindman and Ben Mahler.
> 
> 
> Bugs: MESOS-614
>     https://issues.apache.org/jira/browse/MESOS-614
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> When the master is waiting for a disconected checkpointed slave to "reregister", but the restarted slave tries to "register" (because it failed recovery) the master should remove the old slave. Also, we remove the disconnected slave from the allocator so that we don't offer its resources.
> 
> 
> Diffs
> -----
> 
>   src/master/hierarchical_allocator_process.hpp ce06e7437b760fa6356ef86ae6800907282865ed 
>   src/master/master.hpp 8ec0c17c71b7b4679d4f712d0fb742d420c9152d 
>   src/master/master.cpp b0a2757af3ec83ead53374504fe24d3a8f7673ad 
>   src/tests/slave_recovery_tests.cpp c451e0f4c571a646d375aa89e806e1a4058d39e7 
> 
> Diff: https://reviews.apache.org/r/13259/diff/
> 
> 
> Testing
> -------
> 
> make check
> 
> 
> Thanks,
> 
> Vinod Kone
> 
>

Re: Review Request 13259: Fixed master to properly handle a disconnected slave.

Posted by Vinod Kone <vi...@gmail.com>.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/13259/
-----------------------------------------------------------

(Updated Aug. 7, 2013, 2:54 a.m.)


Review request for mesos, Benjamin Hindman and Ben Mahler.


Changes
-------

bens'. nnfr.


Bugs: MESOS-614
    https://issues.apache.org/jira/browse/MESOS-614


Repository: mesos-git


Description
-------

When the master is waiting for a disconected checkpointed slave to "reregister", but the restarted slave tries to "register" (because it failed recovery) the master should remove the old slave. Also, we remove the disconnected slave from the allocator so that we don't offer its resources.


Diffs (updated)
-----

  src/master/hierarchical_allocator_process.hpp ce06e7437b760fa6356ef86ae6800907282865ed 
  src/master/master.hpp 8ec0c17c71b7b4679d4f712d0fb742d420c9152d 
  src/master/master.cpp b0a2757af3ec83ead53374504fe24d3a8f7673ad 
  src/tests/slave_recovery_tests.cpp c451e0f4c571a646d375aa89e806e1a4058d39e7 

Diff: https://reviews.apache.org/r/13259/diff/


Testing
-------

make check


Thanks,

Vinod Kone

Re: Review Request 13259: Fixed master to properly handle a disconnected slave.

Posted by Vinod Kone <vi...@gmail.com>.


> On Aug. 6, 2013, 10:15 p.m., Benjamin Hindman wrote:
> > src/tests/slave_recovery_tests.cpp, line 1456
> > <https://reviews.apache.org/r/13259/diff/4/?file=337956#file337956line1456>
> >
> >     Let's make it clear (here and elsewhere) that we're shutting down the executor manually because otherwise it won't get cleaned up rather than we're shutting down the executor in order to cause some event to occur in the test (which was my first impression).

FYI, we currently need the manual killing of executors in all of the tests because of a bug in the 'Cluster' dealing with shutdown of slaves. I will fix it in a subsequent review. Added TODOs in this patch.


- Vinod


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/13259/#review24746
-----------------------------------------------------------


On Aug. 7, 2013, 2:54 a.m., Vinod Kone wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/13259/
> -----------------------------------------------------------
> 
> (Updated Aug. 7, 2013, 2:54 a.m.)
> 
> 
> Review request for mesos, Benjamin Hindman and Ben Mahler.
> 
> 
> Bugs: MESOS-614
>     https://issues.apache.org/jira/browse/MESOS-614
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> When the master is waiting for a disconected checkpointed slave to "reregister", but the restarted slave tries to "register" (because it failed recovery) the master should remove the old slave. Also, we remove the disconnected slave from the allocator so that we don't offer its resources.
> 
> 
> Diffs
> -----
> 
>   src/master/hierarchical_allocator_process.hpp ce06e7437b760fa6356ef86ae6800907282865ed 
>   src/master/master.hpp 8ec0c17c71b7b4679d4f712d0fb742d420c9152d 
>   src/master/master.cpp b0a2757af3ec83ead53374504fe24d3a8f7673ad 
>   src/tests/slave_recovery_tests.cpp c451e0f4c571a646d375aa89e806e1a4058d39e7 
> 
> Diff: https://reviews.apache.org/r/13259/diff/
> 
> 
> Testing
> -------
> 
> make check
> 
> 
> Thanks,
> 
> Vinod Kone
> 
>

Re: Review Request 13259: Fixed master to properly handle a disconnected slave.

Posted by Benjamin Hindman <be...@berkeley.edu>.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/13259/#review24746
-----------------------------------------------------------

Ship it!



src/master/hierarchical_allocator_process.hpp
<https://reviews.apache.org/r/13259/#comment48850>

    Can we keep this a CHECK by not invoking slaveRemoved from the master if the slave is disconnected?



src/master/master.hpp
<https://reviews.apache.org/r/13259/#comment48852>

    Can we highlight that we care about disconnected slaves because with slave recovery the slave might reconnect?



src/master/master.cpp
<https://reviews.apache.org/r/13259/#comment48853>

    Can we add a TODO that explores Allocator::slaveDisconnected?



src/master/master.cpp
<https://reviews.apache.org/r/13259/#comment48856>

    Yes, this is a bug in the allocator. Let's do Allocator::resourcesRecovered for now with a TODO that reevaluates this after we fix the bug in the allocator.



src/master/master.cpp
<https://reviews.apache.org/r/13259/#comment48861>

    This sounds like a bug too. We should add a test for this.



src/tests/slave_recovery_tests.cpp
<https://reviews.apache.org/r/13259/#comment48862>

    s/isolator1/isolator/



src/tests/slave_recovery_tests.cpp
<https://reviews.apache.org/r/13259/#comment48863>

    Let's make it clear (here and elsewhere) that we're shutting down the executor manually because otherwise it won't get cleaned up rather than we're shutting down the executor in order to cause some event to occur in the test (which was my first impression).


- Benjamin Hindman


On Aug. 6, 2013, 8:44 p.m., Vinod Kone wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/13259/
> -----------------------------------------------------------
> 
> (Updated Aug. 6, 2013, 8:44 p.m.)
> 
> 
> Review request for mesos, Benjamin Hindman and Ben Mahler.
> 
> 
> Bugs: MESOS-614
>     https://issues.apache.org/jira/browse/MESOS-614
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> When the master is waiting for a disconected checkpointed slave to "reregister", but the restarted slave tries to "register" (because it failed recovery) the master should remove the old slave. Also, we remove the disconnected slave from the allocator so that we don't offer its resources.
> 
> 
> Diffs
> -----
> 
>   src/master/hierarchical_allocator_process.hpp ce06e7437b760fa6356ef86ae6800907282865ed 
>   src/master/master.hpp 8ec0c17c71b7b4679d4f712d0fb742d420c9152d 
>   src/master/master.cpp b0a2757af3ec83ead53374504fe24d3a8f7673ad 
>   src/tests/slave_recovery_tests.cpp c451e0f4c571a646d375aa89e806e1a4058d39e7 
> 
> Diff: https://reviews.apache.org/r/13259/diff/
> 
> 
> Testing
> -------
> 
> make check
> 
> 
> Thanks,
> 
> Vinod Kone
> 
>