You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mesos.apache.org by Vinod Kone <vi...@gmail.com> on 2013/08/06 22:44:28 UTC
Re: Review Request 13259: Fixed master to properly handle a disconnected
slave.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/13259/
-----------------------------------------------------------
(Updated Aug. 6, 2013, 8:44 p.m.)
Review request for mesos, Benjamin Hindman and Ben Mahler.
Changes
-------
updated Summary and some minor fixes. still waiting for a review.
Summary (updated)
-----------------
Fixed master to properly handle a disconnected slave.
Bugs: MESOS-614
https://issues.apache.org/jira/browse/MESOS-614
Repository: mesos-git
Description (updated)
-------
When the master is waiting for a disconected checkpointed slave to "reregister", but the restarted slave tries to "register" (because it failed recovery) the master should remove the old slave. Also, we remove the disconnected slave from the allocator so that we don't offer its resources.
Diffs (updated)
-----
src/master/hierarchical_allocator_process.hpp ce06e7437b760fa6356ef86ae6800907282865ed
src/master/master.hpp 8ec0c17c71b7b4679d4f712d0fb742d420c9152d
src/master/master.cpp b0a2757af3ec83ead53374504fe24d3a8f7673ad
src/tests/slave_recovery_tests.cpp c451e0f4c571a646d375aa89e806e1a4058d39e7
Diff: https://reviews.apache.org/r/13259/diff/
Testing
-------
make check
Thanks,
Vinod Kone
Re: Review Request 13259: Fixed master to properly handle a disconnected
slave.
Posted by Ben Mahler <be...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/13259/#review24755
-----------------------------------------------------------
Ship it!
src/master/master.cpp
<https://reviews.apache.org/r/13259/#comment48871>
break; after this line?
- Ben Mahler
On Aug. 6, 2013, 8:44 p.m., Vinod Kone wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/13259/
> -----------------------------------------------------------
>
> (Updated Aug. 6, 2013, 8:44 p.m.)
>
>
> Review request for mesos, Benjamin Hindman and Ben Mahler.
>
>
> Bugs: MESOS-614
> https://issues.apache.org/jira/browse/MESOS-614
>
>
> Repository: mesos-git
>
>
> Description
> -------
>
> When the master is waiting for a disconected checkpointed slave to "reregister", but the restarted slave tries to "register" (because it failed recovery) the master should remove the old slave. Also, we remove the disconnected slave from the allocator so that we don't offer its resources.
>
>
> Diffs
> -----
>
> src/master/hierarchical_allocator_process.hpp ce06e7437b760fa6356ef86ae6800907282865ed
> src/master/master.hpp 8ec0c17c71b7b4679d4f712d0fb742d420c9152d
> src/master/master.cpp b0a2757af3ec83ead53374504fe24d3a8f7673ad
> src/tests/slave_recovery_tests.cpp c451e0f4c571a646d375aa89e806e1a4058d39e7
>
> Diff: https://reviews.apache.org/r/13259/diff/
>
>
> Testing
> -------
>
> make check
>
>
> Thanks,
>
> Vinod Kone
>
>
Re: Review Request 13259: Fixed master to properly handle a disconnected
slave.
Posted by Vinod Kone <vi...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/13259/
-----------------------------------------------------------
(Updated Aug. 7, 2013, 2:54 a.m.)
Review request for mesos, Benjamin Hindman and Ben Mahler.
Changes
-------
bens'. nnfr.
Bugs: MESOS-614
https://issues.apache.org/jira/browse/MESOS-614
Repository: mesos-git
Description
-------
When the master is waiting for a disconected checkpointed slave to "reregister", but the restarted slave tries to "register" (because it failed recovery) the master should remove the old slave. Also, we remove the disconnected slave from the allocator so that we don't offer its resources.
Diffs (updated)
-----
src/master/hierarchical_allocator_process.hpp ce06e7437b760fa6356ef86ae6800907282865ed
src/master/master.hpp 8ec0c17c71b7b4679d4f712d0fb742d420c9152d
src/master/master.cpp b0a2757af3ec83ead53374504fe24d3a8f7673ad
src/tests/slave_recovery_tests.cpp c451e0f4c571a646d375aa89e806e1a4058d39e7
Diff: https://reviews.apache.org/r/13259/diff/
Testing
-------
make check
Thanks,
Vinod Kone
Re: Review Request 13259: Fixed master to properly handle a disconnected
slave.
Posted by Vinod Kone <vi...@gmail.com>.
> On Aug. 6, 2013, 10:15 p.m., Benjamin Hindman wrote:
> > src/tests/slave_recovery_tests.cpp, line 1456
> > <https://reviews.apache.org/r/13259/diff/4/?file=337956#file337956line1456>
> >
> > Let's make it clear (here and elsewhere) that we're shutting down the executor manually because otherwise it won't get cleaned up rather than we're shutting down the executor in order to cause some event to occur in the test (which was my first impression).
FYI, we currently need the manual killing of executors in all of the tests because of a bug in the 'Cluster' dealing with shutdown of slaves. I will fix it in a subsequent review. Added TODOs in this patch.
- Vinod
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/13259/#review24746
-----------------------------------------------------------
On Aug. 7, 2013, 2:54 a.m., Vinod Kone wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/13259/
> -----------------------------------------------------------
>
> (Updated Aug. 7, 2013, 2:54 a.m.)
>
>
> Review request for mesos, Benjamin Hindman and Ben Mahler.
>
>
> Bugs: MESOS-614
> https://issues.apache.org/jira/browse/MESOS-614
>
>
> Repository: mesos-git
>
>
> Description
> -------
>
> When the master is waiting for a disconected checkpointed slave to "reregister", but the restarted slave tries to "register" (because it failed recovery) the master should remove the old slave. Also, we remove the disconnected slave from the allocator so that we don't offer its resources.
>
>
> Diffs
> -----
>
> src/master/hierarchical_allocator_process.hpp ce06e7437b760fa6356ef86ae6800907282865ed
> src/master/master.hpp 8ec0c17c71b7b4679d4f712d0fb742d420c9152d
> src/master/master.cpp b0a2757af3ec83ead53374504fe24d3a8f7673ad
> src/tests/slave_recovery_tests.cpp c451e0f4c571a646d375aa89e806e1a4058d39e7
>
> Diff: https://reviews.apache.org/r/13259/diff/
>
>
> Testing
> -------
>
> make check
>
>
> Thanks,
>
> Vinod Kone
>
>
Re: Review Request 13259: Fixed master to properly handle a disconnected
slave.
Posted by Benjamin Hindman <be...@berkeley.edu>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/13259/#review24746
-----------------------------------------------------------
Ship it!
src/master/hierarchical_allocator_process.hpp
<https://reviews.apache.org/r/13259/#comment48850>
Can we keep this a CHECK by not invoking slaveRemoved from the master if the slave is disconnected?
src/master/master.hpp
<https://reviews.apache.org/r/13259/#comment48852>
Can we highlight that we care about disconnected slaves because with slave recovery the slave might reconnect?
src/master/master.cpp
<https://reviews.apache.org/r/13259/#comment48853>
Can we add a TODO that explores Allocator::slaveDisconnected?
src/master/master.cpp
<https://reviews.apache.org/r/13259/#comment48856>
Yes, this is a bug in the allocator. Let's do Allocator::resourcesRecovered for now with a TODO that reevaluates this after we fix the bug in the allocator.
src/master/master.cpp
<https://reviews.apache.org/r/13259/#comment48861>
This sounds like a bug too. We should add a test for this.
src/tests/slave_recovery_tests.cpp
<https://reviews.apache.org/r/13259/#comment48862>
s/isolator1/isolator/
src/tests/slave_recovery_tests.cpp
<https://reviews.apache.org/r/13259/#comment48863>
Let's make it clear (here and elsewhere) that we're shutting down the executor manually because otherwise it won't get cleaned up rather than we're shutting down the executor in order to cause some event to occur in the test (which was my first impression).
- Benjamin Hindman
On Aug. 6, 2013, 8:44 p.m., Vinod Kone wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/13259/
> -----------------------------------------------------------
>
> (Updated Aug. 6, 2013, 8:44 p.m.)
>
>
> Review request for mesos, Benjamin Hindman and Ben Mahler.
>
>
> Bugs: MESOS-614
> https://issues.apache.org/jira/browse/MESOS-614
>
>
> Repository: mesos-git
>
>
> Description
> -------
>
> When the master is waiting for a disconected checkpointed slave to "reregister", but the restarted slave tries to "register" (because it failed recovery) the master should remove the old slave. Also, we remove the disconnected slave from the allocator so that we don't offer its resources.
>
>
> Diffs
> -----
>
> src/master/hierarchical_allocator_process.hpp ce06e7437b760fa6356ef86ae6800907282865ed
> src/master/master.hpp 8ec0c17c71b7b4679d4f712d0fb742d420c9152d
> src/master/master.cpp b0a2757af3ec83ead53374504fe24d3a8f7673ad
> src/tests/slave_recovery_tests.cpp c451e0f4c571a646d375aa89e806e1a4058d39e7
>
> Diff: https://reviews.apache.org/r/13259/diff/
>
>
> Testing
> -------
>
> make check
>
>
> Thanks,
>
> Vinod Kone
>
>