You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mesos.apache.org by Ben Mahler <be...@gmail.com> on 2014/05/01 23:52:11 UTC

Review Request 20981: Updated the Registrar to abort permanently upon encountering a Failure.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/20981/
-----------------------------------------------------------

Review request for mesos, Benjamin Hindman and Vinod Kone.


Bugs: MESOS-1274
    https://issues.apache.org/jira/browse/MESOS-1274


Repository: mesos-git


Description
-------

It's possible for a backed-up master (many items in its queue) to have many operations enqueued in the Registrar.

In this event, the Master won't commit suicide until the initial failure is processed. However, in the interim, subsequent operations are potentially being performed against the Registrar. This could lead to fighting between Masters if a "demoted" Master re-attempts to acquire log-leadership! This scenario can occur if the "demoted" master has a large queue and the demotion event is towards the back of the Master's queue.

It would be preferable to ensure that after losing log leadership, the "demoted" master does not try to re-acquire log leadership and write to the log.

This is the motivation for this patch.


Diffs
-----

  src/master/registrar.cpp e4b0b3968a7a50bc951717160769daf2c1850b65 
  src/tests/registrar_tests.cpp 917a470f326523fbf11e245f4156fc8ce1d974d5 

Diff: https://reviews.apache.org/r/20981/diff/


Testing
-------

Added a test and improved the existing tests.


Thanks,

Ben Mahler


Re: Review Request 20981: Updated the Registrar to abort permanently upon encountering a Failure.

Posted by Ben Mahler <be...@gmail.com>.

> On May 2, 2014, 12:04 a.m., Vinod Kone wrote:
> > src/master/registrar.cpp, line 371
> > <https://reviews.apache.org/r/20981/diff/1/?file=573051#file573051line371>
> >
> >     How about handling error condition in all the RegistrarProcess methods? If a method cannot be invoked when in error it should've a check.

Ok, I think instead of adding CHECKs throughout the code, it would be nice to codify the *core* of our semantics here, which is: We won't be calling State::store ever again if an Error occurs. I'll add this in update()!


> On May 2, 2014, 12:04 a.m., Vinod Kone wrote:
> > src/master/registrar.cpp, line 297
> > <https://reviews.apache.org/r/20981/diff/1/?file=573051#file573051line297>
> >
> >     No check for error here?

As far as 'recover' is concerned there are two cases:

  (1) Nothing has happened yet, so start recovering!
  (2) We're already recover{ing|ed}, return existing future!

In both of these cases, there's no need to be concerned about whether we've aborted. In case (1), nothing has happened yet. In case (2), we don't do any work.

If 'recover' fails:

  (a) All future attempts to recover() will fail (as the existing future is returned).
  (b) All future operations will fail, as they are gated behind the 'recovered' future.

If we didn't have to inject 'MasterInfo' we could simplify the code quite a bit. For now, I like the idea of capturing the assumptions! I'll add the CHECK in update().

I also moved the LOG(INFO) inside the 'if' condition to capture the fact that we're not really "recovering" otherwise.


- Ben


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/20981/#review41975
-----------------------------------------------------------


On May 1, 2014, 9:52 p.m., Ben Mahler wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/20981/
> -----------------------------------------------------------
> 
> (Updated May 1, 2014, 9:52 p.m.)
> 
> 
> Review request for mesos, Benjamin Hindman and Vinod Kone.
> 
> 
> Bugs: MESOS-1274
>     https://issues.apache.org/jira/browse/MESOS-1274
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> It's possible for a backed-up master (many items in its queue) to have many operations enqueued in the Registrar.
> 
> In this event, the Master won't commit suicide until the initial failure is processed. However, in the interim, subsequent operations are potentially being performed against the Registrar. This could lead to fighting between Masters if a "demoted" Master re-attempts to acquire log-leadership! This scenario can occur if the "demoted" master has a large queue and the demotion event is towards the back of the Master's queue.
> 
> It would be preferable to ensure that after losing log leadership, the "demoted" master does not try to re-acquire log leadership and write to the log.
> 
> This is the motivation for this patch.
> 
> 
> Diffs
> -----
> 
>   src/master/registrar.cpp e4b0b3968a7a50bc951717160769daf2c1850b65 
>   src/tests/registrar_tests.cpp 917a470f326523fbf11e245f4156fc8ce1d974d5 
> 
> Diff: https://reviews.apache.org/r/20981/diff/
> 
> 
> Testing
> -------
> 
> Added a test and improved the existing tests.
> 
> 
> Thanks,
> 
> Ben Mahler
> 
>


Re: Review Request 20981: Updated the Registrar to abort permanently upon encountering a Failure.

Posted by Vinod Kone <vi...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/20981/#review41975
-----------------------------------------------------------



src/master/registrar.cpp
<https://reviews.apache.org/r/20981/#comment75674>

    No check for error here?



src/master/registrar.cpp
<https://reviews.apache.org/r/20981/#comment75660>

    How about handling error condition in all the RegistrarProcess methods? If a method cannot be invoked when in error it should've a check.


- Vinod Kone


On May 1, 2014, 9:52 p.m., Ben Mahler wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/20981/
> -----------------------------------------------------------
> 
> (Updated May 1, 2014, 9:52 p.m.)
> 
> 
> Review request for mesos, Benjamin Hindman and Vinod Kone.
> 
> 
> Bugs: MESOS-1274
>     https://issues.apache.org/jira/browse/MESOS-1274
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> It's possible for a backed-up master (many items in its queue) to have many operations enqueued in the Registrar.
> 
> In this event, the Master won't commit suicide until the initial failure is processed. However, in the interim, subsequent operations are potentially being performed against the Registrar. This could lead to fighting between Masters if a "demoted" Master re-attempts to acquire log-leadership! This scenario can occur if the "demoted" master has a large queue and the demotion event is towards the back of the Master's queue.
> 
> It would be preferable to ensure that after losing log leadership, the "demoted" master does not try to re-acquire log leadership and write to the log.
> 
> This is the motivation for this patch.
> 
> 
> Diffs
> -----
> 
>   src/master/registrar.cpp e4b0b3968a7a50bc951717160769daf2c1850b65 
>   src/tests/registrar_tests.cpp 917a470f326523fbf11e245f4156fc8ce1d974d5 
> 
> Diff: https://reviews.apache.org/r/20981/diff/
> 
> 
> Testing
> -------
> 
> Added a test and improved the existing tests.
> 
> 
> Thanks,
> 
> Ben Mahler
> 
>


Re: Review Request 20981: Updated the Registrar to abort permanently upon encountering a Failure.

Posted by Ben Mahler <be...@gmail.com>.

> On May 1, 2014, 10:01 p.m., Dominic Hamon wrote:
> > src/master/registrar.cpp, line 333
> > <https://reviews.apache.org/r/20981/diff/1/?file=573051#file573051line333>
> >
> >     worth adding a gauge for this size?

I think so, was planning to take that up in a separate review. :)


> On May 1, 2014, 10:01 p.m., Dominic Hamon wrote:
> > src/master/registrar.cpp, line 451
> > <https://reviews.apache.org/r/20981/diff/1/?file=573051#file573051line451>
> >
> >     can you use CHECK_READY here instead? it's almost duplicating the logic though it looks like you have some extra cases.

Hm, I don't understand this comment, can you clarify? Maybe provide a code snippet?


- Ben


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/20981/#review41962
-----------------------------------------------------------


On May 1, 2014, 9:52 p.m., Ben Mahler wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/20981/
> -----------------------------------------------------------
> 
> (Updated May 1, 2014, 9:52 p.m.)
> 
> 
> Review request for mesos, Benjamin Hindman and Vinod Kone.
> 
> 
> Bugs: MESOS-1274
>     https://issues.apache.org/jira/browse/MESOS-1274
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> It's possible for a backed-up master (many items in its queue) to have many operations enqueued in the Registrar.
> 
> In this event, the Master won't commit suicide until the initial failure is processed. However, in the interim, subsequent operations are potentially being performed against the Registrar. This could lead to fighting between Masters if a "demoted" Master re-attempts to acquire log-leadership! This scenario can occur if the "demoted" master has a large queue and the demotion event is towards the back of the Master's queue.
> 
> It would be preferable to ensure that after losing log leadership, the "demoted" master does not try to re-acquire log leadership and write to the log.
> 
> This is the motivation for this patch.
> 
> 
> Diffs
> -----
> 
>   src/master/registrar.cpp e4b0b3968a7a50bc951717160769daf2c1850b65 
>   src/tests/registrar_tests.cpp 917a470f326523fbf11e245f4156fc8ce1d974d5 
> 
> Diff: https://reviews.apache.org/r/20981/diff/
> 
> 
> Testing
> -------
> 
> Added a test and improved the existing tests.
> 
> 
> Thanks,
> 
> Ben Mahler
> 
>


Re: Review Request 20981: Updated the Registrar to abort permanently upon encountering a Failure.

Posted by Dominic Hamon <dh...@twopensource.com>.

> On May 1, 2014, 3:01 p.m., Dominic Hamon wrote:
> > src/master/registrar.cpp, line 333
> > <https://reviews.apache.org/r/20981/diff/1/?file=573051#file573051line333>
> >
> >     worth adding a gauge for this size?
> 
> Ben Mahler wrote:
>     I think so, was planning to take that up in a separate review. :)

+1


> On May 1, 2014, 3:01 p.m., Dominic Hamon wrote:
> > src/master/registrar.cpp, line 451
> > <https://reviews.apache.org/r/20981/diff/1/?file=573051#file573051line451>
> >
> >     can you use CHECK_READY here instead? it's almost duplicating the logic though it looks like you have some extra cases.
> 
> Ben Mahler wrote:
>     Hm, I don't understand this comment, can you clarify? Maybe provide a code snippet?

CHECK_READY on store will abort with a reasonable message depending on the Future's state. While you have some extra conditions that would need to be added in a separate check (CHECK_SOME on store.get()?) I don't think my comment applies anyway as 'abort' in this context is actually calling LOG(ERROR) and fail. I expected 'abort' in this context to be calling, well, abort!

So ignore my comment about using CHECK_READY, but maybe consider a different name for abort?


- Dominic


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/20981/#review41962
-----------------------------------------------------------


On May 1, 2014, 6:09 p.m., Ben Mahler wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/20981/
> -----------------------------------------------------------
> 
> (Updated May 1, 2014, 6:09 p.m.)
> 
> 
> Review request for mesos, Benjamin Hindman and Vinod Kone.
> 
> 
> Bugs: MESOS-1274
>     https://issues.apache.org/jira/browse/MESOS-1274
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> It's possible for a backed-up master (many items in its queue) to have many operations enqueued in the Registrar.
> 
> In this event, the Master won't commit suicide until the initial failure is processed. However, in the interim, subsequent operations are potentially being performed against the Registrar. This could lead to fighting between Masters if a "demoted" Master re-attempts to acquire log-leadership! This scenario can occur if the "demoted" master has a large queue and the demotion event is towards the back of the Master's queue.
> 
> It would be preferable to ensure that after losing log leadership, the "demoted" master does not try to re-acquire log leadership and write to the log.
> 
> This is the motivation for this patch.
> 
> 
> Diffs
> -----
> 
>   src/master/registrar.cpp fecc314df04c552212522168f7a5a17b77482e34 
>   src/tests/registrar_tests.cpp 917a470f326523fbf11e245f4156fc8ce1d974d5 
> 
> Diff: https://reviews.apache.org/r/20981/diff/
> 
> 
> Testing
> -------
> 
> Added a test and improved the existing tests.
> 
> 
> Thanks,
> 
> Ben Mahler
> 
>


Re: Review Request 20981: Updated the Registrar to abort permanently upon encountering a Failure.

Posted by Dominic Hamon <dh...@twopensource.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/20981/#review41962
-----------------------------------------------------------


drive-by low priority thoughts:


src/master/registrar.cpp
<https://reviews.apache.org/r/20981/#comment75630>

    worth adding a gauge for this size?



src/master/registrar.cpp
<https://reviews.apache.org/r/20981/#comment75632>

    can you use CHECK_READY here instead? it's almost duplicating the logic though it looks like you have some extra cases.


- Dominic Hamon


On May 1, 2014, 2:52 p.m., Ben Mahler wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/20981/
> -----------------------------------------------------------
> 
> (Updated May 1, 2014, 2:52 p.m.)
> 
> 
> Review request for mesos, Benjamin Hindman and Vinod Kone.
> 
> 
> Bugs: MESOS-1274
>     https://issues.apache.org/jira/browse/MESOS-1274
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> It's possible for a backed-up master (many items in its queue) to have many operations enqueued in the Registrar.
> 
> In this event, the Master won't commit suicide until the initial failure is processed. However, in the interim, subsequent operations are potentially being performed against the Registrar. This could lead to fighting between Masters if a "demoted" Master re-attempts to acquire log-leadership! This scenario can occur if the "demoted" master has a large queue and the demotion event is towards the back of the Master's queue.
> 
> It would be preferable to ensure that after losing log leadership, the "demoted" master does not try to re-acquire log leadership and write to the log.
> 
> This is the motivation for this patch.
> 
> 
> Diffs
> -----
> 
>   src/master/registrar.cpp e4b0b3968a7a50bc951717160769daf2c1850b65 
>   src/tests/registrar_tests.cpp 917a470f326523fbf11e245f4156fc8ce1d974d5 
> 
> Diff: https://reviews.apache.org/r/20981/diff/
> 
> 
> Testing
> -------
> 
> Added a test and improved the existing tests.
> 
> 
> Thanks,
> 
> Ben Mahler
> 
>


Re: Review Request 20981: Updated the Registrar to abort permanently upon encountering a Failure.

Posted by Vinod Kone <vi...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/20981/#review42003
-----------------------------------------------------------

Ship it!


Ship It!

- Vinod Kone


On May 2, 2014, 1:09 a.m., Ben Mahler wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/20981/
> -----------------------------------------------------------
> 
> (Updated May 2, 2014, 1:09 a.m.)
> 
> 
> Review request for mesos, Benjamin Hindman and Vinod Kone.
> 
> 
> Bugs: MESOS-1274
>     https://issues.apache.org/jira/browse/MESOS-1274
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> It's possible for a backed-up master (many items in its queue) to have many operations enqueued in the Registrar.
> 
> In this event, the Master won't commit suicide until the initial failure is processed. However, in the interim, subsequent operations are potentially being performed against the Registrar. This could lead to fighting between Masters if a "demoted" Master re-attempts to acquire log-leadership! This scenario can occur if the "demoted" master has a large queue and the demotion event is towards the back of the Master's queue.
> 
> It would be preferable to ensure that after losing log leadership, the "demoted" master does not try to re-acquire log leadership and write to the log.
> 
> This is the motivation for this patch.
> 
> 
> Diffs
> -----
> 
>   src/master/registrar.cpp fecc314df04c552212522168f7a5a17b77482e34 
>   src/tests/registrar_tests.cpp 917a470f326523fbf11e245f4156fc8ce1d974d5 
> 
> Diff: https://reviews.apache.org/r/20981/diff/
> 
> 
> Testing
> -------
> 
> Added a test and improved the existing tests.
> 
> 
> Thanks,
> 
> Ben Mahler
> 
>


Re: Review Request 20981: Updated the Registrar to abort permanently upon encountering a Failure.

Posted by Ben Mahler <be...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/20981/
-----------------------------------------------------------

(Updated May 2, 2014, 1:09 a.m.)


Review request for mesos, Benjamin Hindman and Vinod Kone.


Changes
-------

Vinod's review: Added a CHECK to ensure we do not perform additional State::store calls once we're in the "error" state.


Bugs: MESOS-1274
    https://issues.apache.org/jira/browse/MESOS-1274


Repository: mesos-git


Description
-------

It's possible for a backed-up master (many items in its queue) to have many operations enqueued in the Registrar.

In this event, the Master won't commit suicide until the initial failure is processed. However, in the interim, subsequent operations are potentially being performed against the Registrar. This could lead to fighting between Masters if a "demoted" Master re-attempts to acquire log-leadership! This scenario can occur if the "demoted" master has a large queue and the demotion event is towards the back of the Master's queue.

It would be preferable to ensure that after losing log leadership, the "demoted" master does not try to re-acquire log leadership and write to the log.

This is the motivation for this patch.


Diffs (updated)
-----

  src/master/registrar.cpp fecc314df04c552212522168f7a5a17b77482e34 
  src/tests/registrar_tests.cpp 917a470f326523fbf11e245f4156fc8ce1d974d5 

Diff: https://reviews.apache.org/r/20981/diff/


Testing
-------

Added a test and improved the existing tests.


Thanks,

Ben Mahler