You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mesos.apache.org by Benjamin Hindman <be...@berkeley.edu> on 2013/01/24 21:25:05 UTC

Re: Review Request: Slave Restart (Part 8): Recover status update manager

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/8736/#review15590
-----------------------------------------------------------



src/slave/flags.hpp
<https://reviews.apache.org/r/8736/#comment33753>

    I'd like to make sure we understand and clearly document the semantics of what it means to have the default '--recovery=reconnect' when the slave doesn't actually have any old slave state to reconnect to.



src/slave/slave.cpp
<https://reviews.apache.org/r/8736/#comment33754>

    s/result/state/
    
    I liked your previous name of the variable better.



src/slave/slave.cpp
<https://reviews.apache.org/r/8736/#comment33674>

    s/if(/if (/



src/slave/slave.cpp
<https://reviews.apache.org/r/8736/#comment33756>

    I do not like the semantics of an error getting returned and us deciding that we've "successfully" recovered! Considering using a Result to distinguish _no_ slave state to recover from an actual error. This needs to get done everywhere that an error is swallowed (even though it's logged).



src/slave/slave.cpp
<https://reviews.apache.org/r/8736/#comment33757>

    Invoking recover on the status update manager is asynchronous. In the event there was an error recovering, I don't think we should set 'recovered = true' so optimistically!



src/slave/status_update_manager.hpp
<https://reviews.apache.org/r/8736/#comment33758>

    s/udpates/updates/



src/slave/status_update_manager.cpp
<https://reviews.apache.org/r/8736/#comment33761>

    Is there a non-hacky way to do this? If so, can you leave a TODO that describes what you're thinking for posterity?



src/slave/status_update_manager.cpp
<https://reviews.apache.org/r/8736/#comment33760>

    s/Re-try/Retry/
    
    It's a word! ;)



src/slave/status_update_manager.cpp
<https://reviews.apache.org/r/8736/#comment33762>

    I don't like that semantics of using the "checkpointing" flag to "hack" the replaying of checkpoints. It seems like replying a stream should be a fundamental primitive, and shouldn't require a breach of isolation. Let's really separate the concerns here such that the StatusUpdateManager doesn't need to concern itself with the internals of the StatusUpdateStream's implementation.



src/slave/status_update_manager.cpp
<https://reviews.apache.org/r/8736/#comment33764>

    This variable is not a boolean (or doesn't not have a boolean operator), please don't treat it as such.



src/tests/slave_recovery_tests.cpp
<https://reviews.apache.org/r/8736/#comment33765>

    Indentation.



src/tests/slave_recovery_tests.cpp
<https://reviews.apache.org/r/8736/#comment33766>

    Not Python.



src/tests/slave_recovery_tests.cpp
<https://reviews.apache.org/r/8736/#comment33768>

    As in other places in our code, you should make the 'recover' paramater an Option to explicitly distinguish some or none argument.



src/tests/slave_recovery_tests.cpp
<https://reviews.apache.org/r/8736/#comment33769>

    Indentation, here and everywhere else!



src/tests/slave_recovery_tests.cpp
<https://reviews.apache.org/r/8736/#comment33770>

    Newline.


- Benjamin Hindman


On Dec. 25, 2012, 7:57 a.m., Vinod Kone wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/8736/
> -----------------------------------------------------------
> 
> (Updated Dec. 25, 2012, 7:57 a.m.)
> 
> 
> Review request for mesos, Benjamin Hindman and Ben Mahler.
> 
> 
> Description
> -------
> 
> Status update manager now recovers the update stream and re-sends any pending updates.
> 
> 
> Diffs
> -----
> 
>   src/master/master.cpp fbf7f0775392e3d21a6986203ed8f11729ffda8c 
>   src/slave/flags.hpp 39e57f4104ee7a1538436ebbb9493581e28c99dd 
>   src/slave/slave.cpp 9755b46f97173d6fcc9ab1fd63e0e4814b3bc018 
>   src/slave/status_update_manager.hpp PRE-CREATION 
>   src/slave/status_update_manager.cpp PRE-CREATION 
>   src/tests/slave_recovery_tests.cpp PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/8736/diff/
> 
> 
> Testing
> -------
> 
> make check
> 
> 
> Thanks,
> 
> Vinod Kone
> 
>


Re: Review Request: Slave Restart (Part 8): Recover status update manager

Posted by Vinod Kone <vi...@gmail.com>.

> On Jan. 24, 2013, 8:25 p.m., Benjamin Hindman wrote:
> > src/slave/flags.hpp, line 111
> > <https://reviews.apache.org/r/8736/diff/2/?file=243208#file243208line111>
> >
> >     I'd like to make sure we understand and clearly document the semantics of what it means to have the default '--recovery=reconnect' when the slave doesn't actually have any old slave state to reconnect to.

updated the comment. lemme if it looks ok.


> On Jan. 24, 2013, 8:25 p.m., Benjamin Hindman wrote:
> > src/slave/slave.cpp, line 1456
> > <https://reviews.apache.org/r/8736/diff/2/?file=243209#file243209line1456>
> >
> >     I do not like the semantics of an error getting returned and us deciding that we've "successfully" recovered! Considering using a Result to distinguish _no_ slave state to recover from an actual error. This needs to get done everywhere that an error is swallowed (even though it's logged).

we always exit when recovery fails now.


> On Jan. 24, 2013, 8:25 p.m., Benjamin Hindman wrote:
> > src/slave/slave.cpp, lines 1466-1468
> > <https://reviews.apache.org/r/8736/diff/2/?file=243209#file243209line1466>
> >
> >     Invoking recover on the status update manager is asynchronous. In the event there was an error recovering, I don't think we should set 'recovered = true' so optimistically!

Made 'recover' asynchronous.


> On Jan. 24, 2013, 8:25 p.m., Benjamin Hindman wrote:
> > src/slave/status_update_manager.cpp, line 175
> > <https://reviews.apache.org/r/8736/diff/2/?file=243211#file243211line175>
> >
> >     I don't like that semantics of using the "checkpointing" flag to "hack" the replaying of checkpoints. It seems like replying a stream should be a fundamental primitive, and shouldn't require a breach of isolation. Let's really separate the concerns here such that the StatusUpdateManager doesn't need to concern itself with the internals of the StatusUpdateStream's implementation.

agreed, refactored.


> On Jan. 24, 2013, 8:25 p.m., Benjamin Hindman wrote:
> > src/slave/status_update_manager.cpp, line 138
> > <https://reviews.apache.org/r/8736/diff/2/?file=243211#file243211line138>
> >
> >     Is there a non-hacky way to do this? If so, can you leave a TODO that describes what you're thinking for posterity?

“I never put off till tomorrow what I can possibly do - the day after.” -- O.W.

fixed.


- Vinod


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/8736/#review15590
-----------------------------------------------------------


On Dec. 25, 2012, 7:57 a.m., Vinod Kone wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/8736/
> -----------------------------------------------------------
> 
> (Updated Dec. 25, 2012, 7:57 a.m.)
> 
> 
> Review request for mesos, Benjamin Hindman and Ben Mahler.
> 
> 
> Description
> -------
> 
> Status update manager now recovers the update stream and re-sends any pending updates.
> 
> 
> Diffs
> -----
> 
>   src/master/master.cpp fbf7f0775392e3d21a6986203ed8f11729ffda8c 
>   src/slave/flags.hpp 39e57f4104ee7a1538436ebbb9493581e28c99dd 
>   src/slave/slave.cpp 9755b46f97173d6fcc9ab1fd63e0e4814b3bc018 
>   src/slave/status_update_manager.hpp PRE-CREATION 
>   src/slave/status_update_manager.cpp PRE-CREATION 
>   src/tests/slave_recovery_tests.cpp PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/8736/diff/
> 
> 
> Testing
> -------
> 
> make check
> 
> 
> Thanks,
> 
> Vinod Kone
> 
>