You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mesos.apache.org by Vinod Kone <vi...@gmail.com> on 2013/04/12 23:27:06 UTC

Review Request: Fixed status update manager to explicitly handle checkpointing and non-checkpointing updates.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/10438/
-----------------------------------------------------------

Review request for mesos, Benjamin Hindman and Ben Mahler.


Description
-------

This was split off https://reviews.apache.org/r/10142/.

While I simplified statusUpdate() and StatusUpdateManager a lot, this diff introduced quite a few changes.

So, I will leave it up to you, to take another look.


Diffs
-----

  src/slave/slave.cpp 325231458a6883019436e7cc5a37f85f0f5735fa 
  src/slave/status_update_manager.hpp e6ca40c5c05c0952cf76fb1db7eff2e4270c0d24 
  src/slave/status_update_manager.cpp 044d245f370ef23ddc67fadbf7f8fe9d75dd662a 

Diff: https://reviews.apache.org/r/10438/diff/


Testing
-------

make check


Thanks,

Vinod Kone


Re: Review Request: Fixed status update manager to explicitly handle checkpointing and non-checkpointing updates.

Posted by Vinod Kone <vi...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/10438/
-----------------------------------------------------------

(Updated April 17, 2013, 6:39 a.m.)


Review request for mesos, Benjamin Hindman and Ben Mahler.


Changes
-------

comments and rebase. no need for review.


Description
-------

This was split off https://reviews.apache.org/r/10142/.

While I simplified statusUpdate() and StatusUpdateManager a lot, this diff introduced quite a few changes.

So, I will leave it up to you, to take another look.


Diffs (updated)
-----

  src/slave/slave.cpp 325231458a6883019436e7cc5a37f85f0f5735fa 
  src/slave/status_update_manager.hpp e6ca40c5c05c0952cf76fb1db7eff2e4270c0d24 
  src/slave/status_update_manager.cpp 044d245f370ef23ddc67fadbf7f8fe9d75dd662a 

Diff: https://reviews.apache.org/r/10438/diff/


Testing
-------

make check


Thanks,

Vinod Kone


Re: Review Request: Fixed status update manager to explicitly handle checkpointing and non-checkpointing updates.

Posted by Ben Mahler <be...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/10438/#review19277
-----------------------------------------------------------

Ship it!



src/slave/status_update_manager.hpp
<https://reviews.apache.org/r/10438/#comment39933>

    I'm going to defer on this, but we should come up with a better naming scheme or signatures that indicate clearly the difference between these two.



src/slave/status_update_manager.hpp
<https://reviews.apache.org/r/10438/#comment39934>

    const?


- Ben Mahler


On April 16, 2013, 12:31 a.m., Vinod Kone wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/10438/
> -----------------------------------------------------------
> 
> (Updated April 16, 2013, 12:31 a.m.)
> 
> 
> Review request for mesos, Benjamin Hindman and Ben Mahler.
> 
> 
> Description
> -------
> 
> This was split off https://reviews.apache.org/r/10142/.
> 
> While I simplified statusUpdate() and StatusUpdateManager a lot, this diff introduced quite a few changes.
> 
> So, I will leave it up to you, to take another look.
> 
> 
> Diffs
> -----
> 
>   src/slave/slave.cpp 325231458a6883019436e7cc5a37f85f0f5735fa 
>   src/slave/status_update_manager.hpp e6ca40c5c05c0952cf76fb1db7eff2e4270c0d24 
>   src/slave/status_update_manager.cpp 044d245f370ef23ddc67fadbf7f8fe9d75dd662a 
> 
> Diff: https://reviews.apache.org/r/10438/diff/
> 
> 
> Testing
> -------
> 
> make check
> 
> 
> Thanks,
> 
> Vinod Kone
> 
>


Re: Review Request: Fixed status update manager to explicitly handle checkpointing and non-checkpointing updates.

Posted by Vinod Kone <vi...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/10438/
-----------------------------------------------------------

(Updated April 16, 2013, 12:31 a.m.)


Review request for mesos, Benjamin Hindman and Ben Mahler.


Changes
-------

benm's


Description
-------

This was split off https://reviews.apache.org/r/10142/.

While I simplified statusUpdate() and StatusUpdateManager a lot, this diff introduced quite a few changes.

So, I will leave it up to you, to take another look.


Diffs (updated)
-----

  src/slave/slave.cpp 325231458a6883019436e7cc5a37f85f0f5735fa 
  src/slave/status_update_manager.hpp e6ca40c5c05c0952cf76fb1db7eff2e4270c0d24 
  src/slave/status_update_manager.cpp 044d245f370ef23ddc67fadbf7f8fe9d75dd662a 

Diff: https://reviews.apache.org/r/10438/diff/


Testing
-------

make check


Thanks,

Vinod Kone


Re: Review Request: Fixed status update manager to explicitly handle checkpointing and non-checkpointing updates.

Posted by Vinod Kone <vi...@gmail.com>.

> On April 14, 2013, 11:51 p.m., Ben Mahler wrote:
> > src/slave/slave.cpp, line 590
> > <https://reviews.apache.org/r/10438/diff/1/?file=280875#file280875line590>
> >
> >     What happened here?

fat finger. fixed. thanks!


> On April 14, 2013, 11:51 p.m., Ben Mahler wrote:
> > src/slave/slave.cpp, line 630
> > <https://reviews.apache.org/r/10438/diff/1/?file=280875#file280875line630>
> >
> >     Why not? Seems like frameworks would want a LOST update when launching a task and we can't launch it.
> >     
> >     The alternative to me seems to be to wait for the framework to terminate first before launching further tasks.
> >     
> >     Shouldn't you do one of these options?

Note that a framework is in terminating if it is shutting down. As such, there is no point in sending it further updates. Makes sense?


> On April 14, 2013, 11:51 p.m., Ben Mahler wrote:
> > src/slave/slave.cpp, line 800
> > <https://reviews.apache.org/r/10438/diff/1/?file=280875#file280875line800>
> >
> >     Based on this comment, seems like we should be sending LOST when we can't find the framework as well, how are these different cases?

Because an uknown framework most likely means the framework doesn't exist and we cannot expect any ACKS for the updates. If we send an update, the status update manager will keep retrying forever! Whereas, with an unknown executor, we still expect ACKs from the framework. Makes sense?


> On April 14, 2013, 11:51 p.m., Ben Mahler wrote:
> > src/slave/slave.cpp, line 1421
> > <https://reviews.apache.org/r/10438/diff/1/?file=280875#file280875line1421>
> >
> >     Seems like this case can wrap on the paren:
> >     
> >     dispatch(isolator,
> >                   &Isolator::resourcesChanged,
> >                   ...

done


> On April 14, 2013, 11:51 p.m., Ben Mahler wrote:
> > src/slave/slave.cpp, line 1438
> > <https://reviews.apache.org/r/10438/diff/1/?file=280875#file280875line1438>
> >
> >     One confusing thing here, is that I look at the signatures
> >     
> >     update(Update, SlaveID, ExecutorID, UUID)
> >     update(Update, SlaveID)
> >     
> >     From your comments, the first one checkpoints and reliably sends the update. The second just retries the update?
> >     
> >     This seems unintuitive, or am I missing something here?

Unintuitive in terms of naming or why we do it this way? If it is the former, then I agree. Didn't want to call the first one checkpoint() and/or the second one retry(), because that is confusing too. At least, update is complementary to acknowledgement().

I'm open to suggestions though.


> On April 14, 2013, 11:51 p.m., Ben Mahler wrote:
> > src/slave/status_update_manager.hpp, line 81
> > <https://reviews.apache.org/r/10438/diff/1/?file=280876#file280876line81>
> >
> >     Whether sounds like a boolean return value.

For me Try<Nothing> is like a bool, with an added advantage of giving the error. FWIW, I think we discussed about this earlier (x months ago) and IIRC decided to keep it.

What do you suggest? 


> On April 14, 2013, 11:51 p.m., Ben Mahler wrote:
> > src/slave/status_update_manager.hpp, line 90
> > <https://reviews.apache.org/r/10438/diff/1/?file=280876#file280876line90>
> >
> >     So.. given my earlier comment, why isn't this called retry?

I don't like retry() because 1) it is not complement to acknowledgement() and 2) doesn't tell me its an update.


> On April 14, 2013, 11:51 p.m., Ben Mahler wrote:
> > src/slave/status_update_manager.hpp, line 92
> > <https://reviews.apache.org/r/10438/diff/1/?file=280876#file280876line92>
> >
> >     Ditto.

see above.


> On April 14, 2013, 11:51 p.m., Ben Mahler wrote:
> > src/slave/slave.cpp, line 1397
> > <https://reviews.apache.org/r/10438/diff/1/?file=280875#file280875line1397>
> >
> >     Again, we send the update when we can't find the executor, but we don't send one when we can't find the framework?

see above


- Vinod


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/10438/#review19138
-----------------------------------------------------------


On April 12, 2013, 9:27 p.m., Vinod Kone wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/10438/
> -----------------------------------------------------------
> 
> (Updated April 12, 2013, 9:27 p.m.)
> 
> 
> Review request for mesos, Benjamin Hindman and Ben Mahler.
> 
> 
> Description
> -------
> 
> This was split off https://reviews.apache.org/r/10142/.
> 
> While I simplified statusUpdate() and StatusUpdateManager a lot, this diff introduced quite a few changes.
> 
> So, I will leave it up to you, to take another look.
> 
> 
> Diffs
> -----
> 
>   src/slave/slave.cpp 325231458a6883019436e7cc5a37f85f0f5735fa 
>   src/slave/status_update_manager.hpp e6ca40c5c05c0952cf76fb1db7eff2e4270c0d24 
>   src/slave/status_update_manager.cpp 044d245f370ef23ddc67fadbf7f8fe9d75dd662a 
> 
> Diff: https://reviews.apache.org/r/10438/diff/
> 
> 
> Testing
> -------
> 
> make check
> 
> 
> Thanks,
> 
> Vinod Kone
> 
>


Re: Review Request: Fixed status update manager to explicitly handle checkpointing and non-checkpointing updates.

Posted by Ben Mahler <be...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/10438/#review19138
-----------------------------------------------------------



src/slave/slave.cpp
<https://reviews.apache.org/r/10438/#comment39685>

    What happened here?



src/slave/slave.cpp
<https://reviews.apache.org/r/10438/#comment39686>

    Why not? Seems like frameworks would want a LOST update when launching a task and we can't launch it.
    
    The alternative to me seems to be to wait for the framework to terminate first before launching further tasks.
    
    Shouldn't you do one of these options?



src/slave/slave.cpp
<https://reviews.apache.org/r/10438/#comment39687>

    Based on this comment, seems like we should be sending LOST when we can't find the framework as well, how are these different cases?



src/slave/slave.cpp
<https://reviews.apache.org/r/10438/#comment39688>

    Ditto to my earlier question.



src/slave/slave.cpp
<https://reviews.apache.org/r/10438/#comment39689>

    Again, we send the update when we can't find the executor, but we don't send one when we can't find the framework?



src/slave/slave.cpp
<https://reviews.apache.org/r/10438/#comment39690>

    Any CHECKs like this should print the state so that we can see it when it fails:
    
    CHECK(....) << executor->state;
    
    Which makes me think we should explicitly number our enum members.



src/slave/slave.cpp
<https://reviews.apache.org/r/10438/#comment39691>

    Seems like this case can wrap on the paren:
    
    dispatch(isolator,
                  &Isolator::resourcesChanged,
                  ...



src/slave/slave.cpp
<https://reviews.apache.org/r/10438/#comment39692>

    One confusing thing here, is that I look at the signatures
    
    update(Update, SlaveID, ExecutorID, UUID)
    update(Update, SlaveID)
    
    From your comments, the first one checkpoints and reliably sends the update. The second just retries the update?
    
    This seems unintuitive, or am I missing something here?



src/slave/status_update_manager.hpp
<https://reviews.apache.org/r/10438/#comment39693>

    Whether sounds like a boolean return value.



src/slave/status_update_manager.hpp
<https://reviews.apache.org/r/10438/#comment39695>

    So.. given my earlier comment, why isn't this called retry?



src/slave/status_update_manager.hpp
<https://reviews.apache.org/r/10438/#comment39694>

    Ditto.



src/slave/status_update_manager.cpp
<https://reviews.apache.org/r/10438/#comment39696>

    s/checkpoint status/checkpoint value/?


- Ben Mahler


On April 12, 2013, 9:27 p.m., Vinod Kone wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/10438/
> -----------------------------------------------------------
> 
> (Updated April 12, 2013, 9:27 p.m.)
> 
> 
> Review request for mesos, Benjamin Hindman and Ben Mahler.
> 
> 
> Description
> -------
> 
> This was split off https://reviews.apache.org/r/10142/.
> 
> While I simplified statusUpdate() and StatusUpdateManager a lot, this diff introduced quite a few changes.
> 
> So, I will leave it up to you, to take another look.
> 
> 
> Diffs
> -----
> 
>   src/slave/slave.cpp 325231458a6883019436e7cc5a37f85f0f5735fa 
>   src/slave/status_update_manager.hpp e6ca40c5c05c0952cf76fb1db7eff2e4270c0d24 
>   src/slave/status_update_manager.cpp 044d245f370ef23ddc67fadbf7f8fe9d75dd662a 
> 
> Diff: https://reviews.apache.org/r/10438/diff/
> 
> 
> Testing
> -------
> 
> make check
> 
> 
> Thanks,
> 
> Vinod Kone
> 
>