You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mesos.apache.org by Vinod Kone <vi...@gmail.com> on 2014/10/14 20:03:15 UTC

Review Request 26699: Updated slave re-registration to send unacknowledged task states.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/26699/
-----------------------------------------------------------

Review request for mesos, Adam B, Ben Mahler, and Niklas Nielsen.


Bugs: MESOS-1799 and MESOS-1817
    https://issues.apache.org/jira/browse/MESOS-1799
    https://issues.apache.org/jira/browse/MESOS-1817


Repository: mesos-git


Description
-------

Slave re-registration now sends both the latest state and unacknowledged state to the master.


Diffs
-----

  src/slave/slave.hpp 342b09fc084c20d98d096bb129830440179c092c 
  src/slave/slave.cpp 0e342ed35e3db3b68f9f32b6cf4ace23e4a4db38 
  src/tests/fault_tolerance_tests.cpp a75910d4f486230ba3f1d8927e5f1e5fda6e287b 
  src/tests/slave_tests.cpp f585bdd20ae1af466f2c1b4d85331ac67451552f 

Diff: https://reviews.apache.org/r/26699/diff/


Testing
-------

make check

Ran new test 1000 times.


Thanks,

Vinod Kone


Re: Review Request 26699: Updated slave re-registration to send unacknowledged task states.

Posted by Vinod Kone <vi...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/26699/
-----------------------------------------------------------

(Updated Oct. 21, 2014, 10:22 p.m.)


Review request for mesos, Adam B, Ben Mahler, and Niklas Nielsen.


Changes
-------

benm's. NNFR.


Bugs: MESOS-1799 and MESOS-1817
    https://issues.apache.org/jira/browse/MESOS-1799
    https://issues.apache.org/jira/browse/MESOS-1817


Repository: mesos-git


Description
-------

Slave re-registration now sends both the latest state and unacknowledged state to the master.


Diffs (updated)
-----

  src/messages/messages.proto 0dfc1b78dcf728bbf2b6094fb6111400c7af7362 
  src/slave/slave.cpp 7b5474ae898f369fc55670cf1dbbee684bc0ae4b 
  src/tests/fault_tolerance_tests.cpp a75910d4f486230ba3f1d8927e5f1e5fda6e287b 
  src/tests/slave_tests.cpp 759670ad91c2f8afbd4b9dcb2f367f50482358e9 

Diff: https://reviews.apache.org/r/26699/diff/


Testing
-------

make check

Ran new test 1000 times.


Thanks,

Vinod Kone


Re: Review Request 26699: Updated slave re-registration to send unacknowledged task states.

Posted by Ben Mahler <be...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/26699/#review57631
-----------------------------------------------------------

Ship it!



src/slave/slave.cpp
<https://reviews.apache.org/r/26699/#comment98428>

    s/if(/if (/
    
    Doesn't the style checker catch this?


- Ben Mahler


On Oct. 20, 2014, 11:52 p.m., Vinod Kone wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/26699/
> -----------------------------------------------------------
> 
> (Updated Oct. 20, 2014, 11:52 p.m.)
> 
> 
> Review request for mesos, Adam B, Ben Mahler, and Niklas Nielsen.
> 
> 
> Bugs: MESOS-1799 and MESOS-1817
>     https://issues.apache.org/jira/browse/MESOS-1799
>     https://issues.apache.org/jira/browse/MESOS-1817
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> Slave re-registration now sends both the latest state and unacknowledged state to the master.
> 
> 
> Diffs
> -----
> 
>   src/messages/messages.proto 0dfc1b78dcf728bbf2b6094fb6111400c7af7362 
>   src/slave/slave.cpp 7b5474ae898f369fc55670cf1dbbee684bc0ae4b 
>   src/tests/fault_tolerance_tests.cpp a75910d4f486230ba3f1d8927e5f1e5fda6e287b 
>   src/tests/slave_tests.cpp 0ded545b4cc7dfa2989d072d74ab5b1d7a2a8b9c 
> 
> Diff: https://reviews.apache.org/r/26699/diff/
> 
> 
> Testing
> -------
> 
> make check
> 
> Ran new test 1000 times.
> 
> 
> Thanks,
> 
> Vinod Kone
> 
>


Re: Review Request 26699: Updated slave re-registration to send unacknowledged task states.

Posted by Vinod Kone <vi...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/26699/
-----------------------------------------------------------

(Updated Oct. 20, 2014, 11:52 p.m.)


Review request for mesos, Adam B, Ben Mahler, and Niklas Nielsen.


Changes
-------

addressed comments.


Bugs: MESOS-1799 and MESOS-1817
    https://issues.apache.org/jira/browse/MESOS-1799
    https://issues.apache.org/jira/browse/MESOS-1817


Repository: mesos-git


Description
-------

Slave re-registration now sends both the latest state and unacknowledged state to the master.


Diffs (updated)
-----

  src/messages/messages.proto 0dfc1b78dcf728bbf2b6094fb6111400c7af7362 
  src/slave/slave.cpp 7b5474ae898f369fc55670cf1dbbee684bc0ae4b 
  src/tests/fault_tolerance_tests.cpp a75910d4f486230ba3f1d8927e5f1e5fda6e287b 
  src/tests/slave_tests.cpp 0ded545b4cc7dfa2989d072d74ab5b1d7a2a8b9c 

Diff: https://reviews.apache.org/r/26699/diff/


Testing
-------

make check

Ran new test 1000 times.


Thanks,

Vinod Kone


Re: Review Request 26699: Updated slave re-registration to send unacknowledged task states.

Posted by Vinod Kone <vi...@gmail.com>.

> On Oct. 17, 2014, 6:58 p.m., Ben Mahler wrote:
> > src/messages/messages.proto, line 55
> > <https://reviews.apache.org/r/26699/diff/3/?file=723857#file723857line55>
> >
> >     Wonder if it's time for a UUID wrapper message type akin to what we did with all of our _ID types..
> >     
> >     Just putting it out there as potential TODO material. :)

great idea. added TODO.


> On Oct. 17, 2014, 6:58 p.m., Ben Mahler wrote:
> > src/slave/slave.cpp, lines 2357-2361
> > <https://reviews.apache.org/r/26699/diff/3/?file=723858#file723858line2357>
> >
> >     Could you add a little comment about where we're looking here? In particular, why queued tasks and completed tasks are ignored.
> >     
> >     Or, should there be a CHECK for no queued task matching this?

Added a comment. Can't do a CHECK because updates are sent by executors.


- Vinod


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/26699/#review57185
-----------------------------------------------------------


On Oct. 17, 2014, 12:26 a.m., Vinod Kone wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/26699/
> -----------------------------------------------------------
> 
> (Updated Oct. 17, 2014, 12:26 a.m.)
> 
> 
> Review request for mesos, Adam B, Ben Mahler, and Niklas Nielsen.
> 
> 
> Bugs: MESOS-1799 and MESOS-1817
>     https://issues.apache.org/jira/browse/MESOS-1799
>     https://issues.apache.org/jira/browse/MESOS-1817
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> Slave re-registration now sends both the latest state and unacknowledged state to the master.
> 
> 
> Diffs
> -----
> 
>   src/messages/messages.proto 8de7f9699f43aa2780f4a39fed087abcf5e5af99 
>   src/slave/slave.cpp 0e342ed35e3db3b68f9f32b6cf4ace23e4a4db38 
>   src/tests/fault_tolerance_tests.cpp a75910d4f486230ba3f1d8927e5f1e5fda6e287b 
>   src/tests/slave_tests.cpp f585bdd20ae1af466f2c1b4d85331ac67451552f 
> 
> Diff: https://reviews.apache.org/r/26699/diff/
> 
> 
> Testing
> -------
> 
> make check
> 
> Ran new test 1000 times.
> 
> 
> Thanks,
> 
> Vinod Kone
> 
>


Re: Review Request 26699: Updated slave re-registration to send unacknowledged task states.

Posted by Ben Mahler <be...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/26699/#review57185
-----------------------------------------------------------


Looks good, mainly just comments about adding more comments for myself and for posterity. :)


src/messages/messages.proto
<https://reviews.apache.org/r/26699/#comment97711>

    Wonder if it's time for a UUID wrapper message type akin to what we did with all of our _ID types..
    
    Just putting it out there as potential TODO material. :)



src/slave/slave.cpp
<https://reviews.apache.org/r/26699/#comment97715>

    I intentionally didn't include newlines here because I wanted to capture this as one small logical block of code (add all the tasks) and newlines were not used below for completed tasks either (granted that one does look less readable) :)



src/slave/slave.cpp
<https://reviews.apache.org/r/26699/#comment97718>

    Hm.. could this note expand on how it's possible for them not to exist? Needs some non-local reasoning? Is it due to recovery?



src/slave/slave.cpp
<https://reviews.apache.org/r/26699/#comment97721>

    Maybe a newline after this? A bit dense IMO :)



src/slave/slave.cpp
<https://reviews.apache.org/r/26699/#comment97722>

    Could you add a little comment about where we're looking here? In particular, why queued tasks and completed tasks are ignored.
    
    Or, should there be a CHECK for no queued task matching this?



src/slave/slave.cpp
<https://reviews.apache.org/r/26699/#comment97720>

    Since the goal of this block is to set the unacked state, could this comment be moved above the note?
    
    The note threw me off a bit because I initially associated it with the _goal_ of this block of code, which is actually just to set the unacked state.
    
    The other important thing to comment on is why we're setting the state here as opposed to when we first receive the update, for posterity. Seems really non-obvious to me.



src/tests/fault_tolerance_tests.cpp
<https://reviews.apache.org/r/26699/#comment97717>

    How did this get in here? ;)


- Ben Mahler


On Oct. 17, 2014, 12:26 a.m., Vinod Kone wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/26699/
> -----------------------------------------------------------
> 
> (Updated Oct. 17, 2014, 12:26 a.m.)
> 
> 
> Review request for mesos, Adam B, Ben Mahler, and Niklas Nielsen.
> 
> 
> Bugs: MESOS-1799 and MESOS-1817
>     https://issues.apache.org/jira/browse/MESOS-1799
>     https://issues.apache.org/jira/browse/MESOS-1817
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> Slave re-registration now sends both the latest state and unacknowledged state to the master.
> 
> 
> Diffs
> -----
> 
>   src/messages/messages.proto 8de7f9699f43aa2780f4a39fed087abcf5e5af99 
>   src/slave/slave.cpp 0e342ed35e3db3b68f9f32b6cf4ace23e4a4db38 
>   src/tests/fault_tolerance_tests.cpp a75910d4f486230ba3f1d8927e5f1e5fda6e287b 
>   src/tests/slave_tests.cpp f585bdd20ae1af466f2c1b4d85331ac67451552f 
> 
> Diff: https://reviews.apache.org/r/26699/diff/
> 
> 
> Testing
> -------
> 
> make check
> 
> Ran new test 1000 times.
> 
> 
> Thanks,
> 
> Vinod Kone
> 
>


Re: Review Request 26699: Updated slave re-registration to send unacknowledged task states.

Posted by Vinod Kone <vi...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/26699/
-----------------------------------------------------------

(Updated Oct. 17, 2014, 12:26 a.m.)


Review request for mesos, Adam B, Ben Mahler, and Niklas Nielsen.


Changes
-------

Updated the semantics per the latest discussion.


Bugs: MESOS-1799 and MESOS-1817
    https://issues.apache.org/jira/browse/MESOS-1799
    https://issues.apache.org/jira/browse/MESOS-1817


Repository: mesos-git


Description
-------

Slave re-registration now sends both the latest state and unacknowledged state to the master.


Diffs (updated)
-----

  src/messages/messages.proto 8de7f9699f43aa2780f4a39fed087abcf5e5af99 
  src/slave/slave.cpp 0e342ed35e3db3b68f9f32b6cf4ace23e4a4db38 
  src/tests/fault_tolerance_tests.cpp a75910d4f486230ba3f1d8927e5f1e5fda6e287b 
  src/tests/slave_tests.cpp f585bdd20ae1af466f2c1b4d85331ac67451552f 

Diff: https://reviews.apache.org/r/26699/diff/


Testing
-------

make check

Ran new test 1000 times.


Thanks,

Vinod Kone


Re: Review Request 26699: Updated slave re-registration to send unacknowledged task states.

Posted by Vinod Kone <vi...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/26699/
-----------------------------------------------------------

(Updated Oct. 15, 2014, 9:29 p.m.)


Review request for mesos, Adam B, Ben Mahler, and Niklas Nielsen.


Changes
-------

Addressed comments.


Bugs: MESOS-1799 and MESOS-1817
    https://issues.apache.org/jira/browse/MESOS-1799
    https://issues.apache.org/jira/browse/MESOS-1817


Repository: mesos-git


Description
-------

Slave re-registration now sends both the latest state and unacknowledged state to the master.


Diffs (updated)
-----

  src/slave/slave.hpp 342b09fc084c20d98d096bb129830440179c092c 
  src/slave/slave.cpp 0e342ed35e3db3b68f9f32b6cf4ace23e4a4db38 
  src/tests/fault_tolerance_tests.cpp a75910d4f486230ba3f1d8927e5f1e5fda6e287b 
  src/tests/slave_tests.cpp f585bdd20ae1af466f2c1b4d85331ac67451552f 

Diff: https://reviews.apache.org/r/26699/diff/


Testing
-------

make check

Ran new test 1000 times.


Thanks,

Vinod Kone


Re: Review Request 26699: Updated slave re-registration to send unacknowledged task states.

Posted by Vinod Kone <vi...@gmail.com>.

> On Oct. 15, 2014, 6:26 p.m., Niklas Nielsen wrote:
> > src/slave/slave.cpp, line 1014
> > <https://reviews.apache.org/r/26699/diff/1/?file=720970#file720970line1014>
> >
> >     What are your guarantees that task_id() is in the states map? Maybe guard it?

it is inside a "if states contains task id" condition. am i missing something?


- Vinod


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/26699/#review56758
-----------------------------------------------------------


On Oct. 14, 2014, 6:03 p.m., Vinod Kone wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/26699/
> -----------------------------------------------------------
> 
> (Updated Oct. 14, 2014, 6:03 p.m.)
> 
> 
> Review request for mesos, Adam B, Ben Mahler, and Niklas Nielsen.
> 
> 
> Bugs: MESOS-1799 and MESOS-1817
>     https://issues.apache.org/jira/browse/MESOS-1799
>     https://issues.apache.org/jira/browse/MESOS-1817
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> Slave re-registration now sends both the latest state and unacknowledged state to the master.
> 
> 
> Diffs
> -----
> 
>   src/slave/slave.hpp 342b09fc084c20d98d096bb129830440179c092c 
>   src/slave/slave.cpp 0e342ed35e3db3b68f9f32b6cf4ace23e4a4db38 
>   src/tests/fault_tolerance_tests.cpp a75910d4f486230ba3f1d8927e5f1e5fda6e287b 
>   src/tests/slave_tests.cpp f585bdd20ae1af466f2c1b4d85331ac67451552f 
> 
> Diff: https://reviews.apache.org/r/26699/diff/
> 
> 
> Testing
> -------
> 
> make check
> 
> Ran new test 1000 times.
> 
> 
> Thanks,
> 
> Vinod Kone
> 
>


Re: Review Request 26699: Updated slave re-registration to send unacknowledged task states.

Posted by Niklas Nielsen <ni...@qni.dk>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/26699/#review56758
-----------------------------------------------------------



src/slave/slave.cpp
<https://reviews.apache.org/r/26699/#comment97147>

    What are your guarantees that task_id() is in the states map? Maybe guard it?


- Niklas Nielsen


On Oct. 14, 2014, 11:03 a.m., Vinod Kone wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/26699/
> -----------------------------------------------------------
> 
> (Updated Oct. 14, 2014, 11:03 a.m.)
> 
> 
> Review request for mesos, Adam B, Ben Mahler, and Niklas Nielsen.
> 
> 
> Bugs: MESOS-1799 and MESOS-1817
>     https://issues.apache.org/jira/browse/MESOS-1799
>     https://issues.apache.org/jira/browse/MESOS-1817
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> Slave re-registration now sends both the latest state and unacknowledged state to the master.
> 
> 
> Diffs
> -----
> 
>   src/slave/slave.hpp 342b09fc084c20d98d096bb129830440179c092c 
>   src/slave/slave.cpp 0e342ed35e3db3b68f9f32b6cf4ace23e4a4db38 
>   src/tests/fault_tolerance_tests.cpp a75910d4f486230ba3f1d8927e5f1e5fda6e287b 
>   src/tests/slave_tests.cpp f585bdd20ae1af466f2c1b4d85331ac67451552f 
> 
> Diff: https://reviews.apache.org/r/26699/diff/
> 
> 
> Testing
> -------
> 
> make check
> 
> Ran new test 1000 times.
> 
> 
> Thanks,
> 
> Vinod Kone
> 
>


Re: Review Request 26699: Updated slave re-registration to send unacknowledged task states.

Posted by Vinod Kone <vi...@gmail.com>.

> On Oct. 15, 2014, 4:03 a.m., Adam B wrote:
> > src/slave/slave.cpp, lines 938-948
> > <https://reviews.apache.org/r/26699/diff/1/?file=720970#file720970line938>
> >
> >     Couldn't the Slave and the SUM get out of sync here? Right now, the SUM will flush its pending status updates as soon as a new master is detected.
> >     I'm imagining a scenario where the SUM is flushing status updates and the slave handles a status ACK interleaved with a slave re-registration delivering stale or out-of-sync task states.
> >     Wouldn't it just be better if the SUM didn't flush until after the slave has successfully re-registered?

Definitely thought about this race. 

Yes, it would be better if SUM did the flush after re-registration but I think it is still a race because re-registration could happen due to ZK blips where updates and acks are in flight.

I added a comment on why it is safe. Let me know if you still have concerns.


> On Oct. 15, 2014, 4:03 a.m., Adam B wrote:
> > src/tests/slave_tests.cpp, lines 1088-1089
> > <https://reviews.apache.org/r/26699/diff/1/?file=720972#file720972line1088>
> >
> >     Verify that it's actually a TASK_RUNNING?

I typically only test for things that the test is verifying, to avoid bloating the test.


- Vinod


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/26699/#review56604
-----------------------------------------------------------


On Oct. 14, 2014, 6:03 p.m., Vinod Kone wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/26699/
> -----------------------------------------------------------
> 
> (Updated Oct. 14, 2014, 6:03 p.m.)
> 
> 
> Review request for mesos, Adam B, Ben Mahler, and Niklas Nielsen.
> 
> 
> Bugs: MESOS-1799 and MESOS-1817
>     https://issues.apache.org/jira/browse/MESOS-1799
>     https://issues.apache.org/jira/browse/MESOS-1817
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> Slave re-registration now sends both the latest state and unacknowledged state to the master.
> 
> 
> Diffs
> -----
> 
>   src/slave/slave.hpp 342b09fc084c20d98d096bb129830440179c092c 
>   src/slave/slave.cpp 0e342ed35e3db3b68f9f32b6cf4ace23e4a4db38 
>   src/tests/fault_tolerance_tests.cpp a75910d4f486230ba3f1d8927e5f1e5fda6e287b 
>   src/tests/slave_tests.cpp f585bdd20ae1af466f2c1b4d85331ac67451552f 
> 
> Diff: https://reviews.apache.org/r/26699/diff/
> 
> 
> Testing
> -------
> 
> make check
> 
> Ran new test 1000 times.
> 
> 
> Thanks,
> 
> Vinod Kone
> 
>


Re: Review Request 26699: Updated slave re-registration to send unacknowledged task states.

Posted by Vinod Kone <vi...@gmail.com>.

> On Oct. 15, 2014, 4:03 a.m., Adam B wrote:
> > src/slave/slave.cpp, lines 938-948
> > <https://reviews.apache.org/r/26699/diff/1/?file=720970#file720970line938>
> >
> >     Couldn't the Slave and the SUM get out of sync here? Right now, the SUM will flush its pending status updates as soon as a new master is detected.
> >     I'm imagining a scenario where the SUM is flushing status updates and the slave handles a status ACK interleaved with a slave re-registration delivering stale or out-of-sync task states.
> >     Wouldn't it just be better if the SUM didn't flush until after the slave has successfully re-registered?
> 
> Vinod Kone wrote:
>     Definitely thought about this race. 
>     
>     Yes, it would be better if SUM did the flush after re-registration but I think it is still a race because re-registration could happen due to ZK blips where updates and acks are in flight.
>     
>     I added a comment on why it is safe. Let me know if you still have concerns.

Actually, after thinking more about this and discussing with BenM, there is fundamentally still a race between slave sending re-registered message and SUM sending its update. To fix this, I will make couple changes.
1) Instead of SUM directly sending updates to the master, it will send to the slave which will forward it to the master.
2) Slave updates the latest state of the task when it gets an update from the executor and unacknowledged state when it gets an update from the SUM.
3) The latest state in StatusUpdate will be set by the slave (instead of SUM) before forwarding it to the master.

These changes should guarantee that re-registration and update messages are always in sync w.r.t to the latest and unacknowledged state of a task. As a side benefit of 1), updates won't be sent to the master when slave is in the middle of re-registration.

Thoughts?


- Vinod


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/26699/#review56604
-----------------------------------------------------------


On Oct. 15, 2014, 9:29 p.m., Vinod Kone wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/26699/
> -----------------------------------------------------------
> 
> (Updated Oct. 15, 2014, 9:29 p.m.)
> 
> 
> Review request for mesos, Adam B, Ben Mahler, and Niklas Nielsen.
> 
> 
> Bugs: MESOS-1799 and MESOS-1817
>     https://issues.apache.org/jira/browse/MESOS-1799
>     https://issues.apache.org/jira/browse/MESOS-1817
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> Slave re-registration now sends both the latest state and unacknowledged state to the master.
> 
> 
> Diffs
> -----
> 
>   src/slave/slave.hpp 342b09fc084c20d98d096bb129830440179c092c 
>   src/slave/slave.cpp 0e342ed35e3db3b68f9f32b6cf4ace23e4a4db38 
>   src/tests/fault_tolerance_tests.cpp a75910d4f486230ba3f1d8927e5f1e5fda6e287b 
>   src/tests/slave_tests.cpp f585bdd20ae1af466f2c1b4d85331ac67451552f 
> 
> Diff: https://reviews.apache.org/r/26699/diff/
> 
> 
> Testing
> -------
> 
> make check
> 
> Ran new test 1000 times.
> 
> 
> Thanks,
> 
> Vinod Kone
> 
>


Re: Review Request 26699: Updated slave re-registration to send unacknowledged task states.

Posted by Adam B <ad...@mesosphere.io>.

> On Oct. 14, 2014, 9:03 p.m., Adam B wrote:
> > src/slave/slave.cpp, lines 938-948
> > <https://reviews.apache.org/r/26699/diff/1/?file=720970#file720970line938>
> >
> >     Couldn't the Slave and the SUM get out of sync here? Right now, the SUM will flush its pending status updates as soon as a new master is detected.
> >     I'm imagining a scenario where the SUM is flushing status updates and the slave handles a status ACK interleaved with a slave re-registration delivering stale or out-of-sync task states.
> >     Wouldn't it just be better if the SUM didn't flush until after the slave has successfully re-registered?
> 
> Vinod Kone wrote:
>     Definitely thought about this race. 
>     
>     Yes, it would be better if SUM did the flush after re-registration but I think it is still a race because re-registration could happen due to ZK blips where updates and acks are in flight.
>     
>     I added a comment on why it is safe. Let me know if you still have concerns.
> 
> Vinod Kone wrote:
>     Actually, after thinking more about this and discussing with BenM, there is fundamentally still a race between slave sending re-registered message and SUM sending its update. To fix this, I will make couple changes.
>     1) Instead of SUM directly sending updates to the master, it will send to the slave which will forward it to the master.
>     2) Slave updates the latest state of the task when it gets an update from the executor and unacknowledged state when it gets an update from the SUM.
>     3) The latest state in StatusUpdate will be set by the slave (instead of SUM) before forwarding it to the master.
>     
>     These changes should guarantee that re-registration and update messages are always in sync w.r.t to the latest and unacknowledged state of a task. As a side benefit of 1), updates won't be sent to the master when slave is in the middle of re-registration.
>     
>     Thoughts?

1) I like the design idea: clear API boundaries. Funnel all communication through the slave, and internal actor processes like SUM won't actually communicate outside the slave node.
2,3) I wonder if maybe the slave should update the latest state of the task when it acks the update from the executor, rather than on first receipt, in case we need to ensure that state has been checkpointed? Really just a matter of where in statusUpdate (or its continuations) you insert the line to update the task's latest state.
I'll be able to reason more clearly about it after seeing it in code.


- Adam


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/26699/#review56604
-----------------------------------------------------------


On Oct. 15, 2014, 2:29 p.m., Vinod Kone wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/26699/
> -----------------------------------------------------------
> 
> (Updated Oct. 15, 2014, 2:29 p.m.)
> 
> 
> Review request for mesos, Adam B, Ben Mahler, and Niklas Nielsen.
> 
> 
> Bugs: MESOS-1799 and MESOS-1817
>     https://issues.apache.org/jira/browse/MESOS-1799
>     https://issues.apache.org/jira/browse/MESOS-1817
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> Slave re-registration now sends both the latest state and unacknowledged state to the master.
> 
> 
> Diffs
> -----
> 
>   src/slave/slave.hpp 342b09fc084c20d98d096bb129830440179c092c 
>   src/slave/slave.cpp 0e342ed35e3db3b68f9f32b6cf4ace23e4a4db38 
>   src/tests/fault_tolerance_tests.cpp a75910d4f486230ba3f1d8927e5f1e5fda6e287b 
>   src/tests/slave_tests.cpp f585bdd20ae1af466f2c1b4d85331ac67451552f 
> 
> Diff: https://reviews.apache.org/r/26699/diff/
> 
> 
> Testing
> -------
> 
> make check
> 
> Ran new test 1000 times.
> 
> 
> Thanks,
> 
> Vinod Kone
> 
>


Re: Review Request 26699: Updated slave re-registration to send unacknowledged task states.

Posted by Vinod Kone <vi...@gmail.com>.

> On Oct. 15, 2014, 4:03 a.m., Adam B wrote:
> > src/slave/slave.cpp, lines 938-948
> > <https://reviews.apache.org/r/26699/diff/1/?file=720970#file720970line938>
> >
> >     Couldn't the Slave and the SUM get out of sync here? Right now, the SUM will flush its pending status updates as soon as a new master is detected.
> >     I'm imagining a scenario where the SUM is flushing status updates and the slave handles a status ACK interleaved with a slave re-registration delivering stale or out-of-sync task states.
> >     Wouldn't it just be better if the SUM didn't flush until after the slave has successfully re-registered?
> 
> Vinod Kone wrote:
>     Definitely thought about this race. 
>     
>     Yes, it would be better if SUM did the flush after re-registration but I think it is still a race because re-registration could happen due to ZK blips where updates and acks are in flight.
>     
>     I added a comment on why it is safe. Let me know if you still have concerns.
> 
> Vinod Kone wrote:
>     Actually, after thinking more about this and discussing with BenM, there is fundamentally still a race between slave sending re-registered message and SUM sending its update. To fix this, I will make couple changes.
>     1) Instead of SUM directly sending updates to the master, it will send to the slave which will forward it to the master.
>     2) Slave updates the latest state of the task when it gets an update from the executor and unacknowledged state when it gets an update from the SUM.
>     3) The latest state in StatusUpdate will be set by the slave (instead of SUM) before forwarding it to the master.
>     
>     These changes should guarantee that re-registration and update messages are always in sync w.r.t to the latest and unacknowledged state of a task. As a side benefit of 1), updates won't be sent to the master when slave is in the middle of re-registration.
>     
>     Thoughts?
> 
> Adam B wrote:
>     1) I like the design idea: clear API boundaries. Funnel all communication through the slave, and internal actor processes like SUM won't actually communicate outside the slave node.
>     2,3) I wonder if maybe the slave should update the latest state of the task when it acks the update from the executor, rather than on first receipt, in case we need to ensure that state has been checkpointed? Really just a matter of where in statusUpdate (or its continuations) you insert the line to update the task's latest state.
>     I'll be able to reason more clearly about it after seeing it in code.

For 2,3) I kept the old semantics for now i.e., state changes happen in statusUpdate() instead of _statusUpdate(), since I didn't want to change the containerizer update semantics.


- Vinod


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/26699/#review56604
-----------------------------------------------------------


On Oct. 17, 2014, 12:26 a.m., Vinod Kone wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/26699/
> -----------------------------------------------------------
> 
> (Updated Oct. 17, 2014, 12:26 a.m.)
> 
> 
> Review request for mesos, Adam B, Ben Mahler, and Niklas Nielsen.
> 
> 
> Bugs: MESOS-1799 and MESOS-1817
>     https://issues.apache.org/jira/browse/MESOS-1799
>     https://issues.apache.org/jira/browse/MESOS-1817
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> Slave re-registration now sends both the latest state and unacknowledged state to the master.
> 
> 
> Diffs
> -----
> 
>   src/messages/messages.proto 8de7f9699f43aa2780f4a39fed087abcf5e5af99 
>   src/slave/slave.cpp 0e342ed35e3db3b68f9f32b6cf4ace23e4a4db38 
>   src/tests/fault_tolerance_tests.cpp a75910d4f486230ba3f1d8927e5f1e5fda6e287b 
>   src/tests/slave_tests.cpp f585bdd20ae1af466f2c1b4d85331ac67451552f 
> 
> Diff: https://reviews.apache.org/r/26699/diff/
> 
> 
> Testing
> -------
> 
> make check
> 
> Ran new test 1000 times.
> 
> 
> Thanks,
> 
> Vinod Kone
> 
>


Re: Review Request 26699: Updated slave re-registration to send unacknowledged task states.

Posted by Adam B <ad...@mesosphere.io>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/26699/#review56604
-----------------------------------------------------------


Looking good. I'm glad to see some attention paid to the status update mechanism. We've run into a couple of issues in this area lately.


src/slave/slave.cpp
<https://reviews.apache.org/r/26699/#comment96990>

    Couldn't the Slave and the SUM get out of sync here? Right now, the SUM will flush its pending status updates as soon as a new master is detected.
    I'm imagining a scenario where the SUM is flushing status updates and the slave handles a status ACK interleaved with a slave re-registration delivering stale or out-of-sync task states.
    Wouldn't it just be better if the SUM didn't flush until after the slave has successfully re-registered?



src/slave/slave.cpp
<https://reviews.apache.org/r/26699/#comment96987>

    Please update the comment to reflect that a launched or terminated task will also contain the state from its oldest pending/unacknowledged status update.



src/tests/slave_tests.cpp
<https://reviews.apache.org/r/26699/#comment97039>

    Times(1) is implicit on a vanilla EXPECT_CALL



src/tests/slave_tests.cpp
<https://reviews.apache.org/r/26699/#comment97040>

    Verify that it's actually a TASK_RUNNING?


- Adam B


On Oct. 14, 2014, 11:03 a.m., Vinod Kone wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/26699/
> -----------------------------------------------------------
> 
> (Updated Oct. 14, 2014, 11:03 a.m.)
> 
> 
> Review request for mesos, Adam B, Ben Mahler, and Niklas Nielsen.
> 
> 
> Bugs: MESOS-1799 and MESOS-1817
>     https://issues.apache.org/jira/browse/MESOS-1799
>     https://issues.apache.org/jira/browse/MESOS-1817
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> Slave re-registration now sends both the latest state and unacknowledged state to the master.
> 
> 
> Diffs
> -----
> 
>   src/slave/slave.hpp 342b09fc084c20d98d096bb129830440179c092c 
>   src/slave/slave.cpp 0e342ed35e3db3b68f9f32b6cf4ace23e4a4db38 
>   src/tests/fault_tolerance_tests.cpp a75910d4f486230ba3f1d8927e5f1e5fda6e287b 
>   src/tests/slave_tests.cpp f585bdd20ae1af466f2c1b4d85331ac67451552f 
> 
> Diff: https://reviews.apache.org/r/26699/diff/
> 
> 
> Testing
> -------
> 
> make check
> 
> Ran new test 1000 times.
> 
> 
> Thanks,
> 
> Vinod Kone
> 
>