You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mesos.apache.org by Brenden Matthews <br...@diddyinc.com> on 2013/11/21 01:07:11 UTC

Review Request 15745: Fixed task reconciliation for lost tasks.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/15745/
-----------------------------------------------------------

Review request for mesos and Niklas Nielsen.


Repository: mesos-git


Description
-------

Fixed task reconciliation for lost tasks.

If a slave is known but the task cannot be found, we should assume that
the task has been lost.  It's possible that the following events
occurred:

 1) Framework disconnected from master
 2) Master terminated framework's tasks
 3) Framework reconnects to master, and (incorrectly) assumes tasks are
 still running


Diffs
-----

  src/master/master.cpp a08d01208ff7bbb878b2d50d8406efee4de86171 

Diff: https://reviews.apache.org/r/15745/diff/


Testing
-------

`make check` & tested in staging cluster.


Thanks,

Brenden Matthews

Re: Review Request 15745: Fixed some task reconciliation cases.

Posted by Brenden Matthews <br...@diddyinc.com>.


> On Nov. 21, 2013, 10:03 p.m., Ben Mahler wrote:
> > Slaves continue to retry status updates until an acknowledgement is received (the scheduler driver sends an acknowledgement once Scheduler::statusUpdate was invoked). So in Case 1 you've described, if a task transitions while a framework if failing over, the slave will continue to retry sending the status update. When the framework reconnects to the master, the retried status update would be delivered and so reconciliation of this would be unnecessary and possibly error prone:
> > 
> > Normally:
> > 
> > Framework fails over with T1 in STARTING.
> > T1 transitions from STARTING->RUNNING->FINISHED in the slave and master.
> > Framework reconnects thinking T1 is STARTING.
> > First the slave retries RUNNING and this now gets processed by the Framework. Acknowledgement is sent to the slave.
> > Then the slave retries FINISHED and this now gets processed by the Framework. Acknowledgement is sent to the slave.
> > 
> > Framework sees the following order: STARTING -> RUNNING -> FINISHED
> > 
> > Assuming reconciliation Case 1 here:
> > Framework fails over with T1 in STARTING.
> > T1 transitions from STARTING->RUNNING->FINISHED in the slave and master.
> > Framework reconnects thinking T1 is STARTING.
> > Framework reconciles and master sends FINISHED.
> > The slave retries RUNNING and this now gets processed by the Framework. Acknowledgement is sent to the slave.
> > Then the slave retries FINISHED and this now gets processed by the Framework. Acknowledgement is sent to the slave.
> > 
> > Framework sees the following order: STARTING -> FINISHED -> RUNNING -> FINISHED
> > 
> > Case 1 here can unfortunately lead to out-of-order update delivery to frameworks.

I think the latter option is still the lesser of two evils.

When the framework makes a call to reconcileTasks(), the master should always inform the framework of any differences between the framework and the master's current accounting for task states.  In the case you describe, it seems to be the expected behaviour.  I wouldn't expect the master to replay the queue of task statuses.

Alternatively, you could use the task status timestamp to replay every status update since the last status update which the framework knows about.  Thoughts?


- Brenden


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/15745/#review29249
-----------------------------------------------------------


On Nov. 21, 2013, 12:30 a.m., Brenden Matthews wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/15745/
> -----------------------------------------------------------
> 
> (Updated Nov. 21, 2013, 12:30 a.m.)
> 
> 
> Review request for mesos and Niklas Nielsen.
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> Fixed some task reconciliation cases.
> 
> Case 1:
> 
> If a slave is known but the task cannot be found, we should assume that
> the task has been lost.  It's possible that the following events
> occurred:
> 
>  1) Framework disconnected from master
>  2) Master terminated framework's tasks
>  3) Framework reconnects to master, and (incorrectly) assumes tasks are
>  still running
> 
> Case 2:
> 
> If a framework loses track of running tasks, the master should inform
> the framework of which tasks it knows to be running, in addition to any
> which have had a state change.
> 
> Review: https://reviews.apache.org/r/15745
> 
> 
> Diffs
> -----
> 
>   src/master/master.cpp a08d01208ff7bbb878b2d50d8406efee4de86171 
> 
> Diff: https://reviews.apache.org/r/15745/diff/
> 
> 
> Testing
> -------
> 
> `make check` & tested in staging cluster.
> 
> 
> Thanks,
> 
> Brenden Matthews
> 
>

Re: Review Request 15745: Fixed some task reconciliation cases.

Posted by Ben Mahler <be...@gmail.com>.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/15745/#review29249
-----------------------------------------------------------

Slaves continue to retry status updates until an acknowledgement is received (the scheduler driver sends an acknowledgement once Scheduler::statusUpdate was invoked). So in Case 1 you've described, if a task transitions while a framework if failing over, the slave will continue to retry sending the status update. When the framework reconnects to the master, the retried status update would be delivered and so reconciliation of this would be unnecessary and possibly error prone:

Normally:

Framework fails over with T1 in STARTING.
T1 transitions from STARTING->RUNNING->FINISHED in the slave and master.
Framework reconnects thinking T1 is STARTING.
First the slave retries RUNNING and this now gets processed by the Framework. Acknowledgement is sent to the slave.
Then the slave retries FINISHED and this now gets processed by the Framework. Acknowledgement is sent to the slave.

Framework sees the following order: STARTING -> RUNNING -> FINISHED

Assuming reconciliation Case 1 here:
Framework fails over with T1 in STARTING.
T1 transitions from STARTING->RUNNING->FINISHED in the slave and master.
Framework reconnects thinking T1 is STARTING.
Framework reconciles and master sends FINISHED.
The slave retries RUNNING and this now gets processed by the Framework. Acknowledgement is sent to the slave.
Then the slave retries FINISHED and this now gets processed by the Framework. Acknowledgement is sent to the slave.

Framework sees the following order: STARTING -> FINISHED -> RUNNING -> FINISHED

Case 1 here can unfortunately lead to out-of-order update delivery to frameworks.

- Ben Mahler

On Nov. 21, 2013, 12:30 a.m., Brenden Matthews wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/15745/
> -----------------------------------------------------------
> 
> (Updated Nov. 21, 2013, 12:30 a.m.)
> 
> 
> Review request for mesos and Niklas Nielsen.
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> Fixed some task reconciliation cases.
> 
> Case 1:
> 
> If a slave is known but the task cannot be found, we should assume that
> the task has been lost.  It's possible that the following events
> occurred:
> 
>  1) Framework disconnected from master
>  2) Master terminated framework's tasks
>  3) Framework reconnects to master, and (incorrectly) assumes tasks are
>  still running
> 
> Case 2:
> 
> If a framework loses track of running tasks, the master should inform
> the framework of which tasks it knows to be running, in addition to any
> which have had a state change.
> 
> Review: https://reviews.apache.org/r/15745
> 
> 
> Diffs
> -----
> 
>   src/master/master.cpp a08d01208ff7bbb878b2d50d8406efee4de86171 
> 
> Diff: https://reviews.apache.org/r/15745/diff/
> 
> 
> Testing
> -------
> 
> `make check` & tested in staging cluster.
> 
> 
> Thanks,
> 
> Brenden Matthews
> 
>

Re: Review Request 15745: Fixed some task reconciliation cases.

Posted by Brenden Matthews <br...@diddyinc.com>.


> On Nov. 22, 2013, 8:04 p.m., Niklas Nielsen wrote:
> > Did we get to a conclusion regarding case 1)? and could we write a test which exercises the new scenarios?
> 
> Brenden Matthews wrote:
>     If I get some time, I'll write a test.  I've been testing it in production for a few days though.
>     
>     Not sure about consensus.  Would like to hear from the others.
> 
> Benjamin Hindman wrote:
>     Regarding Case 1, is the framework not receiving the status updates from the slave? That seems more severe. When we added reconcileTasks we specifically decided that we would not send status updates for all possible tasks precisely because we could get into some incorrect situations.
>     
>     Regarding Case 2, why is a framework losing track of running tasks? That's either a bug in the framework or it isn't keeping track of tasks in the first place. Maybe we need a different API call that returns the list of tasks and statuses that the master knows about?

The original problem I tried to solve with this actually turned out to be caused by a bug in marathon ( https://github.com/mesosphere/marathon/commit/1a39f8a37b4db34c088a1669d43a400122c48ba4 ).

That said, it seems confusing to me that the reconciliation wouldn't include updates for tasks which either the master or the framework don't know about.

I'm fine with also having a separate API call.  What about using the status timestamps to avoid some of the incorrect situations?


- Brenden


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/15745/#review29305
-----------------------------------------------------------


On Nov. 22, 2013, 12:30 a.m., Brenden Matthews wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/15745/
> -----------------------------------------------------------
> 
> (Updated Nov. 22, 2013, 12:30 a.m.)
> 
> 
> Review request for mesos and Niklas Nielsen.
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> Fixed some task reconciliation cases.
> 
> Case 1:
> 
> If a slave is known but the task cannot be found, we should assume that
> the task has been lost.  It's possible that the following events
> occurred:
> 
>  1) Framework disconnected from master
>  2) Master terminated framework's tasks
>  3) Framework reconnects to master, and (incorrectly) assumes tasks are
>  still running
> 
> Case 2:
> 
> If a framework loses track of running tasks, the master should inform
> the framework of which tasks it knows to be running, in addition to any
> which have had a state change.
> 
> Review: https://reviews.apache.org/r/15745
> 
> 
> Diffs
> -----
> 
>   src/master/master.cpp a08d01208ff7bbb878b2d50d8406efee4de86171 
> 
> Diff: https://reviews.apache.org/r/15745/diff/
> 
> 
> Testing
> -------
> 
> `make check` & tested in staging cluster.
> 
> 
> Thanks,
> 
> Brenden Matthews
> 
>

Re: Review Request 15745: Fixed some task reconciliation cases.

Posted by Benjamin Hindman <be...@berkeley.edu>.


> On Nov. 22, 2013, 8:04 p.m., Niklas Nielsen wrote:
> > Did we get to a conclusion regarding case 1)? and could we write a test which exercises the new scenarios?
> 
> Brenden Matthews wrote:
>     If I get some time, I'll write a test.  I've been testing it in production for a few days though.
>     
>     Not sure about consensus.  Would like to hear from the others.
> 
> Benjamin Hindman wrote:
>     Regarding Case 1, is the framework not receiving the status updates from the slave? That seems more severe. When we added reconcileTasks we specifically decided that we would not send status updates for all possible tasks precisely because we could get into some incorrect situations.
>     
>     Regarding Case 2, why is a framework losing track of running tasks? That's either a bug in the framework or it isn't keeping track of tasks in the first place. Maybe we need a different API call that returns the list of tasks and statuses that the master knows about?
> 
> Brenden Matthews wrote:
>     The original problem I tried to solve with this actually turned out to be caused by a bug in marathon ( https://github.com/mesosphere/marathon/commit/1a39f8a37b4db34c088a1669d43a400122c48ba4 ).
>     
>     That said, it seems confusing to me that the reconciliation wouldn't include updates for tasks which either the master or the framework don't know about.
>     
>     I'm fine with also having a separate API call.  What about using the status timestamps to avoid some of the incorrect situations?
> 
> Niklas Nielsen wrote:
>     Is this patch still relevant? It seems that improving reconciliation guarantees is already a part of the post-registrar tasks. If not, can we drop it? :)
> 
> Brenden Matthews wrote:
>     We can drop this for now, though I'll probably keep it in my branch.  Hopefully we can get all the reconciliation reconciled.

:-)


- Benjamin


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/15745/#review29305
-----------------------------------------------------------


On Nov. 22, 2013, 12:30 a.m., Brenden Matthews wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/15745/
> -----------------------------------------------------------
> 
> (Updated Nov. 22, 2013, 12:30 a.m.)
> 
> 
> Review request for mesos and Niklas Nielsen.
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> Fixed some task reconciliation cases.
> 
> Case 1:
> 
> If a slave is known but the task cannot be found, we should assume that
> the task has been lost.  It's possible that the following events
> occurred:
> 
>  1) Framework disconnected from master
>  2) Master terminated framework's tasks
>  3) Framework reconnects to master, and (incorrectly) assumes tasks are
>  still running
> 
> Case 2:
> 
> If a framework loses track of running tasks, the master should inform
> the framework of which tasks it knows to be running, in addition to any
> which have had a state change.
> 
> Review: https://reviews.apache.org/r/15745
> 
> 
> Diffs
> -----
> 
>   src/master/master.cpp a08d01208ff7bbb878b2d50d8406efee4de86171 
> 
> Diff: https://reviews.apache.org/r/15745/diff/
> 
> 
> Testing
> -------
> 
> `make check` & tested in staging cluster.
> 
> 
> Thanks,
> 
> Brenden Matthews
> 
>

Re: Review Request 15745: Fixed some task reconciliation cases.

Posted by Benjamin Hindman <be...@berkeley.edu>.


> On Nov. 22, 2013, 8:04 p.m., Niklas Nielsen wrote:
> > Did we get to a conclusion regarding case 1)? and could we write a test which exercises the new scenarios?
> 
> Brenden Matthews wrote:
>     If I get some time, I'll write a test.  I've been testing it in production for a few days though.
>     
>     Not sure about consensus.  Would like to hear from the others.

Regarding Case 1, is the framework not receiving the status updates from the slave? That seems more severe. When we added reconcileTasks we specifically decided that we would not send status updates for all possible tasks precisely because we could get into some incorrect situations.

Regarding Case 2, why is a framework losing track of running tasks? That's either a bug in the framework or it isn't keeping track of tasks in the first place. Maybe we need a different API call that returns the list of tasks and statuses that the master knows about?


- Benjamin


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/15745/#review29305
-----------------------------------------------------------


On Nov. 22, 2013, 12:30 a.m., Brenden Matthews wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/15745/
> -----------------------------------------------------------
> 
> (Updated Nov. 22, 2013, 12:30 a.m.)
> 
> 
> Review request for mesos and Niklas Nielsen.
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> Fixed some task reconciliation cases.
> 
> Case 1:
> 
> If a slave is known but the task cannot be found, we should assume that
> the task has been lost.  It's possible that the following events
> occurred:
> 
>  1) Framework disconnected from master
>  2) Master terminated framework's tasks
>  3) Framework reconnects to master, and (incorrectly) assumes tasks are
>  still running
> 
> Case 2:
> 
> If a framework loses track of running tasks, the master should inform
> the framework of which tasks it knows to be running, in addition to any
> which have had a state change.
> 
> Review: https://reviews.apache.org/r/15745
> 
> 
> Diffs
> -----
> 
>   src/master/master.cpp a08d01208ff7bbb878b2d50d8406efee4de86171 
> 
> Diff: https://reviews.apache.org/r/15745/diff/
> 
> 
> Testing
> -------
> 
> `make check` & tested in staging cluster.
> 
> 
> Thanks,
> 
> Brenden Matthews
> 
>

Re: Review Request 15745: Fixed some task reconciliation cases.

Posted by Brenden Matthews <br...@diddyinc.com>.


> On Nov. 22, 2013, 8:04 p.m., Niklas Nielsen wrote:
> > Did we get to a conclusion regarding case 1)? and could we write a test which exercises the new scenarios?
> 
> Brenden Matthews wrote:
>     If I get some time, I'll write a test.  I've been testing it in production for a few days though.
>     
>     Not sure about consensus.  Would like to hear from the others.
> 
> Benjamin Hindman wrote:
>     Regarding Case 1, is the framework not receiving the status updates from the slave? That seems more severe. When we added reconcileTasks we specifically decided that we would not send status updates for all possible tasks precisely because we could get into some incorrect situations.
>     
>     Regarding Case 2, why is a framework losing track of running tasks? That's either a bug in the framework or it isn't keeping track of tasks in the first place. Maybe we need a different API call that returns the list of tasks and statuses that the master knows about?
> 
> Brenden Matthews wrote:
>     The original problem I tried to solve with this actually turned out to be caused by a bug in marathon ( https://github.com/mesosphere/marathon/commit/1a39f8a37b4db34c088a1669d43a400122c48ba4 ).
>     
>     That said, it seems confusing to me that the reconciliation wouldn't include updates for tasks which either the master or the framework don't know about.
>     
>     I'm fine with also having a separate API call.  What about using the status timestamps to avoid some of the incorrect situations?
> 
> Niklas Nielsen wrote:
>     Is this patch still relevant? It seems that improving reconciliation guarantees is already a part of the post-registrar tasks. If not, can we drop it? :)

We can drop this for now, though I'll probably keep it in my branch.  Hopefully we can get all the reconciliation reconciled.


- Brenden


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/15745/#review29305
-----------------------------------------------------------


On Nov. 22, 2013, 12:30 a.m., Brenden Matthews wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/15745/
> -----------------------------------------------------------
> 
> (Updated Nov. 22, 2013, 12:30 a.m.)
> 
> 
> Review request for mesos and Niklas Nielsen.
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> Fixed some task reconciliation cases.
> 
> Case 1:
> 
> If a slave is known but the task cannot be found, we should assume that
> the task has been lost.  It's possible that the following events
> occurred:
> 
>  1) Framework disconnected from master
>  2) Master terminated framework's tasks
>  3) Framework reconnects to master, and (incorrectly) assumes tasks are
>  still running
> 
> Case 2:
> 
> If a framework loses track of running tasks, the master should inform
> the framework of which tasks it knows to be running, in addition to any
> which have had a state change.
> 
> Review: https://reviews.apache.org/r/15745
> 
> 
> Diffs
> -----
> 
>   src/master/master.cpp a08d01208ff7bbb878b2d50d8406efee4de86171 
> 
> Diff: https://reviews.apache.org/r/15745/diff/
> 
> 
> Testing
> -------
> 
> `make check` & tested in staging cluster.
> 
> 
> Thanks,
> 
> Brenden Matthews
> 
>

Re: Review Request 15745: Fixed some task reconciliation cases.

Posted by Brenden Matthews <br...@diddyinc.com>.


> On Nov. 22, 2013, 8:04 p.m., Niklas Nielsen wrote:
> > Did we get to a conclusion regarding case 1)? and could we write a test which exercises the new scenarios?

If I get some time, I'll write a test.  I've been testing it in production for a few days though.

Not sure about consensus.  Would like to hear from the others.


- Brenden


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/15745/#review29305
-----------------------------------------------------------


On Nov. 22, 2013, 12:30 a.m., Brenden Matthews wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/15745/
> -----------------------------------------------------------
> 
> (Updated Nov. 22, 2013, 12:30 a.m.)
> 
> 
> Review request for mesos and Niklas Nielsen.
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> Fixed some task reconciliation cases.
> 
> Case 1:
> 
> If a slave is known but the task cannot be found, we should assume that
> the task has been lost.  It's possible that the following events
> occurred:
> 
>  1) Framework disconnected from master
>  2) Master terminated framework's tasks
>  3) Framework reconnects to master, and (incorrectly) assumes tasks are
>  still running
> 
> Case 2:
> 
> If a framework loses track of running tasks, the master should inform
> the framework of which tasks it knows to be running, in addition to any
> which have had a state change.
> 
> Review: https://reviews.apache.org/r/15745
> 
> 
> Diffs
> -----
> 
>   src/master/master.cpp a08d01208ff7bbb878b2d50d8406efee4de86171 
> 
> Diff: https://reviews.apache.org/r/15745/diff/
> 
> 
> Testing
> -------
> 
> `make check` & tested in staging cluster.
> 
> 
> Thanks,
> 
> Brenden Matthews
> 
>

Re: Review Request 15745: Fixed some task reconciliation cases.

Posted by Niklas Nielsen <ni...@qni.dk>.


> On Nov. 22, 2013, 12:04 p.m., Niklas Nielsen wrote:
> > Did we get to a conclusion regarding case 1)? and could we write a test which exercises the new scenarios?
> 
> Brenden Matthews wrote:
>     If I get some time, I'll write a test.  I've been testing it in production for a few days though.
>     
>     Not sure about consensus.  Would like to hear from the others.
> 
> Benjamin Hindman wrote:
>     Regarding Case 1, is the framework not receiving the status updates from the slave? That seems more severe. When we added reconcileTasks we specifically decided that we would not send status updates for all possible tasks precisely because we could get into some incorrect situations.
>     
>     Regarding Case 2, why is a framework losing track of running tasks? That's either a bug in the framework or it isn't keeping track of tasks in the first place. Maybe we need a different API call that returns the list of tasks and statuses that the master knows about?
> 
> Brenden Matthews wrote:
>     The original problem I tried to solve with this actually turned out to be caused by a bug in marathon ( https://github.com/mesosphere/marathon/commit/1a39f8a37b4db34c088a1669d43a400122c48ba4 ).
>     
>     That said, it seems confusing to me that the reconciliation wouldn't include updates for tasks which either the master or the framework don't know about.
>     
>     I'm fine with also having a separate API call.  What about using the status timestamps to avoid some of the incorrect situations?

Is this patch still relevant? It seems that improving reconciliation guarantees is already a part of the post-registrar tasks. If not, can we drop it? :)


- Niklas


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/15745/#review29305
-----------------------------------------------------------


On Nov. 21, 2013, 4:30 p.m., Brenden Matthews wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/15745/
> -----------------------------------------------------------
> 
> (Updated Nov. 21, 2013, 4:30 p.m.)
> 
> 
> Review request for mesos and Niklas Nielsen.
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> Fixed some task reconciliation cases.
> 
> Case 1:
> 
> If a slave is known but the task cannot be found, we should assume that
> the task has been lost.  It's possible that the following events
> occurred:
> 
>  1) Framework disconnected from master
>  2) Master terminated framework's tasks
>  3) Framework reconnects to master, and (incorrectly) assumes tasks are
>  still running
> 
> Case 2:
> 
> If a framework loses track of running tasks, the master should inform
> the framework of which tasks it knows to be running, in addition to any
> which have had a state change.
> 
> Review: https://reviews.apache.org/r/15745
> 
> 
> Diffs
> -----
> 
>   src/master/master.cpp a08d01208ff7bbb878b2d50d8406efee4de86171 
> 
> Diff: https://reviews.apache.org/r/15745/diff/
> 
> 
> Testing
> -------
> 
> `make check` & tested in staging cluster.
> 
> 
> Thanks,
> 
> Brenden Matthews
> 
>

Re: Review Request 15745: Fixed some task reconciliation cases.

Posted by Niklas Nielsen <ni...@qni.dk>.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/15745/#review29305
-----------------------------------------------------------


Did we get to a conclusion regarding case 1)? and could we write a test which exercises the new scenarios?

- Niklas Nielsen


On Nov. 22, 2013, 12:30 a.m., Brenden Matthews wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/15745/
> -----------------------------------------------------------
> 
> (Updated Nov. 22, 2013, 12:30 a.m.)
> 
> 
> Review request for mesos and Niklas Nielsen.
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> Fixed some task reconciliation cases.
> 
> Case 1:
> 
> If a slave is known but the task cannot be found, we should assume that
> the task has been lost.  It's possible that the following events
> occurred:
> 
>  1) Framework disconnected from master
>  2) Master terminated framework's tasks
>  3) Framework reconnects to master, and (incorrectly) assumes tasks are
>  still running
> 
> Case 2:
> 
> If a framework loses track of running tasks, the master should inform
> the framework of which tasks it knows to be running, in addition to any
> which have had a state change.
> 
> Review: https://reviews.apache.org/r/15745
> 
> 
> Diffs
> -----
> 
>   src/master/master.cpp a08d01208ff7bbb878b2d50d8406efee4de86171 
> 
> Diff: https://reviews.apache.org/r/15745/diff/
> 
> 
> Testing
> -------
> 
> `make check` & tested in staging cluster.
> 
> 
> Thanks,
> 
> Brenden Matthews
> 
>

Re: Review Request 15745: Fixed some task reconciliation cases.

Posted by Brenden Matthews <br...@diddyinc.com>.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/15745/
-----------------------------------------------------------

(Updated Nov. 22, 2013, 12:30 a.m.)


Review request for mesos and Niklas Nielsen.


Changes
-------

Fixed indentation.


Repository: mesos-git


Description
-------

Fixed some task reconciliation cases.

Case 1:

If a slave is known but the task cannot be found, we should assume that
the task has been lost.  It's possible that the following events
occurred:

 1) Framework disconnected from master
 2) Master terminated framework's tasks
 3) Framework reconnects to master, and (incorrectly) assumes tasks are
 still running

Case 2:

If a framework loses track of running tasks, the master should inform
the framework of which tasks it knows to be running, in addition to any
which have had a state change.

Review: https://reviews.apache.org/r/15745


Diffs (updated)
-----

  src/master/master.cpp a08d01208ff7bbb878b2d50d8406efee4de86171 

Diff: https://reviews.apache.org/r/15745/diff/


Testing
-------

`make check` & tested in staging cluster.


Thanks,

Brenden Matthews

Re: Review Request 15745: Fixed some task reconciliation cases.

Posted by Brenden Matthews <br...@diddyinc.com>.


> On Nov. 21, 2013, 7:59 p.m., Niklas Nielsen wrote:
> > src/master/master.cpp, line 1597
> > <https://reviews.apache.org/r/15745/diff/2/?file=389363#file389363line1597>
> >
> >     What happens if status.task_id() does not exist in tasks? Is it silently ignored or does it fail?

In this case I'm assuming it has been set.  If not, it will just return an empty value.


- Brenden


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/15745/#review29244
-----------------------------------------------------------


On Nov. 21, 2013, 12:30 a.m., Brenden Matthews wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/15745/
> -----------------------------------------------------------
> 
> (Updated Nov. 21, 2013, 12:30 a.m.)
> 
> 
> Review request for mesos and Niklas Nielsen.
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> Fixed some task reconciliation cases.
> 
> Case 1:
> 
> If a slave is known but the task cannot be found, we should assume that
> the task has been lost.  It's possible that the following events
> occurred:
> 
>  1) Framework disconnected from master
>  2) Master terminated framework's tasks
>  3) Framework reconnects to master, and (incorrectly) assumes tasks are
>  still running
> 
> Case 2:
> 
> If a framework loses track of running tasks, the master should inform
> the framework of which tasks it knows to be running, in addition to any
> which have had a state change.
> 
> Review: https://reviews.apache.org/r/15745
> 
> 
> Diffs
> -----
> 
>   src/master/master.cpp a08d01208ff7bbb878b2d50d8406efee4de86171 
> 
> Diff: https://reviews.apache.org/r/15745/diff/
> 
> 
> Testing
> -------
> 
> `make check` & tested in staging cluster.
> 
> 
> Thanks,
> 
> Brenden Matthews
> 
>

Re: Review Request 15745: Fixed some task reconciliation cases.

Posted by Brenden Matthews <br...@diddyinc.com>.


> On Nov. 21, 2013, 7:59 p.m., Niklas Nielsen wrote:
> > src/master/master.cpp, line 1597
> > <https://reviews.apache.org/r/15745/diff/2/?file=389363#file389363line1597>
> >
> >     What happens if status.task_id() does not exist in tasks? Is it silently ignored or does it fail?
> 
> Brenden Matthews wrote:
>     In this case I'm assuming it has been set.  If not, it will just return an empty value.

Err, that didn't even make sense.  The docs at http://www.boost.org/doc/libs/1_38_0/doc/html/boost/unordered_map.html say it "erases all keys matching K", which means it does nothing in the case where the key is not present.


- Brenden


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/15745/#review29244
-----------------------------------------------------------


On Nov. 22, 2013, 12:30 a.m., Brenden Matthews wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/15745/
> -----------------------------------------------------------
> 
> (Updated Nov. 22, 2013, 12:30 a.m.)
> 
> 
> Review request for mesos and Niklas Nielsen.
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> Fixed some task reconciliation cases.
> 
> Case 1:
> 
> If a slave is known but the task cannot be found, we should assume that
> the task has been lost.  It's possible that the following events
> occurred:
> 
>  1) Framework disconnected from master
>  2) Master terminated framework's tasks
>  3) Framework reconnects to master, and (incorrectly) assumes tasks are
>  still running
> 
> Case 2:
> 
> If a framework loses track of running tasks, the master should inform
> the framework of which tasks it knows to be running, in addition to any
> which have had a state change.
> 
> Review: https://reviews.apache.org/r/15745
> 
> 
> Diffs
> -----
> 
>   src/master/master.cpp a08d01208ff7bbb878b2d50d8406efee4de86171 
> 
> Diff: https://reviews.apache.org/r/15745/diff/
> 
> 
> Testing
> -------
> 
> `make check` & tested in staging cluster.
> 
> 
> Thanks,
> 
> Brenden Matthews
> 
>

Re: Review Request 15745: Fixed some task reconciliation cases.

Posted by Niklas Nielsen <ni...@qni.dk>.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/15745/#review29244
-----------------------------------------------------------


Awesome Brenden! Can we write tests that covers the new scenarios?
Also, can you add Vinod and/or Ben M. as reviewer?


src/master/master.cpp
<https://reviews.apache.org/r/15745/#comment56384>

    What happens if status.task_id() does not exist in tasks? Is it silently ignored or does it fail?



src/master/master.cpp
<https://reviews.apache.org/r/15745/#comment56382>

    This should be 2 indents right?



src/master/master.cpp
<https://reviews.apache.org/r/15745/#comment56383>

    Same here.


- Niklas Nielsen


On Nov. 21, 2013, 12:30 a.m., Brenden Matthews wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/15745/
> -----------------------------------------------------------
> 
> (Updated Nov. 21, 2013, 12:30 a.m.)
> 
> 
> Review request for mesos and Niklas Nielsen.
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> Fixed some task reconciliation cases.
> 
> Case 1:
> 
> If a slave is known but the task cannot be found, we should assume that
> the task has been lost.  It's possible that the following events
> occurred:
> 
>  1) Framework disconnected from master
>  2) Master terminated framework's tasks
>  3) Framework reconnects to master, and (incorrectly) assumes tasks are
>  still running
> 
> Case 2:
> 
> If a framework loses track of running tasks, the master should inform
> the framework of which tasks it knows to be running, in addition to any
> which have had a state change.
> 
> Review: https://reviews.apache.org/r/15745
> 
> 
> Diffs
> -----
> 
>   src/master/master.cpp a08d01208ff7bbb878b2d50d8406efee4de86171 
> 
> Diff: https://reviews.apache.org/r/15745/diff/
> 
> 
> Testing
> -------
> 
> `make check` & tested in staging cluster.
> 
> 
> Thanks,
> 
> Brenden Matthews
> 
>

Re: Review Request 15745: Fixed some task reconciliation cases.

Posted by Brenden Matthews <br...@diddyinc.com>.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/15745/
-----------------------------------------------------------

(Updated Nov. 21, 2013, 12:30 a.m.)


Review request for mesos and Niklas Nielsen.


Changes
-------

Amended my previous commit.


Summary (updated)
-----------------

Fixed some task reconciliation cases.


Repository: mesos-git


Description (updated)
-------

Fixed some task reconciliation cases.

Case 1:

If a slave is known but the task cannot be found, we should assume that
the task has been lost.  It's possible that the following events
occurred:

 1) Framework disconnected from master
 2) Master terminated framework's tasks
 3) Framework reconnects to master, and (incorrectly) assumes tasks are
 still running

Case 2:

If a framework loses track of running tasks, the master should inform
the framework of which tasks it knows to be running, in addition to any
which have had a state change.

Review: https://reviews.apache.org/r/15745


Diffs (updated)
-----

  src/master/master.cpp a08d01208ff7bbb878b2d50d8406efee4de86171 

Diff: https://reviews.apache.org/r/15745/diff/


Testing
-------

`make check` & tested in staging cluster.


Thanks,

Brenden Matthews