You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mesos.apache.org by Ben Mahler <be...@gmail.com> on 2014/10/14 02:34:30 UTC

Review Request 26669: Added a document for reconciliation.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/26669/
-----------------------------------------------------------

Review request for mesos, Benjamin Hindman, Niklas Nielsen, and Vinod Kone.


Bugs: MESOS-681
    https://issues.apache.org/jira/browse/MESOS-681


Repository: mesos-git


Description
-------

Please see here for rendered markdown, will be easier to review:
https://gist.github.com/bmahler/18409fc4f052df43f403

Please send your high level thoughts :)


Diffs
-----

  docs/reconciliation.md PRE-CREATION 

Diff: https://reviews.apache.org/r/26669/diff/


Testing
-------


N/A


Thanks,

Ben Mahler


Re: Review Request 26669: Added a document for reconciliation.

Posted by Ben Mahler <be...@gmail.com>.

> On Oct. 15, 2014, 9:49 p.m., Tobias Weingartner wrote:
> > docs/reconciliation.md, lines 60-61
> > <https://reviews.apache.org/r/26669/diff/1/?file=719858#file719858line60>
> >
> >     Does this result in tons of TASK_LOST right after a fail-over?

There will only be TASK_LOST updates sent if the tasks are no longer known. If this occurs, it's because the framework thought the task was non-terminal. The likely sources of this after a failover are tasks that were dropped during the failover (which should be a fairly small amount).


> On Oct. 15, 2014, 9:49 p.m., Tobias Weingartner wrote:
> > docs/reconciliation.md, line 85
> > <https://reviews.apache.org/r/26669/diff/1/?file=719858#file719858line85>
> >
> >     This may never finish... what is a framework to do if this does not finish?

It is guaranteed to eventually complete, if it does not complete it is a bug in Mesos.

In the case of a serious regression that causes this to never complete, backoff is advised to ensure that this does not overload the system.


> On Oct. 15, 2014, 9:49 p.m., Tobias Weingartner wrote:
> > docs/reconciliation.md, lines 96-99
> > <https://reviews.apache.org/r/26669/diff/1/?file=719858#file719858line96>
> >
> >     one recon per master/cluster, or per framework?

This document is aimed at framework developers, so per framework.

Per master/cluster is impossible for a framework to achieve without some form of distributed consensus across frameworks.


> On Oct. 15, 2014, 9:49 p.m., Tobias Weingartner wrote:
> > docs/reconciliation.md, lines 107-108
> > <https://reviews.apache.org/r/26669/diff/1/?file=719858#file719858line107>
> >
> >     Why do we need both of these?
> >     
> >     Does one not imply the other?

Almost, the second point captures the fact that a ZK blip does not trigger a disconnection, but it triggers a re-registration.

I could elaborate here, but this information is more relevant to a mesos developer or operational engineer, whereas this doc is aimed at framework developers.


- Ben


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/26669/#review56818
-----------------------------------------------------------


On Oct. 14, 2014, 12:34 a.m., Ben Mahler wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/26669/
> -----------------------------------------------------------
> 
> (Updated Oct. 14, 2014, 12:34 a.m.)
> 
> 
> Review request for mesos, Benjamin Hindman, Niklas Nielsen, and Vinod Kone.
> 
> 
> Bugs: MESOS-681
>     https://issues.apache.org/jira/browse/MESOS-681
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> Please see here for rendered markdown, will be easier to review:
> https://gist.github.com/bmahler/18409fc4f052df43f403
> 
> Please send your high level thoughts :)
> 
> 
> Diffs
> -----
> 
>   docs/reconciliation.md PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/26669/diff/
> 
> 
> Testing
> -------
> 
> 
> N/A
> 
> 
> Thanks,
> 
> Ben Mahler
> 
>


Re: Review Request 26669: Added a document for reconciliation.

Posted by Tobias Weingartner <tw...@twopensource.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/26669/#review56818
-----------------------------------------------------------



docs/reconciliation.md
<https://reviews.apache.org/r/26669/#comment97250>

    Does this result in tons of TASK_LOST right after a fail-over?



docs/reconciliation.md
<https://reviews.apache.org/r/26669/#comment97252>

    This may never finish... what is a framework to do if this does not finish?



docs/reconciliation.md
<https://reviews.apache.org/r/26669/#comment97253>

    one recon per master/cluster, or per framework?



docs/reconciliation.md
<https://reviews.apache.org/r/26669/#comment97254>

    Why do we need both of these?
    
    Does one not imply the other?


- Tobias Weingartner


On Oct. 14, 2014, 12:34 a.m., Ben Mahler wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/26669/
> -----------------------------------------------------------
> 
> (Updated Oct. 14, 2014, 12:34 a.m.)
> 
> 
> Review request for mesos, Benjamin Hindman, Niklas Nielsen, and Vinod Kone.
> 
> 
> Bugs: MESOS-681
>     https://issues.apache.org/jira/browse/MESOS-681
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> Please see here for rendered markdown, will be easier to review:
> https://gist.github.com/bmahler/18409fc4f052df43f403
> 
> Please send your high level thoughts :)
> 
> 
> Diffs
> -----
> 
>   docs/reconciliation.md PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/26669/diff/
> 
> 
> Testing
> -------
> 
> 
> N/A
> 
> 
> Thanks,
> 
> Ben Mahler
> 
>


Re: Review Request 26669: Added a document for reconciliation.

Posted by Ben Mahler <be...@gmail.com>.

> On Oct. 14, 2014, 8:40 p.m., Vinod Kone wrote:
> > docs/reconciliation.md, line 27
> > <https://reviews.apache.org/r/26669/diff/1/?file=719858#file719858line27>
> >
> >     Also, 
> >     * slave fails before launching the task?

We'll actually case that case typically, because the slave either gets removed or will re-register.


- Ben


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/26669/#review56566
-----------------------------------------------------------


On Oct. 14, 2014, 12:34 a.m., Ben Mahler wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/26669/
> -----------------------------------------------------------
> 
> (Updated Oct. 14, 2014, 12:34 a.m.)
> 
> 
> Review request for mesos, Benjamin Hindman, Niklas Nielsen, and Vinod Kone.
> 
> 
> Bugs: MESOS-681
>     https://issues.apache.org/jira/browse/MESOS-681
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> Please see here for rendered markdown, will be easier to review:
> https://gist.github.com/bmahler/18409fc4f052df43f403
> 
> Please send your high level thoughts :)
> 
> 
> Diffs
> -----
> 
>   docs/reconciliation.md PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/26669/diff/
> 
> 
> Testing
> -------
> 
> 
> N/A
> 
> 
> Thanks,
> 
> Ben Mahler
> 
>


Re: Review Request 26669: Added a document for reconciliation.

Posted by Vinod Kone <vi...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/26669/#review56566
-----------------------------------------------------------

Ship it!


LGTM.


docs/reconciliation.md
<https://reviews.apache.org/r/26669/#comment96911>

    Also, 
    * slave fails before launching the task?



docs/reconciliation.md
<https://reviews.apache.org/r/26669/#comment96914>

    Maybe to be explicit, say
    
    If `remaining` is non-empty, wait for some time (e.g., exponential backoff) then go to 3.


- Vinod Kone


On Oct. 14, 2014, 12:34 a.m., Ben Mahler wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/26669/
> -----------------------------------------------------------
> 
> (Updated Oct. 14, 2014, 12:34 a.m.)
> 
> 
> Review request for mesos, Benjamin Hindman, Niklas Nielsen, and Vinod Kone.
> 
> 
> Bugs: MESOS-681
>     https://issues.apache.org/jira/browse/MESOS-681
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> Please see here for rendered markdown, will be easier to review:
> https://gist.github.com/bmahler/18409fc4f052df43f403
> 
> Please send your high level thoughts :)
> 
> 
> Diffs
> -----
> 
>   docs/reconciliation.md PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/26669/diff/
> 
> 
> Testing
> -------
> 
> 
> N/A
> 
> 
> Thanks,
> 
> Ben Mahler
> 
>


Re: Review Request 26669: Added a document for reconciliation.

Posted by Ben Mahler <be...@gmail.com>.

> On Oct. 14, 2014, 4:44 p.m., Dominic Hamon wrote:
> > docs/reconciliation.md, line 35
> > <https://reviews.apache.org/r/26669/diff/1/?file=719858#file719858line35>
> >
> >     can we say here that it's the responsibility of the master and work toward that goal?

If the client side (today this is the driver) sends a message on some socket that closes, that message is dropped and the master is unaware. For now I think framework developers should be aware of this.

With the new API we'll definitely need to carefully reconsider whether we can use a bidirectional socket and whether the responsibility can reside solely on the Master. I have not thought through that, so I'm not sure.


> On Oct. 14, 2014, 4:44 p.m., Dominic Hamon wrote:
> > docs/reconciliation.md, line 48
> > <https://reviews.apache.org/r/26669/diff/1/?file=719858#file719858line48>
> >
> >     if there's a jira ticket for this, please link it.

I'll just to remove this specific reference for now, since we don't have the new API available to make this kind of thing possible.


> On Oct. 14, 2014, 4:44 p.m., Dominic Hamon wrote:
> > docs/reconciliation.md, line 9
> > <https://reviews.apache.org/r/26669/diff/1/?file=719858#file719858line9>
> >
> >     is there a link to other documentation that could come in here to describe the programming model in more depth?

Not really, but there should be! :)


> On Oct. 14, 2014, 4:44 p.m., Dominic Hamon wrote:
> > docs/reconciliation.md, line 75
> > <https://reviews.apache.org/r/26669/diff/1/?file=719858#file719858line75>
> >
> >     it's unfortunate that we put the onus on framework developers to manage this algorithm and the exponential backoff.
> >     
> >     I wonder if we can move to a model where a reconcile request is made and then a stream of replies is sent including a terminal empty set.
> >     
> >     (this is speculative only, obviously.. i think the document is fine as it describes the current state).

Yeah, see the discussion on the mailing list w/ Sharma Podila.

I think the ergonomics could definitely be improved, but then what does the framework do if it finds it did not receive the expected updates? Would there still be a retry on the framework side to be resilient in this case?

At least going forward with pure language bindings, someone could implement this technique and it can be shared across different frameworks.


- Ben


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/26669/#review56539
-----------------------------------------------------------


On Oct. 14, 2014, 12:34 a.m., Ben Mahler wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/26669/
> -----------------------------------------------------------
> 
> (Updated Oct. 14, 2014, 12:34 a.m.)
> 
> 
> Review request for mesos, Benjamin Hindman, Niklas Nielsen, and Vinod Kone.
> 
> 
> Bugs: MESOS-681
>     https://issues.apache.org/jira/browse/MESOS-681
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> Please see here for rendered markdown, will be easier to review:
> https://gist.github.com/bmahler/18409fc4f052df43f403
> 
> Please send your high level thoughts :)
> 
> 
> Diffs
> -----
> 
>   docs/reconciliation.md PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/26669/diff/
> 
> 
> Testing
> -------
> 
> 
> N/A
> 
> 
> Thanks,
> 
> Ben Mahler
> 
>


Re: Review Request 26669: Added a document for reconciliation.

Posted by Dominic Hamon <dh...@twopensource.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/26669/#review56539
-----------------------------------------------------------



docs/reconciliation.md
<https://reviews.apache.org/r/26669/#comment96894>

    is there a link to other documentation that could come in here to describe the programming model in more depth?



docs/reconciliation.md
<https://reviews.apache.org/r/26669/#comment96893>

    can we say here that it's the responsibility of the master and work toward that goal?



docs/reconciliation.md
<https://reviews.apache.org/r/26669/#comment96895>

    if there's a jira ticket for this, please link it.



docs/reconciliation.md
<https://reviews.apache.org/r/26669/#comment96896>

    it's unfortunate that we put the onus on framework developers to manage this algorithm and the exponential backoff.
    
    I wonder if we can move to a model where a reconcile request is made and then a stream of replies is sent including a terminal empty set.
    
    (this is speculative only, obviously.. i think the document is fine as it describes the current state).


- Dominic Hamon


On Oct. 13, 2014, 5:34 p.m., Ben Mahler wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/26669/
> -----------------------------------------------------------
> 
> (Updated Oct. 13, 2014, 5:34 p.m.)
> 
> 
> Review request for mesos, Benjamin Hindman, Niklas Nielsen, and Vinod Kone.
> 
> 
> Bugs: MESOS-681
>     https://issues.apache.org/jira/browse/MESOS-681
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> Please see here for rendered markdown, will be easier to review:
> https://gist.github.com/bmahler/18409fc4f052df43f403
> 
> Please send your high level thoughts :)
> 
> 
> Diffs
> -----
> 
>   docs/reconciliation.md PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/26669/diff/
> 
> 
> Testing
> -------
> 
> 
> N/A
> 
> 
> Thanks,
> 
> Ben Mahler
> 
>