You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mesos.apache.org by Benjamin Mahler <be...@gmail.com> on 2014/10/15 23:05:02 UTC

Reconciliation Document

Hi all,

I've sent a review out for a document describing reconciliation, you can
see the draft here:
https://gist.github.com/bmahler/18409fc4f052df43f403

Would love to gather high level feedback on it from framework developers.
Feel free to reply here, or on the review:
https://reviews.apache.org/r/26669/

Thanks!
Ben

Re: Reconciliation Document

Posted by Sharma Podila <sp...@netflix.com>.

Inline...

On Tue, Oct 21, 2014 at 12:52 PM, Benjamin Mahler <benjamin.mahler@gmail.com
> wrote:

> Inline.
>
> On Thu, Oct 16, 2014 at 7:43 PM, Sharma Podila <sp...@netflix.com>
> wrote:
>
>> Response inline, below.
>>
>> On Thu, Oct 16, 2014 at 5:41 PM, Benjamin Mahler <
>> benjamin.mahler@gmail.com> wrote:
>>
>>> Thanks for the thoughtful questions, I will take these into account in
>>> the document.
>>>
>>> Addressing each question in order:
>>>
>>> *(1) Why the retry?*
>>>
>>> It could be once per (re-)registration in the future.
>>>
>>> Some requests are temporarily unanswerable. For example, if reconciling
>>> task T on slave S, and slave S has not yet re-registered, we cannot reply
>>> until the slave is re-registered or removed. Also, if a slave is
>>> transitioning (being removed), we want to make sure that operation finishes
>>> before we can answer.
>>>
>>> It's possible to keep the request around and trigger an event once we
>>> can answer. However, we chose to drop and remain silent for these tasks.
>>> This is both for implementation simplicity and as a defense against OOMing
>>> from too many pending reconciliation requests.
>>>
>>
>> I was thinking that the state machine that maintains the state of tasks
>> always has answers for the current state. Therefore, I don't expect any
>> blocking. For example, if S hasn't yet re-registered. the state machine
>> must think that the state of T is still 'running' until either the slave
>> re-registers and informs of the task being lost, or a timeout occurs after
>> which master decides the slave is gone. At which point a new status update
>> can be sent. I don't see a reason why reconcile needs to wait until slave
>> re-registers here. Maybe I am missing something else? Same with
>> transitioning... the state information is always available, say, as
>> running, until transition happens. This results in two status updates, but
>> always correct.
>>
>
> Task state in Mesos is persisted in the leaves of the system (the slaves)
> for scalability reasons. So when a new master starts up, it doesn't know
> anything about tasks; this state is bootstrapped from the slaves as they
> re-register. This interim period of state recovery is when frameworks may
> not receive answers to reconciliation requests, depending on whether the
> particular slave has re-registered.
>
> In your second case, once a slave is removed, we will send the LOST update
> for all non-terminal tasks on the slave. There's little benefit of replying
> to a reconciliation request while it's being removed, because LOST updates
> are coming shortly thereafter. You can think of these LOST updates as the
> reply to the reconciliation request, as far as the scheduler is concerned.
>
> I think the two takeaways here are:
>
> (1) Ultimately while it is possible to avoid the need for retries on the
> framework side, it introduces too much complexity in the master and gives
> us no flexibility in ignoring or dropping messages. Even in such a world,
> the retries would be a valid resiliency measure for frameworks to insulate
> themselves against anything being dropped.
>
> (2) For now, we want to encourage framework developers to think about
> these kinds of issues, we want them to implement their frameworks in a
> resilient manner. And so in general we haven't chosen to provide a crutch
> when it requires a lot of complexity in Mesos. Today we can't add these
> ergonomic improvements in the scheduler driver because it has no
> persistence. Hopefully as the project moves forward, we can have these kind
> of framework side ergonomic improvements be contained in pure language
> bindings to Mesos. A nice stateful language binding can hide this from you.
> :)
>

OK. The only thought I have is that it could be somewhat useful to have
master send back a (new) state of 'PendingSlaveUpdate' instead of going
silent. This way the reconcile process finishes immediately. Framework
would then retry later for tasks that got these states. Although, figuring
out the timeout after which to retry is still the same issue.

This brings up another question. Say, a slave is 'missing' and hasn't
re-registered with master yet. What is the expected behavior when framework
asks master to kill a task on that slave? Since the slave is disconnected,
the kill request isn't delivered to the executor on that slave. Is the
framework notified of this failure to send the kill request?

This has implications to a framework's task reconcile logic. After a
certain #reconciliations, framework would want to treat the task as
terminally lost and resubmit a replacement. For safety, I'd kill the
existing task before resubmitting the replacement. I am guessing frameworks
should not assume guaranteed delivery of the kill request. So, it is
possible that the task may continue running after the slave reconnects.
Which implies that the framework is now consuming double resources for the
"same" task. I understand this is out of scope for the master and the
tasks/frameworks should use external logic to guarantee only one instance
of a task runs. I am just wanting to know the expectations of the kill
request.

Thanks.

Sharma

Re: Reconciliation Document

Posted by Benjamin Mahler <be...@gmail.com>.

Thanks for looking at this more thoroughly, didn't realize at first that
you have multiple frameworks running! Mind forking this thread or filing a
ticket so we don't pollute this thread further?

You can see the flood on the master side here:
$ grep Singularity master_log | cut -d" " -f2 | cut -d"." -f1 | uniq -c
...
   1 18:59:56
   1 19:02:13
* 128 19:23:53*
   3 19:23:54
   3 19:23:56
...

It doesn't look like the master was blocked during this period:
$ cat master_log | cut -d" " -f2 | cut -d"." -f1 | uniq -c
<stuff happening throughout period>

Do you do any blocking of the scheduler driver thread? If you look at the
logs you can see that you were actually receiving reconciliation updates,
but really slowly (compare the log line timestamp to the status update
timestamp here:

$ grep "is now" scheduler_log

On Mon, Nov 3, 2014 at 5:27 PM, Whitney Sorenson <ws...@hubspot.com>
wrote:

> Hey Ben,
>
> Thanks for looking into this. After looking at the code and the logs
> again, I do not think it is a scheduler issue. The scheduler implements and
> even in this case logs out precisely the algorithm defined here:
> https://gist.github.com/bmahler/18409fc4f052df43f403#algorithm - it looks
> like some odd delay happened in either the driver or master processing the
> calls, possibly due to some network issue?
>
> To your first point, we can easily match up the # of tasks we get
> reconciliation from and the corresponding number we ask for changing:
>
> INFO  [2014-10-30 18:59:21,761]
> com.hubspot.singularity.scheduler.SingularityTaskReconciliation: Starting a
> reconciliation cycle - 25 current active tasks
>
> Then we see 13 task updates come in (not 25). So, the very first next
> check (#5->#3 in your algorithm):
>
> INFO  [2014-10-30 18:59:51,772]
> com.hubspot.singularity.scheduler.SingularityTaskReconciliation: Requesting
> reconciliation of 12 taskStatuses, task reconciliation has been running for
> 00:30.012
>
> So, after the first check has been sent we ask for the delta - the 12
> statuses we haven't seen.
>
> And we can find the corresponding master line validating that:
>
> I1030 18:59:52.006842  1590 master.cpp:3349] Performing explicit task state reconciliation for 12 tasks of framework Singularity
>
> Regarding the task you followed, we get an update for that task at 19:05
> (6 minutes after the first request):
>
> ---
> INFO  [2014-10-30 19:05:51,843]
> com.hubspot.singularity.scheduler.SingularityTaskReconciliation: Requesting
> reconciliation of 10 taskStatuses, task reconciliation has been running for
> 06:30.083
>
> DEBUG [2014-10-30 19:05:52,510]
> com.hubspot.singularity.mesos.SingularityMesosScheduler: Task
> ci-tagupdate-stryker.2014.10.21T22.57.44-1414434875917-1-10-us_west_2c is
> now TASK_LOST (Reconciliation: Task is unknown) at 1414695592000
>
> INFO  [2014-10-30 19:06:21,847]
> com.hubspot.singularity.scheduler.SingularityTaskReconciliation: Requesting
> reconciliation of 9 taskStatuses, task reconciliation has been running for
> 07:00.087
> ---
>
> So, the scheduler is recognizing that it has an update for that task, and
> is no longer requesting an update for that particular task.
>
> I agree it is especially curious that we see a flood of master actions
> (not necessarily scheduler requests) at 19:23:53, which is also when
> reconciliation ends because the scheduler finally gets all of the task
> updates it has been asking for, and we can also see a gap in the master log
> of mentions from Singularity - it does appear that all these
> reconciliations were either processed or delivered in a flood all at once,
> after ~20 minutes. We can see from the scheduler logs, however, that they
> were not sent all at once. Considering the scheduler has no code path which
> would request a flood of reconciliation requests, and because it has been
> reliably (as you can see from the logs) re-requesting reconciliation every
> 30 seconds, perhaps there is another reason that the master suddenly
> decides to send a flood of updates.
>
> Another interesting thing to note is that we get 16 updates for the task
> you pulled out at 19:23, which is unsurprisingly one more than the # of
> times the reconciliation is called before the first 19:05 status update
> shown above (remember it was called total about 50 times.)
>
> If we look at the master logs, we also see another framework, I believe
> Marathon, (you also bolded this line from it - notice it was processing 26
> not 25 initial updates: 1030 19:23:39.238561  1590 master.cpp:3349]
> Performing explicit task state reconciliation for 26 tasks of framework
> 20140724-231003-419644938-5050-1707-0000) continuously requesting updates
> for 26 tasks during the time between 19:05->19:23.
>
> For some reason, the master was recording those calls but not the calls
> from Singularity. In either case, it didn't appear to be making progress
> for its task's either.
>
>
>
> On Mon, Nov 3, 2014 at 6:15 PM, Benjamin Mahler <be...@gmail.com>
> wrote:
> >
> > I'm pretty confused by what's occurring in your scheduler, let's start
> by looking at a particular task:
> https://gist.github.com/bmahler/6f6bdb0385ec245b2346
> >
> > You received an update from reconciliation at: 19:05:52
> >
> > But yet the scheduler continued to think it was reconciling it, and
> later at 19:23:53, a flood of 16 reconciliation requests for this task are
> sent to the master, which leads to 16 LOST updates for this task within the
> same second. Do you know what went wrong in the scheduler here?
> >
> > On the master side I see the flooding of reconciliation requests:
> > https://gist.github.com/bmahler/d82433764ee6ce33faac
> >
> >
> > The only time the master does not reply is when a slave is
> transitioning, and I don't see that in the logs, so for example here where
> the scheduler went from 26 to 12, all 26 of them should have been
> reconciled on the scheduler side:
> >
> > I1030 19:23:37.980578  1593 http.cpp:334] HTTP request for
> '/master/redirect'
> > I1030 19:23:38.056701  1590 http.cpp:466] HTTP request for
> '/master/state.json'
> > I1030 19:23:38.441155  1590 http.cpp:466] HTTP request for
> '/master/state.json'
> > I1030 19:23:38.826906  1590 http.cpp:466] HTTP request for
> '/master/state.json'
> > I1030 19:23:39.238561  1590 master.cpp:3349] Performing explicit task
> state reconciliation for 26 tasks of framework
> 20140724-231003-419644938-5050-1707-0000
> > I1030 19:23:42.177489  1590 http.cpp:466] HTTP request for
> '/master/state.json'
> > I1030 19:23:44.478252  1592 http.cpp:466] HTTP request for
> '/master/state.json'
> > I1030 19:23:44.854184  1592 http.cpp:466] HTTP request for
> '/master/state.json'
> > I1030 19:23:48.172827  1590 http.cpp:466] HTTP request for
> '/master/state.json'
> > I1030 19:23:48.556999  1590 http.cpp:466] HTTP request for
> '/master/state.json'
> > I1030 19:23:53.175287  1587 http.cpp:466] HTTP request for
> '/master/state.json'
> > I1030 19:23:53.686470  1592 master.cpp:3349] Performing explicit task
> state reconciliation for 12 tasks of framework Singularity
> >
> > So far I'm not seeing an issue in the master. You can turn up verbose
> logging on the master to see exactly which status updates are being sent
> back (GLOG_v=1 or use the /help/logging endpoint).
> >
> > ----
> > Also note that the timestamps are only processed at the second
> granularity in your scheduler FWICT, since you're casting to a long here:
> >
> https://github.com/HubSpot/Singularity/blob/master/SingularityService/src/main/java/com/hubspot/singularity/mesos/SingularityMesosScheduler.java#L254
> >
> > On Mon, Nov 3, 2014 at 12:24 PM, Steven Schlansker <
> sschlansker@opentable.com> wrote:
> >>
> >> Here's the Singularity log again, just to have them in the same email:
> >>
> https://gist.githubusercontent.com/stevenschlansker/50dbe2e068c8156a12de/raw/bd4bee96aab770f0899885d826c5b7bca76225e4/gistfile1.txt
> >> and the master log from the same time period:
> >>
> https://gist.githubusercontent.com/stevenschlansker/1577a1fc269525459571/raw/5cd53f53acc8e3b27490b0ea9af04812d624bc50/gistfile1.txt
> >>
> >>
> >> On Nov 3, 2014, at 10:46 AM, Benjamin Mahler <be...@gmail.com>
> wrote:
> >>
> >> > Thanks! Do you have the master logs?
> >> >
> >> > On Mon, Nov 3, 2014 at 10:13 AM, Steven Schlansker <
> sschlansker@opentable.com> wrote:
> >> > Hi,
> >> > I'm the poor end user in question :)
> >> >
> >> > I have the Singularity logs from task reconciliation saved here:
> >> >
> https://gist.githubusercontent.com/stevenschlansker/50dbe2e068c8156a12de/raw/bd4bee96aab770f0899885d826c5b7bca76225e4/gistfile1.txt
> >> >
> >> > The last line in the log file sums it up pretty well -
> >> > INFO  [2014-10-30 19:24:21,948]
> com.hubspot.singularity.scheduler.SingularityTaskReconciliation: Task
> reconciliation ended after 50 checks and 25:00.188
> >> >
> >> > On Nov 3, 2014, at 10:02 AM, Benjamin Mahler <
> benjamin.mahler@gmail.com> wrote:
> >> >
> >> > > I don't think this is related to your retry timeout, but it's very
> difficult to diagnose this without logs or a more thorough description of
> what occurred. Do you have the logs?
> >> > >
> >> > > user saw it take 30 minutes to eventually reconcile 25 task statuses
> >> > >
> >> > > What exactly did the user see to infer this that this was related
> to reconciling the statuses?
> >> > >
> >> > > On Thu, Oct 30, 2014 at 3:26 PM, Whitney Sorenson <
> wsorenson@hubspot.com> wrote:
> >> > > Ben,
> >> > >
> >> > > What's a reasonable initial timeout and cap for reconciliation when
> the # of slaves and tasks involved is in the tens/hundreds?
> >> > >
> >> > > I ask because in Singularity we are using a fixed 30 seconds and
> one user saw it take 30 minutes to eventually reconcile 25 task statuses
> (after seeing all slaves crash and a master failover -- although that's
> another issue.)
> >> > >
> >> > >
> >> > >
> >> > >
> >> > >
> >> > > On Tue, Oct 21, 2014 at 3:52 PM, Benjamin Mahler <
> benjamin.mahler@gmail.com> wrote:
> >> > > Inline.
> >> > >
> >> > > On Thu, Oct 16, 2014 at 7:43 PM, Sharma Podila <sp...@netflix.com>
> wrote:
> >> > > Response inline, below.
> >> > >
> >> > > On Thu, Oct 16, 2014 at 5:41 PM, Benjamin Mahler <
> benjamin.mahler@gmail.com> wrote:
> >> > > Thanks for the thoughtful questions, I will take these into account
> in the document.
> >> > >
> >> > > Addressing each question in order:
> >> > >
> >> > > (1) Why the retry?
> >> > >
> >> > > It could be once per (re-)registration in the future.
> >> > >
> >> > > Some requests are temporarily unanswerable. For example, if
> reconciling task T on slave S, and slave S has not yet re-registered, we
> cannot reply until the slave is re-registered or removed. Also, if a slave
> is transitioning (being removed), we want to make sure that operation
> finishes before we can answer.
> >> > >
> >> > > It's possible to keep the request around and trigger an event once
> we can answer. However, we chose to drop and remain silent for these tasks.
> This is both for implementation simplicity and as a defense against OOMing
> from too many pending reconciliation requests.
> >> > >
> >> > > I was thinking that the state machine that maintains the state of
> tasks always has answers for the current state. Therefore, I don't expect
> any blocking. For example, if S hasn't yet re-registered. the state machine
> must think that the state of T is still 'running' until either the slave
> re-registers and informs of the task being lost, or a timeout occurs after
> which master decides the slave is gone. At which point a new status update
> can be sent. I don't see a reason why reconcile needs to wait until slave
> re-registers here. Maybe I am missing something else? Same with
> transitioning... the state information is always available, say, as
> running, until transition happens. This results in two status updates, but
> always correct.
> >> > >
> >> > > Task state in Mesos is persisted in the leaves of the system (the
> slaves) for scalability reasons. So when a new master starts up, it doesn't
> know anything about tasks; this state is bootstrapped from the slaves as
> they re-register. This interim period of state recovery is when frameworks
> may not receive answers to reconciliation requests, depending on whether
> the particular slave has re-registered.
> >> > >
> >> > > In your second case, once a slave is removed, we will send the LOST
> update for all non-terminal tasks on the slave. There's little benefit of
> replying to a reconciliation request while it's being removed, because LOST
> updates are coming shortly thereafter. You can think of these LOST updates
> as the reply to the reconciliation request, as far as the scheduler is
> concerned.
> >> > >
> >> > > I think the two takeaways here are:
> >> > >
> >> > > (1) Ultimately while it is possible to avoid the need for retries
> on the framework side, it introduces too much complexity in the master and
> gives us no flexibility in ignoring or dropping messages. Even in such a
> world, the retries would be a valid resiliency measure for frameworks to
> insulate themselves against anything being dropped.
> >> > >
> >> > > (2) For now, we want to encourage framework developers to think
> about these kinds of issues, we want them to implement their frameworks in
> a resilient manner. And so in general we haven't chosen to provide a crutch
> when it requires a lot of complexity in Mesos. Today we can't add these
> ergonomic improvements in the scheduler driver because it has no
> persistence. Hopefully as the project moves forward, we can have these kind
> of framework side ergonomic improvements be contained in pure language
> bindings to Mesos. A nice stateful language binding can hide this from you.
> :)
> >> > >
> >> > >
> >> > >
> >> > >
> >> > >
> >> > > (2) Any time bound guarantees?
> >> > >
> >> > > No guarantees on exact timing, but you are guaranteed to eventually
> receive an answer.
> >> > >
> >> > > This is why exponential backoff is important, to tolerate
> variability in timing and avoid snowballing if a backlog ever occurs.
> >> > >
> >> > > For suggesting an initial timeout, I need to digress a bit.
> Currently the driver does not explicitly expose the event queue to the
> scheduler, and so when you call reconcile, you may have an event queue in
> the driver full of status updates. Because of this lack of visibility,
> picking an initial timeout will depend on your scheduler's update
> processing speed and scale (# expected status updates). Again, backoff is
> recommended to handle this.
> >> > >
> >> > > We were considering exposing Java bindings for the newer Event/Call
> API. It makes the queue explicit, which lets you avoid reconciling while
> you have a queue full of updates.
> >> > >
> >> > > Here is what the C++ interface looks like:
> >> > >
> https://github.com/apache/mesos/blob/0.20.1/include/mesos/scheduler.hpp#L478
> >> > >
> >> > > Does this interest you?
> >> > >
> >> > > I am interpreting this (correct me as needed) to mean that the Java
> callback statusUpdate() receives a queue instead of the current version
> with just one TaskStatus argument? I suppose this could be useful, yes. In
> that case, the acknowledgements of receiving the task status is sent to
> master once per the entire queue of task status. Which may be OK.
> >> > >
> >> > > You would always receive a queue of events, which you can store and
> process asynchronously (the key to enabling this was making
> acknowledgements explicit). Sorry for the tangent, keep an eye out for
> discussions related to the new API / HTTP API changes.
> >> > >
> >> > >
> >> > >
> >> > >
> >> > >
> >> > > (3) After timeout with no answer, I would be tempted to kill the
> task.
> >> > >
> >> > > You will eventually receive an answer, so if you decide to kill the
> task because you have not received an answer soon enough, you may make the
> wrong decision. This is up to you.
> >> > >
> >> > > In particular, I would caution against making decisions without
> feedback because it can lead to a snowball effect if tasks are treated
> independently. In the event of a backlog, what's to stop you from killing
> all tasks because you haven't received any answers?
> >> > >
> >> > > I would recommend that you only use this kind of timeout as a last
> resort, when not receiving a response after a large amount of time and a
> large number of reconciliation requests.
> >> > >
> >> > > Yes, that is the timeout value I was after. However, based on my
> response to #1, this could be short, isn't it?
> >> > >
> >> > > Yes it could be on the order of seconds to start with.
> >> > >
> >> > >
> >> > >
> >> > >
> >> > >
> >> > > (4) Does rate limiting affect this?
> >> > >
> >> > > When enabled, rate limiting currently only operates on the rate of
> incoming messages from a particular framework, so the number of updates
> sent back has no effect on the limiting.
> >> > >
> >> > > That sounds good. Although, just to be paranoid, what if there's a
> problematic framework that restarts frequently (due to a bug, for example)?
> This would keep Mesos master busy sending reconcile task updates to it
> constantly.
> >> > >
> >> > > You're right, it's an orthogonal problem to address since it
> applies broadly to other messages (e.g. framework sending 100MB tasks).
> >> > >
> >> > >
> >> > > Thanks.
> >> > >
> >> > > Sharma
> >> > >
> >> > >
> >> > >
> >> > >
> >> > > On Wed, Oct 15, 2014 at 3:22 PM, Sharma Podila <sp...@netflix.com>
> wrote:
> >> > > Looks like a good step forward.
> >> > >
> >> > > What is the reason for the algorithm having to call reconcile tasks
> multiple times after waiting some time in step 6? Shouldn't it be just once
> per (re)registration?
> >> > >
> >> > > Are there time bound guarantees within which a task update will be
> sent out after a reconcile request is sent? In the algorithm for task
> reconciliation, what would be a good timeout after which we conclude that
> we got no task update from the master? Upon such a timeout, I would be
> tempted to conclude that the task has disappeared. In which case, I would
> call driver.killTask() (to be sure its marked as gone), mark my task as
> terminated, then submit a replacement task.
> >> > >
> >> > > Does the "rate limiting" feature (in the works?) affect task
> reconciliation due to the volume of task updates sent back?
> >> > >
> >> > > Thanks.
> >> > >
> >> > >
> >> > > On Wed, Oct 15, 2014 at 2:05 PM, Benjamin Mahler <
> benjamin.mahler@gmail.com> wrote:
> >> > > Hi all,
> >> > >
> >> > > I've sent a review out for a document describing reconciliation,
> you can see the draft here:
> >> > > https://gist.github.com/bmahler/18409fc4f052df43f403
> >> > >
> >> > > Would love to gather high level feedback on it from framework
> developers. Feel free to reply here, or on the review:
> >> > > https://reviews.apache.org/r/26669/
> >> > >
> >> > > Thanks!
> >> > > Ben
> >> > >
> >> > >
> >> > >
> >> > >
> >> > >
> >> > >
> >> >
> >> >
> >>
> >
>

Re: Reconciliation Document

Posted by Whitney Sorenson <ws...@hubspot.com>.

Hey Ben,

Thanks for looking into this. After looking at the code and the logs again,
I do not think it is a scheduler issue. The scheduler implements and even
in this case logs out precisely the algorithm defined here:
https://gist.github.com/bmahler/18409fc4f052df43f403#algorithm - it looks
like some odd delay happened in either the driver or master processing the
calls, possibly due to some network issue?

To your first point, we can easily match up the # of tasks we get
reconciliation from and the corresponding number we ask for changing:

INFO  [2014-10-30 18:59:21,761]
com.hubspot.singularity.scheduler.SingularityTaskReconciliation: Starting a
reconciliation cycle - 25 current active tasks

Then we see 13 task updates come in (not 25). So, the very first next check
(#5->#3 in your algorithm):

INFO  [2014-10-30 18:59:51,772]
com.hubspot.singularity.scheduler.SingularityTaskReconciliation: Requesting
reconciliation of 12 taskStatuses, task reconciliation has been running for
00:30.012

So, after the first check has been sent we ask for the delta - the 12
statuses we haven't seen.

And we can find the corresponding master line validating that:

I1030 18:59:52.006842  1590 master.cpp:3349] Performing explicit task
state reconciliation for 12 tasks of framework Singularity

Regarding the task you followed, we get an update for that task at 19:05 (6
minutes after the first request):

---
INFO  [2014-10-30 19:05:51,843]
com.hubspot.singularity.scheduler.SingularityTaskReconciliation: Requesting
reconciliation of 10 taskStatuses, task reconciliation has been running for
06:30.083

DEBUG [2014-10-30 19:05:52,510]
com.hubspot.singularity.mesos.SingularityMesosScheduler: Task
ci-tagupdate-stryker.2014.10.21T22.57.44-1414434875917-1-10-us_west_2c is
now TASK_LOST (Reconciliation: Task is unknown) at 1414695592000

INFO  [2014-10-30 19:06:21,847]
com.hubspot.singularity.scheduler.SingularityTaskReconciliation: Requesting
reconciliation of 9 taskStatuses, task reconciliation has been running for
07:00.087
---

So, the scheduler is recognizing that it has an update for that task, and
is no longer requesting an update for that particular task.

I agree it is especially curious that we see a flood of master actions (not
necessarily scheduler requests) at 19:23:53, which is also when
reconciliation ends because the scheduler finally gets all of the task
updates it has been asking for, and we can also see a gap in the master log
of mentions from Singularity - it does appear that all these
reconciliations were either processed or delivered in a flood all at once,
after ~20 minutes. We can see from the scheduler logs, however, that they
were not sent all at once. Considering the scheduler has no code path which
would request a flood of reconciliation requests, and because it has been
reliably (as you can see from the logs) re-requesting reconciliation every
30 seconds, perhaps there is another reason that the master suddenly
decides to send a flood of updates.

Another interesting thing to note is that we get 16 updates for the task
you pulled out at 19:23, which is unsurprisingly one more than the # of
times the reconciliation is called before the first 19:05 status update
shown above (remember it was called total about 50 times.)

If we look at the master logs, we also see another framework, I believe
Marathon, (you also bolded this line from it - notice it was processing 26
not 25 initial updates: 1030 19:23:39.238561  1590 master.cpp:3349]
Performing explicit task state reconciliation for 26 tasks of framework
20140724-231003-419644938-5050-1707-0000) continuously requesting updates
for 26 tasks during the time between 19:05->19:23.

For some reason, the master was recording those calls but not the calls
from Singularity. In either case, it didn't appear to be making progress
for its task's either.



On Mon, Nov 3, 2014 at 6:15 PM, Benjamin Mahler <be...@gmail.com>
wrote:
>
> I'm pretty confused by what's occurring in your scheduler, let's start by
looking at a particular task:
https://gist.github.com/bmahler/6f6bdb0385ec245b2346
>
> You received an update from reconciliation at: 19:05:52
>
> But yet the scheduler continued to think it was reconciling it, and later
at 19:23:53, a flood of 16 reconciliation requests for this task are sent
to the master, which leads to 16 LOST updates for this task within the same
second. Do you know what went wrong in the scheduler here?
>
> On the master side I see the flooding of reconciliation requests:
> https://gist.github.com/bmahler/d82433764ee6ce33faac
>
>
> The only time the master does not reply is when a slave is transitioning,
and I don't see that in the logs, so for example here where the scheduler
went from 26 to 12, all 26 of them should have been reconciled on the
scheduler side:
>
> I1030 19:23:37.980578  1593 http.cpp:334] HTTP request for
'/master/redirect'
> I1030 19:23:38.056701  1590 http.cpp:466] HTTP request for
'/master/state.json'
> I1030 19:23:38.441155  1590 http.cpp:466] HTTP request for
'/master/state.json'
> I1030 19:23:38.826906  1590 http.cpp:466] HTTP request for
'/master/state.json'
> I1030 19:23:39.238561  1590 master.cpp:3349] Performing explicit task
state reconciliation for 26 tasks of framework
20140724-231003-419644938-5050-1707-0000
> I1030 19:23:42.177489  1590 http.cpp:466] HTTP request for
'/master/state.json'
> I1030 19:23:44.478252  1592 http.cpp:466] HTTP request for
'/master/state.json'
> I1030 19:23:44.854184  1592 http.cpp:466] HTTP request for
'/master/state.json'
> I1030 19:23:48.172827  1590 http.cpp:466] HTTP request for
'/master/state.json'
> I1030 19:23:48.556999  1590 http.cpp:466] HTTP request for
'/master/state.json'
> I1030 19:23:53.175287  1587 http.cpp:466] HTTP request for
'/master/state.json'
> I1030 19:23:53.686470  1592 master.cpp:3349] Performing explicit task
state reconciliation for 12 tasks of framework Singularity
>
> So far I'm not seeing an issue in the master. You can turn up verbose
logging on the master to see exactly which status updates are being sent
back (GLOG_v=1 or use the /help/logging endpoint).
>
> ----
> Also note that the timestamps are only processed at the second
granularity in your scheduler FWICT, since you're casting to a long here:
>
https://github.com/HubSpot/Singularity/blob/master/SingularityService/src/main/java/com/hubspot/singularity/mesos/SingularityMesosScheduler.java#L254
>
> On Mon, Nov 3, 2014 at 12:24 PM, Steven Schlansker <
sschlansker@opentable.com> wrote:
>>
>> Here's the Singularity log again, just to have them in the same email:
>>
https://gist.githubusercontent.com/stevenschlansker/50dbe2e068c8156a12de/raw/bd4bee96aab770f0899885d826c5b7bca76225e4/gistfile1.txt
>> and the master log from the same time period:
>>
https://gist.githubusercontent.com/stevenschlansker/1577a1fc269525459571/raw/5cd53f53acc8e3b27490b0ea9af04812d624bc50/gistfile1.txt
>>
>>
>> On Nov 3, 2014, at 10:46 AM, Benjamin Mahler <be...@gmail.com>
wrote:
>>
>> > Thanks! Do you have the master logs?
>> >
>> > On Mon, Nov 3, 2014 at 10:13 AM, Steven Schlansker <
sschlansker@opentable.com> wrote:
>> > Hi,
>> > I'm the poor end user in question :)
>> >
>> > I have the Singularity logs from task reconciliation saved here:
>> >
https://gist.githubusercontent.com/stevenschlansker/50dbe2e068c8156a12de/raw/bd4bee96aab770f0899885d826c5b7bca76225e4/gistfile1.txt
>> >
>> > The last line in the log file sums it up pretty well -
>> > INFO  [2014-10-30 19:24:21,948]
com.hubspot.singularity.scheduler.SingularityTaskReconciliation: Task
reconciliation ended after 50 checks and 25:00.188
>> >
>> > On Nov 3, 2014, at 10:02 AM, Benjamin Mahler <be...@gmail.com>
wrote:
>> >
>> > > I don't think this is related to your retry timeout, but it's very
difficult to diagnose this without logs or a more thorough description of
what occurred. Do you have the logs?
>> > >
>> > > user saw it take 30 minutes to eventually reconcile 25 task statuses
>> > >
>> > > What exactly did the user see to infer this that this was related to
reconciling the statuses?
>> > >
>> > > On Thu, Oct 30, 2014 at 3:26 PM, Whitney Sorenson <
wsorenson@hubspot.com> wrote:
>> > > Ben,
>> > >
>> > > What's a reasonable initial timeout and cap for reconciliation when
the # of slaves and tasks involved is in the tens/hundreds?
>> > >
>> > > I ask because in Singularity we are using a fixed 30 seconds and one
user saw it take 30 minutes to eventually reconcile 25 task statuses (after
seeing all slaves crash and a master failover -- although that's another
issue.)
>> > >
>> > >
>> > >
>> > >
>> > >
>> > > On Tue, Oct 21, 2014 at 3:52 PM, Benjamin Mahler <
benjamin.mahler@gmail.com> wrote:
>> > > Inline.
>> > >
>> > > On Thu, Oct 16, 2014 at 7:43 PM, Sharma Podila <sp...@netflix.com>
wrote:
>> > > Response inline, below.
>> > >
>> > > On Thu, Oct 16, 2014 at 5:41 PM, Benjamin Mahler <
benjamin.mahler@gmail.com> wrote:
>> > > Thanks for the thoughtful questions, I will take these into account
in the document.
>> > >
>> > > Addressing each question in order:
>> > >
>> > > (1) Why the retry?
>> > >
>> > > It could be once per (re-)registration in the future.
>> > >
>> > > Some requests are temporarily unanswerable. For example, if
reconciling task T on slave S, and slave S has not yet re-registered, we
cannot reply until the slave is re-registered or removed. Also, if a slave
is transitioning (being removed), we want to make sure that operation
finishes before we can answer.
>> > >
>> > > It's possible to keep the request around and trigger an event once
we can answer. However, we chose to drop and remain silent for these tasks.
This is both for implementation simplicity and as a defense against OOMing
from too many pending reconciliation requests.
>> > >
>> > > I was thinking that the state machine that maintains the state of
tasks always has answers for the current state. Therefore, I don't expect
any blocking. For example, if S hasn't yet re-registered. the state machine
must think that the state of T is still 'running' until either the slave
re-registers and informs of the task being lost, or a timeout occurs after
which master decides the slave is gone. At which point a new status update
can be sent. I don't see a reason why reconcile needs to wait until slave
re-registers here. Maybe I am missing something else? Same with
transitioning... the state information is always available, say, as
running, until transition happens. This results in two status updates, but
always correct.
>> > >
>> > > Task state in Mesos is persisted in the leaves of the system (the
slaves) for scalability reasons. So when a new master starts up, it doesn't
know anything about tasks; this state is bootstrapped from the slaves as
they re-register. This interim period of state recovery is when frameworks
may not receive answers to reconciliation requests, depending on whether
the particular slave has re-registered.
>> > >
>> > > In your second case, once a slave is removed, we will send the LOST
update for all non-terminal tasks on the slave. There's little benefit of
replying to a reconciliation request while it's being removed, because LOST
updates are coming shortly thereafter. You can think of these LOST updates
as the reply to the reconciliation request, as far as the scheduler is
concerned.
>> > >
>> > > I think the two takeaways here are:
>> > >
>> > > (1) Ultimately while it is possible to avoid the need for retries on
the framework side, it introduces too much complexity in the master and
gives us no flexibility in ignoring or dropping messages. Even in such a
world, the retries would be a valid resiliency measure for frameworks to
insulate themselves against anything being dropped.
>> > >
>> > > (2) For now, we want to encourage framework developers to think
about these kinds of issues, we want them to implement their frameworks in
a resilient manner. And so in general we haven't chosen to provide a crutch
when it requires a lot of complexity in Mesos. Today we can't add these
ergonomic improvements in the scheduler driver because it has no
persistence. Hopefully as the project moves forward, we can have these kind
of framework side ergonomic improvements be contained in pure language
bindings to Mesos. A nice stateful language binding can hide this from you.
:)
>> > >
>> > >
>> > >
>> > >
>> > >
>> > > (2) Any time bound guarantees?
>> > >
>> > > No guarantees on exact timing, but you are guaranteed to eventually
receive an answer.
>> > >
>> > > This is why exponential backoff is important, to tolerate
variability in timing and avoid snowballing if a backlog ever occurs.
>> > >
>> > > For suggesting an initial timeout, I need to digress a bit.
Currently the driver does not explicitly expose the event queue to the
scheduler, and so when you call reconcile, you may have an event queue in
the driver full of status updates. Because of this lack of visibility,
picking an initial timeout will depend on your scheduler's update
processing speed and scale (# expected status updates). Again, backoff is
recommended to handle this.
>> > >
>> > > We were considering exposing Java bindings for the newer Event/Call
API. It makes the queue explicit, which lets you avoid reconciling while
you have a queue full of updates.
>> > >
>> > > Here is what the C++ interface looks like:
>> > >
https://github.com/apache/mesos/blob/0.20.1/include/mesos/scheduler.hpp#L478
>> > >
>> > > Does this interest you?
>> > >
>> > > I am interpreting this (correct me as needed) to mean that the Java
callback statusUpdate() receives a queue instead of the current version
with just one TaskStatus argument? I suppose this could be useful, yes. In
that case, the acknowledgements of receiving the task status is sent to
master once per the entire queue of task status. Which may be OK.
>> > >
>> > > You would always receive a queue of events, which you can store and
process asynchronously (the key to enabling this was making
acknowledgements explicit). Sorry for the tangent, keep an eye out for
discussions related to the new API / HTTP API changes.
>> > >
>> > >
>> > >
>> > >
>> > >
>> > > (3) After timeout with no answer, I would be tempted to kill the
task.
>> > >
>> > > You will eventually receive an answer, so if you decide to kill the
task because you have not received an answer soon enough, you may make the
wrong decision. This is up to you.
>> > >
>> > > In particular, I would caution against making decisions without
feedback because it can lead to a snowball effect if tasks are treated
independently. In the event of a backlog, what's to stop you from killing
all tasks because you haven't received any answers?
>> > >
>> > > I would recommend that you only use this kind of timeout as a last
resort, when not receiving a response after a large amount of time and a
large number of reconciliation requests.
>> > >
>> > > Yes, that is the timeout value I was after. However, based on my
response to #1, this could be short, isn't it?
>> > >
>> > > Yes it could be on the order of seconds to start with.
>> > >
>> > >
>> > >
>> > >
>> > >
>> > > (4) Does rate limiting affect this?
>> > >
>> > > When enabled, rate limiting currently only operates on the rate of
incoming messages from a particular framework, so the number of updates
sent back has no effect on the limiting.
>> > >
>> > > That sounds good. Although, just to be paranoid, what if there's a
problematic framework that restarts frequently (due to a bug, for example)?
This would keep Mesos master busy sending reconcile task updates to it
constantly.
>> > >
>> > > You're right, it's an orthogonal problem to address since it applies
broadly to other messages (e.g. framework sending 100MB tasks).
>> > >
>> > >
>> > > Thanks.
>> > >
>> > > Sharma
>> > >
>> > >
>> > >
>> > >
>> > > On Wed, Oct 15, 2014 at 3:22 PM, Sharma Podila <sp...@netflix.com>
wrote:
>> > > Looks like a good step forward.
>> > >
>> > > What is the reason for the algorithm having to call reconcile tasks
multiple times after waiting some time in step 6? Shouldn't it be just once
per (re)registration?
>> > >
>> > > Are there time bound guarantees within which a task update will be
sent out after a reconcile request is sent? In the algorithm for task
reconciliation, what would be a good timeout after which we conclude that
we got no task update from the master? Upon such a timeout, I would be
tempted to conclude that the task has disappeared. In which case, I would
call driver.killTask() (to be sure its marked as gone), mark my task as
terminated, then submit a replacement task.
>> > >
>> > > Does the "rate limiting" feature (in the works?) affect task
reconciliation due to the volume of task updates sent back?
>> > >
>> > > Thanks.
>> > >
>> > >
>> > > On Wed, Oct 15, 2014 at 2:05 PM, Benjamin Mahler <
benjamin.mahler@gmail.com> wrote:
>> > > Hi all,
>> > >
>> > > I've sent a review out for a document describing reconciliation, you
can see the draft here:
>> > > https://gist.github.com/bmahler/18409fc4f052df43f403
>> > >
>> > > Would love to gather high level feedback on it from framework
developers. Feel free to reply here, or on the review:
>> > > https://reviews.apache.org/r/26669/
>> > >
>> > > Thanks!
>> > > Ben
>> > >
>> > >
>> > >
>> > >
>> > >
>> > >
>> >
>> >
>>
>

Re: Reconciliation Document

Posted by Benjamin Mahler <be...@gmail.com>.

I'm pretty confused by what's occurring in your scheduler, let's start by
looking at a particular task:
https://gist.github.com/bmahler/6f6bdb0385ec245b2346

You received an update from reconciliation at: 19:05:52

But yet the scheduler continued to think it was reconciling it, and later
at 19:23:53, a flood of 16 reconciliation requests for this task are sent
to the master, which leads to 16 LOST updates for this task within the same
second. Do you know what went wrong in the scheduler here?

On the master side I see the flooding of reconciliation requests:
https://gist.github.com/bmahler/d82433764ee6ce33faac


The only time the master does not reply is when a slave is transitioning,
and I don't see that in the logs, so for example here where the scheduler
went from 26 to 12, all 26 of them should have been reconciled on the
scheduler side:

I1030 19:23:37.980578  1593 http.cpp:334] HTTP request for
'/master/redirect'
I1030 19:23:38.056701  1590 http.cpp:466] HTTP request for
'/master/state.json'
I1030 19:23:38.441155  1590 http.cpp:466] HTTP request for
'/master/state.json'
I1030 19:23:38.826906  1590 http.cpp:466] HTTP request for
'/master/state.json'
*I1030 19:23:39.238561  1590 master.cpp:3349] Performing explicit task
state reconciliation for 26 tasks of framework
20140724-231003-419644938-5050-1707-0000*
I1030 19:23:42.177489  1590 http.cpp:466] HTTP request for
'/master/state.json'
I1030 19:23:44.478252  1592 http.cpp:466] HTTP request for
'/master/state.json'
I1030 19:23:44.854184  1592 http.cpp:466] HTTP request for
'/master/state.json'
I1030 19:23:48.172827  1590 http.cpp:466] HTTP request for
'/master/state.json'
I1030 19:23:48.556999  1590 http.cpp:466] HTTP request for
'/master/state.json'
I1030 19:23:53.175287  1587 http.cpp:466] HTTP request for
'/master/state.json'
*I1030 19:23:53.686470  1592 master.cpp:3349] Performing explicit task
state reconciliation for 12 tasks of framework Singularity*

So far I'm not seeing an issue in the master. You can turn up verbose
logging on the master to see exactly which status updates are being sent
back (GLOG_v=1 or use the /help/logging endpoint).

----
Also note that the timestamps are only processed at the second granularity
in your scheduler FWICT, since you're casting to a long here:
https://github.com/HubSpot/Singularity/blob/master/SingularityService/src/main/java/com/hubspot/singularity/mesos/SingularityMesosScheduler.java#L254

On Mon, Nov 3, 2014 at 12:24 PM, Steven Schlansker <
sschlansker@opentable.com> wrote:

> Here's the Singularity log again, just to have them in the same email:
>
> https://gist.githubusercontent.com/stevenschlansker/50dbe2e068c8156a12de/raw/bd4bee96aab770f0899885d826c5b7bca76225e4/gistfile1.txt
> and the master log from the same time period:
>
> https://gist.githubusercontent.com/stevenschlansker/1577a1fc269525459571/raw/5cd53f53acc8e3b27490b0ea9af04812d624bc50/gistfile1.txt
>
>
> On Nov 3, 2014, at 10:46 AM, Benjamin Mahler <be...@gmail.com>
> wrote:
>
> > Thanks! Do you have the master logs?
> >
> > On Mon, Nov 3, 2014 at 10:13 AM, Steven Schlansker <
> sschlansker@opentable.com> wrote:
> > Hi,
> > I'm the poor end user in question :)
> >
> > I have the Singularity logs from task reconciliation saved here:
> >
> https://gist.githubusercontent.com/stevenschlansker/50dbe2e068c8156a12de/raw/bd4bee96aab770f0899885d826c5b7bca76225e4/gistfile1.txt
> >
> > The last line in the log file sums it up pretty well -
> > INFO  [2014-10-30 19:24:21,948]
> com.hubspot.singularity.scheduler.SingularityTaskReconciliation: Task
> reconciliation ended after 50 checks and 25:00.188
> >
> > On Nov 3, 2014, at 10:02 AM, Benjamin Mahler <be...@gmail.com>
> wrote:
> >
> > > I don't think this is related to your retry timeout, but it's very
> difficult to diagnose this without logs or a more thorough description of
> what occurred. Do you have the logs?
> > >
> > > user saw it take 30 minutes to eventually reconcile 25 task statuses
> > >
> > > What exactly did the user see to infer this that this was related to
> reconciling the statuses?
> > >
> > > On Thu, Oct 30, 2014 at 3:26 PM, Whitney Sorenson <
> wsorenson@hubspot.com> wrote:
> > > Ben,
> > >
> > > What's a reasonable initial timeout and cap for reconciliation when
> the # of slaves and tasks involved is in the tens/hundreds?
> > >
> > > I ask because in Singularity we are using a fixed 30 seconds and one
> user saw it take 30 minutes to eventually reconcile 25 task statuses (after
> seeing all slaves crash and a master failover -- although that's another
> issue.)
> > >
> > >
> > >
> > >
> > >
> > > On Tue, Oct 21, 2014 at 3:52 PM, Benjamin Mahler <
> benjamin.mahler@gmail.com> wrote:
> > > Inline.
> > >
> > > On Thu, Oct 16, 2014 at 7:43 PM, Sharma Podila <sp...@netflix.com>
> wrote:
> > > Response inline, below.
> > >
> > > On Thu, Oct 16, 2014 at 5:41 PM, Benjamin Mahler <
> benjamin.mahler@gmail.com> wrote:
> > > Thanks for the thoughtful questions, I will take these into account in
> the document.
> > >
> > > Addressing each question in order:
> > >
> > > (1) Why the retry?
> > >
> > > It could be once per (re-)registration in the future.
> > >
> > > Some requests are temporarily unanswerable. For example, if
> reconciling task T on slave S, and slave S has not yet re-registered, we
> cannot reply until the slave is re-registered or removed. Also, if a slave
> is transitioning (being removed), we want to make sure that operation
> finishes before we can answer.
> > >
> > > It's possible to keep the request around and trigger an event once we
> can answer. However, we chose to drop and remain silent for these tasks.
> This is both for implementation simplicity and as a defense against OOMing
> from too many pending reconciliation requests.
> > >
> > > I was thinking that the state machine that maintains the state of
> tasks always has answers for the current state. Therefore, I don't expect
> any blocking. For example, if S hasn't yet re-registered. the state machine
> must think that the state of T is still 'running' until either the slave
> re-registers and informs of the task being lost, or a timeout occurs after
> which master decides the slave is gone. At which point a new status update
> can be sent. I don't see a reason why reconcile needs to wait until slave
> re-registers here. Maybe I am missing something else? Same with
> transitioning... the state information is always available, say, as
> running, until transition happens. This results in two status updates, but
> always correct.
> > >
> > > Task state in Mesos is persisted in the leaves of the system (the
> slaves) for scalability reasons. So when a new master starts up, it doesn't
> know anything about tasks; this state is bootstrapped from the slaves as
> they re-register. This interim period of state recovery is when frameworks
> may not receive answers to reconciliation requests, depending on whether
> the particular slave has re-registered.
> > >
> > > In your second case, once a slave is removed, we will send the LOST
> update for all non-terminal tasks on the slave. There's little benefit of
> replying to a reconciliation request while it's being removed, because LOST
> updates are coming shortly thereafter. You can think of these LOST updates
> as the reply to the reconciliation request, as far as the scheduler is
> concerned.
> > >
> > > I think the two takeaways here are:
> > >
> > > (1) Ultimately while it is possible to avoid the need for retries on
> the framework side, it introduces too much complexity in the master and
> gives us no flexibility in ignoring or dropping messages. Even in such a
> world, the retries would be a valid resiliency measure for frameworks to
> insulate themselves against anything being dropped.
> > >
> > > (2) For now, we want to encourage framework developers to think about
> these kinds of issues, we want them to implement their frameworks in a
> resilient manner. And so in general we haven't chosen to provide a crutch
> when it requires a lot of complexity in Mesos. Today we can't add these
> ergonomic improvements in the scheduler driver because it has no
> persistence. Hopefully as the project moves forward, we can have these kind
> of framework side ergonomic improvements be contained in pure language
> bindings to Mesos. A nice stateful language binding can hide this from you.
> :)
> > >
> > >
> > >
> > >
> > >
> > > (2) Any time bound guarantees?
> > >
> > > No guarantees on exact timing, but you are guaranteed to eventually
> receive an answer.
> > >
> > > This is why exponential backoff is important, to tolerate variability
> in timing and avoid snowballing if a backlog ever occurs.
> > >
> > > For suggesting an initial timeout, I need to digress a bit. Currently
> the driver does not explicitly expose the event queue to the scheduler, and
> so when you call reconcile, you may have an event queue in the driver full
> of status updates. Because of this lack of visibility, picking an initial
> timeout will depend on your scheduler's update processing speed and scale
> (# expected status updates). Again, backoff is recommended to handle this.
> > >
> > > We were considering exposing Java bindings for the newer Event/Call
> API. It makes the queue explicit, which lets you avoid reconciling while
> you have a queue full of updates.
> > >
> > > Here is what the C++ interface looks like:
> > >
> https://github.com/apache/mesos/blob/0.20.1/include/mesos/scheduler.hpp#L478
> > >
> > > Does this interest you?
> > >
> > > I am interpreting this (correct me as needed) to mean that the Java
> callback statusUpdate() receives a queue instead of the current version
> with just one TaskStatus argument? I suppose this could be useful, yes. In
> that case, the acknowledgements of receiving the task status is sent to
> master once per the entire queue of task status. Which may be OK.
> > >
> > > You would always receive a queue of events, which you can store and
> process asynchronously (the key to enabling this was making
> acknowledgements explicit). Sorry for the tangent, keep an eye out for
> discussions related to the new API / HTTP API changes.
> > >
> > >
> > >
> > >
> > >
> > > (3) After timeout with no answer, I would be tempted to kill the task.
> > >
> > > You will eventually receive an answer, so if you decide to kill the
> task because you have not received an answer soon enough, you may make the
> wrong decision. This is up to you.
> > >
> > > In particular, I would caution against making decisions without
> feedback because it can lead to a snowball effect if tasks are treated
> independently. In the event of a backlog, what's to stop you from killing
> all tasks because you haven't received any answers?
> > >
> > > I would recommend that you only use this kind of timeout as a last
> resort, when not receiving a response after a large amount of time and a
> large number of reconciliation requests.
> > >
> > > Yes, that is the timeout value I was after. However, based on my
> response to #1, this could be short, isn't it?
> > >
> > > Yes it could be on the order of seconds to start with.
> > >
> > >
> > >
> > >
> > >
> > > (4) Does rate limiting affect this?
> > >
> > > When enabled, rate limiting currently only operates on the rate of
> incoming messages from a particular framework, so the number of updates
> sent back has no effect on the limiting.
> > >
> > > That sounds good. Although, just to be paranoid, what if there's a
> problematic framework that restarts frequently (due to a bug, for
> example)? This would keep Mesos master busy sending reconcile task updates
> to it constantly.
> > >
> > > You're right, it's an orthogonal problem to address since it applies
> broadly to other messages (e.g. framework sending 100MB tasks).
> > >
> > >
> > > Thanks.
> > >
> > > Sharma
> > >
> > >
> > >
> > >
> > > On Wed, Oct 15, 2014 at 3:22 PM, Sharma Podila <sp...@netflix.com>
> wrote:
> > > Looks like a good step forward.
> > >
> > > What is the reason for the algorithm having to call reconcile tasks
> multiple times after waiting some time in step 6? Shouldn't it be just once
> per (re)registration?
> > >
> > > Are there time bound guarantees within which a task update will be
> sent out after a reconcile request is sent? In the algorithm for task
> reconciliation, what would be a good timeout after which we conclude that
> we got no task update from the master? Upon such a timeout, I would be
> tempted to conclude that the task has disappeared. In which case, I would
> call driver.killTask() (to be sure its marked as gone), mark my task as
> terminated, then submit a replacement task.
> > >
> > > Does the "rate limiting" feature (in the works?) affect task
> reconciliation due to the volume of task updates sent back?
> > >
> > > Thanks.
> > >
> > >
> > > On Wed, Oct 15, 2014 at 2:05 PM, Benjamin Mahler <
> benjamin.mahler@gmail.com> wrote:
> > > Hi all,
> > >
> > > I've sent a review out for a document describing reconciliation, you
> can see the draft here:
> > > https://gist.github.com/bmahler/18409fc4f052df43f403
> > >
> > > Would love to gather high level feedback on it from framework
> developers. Feel free to reply here, or on the review:
> > > https://reviews.apache.org/r/26669/
> > >
> > > Thanks!
> > > Ben
> > >
> > >
> > >
> > >
> > >
> > >
> >
> >
>
>

Re: Reconciliation Document

Posted by Steven Schlansker <ss...@opentable.com>.

Here's the Singularity log again, just to have them in the same email:
https://gist.githubusercontent.com/stevenschlansker/50dbe2e068c8156a12de/raw/bd4bee96aab770f0899885d826c5b7bca76225e4/gistfile1.txt
and the master log from the same time period:
https://gist.githubusercontent.com/stevenschlansker/1577a1fc269525459571/raw/5cd53f53acc8e3b27490b0ea9af04812d624bc50/gistfile1.txt


On Nov 3, 2014, at 10:46 AM, Benjamin Mahler <be...@gmail.com> wrote:

> Thanks! Do you have the master logs?
> 
> On Mon, Nov 3, 2014 at 10:13 AM, Steven Schlansker <ss...@opentable.com> wrote:
> Hi,
> I'm the poor end user in question :)
> 
> I have the Singularity logs from task reconciliation saved here:
> https://gist.githubusercontent.com/stevenschlansker/50dbe2e068c8156a12de/raw/bd4bee96aab770f0899885d826c5b7bca76225e4/gistfile1.txt
> 
> The last line in the log file sums it up pretty well -
> INFO  [2014-10-30 19:24:21,948] com.hubspot.singularity.scheduler.SingularityTaskReconciliation: Task reconciliation ended after 50 checks and 25:00.188
> 
> On Nov 3, 2014, at 10:02 AM, Benjamin Mahler <be...@gmail.com> wrote:
> 
> > I don't think this is related to your retry timeout, but it's very difficult to diagnose this without logs or a more thorough description of what occurred. Do you have the logs?
> >
> > user saw it take 30 minutes to eventually reconcile 25 task statuses
> >
> > What exactly did the user see to infer this that this was related to reconciling the statuses?
> >
> > On Thu, Oct 30, 2014 at 3:26 PM, Whitney Sorenson <ws...@hubspot.com> wrote:
> > Ben,
> >
> > What's a reasonable initial timeout and cap for reconciliation when the # of slaves and tasks involved is in the tens/hundreds?
> >
> > I ask because in Singularity we are using a fixed 30 seconds and one user saw it take 30 minutes to eventually reconcile 25 task statuses (after seeing all slaves crash and a master failover -- although that's another issue.)
> >
> >
> >
> >
> >
> > On Tue, Oct 21, 2014 at 3:52 PM, Benjamin Mahler <be...@gmail.com> wrote:
> > Inline.
> >
> > On Thu, Oct 16, 2014 at 7:43 PM, Sharma Podila <sp...@netflix.com> wrote:
> > Response inline, below.
> >
> > On Thu, Oct 16, 2014 at 5:41 PM, Benjamin Mahler <be...@gmail.com> wrote:
> > Thanks for the thoughtful questions, I will take these into account in the document.
> >
> > Addressing each question in order:
> >
> > (1) Why the retry?
> >
> > It could be once per (re-)registration in the future.
> >
> > Some requests are temporarily unanswerable. For example, if reconciling task T on slave S, and slave S has not yet re-registered, we cannot reply until the slave is re-registered or removed. Also, if a slave is transitioning (being removed), we want to make sure that operation finishes before we can answer.
> >
> > It's possible to keep the request around and trigger an event once we can answer. However, we chose to drop and remain silent for these tasks. This is both for implementation simplicity and as a defense against OOMing from too many pending reconciliation requests.
> >
> > I was thinking that the state machine that maintains the state of tasks always has answers for the current state. Therefore, I don't expect any blocking. For example, if S hasn't yet re-registered. the state machine must think that the state of T is still 'running' until either the slave re-registers and informs of the task being lost, or a timeout occurs after which master decides the slave is gone. At which point a new status update can be sent. I don't see a reason why reconcile needs to wait until slave re-registers here. Maybe I am missing something else? Same with transitioning... the state information is always available, say, as running, until transition happens. This results in two status updates, but always correct.
> >
> > Task state in Mesos is persisted in the leaves of the system (the slaves) for scalability reasons. So when a new master starts up, it doesn't know anything about tasks; this state is bootstrapped from the slaves as they re-register. This interim period of state recovery is when frameworks may not receive answers to reconciliation requests, depending on whether the particular slave has re-registered.
> >
> > In your second case, once a slave is removed, we will send the LOST update for all non-terminal tasks on the slave. There's little benefit of replying to a reconciliation request while it's being removed, because LOST updates are coming shortly thereafter. You can think of these LOST updates as the reply to the reconciliation request, as far as the scheduler is concerned.
> >
> > I think the two takeaways here are:
> >
> > (1) Ultimately while it is possible to avoid the need for retries on the framework side, it introduces too much complexity in the master and gives us no flexibility in ignoring or dropping messages. Even in such a world, the retries would be a valid resiliency measure for frameworks to insulate themselves against anything being dropped.
> >
> > (2) For now, we want to encourage framework developers to think about these kinds of issues, we want them to implement their frameworks in a resilient manner. And so in general we haven't chosen to provide a crutch when it requires a lot of complexity in Mesos. Today we can't add these ergonomic improvements in the scheduler driver because it has no persistence. Hopefully as the project moves forward, we can have these kind of framework side ergonomic improvements be contained in pure language bindings to Mesos. A nice stateful language binding can hide this from you. :)
> >
> >
> >
> >
> >
> > (2) Any time bound guarantees?
> >
> > No guarantees on exact timing, but you are guaranteed to eventually receive an answer.
> >
> > This is why exponential backoff is important, to tolerate variability in timing and avoid snowballing if a backlog ever occurs.
> >
> > For suggesting an initial timeout, I need to digress a bit. Currently the driver does not explicitly expose the event queue to the scheduler, and so when you call reconcile, you may have an event queue in the driver full of status updates. Because of this lack of visibility, picking an initial timeout will depend on your scheduler's update processing speed and scale (# expected status updates). Again, backoff is recommended to handle this.
> >
> > We were considering exposing Java bindings for the newer Event/Call API. It makes the queue explicit, which lets you avoid reconciling while you have a queue full of updates.
> >
> > Here is what the C++ interface looks like:
> > https://github.com/apache/mesos/blob/0.20.1/include/mesos/scheduler.hpp#L478
> >
> > Does this interest you?
> >
> > I am interpreting this (correct me as needed) to mean that the Java callback statusUpdate() receives a queue instead of the current version with just one TaskStatus argument? I suppose this could be useful, yes. In that case, the acknowledgements of receiving the task status is sent to master once per the entire queue of task status. Which may be OK.
> >
> > You would always receive a queue of events, which you can store and process asynchronously (the key to enabling this was making acknowledgements explicit). Sorry for the tangent, keep an eye out for discussions related to the new API / HTTP API changes.
> >
> >
> >
> >
> >
> > (3) After timeout with no answer, I would be tempted to kill the task.
> >
> > You will eventually receive an answer, so if you decide to kill the task because you have not received an answer soon enough, you may make the wrong decision. This is up to you.
> >
> > In particular, I would caution against making decisions without feedback because it can lead to a snowball effect if tasks are treated independently. In the event of a backlog, what's to stop you from killing all tasks because you haven't received any answers?
> >
> > I would recommend that you only use this kind of timeout as a last resort, when not receiving a response after a large amount of time and a large number of reconciliation requests.
> >
> > Yes, that is the timeout value I was after. However, based on my response to #1, this could be short, isn't it?
> >
> > Yes it could be on the order of seconds to start with.
> >
> >
> >
> >
> >
> > (4) Does rate limiting affect this?
> >
> > When enabled, rate limiting currently only operates on the rate of incoming messages from a particular framework, so the number of updates sent back has no effect on the limiting.
> >
> > That sounds good. Although, just to be paranoid, what if there's a problematic framework that restarts frequently (due to a bug, for example)? This would keep Mesos master busy sending reconcile task updates to it constantly.
> >
> > You're right, it's an orthogonal problem to address since it applies broadly to other messages (e.g. framework sending 100MB tasks).
> >
> >
> > Thanks.
> >
> > Sharma
> >
> >
> >
> >
> > On Wed, Oct 15, 2014 at 3:22 PM, Sharma Podila <sp...@netflix.com> wrote:
> > Looks like a good step forward.
> >
> > What is the reason for the algorithm having to call reconcile tasks multiple times after waiting some time in step 6? Shouldn't it be just once per (re)registration?
> >
> > Are there time bound guarantees within which a task update will be sent out after a reconcile request is sent? In the algorithm for task reconciliation, what would be a good timeout after which we conclude that we got no task update from the master? Upon such a timeout, I would be tempted to conclude that the task has disappeared. In which case, I would call driver.killTask() (to be sure its marked as gone), mark my task as terminated, then submit a replacement task.
> >
> > Does the "rate limiting" feature (in the works?) affect task reconciliation due to the volume of task updates sent back?
> >
> > Thanks.
> >
> >
> > On Wed, Oct 15, 2014 at 2:05 PM, Benjamin Mahler <be...@gmail.com> wrote:
> > Hi all,
> >
> > I've sent a review out for a document describing reconciliation, you can see the draft here:
> > https://gist.github.com/bmahler/18409fc4f052df43f403
> >
> > Would love to gather high level feedback on it from framework developers. Feel free to reply here, or on the review:
> > https://reviews.apache.org/r/26669/
> >
> > Thanks!
> > Ben
> >
> >
> >
> >
> >
> >
> 
>

Re: Reconciliation Document

Posted by Benjamin Mahler <be...@gmail.com>.

Thanks! Do you have the master logs?

On Mon, Nov 3, 2014 at 10:13 AM, Steven Schlansker <
sschlansker@opentable.com> wrote:

> Hi,
> I'm the poor end user in question :)
>
> I have the Singularity logs from task reconciliation saved here:
>
> https://gist.githubusercontent.com/stevenschlansker/50dbe2e068c8156a12de/raw/bd4bee96aab770f0899885d826c5b7bca76225e4/gistfile1.txt
>
> The last line in the log file sums it up pretty well -
> INFO  [2014-10-30 19:24:21,948]
> com.hubspot.singularity.scheduler.SingularityTaskReconciliation: Task
> reconciliation ended after 50 checks and 25:00.188
>
> On Nov 3, 2014, at 10:02 AM, Benjamin Mahler <be...@gmail.com>
> wrote:
>
> > I don't think this is related to your retry timeout, but it's very
> difficult to diagnose this without logs or a more thorough description of
> what occurred. Do you have the logs?
> >
> > user saw it take 30 minutes to eventually reconcile 25 task statuses
> >
> > What exactly did the user see to infer this that this was related to
> reconciling the statuses?
> >
> > On Thu, Oct 30, 2014 at 3:26 PM, Whitney Sorenson <ws...@hubspot.com>
> wrote:
> > Ben,
> >
> > What's a reasonable initial timeout and cap for reconciliation when the
> # of slaves and tasks involved is in the tens/hundreds?
> >
> > I ask because in Singularity we are using a fixed 30 seconds and one
> user saw it take 30 minutes to eventually reconcile 25 task statuses (after
> seeing all slaves crash and a master failover -- although that's another
> issue.)
> >
> >
> >
> >
> >
> > On Tue, Oct 21, 2014 at 3:52 PM, Benjamin Mahler <
> benjamin.mahler@gmail.com> wrote:
> > Inline.
> >
> > On Thu, Oct 16, 2014 at 7:43 PM, Sharma Podila <sp...@netflix.com>
> wrote:
> > Response inline, below.
> >
> > On Thu, Oct 16, 2014 at 5:41 PM, Benjamin Mahler <
> benjamin.mahler@gmail.com> wrote:
> > Thanks for the thoughtful questions, I will take these into account in
> the document.
> >
> > Addressing each question in order:
> >
> > (1) Why the retry?
> >
> > It could be once per (re-)registration in the future.
> >
> > Some requests are temporarily unanswerable. For example, if reconciling
> task T on slave S, and slave S has not yet re-registered, we cannot reply
> until the slave is re-registered or removed. Also, if a slave is
> transitioning (being removed), we want to make sure that operation finishes
> before we can answer.
> >
> > It's possible to keep the request around and trigger an event once we
> can answer. However, we chose to drop and remain silent for these tasks.
> This is both for implementation simplicity and as a defense against OOMing
> from too many pending reconciliation requests.
> >
> > I was thinking that the state machine that maintains the state of tasks
> always has answers for the current state. Therefore, I don't expect any
> blocking. For example, if S hasn't yet re-registered. the state machine
> must think that the state of T is still 'running' until either the slave
> re-registers and informs of the task being lost, or a timeout occurs after
> which master decides the slave is gone. At which point a new status update
> can be sent. I don't see a reason why reconcile needs to wait until slave
> re-registers here. Maybe I am missing something else? Same with
> transitioning... the state information is always available, say, as
> running, until transition happens. This results in two status updates, but
> always correct.
> >
> > Task state in Mesos is persisted in the leaves of the system (the
> slaves) for scalability reasons. So when a new master starts up, it doesn't
> know anything about tasks; this state is bootstrapped from the slaves as
> they re-register. This interim period of state recovery is when frameworks
> may not receive answers to reconciliation requests, depending on whether
> the particular slave has re-registered.
> >
> > In your second case, once a slave is removed, we will send the LOST
> update for all non-terminal tasks on the slave. There's little benefit of
> replying to a reconciliation request while it's being removed, because LOST
> updates are coming shortly thereafter. You can think of these LOST updates
> as the reply to the reconciliation request, as far as the scheduler is
> concerned.
> >
> > I think the two takeaways here are:
> >
> > (1) Ultimately while it is possible to avoid the need for retries on the
> framework side, it introduces too much complexity in the master and gives
> us no flexibility in ignoring or dropping messages. Even in such a world,
> the retries would be a valid resiliency measure for frameworks to insulate
> themselves against anything being dropped.
> >
> > (2) For now, we want to encourage framework developers to think about
> these kinds of issues, we want them to implement their frameworks in a
> resilient manner. And so in general we haven't chosen to provide a crutch
> when it requires a lot of complexity in Mesos. Today we can't add these
> ergonomic improvements in the scheduler driver because it has no
> persistence. Hopefully as the project moves forward, we can have these kind
> of framework side ergonomic improvements be contained in pure language
> bindings to Mesos. A nice stateful language binding can hide this from you.
> :)
> >
> >
> >
> >
> >
> > (2) Any time bound guarantees?
> >
> > No guarantees on exact timing, but you are guaranteed to eventually
> receive an answer.
> >
> > This is why exponential backoff is important, to tolerate variability in
> timing and avoid snowballing if a backlog ever occurs.
> >
> > For suggesting an initial timeout, I need to digress a bit. Currently
> the driver does not explicitly expose the event queue to the scheduler, and
> so when you call reconcile, you may have an event queue in the driver full
> of status updates. Because of this lack of visibility, picking an initial
> timeout will depend on your scheduler's update processing speed and scale
> (# expected status updates). Again, backoff is recommended to handle this.
> >
> > We were considering exposing Java bindings for the newer Event/Call API.
> It makes the queue explicit, which lets you avoid reconciling while you
> have a queue full of updates.
> >
> > Here is what the C++ interface looks like:
> >
> https://github.com/apache/mesos/blob/0.20.1/include/mesos/scheduler.hpp#L478
> >
> > Does this interest you?
> >
> > I am interpreting this (correct me as needed) to mean that the Java
> callback statusUpdate() receives a queue instead of the current version
> with just one TaskStatus argument? I suppose this could be useful, yes. In
> that case, the acknowledgements of receiving the task status is sent to
> master once per the entire queue of task status. Which may be OK.
> >
> > You would always receive a queue of events, which you can store and
> process asynchronously (the key to enabling this was making
> acknowledgements explicit). Sorry for the tangent, keep an eye out for
> discussions related to the new API / HTTP API changes.
> >
> >
> >
> >
> >
> > (3) After timeout with no answer, I would be tempted to kill the task.
> >
> > You will eventually receive an answer, so if you decide to kill the task
> because you have not received an answer soon enough, you may make the wrong
> decision. This is up to you.
> >
> > In particular, I would caution against making decisions without feedback
> because it can lead to a snowball effect if tasks are treated
> independently. In the event of a backlog, what's to stop you from killing
> all tasks because you haven't received any answers?
> >
> > I would recommend that you only use this kind of timeout as a last
> resort, when not receiving a response after a large amount of time and a
> large number of reconciliation requests.
> >
> > Yes, that is the timeout value I was after. However, based on my
> response to #1, this could be short, isn't it?
> >
> > Yes it could be on the order of seconds to start with.
> >
> >
> >
> >
> >
> > (4) Does rate limiting affect this?
> >
> > When enabled, rate limiting currently only operates on the rate of
> incoming messages from a particular framework, so the number of updates
> sent back has no effect on the limiting.
> >
> > That sounds good. Although, just to be paranoid, what if there's a
> problematic framework that restarts frequently (due to a bug, for
> example)? This would keep Mesos master busy sending reconcile task updates
> to it constantly.
> >
> > You're right, it's an orthogonal problem to address since it applies
> broadly to other messages (e.g. framework sending 100MB tasks).
> >
> >
> > Thanks.
> >
> > Sharma
> >
> >
> >
> >
> > On Wed, Oct 15, 2014 at 3:22 PM, Sharma Podila <sp...@netflix.com>
> wrote:
> > Looks like a good step forward.
> >
> > What is the reason for the algorithm having to call reconcile tasks
> multiple times after waiting some time in step 6? Shouldn't it be just once
> per (re)registration?
> >
> > Are there time bound guarantees within which a task update will be sent
> out after a reconcile request is sent? In the algorithm for task
> reconciliation, what would be a good timeout after which we conclude that
> we got no task update from the master? Upon such a timeout, I would be
> tempted to conclude that the task has disappeared. In which case, I would
> call driver.killTask() (to be sure its marked as gone), mark my task as
> terminated, then submit a replacement task.
> >
> > Does the "rate limiting" feature (in the works?) affect task
> reconciliation due to the volume of task updates sent back?
> >
> > Thanks.
> >
> >
> > On Wed, Oct 15, 2014 at 2:05 PM, Benjamin Mahler <
> benjamin.mahler@gmail.com> wrote:
> > Hi all,
> >
> > I've sent a review out for a document describing reconciliation, you can
> see the draft here:
> > https://gist.github.com/bmahler/18409fc4f052df43f403
> >
> > Would love to gather high level feedback on it from framework
> developers. Feel free to reply here, or on the review:
> > https://reviews.apache.org/r/26669/
> >
> > Thanks!
> > Ben
> >
> >
> >
> >
> >
> >
>
>

Re: Reconciliation Document

Posted by Steven Schlansker <ss...@opentable.com>.

Hi,
I'm the poor end user in question :)

I have the Singularity logs from task reconciliation saved here:
https://gist.githubusercontent.com/stevenschlansker/50dbe2e068c8156a12de/raw/bd4bee96aab770f0899885d826c5b7bca76225e4/gistfile1.txt

The last line in the log file sums it up pretty well -
INFO  [2014-10-30 19:24:21,948] com.hubspot.singularity.scheduler.SingularityTaskReconciliation: Task reconciliation ended after 50 checks and 25:00.188

On Nov 3, 2014, at 10:02 AM, Benjamin Mahler <be...@gmail.com> wrote:

> I don't think this is related to your retry timeout, but it's very difficult to diagnose this without logs or a more thorough description of what occurred. Do you have the logs?
> 
> user saw it take 30 minutes to eventually reconcile 25 task statuses 
> 
> What exactly did the user see to infer this that this was related to reconciling the statuses?
> 
> On Thu, Oct 30, 2014 at 3:26 PM, Whitney Sorenson <ws...@hubspot.com> wrote:
> Ben,
> 
> What's a reasonable initial timeout and cap for reconciliation when the # of slaves and tasks involved is in the tens/hundreds?
> 
> I ask because in Singularity we are using a fixed 30 seconds and one user saw it take 30 minutes to eventually reconcile 25 task statuses (after seeing all slaves crash and a master failover -- although that's another issue.)
> 
> 
> 
>  
> 
> On Tue, Oct 21, 2014 at 3:52 PM, Benjamin Mahler <be...@gmail.com> wrote:
> Inline.
> 
> On Thu, Oct 16, 2014 at 7:43 PM, Sharma Podila <sp...@netflix.com> wrote:
> Response inline, below.
> 
> On Thu, Oct 16, 2014 at 5:41 PM, Benjamin Mahler <be...@gmail.com> wrote:
> Thanks for the thoughtful questions, I will take these into account in the document.
> 
> Addressing each question in order:
> 
> (1) Why the retry?
> 
> It could be once per (re-)registration in the future.
> 
> Some requests are temporarily unanswerable. For example, if reconciling task T on slave S, and slave S has not yet re-registered, we cannot reply until the slave is re-registered or removed. Also, if a slave is transitioning (being removed), we want to make sure that operation finishes before we can answer.
> 
> It's possible to keep the request around and trigger an event once we can answer. However, we chose to drop and remain silent for these tasks. This is both for implementation simplicity and as a defense against OOMing from too many pending reconciliation requests.
> 
> I was thinking that the state machine that maintains the state of tasks always has answers for the current state. Therefore, I don't expect any blocking. For example, if S hasn't yet re-registered. the state machine must think that the state of T is still 'running' until either the slave re-registers and informs of the task being lost, or a timeout occurs after which master decides the slave is gone. At which point a new status update can be sent. I don't see a reason why reconcile needs to wait until slave re-registers here. Maybe I am missing something else? Same with transitioning... the state information is always available, say, as running, until transition happens. This results in two status updates, but always correct. 
> 
> Task state in Mesos is persisted in the leaves of the system (the slaves) for scalability reasons. So when a new master starts up, it doesn't know anything about tasks; this state is bootstrapped from the slaves as they re-register. This interim period of state recovery is when frameworks may not receive answers to reconciliation requests, depending on whether the particular slave has re-registered.
> 
> In your second case, once a slave is removed, we will send the LOST update for all non-terminal tasks on the slave. There's little benefit of replying to a reconciliation request while it's being removed, because LOST updates are coming shortly thereafter. You can think of these LOST updates as the reply to the reconciliation request, as far as the scheduler is concerned.
> 
> I think the two takeaways here are:
> 
> (1) Ultimately while it is possible to avoid the need for retries on the framework side, it introduces too much complexity in the master and gives us no flexibility in ignoring or dropping messages. Even in such a world, the retries would be a valid resiliency measure for frameworks to insulate themselves against anything being dropped.
> 
> (2) For now, we want to encourage framework developers to think about these kinds of issues, we want them to implement their frameworks in a resilient manner. And so in general we haven't chosen to provide a crutch when it requires a lot of complexity in Mesos. Today we can't add these ergonomic improvements in the scheduler driver because it has no persistence. Hopefully as the project moves forward, we can have these kind of framework side ergonomic improvements be contained in pure language bindings to Mesos. A nice stateful language binding can hide this from you. :)
>  
> 
>  
> 
> 
> (2) Any time bound guarantees?
> 
> No guarantees on exact timing, but you are guaranteed to eventually receive an answer.
> 
> This is why exponential backoff is important, to tolerate variability in timing and avoid snowballing if a backlog ever occurs.
> 
> For suggesting an initial timeout, I need to digress a bit. Currently the driver does not explicitly expose the event queue to the scheduler, and so when you call reconcile, you may have an event queue in the driver full of status updates. Because of this lack of visibility, picking an initial timeout will depend on your scheduler's update processing speed and scale (# expected status updates). Again, backoff is recommended to handle this.
> 
> We were considering exposing Java bindings for the newer Event/Call API. It makes the queue explicit, which lets you avoid reconciling while you have a queue full of updates.
> 
> Here is what the C++ interface looks like:
> https://github.com/apache/mesos/blob/0.20.1/include/mesos/scheduler.hpp#L478
> 
> Does this interest you?
> 
> I am interpreting this (correct me as needed) to mean that the Java callback statusUpdate() receives a queue instead of the current version with just one TaskStatus argument? I suppose this could be useful, yes. In that case, the acknowledgements of receiving the task status is sent to master once per the entire queue of task status. Which may be OK.
> 
> You would always receive a queue of events, which you can store and process asynchronously (the key to enabling this was making acknowledgements explicit). Sorry for the tangent, keep an eye out for discussions related to the new API / HTTP API changes.
>  
> 
>  
> 
> 
> (3) After timeout with no answer, I would be tempted to kill the task.
> 
> You will eventually receive an answer, so if you decide to kill the task because you have not received an answer soon enough, you may make the wrong decision. This is up to you.
> 
> In particular, I would caution against making decisions without feedback because it can lead to a snowball effect if tasks are treated independently. In the event of a backlog, what's to stop you from killing all tasks because you haven't received any answers?
> 
> I would recommend that you only use this kind of timeout as a last resort, when not receiving a response after a large amount of time and a large number of reconciliation requests.
> 
> Yes, that is the timeout value I was after. However, based on my response to #1, this could be short, isn't it?
> 
> Yes it could be on the order of seconds to start with.
>  
> 
>  
> 
> 
> (4) Does rate limiting affect this?
> 
> When enabled, rate limiting currently only operates on the rate of incoming messages from a particular framework, so the number of updates sent back has no effect on the limiting.
> 
> That sounds good. Although, just to be paranoid, what if there's a problematic framework that restarts frequently (due to a bug, for example)? This would keep Mesos master busy sending reconcile task updates to it constantly. 
> 
> You're right, it's an orthogonal problem to address since it applies broadly to other messages (e.g. framework sending 100MB tasks).
>  
> 
> Thanks.
> 
> Sharma
> 
>  
> 
> 
> On Wed, Oct 15, 2014 at 3:22 PM, Sharma Podila <sp...@netflix.com> wrote:
> Looks like a good step forward.
> 
> What is the reason for the algorithm having to call reconcile tasks multiple times after waiting some time in step 6? Shouldn't it be just once per (re)registration? 
> 
> Are there time bound guarantees within which a task update will be sent out after a reconcile request is sent? In the algorithm for task reconciliation, what would be a good timeout after which we conclude that we got no task update from the master? Upon such a timeout, I would be tempted to conclude that the task has disappeared. In which case, I would call driver.killTask() (to be sure its marked as gone), mark my task as terminated, then submit a replacement task. 
> 
> Does the "rate limiting" feature (in the works?) affect task reconciliation due to the volume of task updates sent back? 
> 
> Thanks.
> 
> 
> On Wed, Oct 15, 2014 at 2:05 PM, Benjamin Mahler <be...@gmail.com> wrote:
> Hi all,
> 
> I've sent a review out for a document describing reconciliation, you can see the draft here:
> https://gist.github.com/bmahler/18409fc4f052df43f403
> 
> Would love to gather high level feedback on it from framework developers. Feel free to reply here, or on the review:
> https://reviews.apache.org/r/26669/
> 
> Thanks!
> Ben
> 
> 
> 
> 
> 
>

Re: Reconciliation Document

Posted by Benjamin Mahler <be...@gmail.com>.

I don't think this is related to your retry timeout, but it's very
difficult to diagnose this without logs or a more thorough description of
what occurred. Do you have the logs?

user saw it take 30 minutes to eventually reconcile 25 task statuses


What exactly did the user see to infer this that this was related to
reconciling the statuses?

On Thu, Oct 30, 2014 at 3:26 PM, Whitney Sorenson <ws...@hubspot.com>
wrote:

> Ben,
>
> What's a reasonable initial timeout and cap for reconciliation when the #
> of slaves and tasks involved is in the tens/hundreds?
>
> I ask because in Singularity we are using a fixed 30 seconds and one user
> saw it take 30 minutes to eventually reconcile 25 task statuses (after
> seeing all slaves crash and a master failover -- although that's another
> issue.)
>
>
>
>
>
> On Tue, Oct 21, 2014 at 3:52 PM, Benjamin Mahler <
> benjamin.mahler@gmail.com> wrote:
>
>> Inline.
>>
>> On Thu, Oct 16, 2014 at 7:43 PM, Sharma Podila <sp...@netflix.com>
>> wrote:
>>
>>> Response inline, below.
>>>
>>> On Thu, Oct 16, 2014 at 5:41 PM, Benjamin Mahler <
>>> benjamin.mahler@gmail.com> wrote:
>>>
>>>> Thanks for the thoughtful questions, I will take these into account in
>>>> the document.
>>>>
>>>> Addressing each question in order:
>>>>
>>>> *(1) Why the retry?*
>>>>
>>>> It could be once per (re-)registration in the future.
>>>>
>>>> Some requests are temporarily unanswerable. For example, if reconciling
>>>> task T on slave S, and slave S has not yet re-registered, we cannot reply
>>>> until the slave is re-registered or removed. Also, if a slave is
>>>> transitioning (being removed), we want to make sure that operation finishes
>>>> before we can answer.
>>>>
>>>> It's possible to keep the request around and trigger an event once we
>>>> can answer. However, we chose to drop and remain silent for these tasks.
>>>> This is both for implementation simplicity and as a defense against OOMing
>>>> from too many pending reconciliation requests.
>>>>
>>>
>>> I was thinking that the state machine that maintains the state of tasks
>>> always has answers for the current state. Therefore, I don't expect any
>>> blocking. For example, if S hasn't yet re-registered. the state machine
>>> must think that the state of T is still 'running' until either the slave
>>> re-registers and informs of the task being lost, or a timeout occurs after
>>> which master decides the slave is gone. At which point a new status update
>>> can be sent. I don't see a reason why reconcile needs to wait until slave
>>> re-registers here. Maybe I am missing something else? Same with
>>> transitioning... the state information is always available, say, as
>>> running, until transition happens. This results in two status updates, but
>>> always correct.
>>>
>>
>> Task state in Mesos is persisted in the leaves of the system (the slaves)
>> for scalability reasons. So when a new master starts up, it doesn't know
>> anything about tasks; this state is bootstrapped from the slaves as they
>> re-register. This interim period of state recovery is when frameworks may
>> not receive answers to reconciliation requests, depending on whether the
>> particular slave has re-registered.
>>
>> In your second case, once a slave is removed, we will send the LOST
>> update for all non-terminal tasks on the slave. There's little benefit of
>> replying to a reconciliation request while it's being removed, because LOST
>> updates are coming shortly thereafter. You can think of these LOST updates
>> as the reply to the reconciliation request, as far as the scheduler is
>> concerned.
>>
>> I think the two takeaways here are:
>>
>> (1) Ultimately while it is possible to avoid the need for retries on the
>> framework side, it introduces too much complexity in the master and gives
>> us no flexibility in ignoring or dropping messages. Even in such a world,
>> the retries would be a valid resiliency measure for frameworks to insulate
>> themselves against anything being dropped.
>>
>> (2) For now, we want to encourage framework developers to think about
>> these kinds of issues, we want them to implement their frameworks in a
>> resilient manner. And so in general we haven't chosen to provide a crutch
>> when it requires a lot of complexity in Mesos. Today we can't add these
>> ergonomic improvements in the scheduler driver because it has no
>> persistence. Hopefully as the project moves forward, we can have these kind
>> of framework side ergonomic improvements be contained in pure language
>> bindings to Mesos. A nice stateful language binding can hide this from you.
>> :)
>>
>>
>>>
>>>
>>>
>>>>
>>>>
>>>> *(2) Any time bound guarantees?*
>>>>
>>>> No guarantees on exact timing, but you are guaranteed to eventually
>>>> receive an answer.
>>>>
>>>> This is why exponential backoff is important, to tolerate variability
>>>> in timing and avoid snowballing if a backlog ever occurs.
>>>>
>>>> For suggesting an initial timeout, I need to digress a bit. Currently
>>>> the driver does not explicitly expose the event queue to the scheduler, and
>>>> so when you call reconcile, you may have an event queue in the driver full
>>>> of status updates. Because of this lack of visibility, picking an initial
>>>> timeout will depend on your scheduler's update processing speed and scale
>>>> (# expected status updates). Again, backoff is recommended to handle this.
>>>>
>>>> We were considering exposing Java bindings for the newer Event/Call
>>>> API. It makes the queue explicit, which lets you avoid reconciling while
>>>> you have a queue full of updates.
>>>>
>>>> Here is what the C++ interface looks like:
>>>>
>>>> https://github.com/apache/mesos/blob/0.20.1/include/mesos/scheduler.hpp#L478
>>>>
>>>> Does this interest you?
>>>>
>>>
>>> I am interpreting this (correct me as needed) to mean that the Java
>>> callback statusUpdate() receives a queue instead of the current version
>>> with just one TaskStatus argument? I suppose this could be useful, yes. In
>>> that case, the acknowledgements of receiving the task status is sent to
>>> master once per the entire queue of task status. Which may be OK.
>>>
>>
>> You would always receive a queue of events, which you can store and
>> process asynchronously (the key to enabling this was making
>> acknowledgements explicit). Sorry for the tangent, keep an eye out for
>> discussions related to the new API / HTTP API changes.
>>
>>
>>>
>>>
>>>
>>>>
>>>>
>>>> *(3) After timeout with no answer, I would be tempted to kill the task.*
>>>>
>>>> You will eventually receive an answer, so if you decide to kill the
>>>> task because you have not received an answer soon enough, you may make the
>>>> wrong decision. This is up to you.
>>>>
>>>> In particular, I would caution against making decisions without
>>>> feedback because it can lead to a snowball effect if tasks are treated
>>>> independently. In the event of a backlog, what's to stop you from killing
>>>> all tasks because you haven't received any answers?
>>>>
>>>> I would recommend that you only use this kind of timeout as a last
>>>> resort, when not receiving a response after a large amount of time and a
>>>> large number of reconciliation requests.
>>>>
>>>
>>> Yes, that is the timeout value I was after. However, based on my
>>> response to #1, this could be short, isn't it?
>>>
>>
>> Yes it could be on the order of seconds to start with.
>>
>>
>>>
>>>
>>>
>>>>
>>>>
>>>> *(4) Does rate limiting affect this?*
>>>>
>>>> When enabled, rate limiting currently only operates on the rate of
>>>> incoming messages from a particular framework, so the number of updates
>>>> sent back has no effect on the limiting.
>>>>
>>>
>>> That sounds good. Although, just to be paranoid, what if there's a
>>> problematic framework that restarts frequently (due to a bug, for
>>> example)? This would keep Mesos master busy sending reconcile task updates
>>> to it constantly.
>>>
>>
>> You're right, it's an orthogonal problem to address since it applies
>> broadly to other messages (e.g. framework sending 100MB tasks).
>>
>>
>>>
>>> Thanks.
>>>
>>> Sharma
>>>
>>>
>>>
>>>>
>>>>
>>>> On Wed, Oct 15, 2014 at 3:22 PM, Sharma Podila <sp...@netflix.com>
>>>> wrote:
>>>>
>>>>> Looks like a good step forward.
>>>>>
>>>>> What is the reason for the algorithm having to call reconcile tasks
>>>>> multiple times after waiting some time in step 6? Shouldn't it be just once
>>>>> per (re)registration?
>>>>>
>>>>
>>>>> Are there time bound guarantees within which a task update will be
>>>>> sent out after a reconcile request is sent? In the algorithm for task
>>>>> reconciliation, what would be a good timeout after which we conclude that
>>>>> we got no task update from the master? Upon such a timeout, I would be
>>>>> tempted to conclude that the task has disappeared. In which case, I would
>>>>> call driver.killTask() (to be sure its marked as gone), mark my task as
>>>>> terminated, then submit a replacement task.
>>>>>
>>>>> Does the "rate limiting" feature (in the works?) affect task
>>>>> reconciliation due to the volume of task updates sent back?
>>>>>
>>>>> Thanks.
>>>>>
>>>>>
>>>>> On Wed, Oct 15, 2014 at 2:05 PM, Benjamin Mahler <
>>>>> benjamin.mahler@gmail.com> wrote:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> I've sent a review out for a document describing reconciliation, you
>>>>>> can see the draft here:
>>>>>> https://gist.github.com/bmahler/18409fc4f052df43f403
>>>>>>
>>>>>> Would love to gather high level feedback on it from framework
>>>>>> developers. Feel free to reply here, or on the review:
>>>>>> https://reviews.apache.org/r/26669/
>>>>>>
>>>>>> Thanks!
>>>>>> Ben
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Reconciliation Document

Posted by Whitney Sorenson <ws...@hubspot.com>.

Ben,

What's a reasonable initial timeout and cap for reconciliation when the #
of slaves and tasks involved is in the tens/hundreds?

I ask because in Singularity we are using a fixed 30 seconds and one user
saw it take 30 minutes to eventually reconcile 25 task statuses (after
seeing all slaves crash and a master failover -- although that's another
issue.)





On Tue, Oct 21, 2014 at 3:52 PM, Benjamin Mahler <be...@gmail.com>
wrote:

> Inline.
>
> On Thu, Oct 16, 2014 at 7:43 PM, Sharma Podila <sp...@netflix.com>
> wrote:
>
>> Response inline, below.
>>
>> On Thu, Oct 16, 2014 at 5:41 PM, Benjamin Mahler <
>> benjamin.mahler@gmail.com> wrote:
>>
>>> Thanks for the thoughtful questions, I will take these into account in
>>> the document.
>>>
>>> Addressing each question in order:
>>>
>>> *(1) Why the retry?*
>>>
>>> It could be once per (re-)registration in the future.
>>>
>>> Some requests are temporarily unanswerable. For example, if reconciling
>>> task T on slave S, and slave S has not yet re-registered, we cannot reply
>>> until the slave is re-registered or removed. Also, if a slave is
>>> transitioning (being removed), we want to make sure that operation finishes
>>> before we can answer.
>>>
>>> It's possible to keep the request around and trigger an event once we
>>> can answer. However, we chose to drop and remain silent for these tasks.
>>> This is both for implementation simplicity and as a defense against OOMing
>>> from too many pending reconciliation requests.
>>>
>>
>> I was thinking that the state machine that maintains the state of tasks
>> always has answers for the current state. Therefore, I don't expect any
>> blocking. For example, if S hasn't yet re-registered. the state machine
>> must think that the state of T is still 'running' until either the slave
>> re-registers and informs of the task being lost, or a timeout occurs after
>> which master decides the slave is gone. At which point a new status update
>> can be sent. I don't see a reason why reconcile needs to wait until slave
>> re-registers here. Maybe I am missing something else? Same with
>> transitioning... the state information is always available, say, as
>> running, until transition happens. This results in two status updates, but
>> always correct.
>>
>
> Task state in Mesos is persisted in the leaves of the system (the slaves)
> for scalability reasons. So when a new master starts up, it doesn't know
> anything about tasks; this state is bootstrapped from the slaves as they
> re-register. This interim period of state recovery is when frameworks may
> not receive answers to reconciliation requests, depending on whether the
> particular slave has re-registered.
>
> In your second case, once a slave is removed, we will send the LOST update
> for all non-terminal tasks on the slave. There's little benefit of replying
> to a reconciliation request while it's being removed, because LOST updates
> are coming shortly thereafter. You can think of these LOST updates as the
> reply to the reconciliation request, as far as the scheduler is concerned.
>
> I think the two takeaways here are:
>
> (1) Ultimately while it is possible to avoid the need for retries on the
> framework side, it introduces too much complexity in the master and gives
> us no flexibility in ignoring or dropping messages. Even in such a world,
> the retries would be a valid resiliency measure for frameworks to insulate
> themselves against anything being dropped.
>
> (2) For now, we want to encourage framework developers to think about
> these kinds of issues, we want them to implement their frameworks in a
> resilient manner. And so in general we haven't chosen to provide a crutch
> when it requires a lot of complexity in Mesos. Today we can't add these
> ergonomic improvements in the scheduler driver because it has no
> persistence. Hopefully as the project moves forward, we can have these kind
> of framework side ergonomic improvements be contained in pure language
> bindings to Mesos. A nice stateful language binding can hide this from you.
> :)
>
>
>>
>>
>>
>>>
>>>
>>> *(2) Any time bound guarantees?*
>>>
>>> No guarantees on exact timing, but you are guaranteed to eventually
>>> receive an answer.
>>>
>>> This is why exponential backoff is important, to tolerate variability in
>>> timing and avoid snowballing if a backlog ever occurs.
>>>
>>> For suggesting an initial timeout, I need to digress a bit. Currently
>>> the driver does not explicitly expose the event queue to the scheduler, and
>>> so when you call reconcile, you may have an event queue in the driver full
>>> of status updates. Because of this lack of visibility, picking an initial
>>> timeout will depend on your scheduler's update processing speed and scale
>>> (# expected status updates). Again, backoff is recommended to handle this.
>>>
>>> We were considering exposing Java bindings for the newer Event/Call API.
>>> It makes the queue explicit, which lets you avoid reconciling while you
>>> have a queue full of updates.
>>>
>>> Here is what the C++ interface looks like:
>>>
>>> https://github.com/apache/mesos/blob/0.20.1/include/mesos/scheduler.hpp#L478
>>>
>>> Does this interest you?
>>>
>>
>> I am interpreting this (correct me as needed) to mean that the Java
>> callback statusUpdate() receives a queue instead of the current version
>> with just one TaskStatus argument? I suppose this could be useful, yes. In
>> that case, the acknowledgements of receiving the task status is sent to
>> master once per the entire queue of task status. Which may be OK.
>>
>
> You would always receive a queue of events, which you can store and
> process asynchronously (the key to enabling this was making
> acknowledgements explicit). Sorry for the tangent, keep an eye out for
> discussions related to the new API / HTTP API changes.
>
>
>>
>>
>>
>>>
>>>
>>> *(3) After timeout with no answer, I would be tempted to kill the task.*
>>>
>>> You will eventually receive an answer, so if you decide to kill the task
>>> because you have not received an answer soon enough, you may make the wrong
>>> decision. This is up to you.
>>>
>>> In particular, I would caution against making decisions without feedback
>>> because it can lead to a snowball effect if tasks are treated
>>> independently. In the event of a backlog, what's to stop you from killing
>>> all tasks because you haven't received any answers?
>>>
>>> I would recommend that you only use this kind of timeout as a last
>>> resort, when not receiving a response after a large amount of time and a
>>> large number of reconciliation requests.
>>>
>>
>> Yes, that is the timeout value I was after. However, based on my
>> response to #1, this could be short, isn't it?
>>
>
> Yes it could be on the order of seconds to start with.
>
>
>>
>>
>>
>>>
>>>
>>> *(4) Does rate limiting affect this?*
>>>
>>> When enabled, rate limiting currently only operates on the rate of
>>> incoming messages from a particular framework, so the number of updates
>>> sent back has no effect on the limiting.
>>>
>>
>> That sounds good. Although, just to be paranoid, what if there's a
>> problematic framework that restarts frequently (due to a bug, for
>> example)? This would keep Mesos master busy sending reconcile task updates
>> to it constantly.
>>
>
> You're right, it's an orthogonal problem to address since it applies
> broadly to other messages (e.g. framework sending 100MB tasks).
>
>
>>
>> Thanks.
>>
>> Sharma
>>
>>
>>
>>>
>>>
>>> On Wed, Oct 15, 2014 at 3:22 PM, Sharma Podila <sp...@netflix.com>
>>> wrote:
>>>
>>>> Looks like a good step forward.
>>>>
>>>> What is the reason for the algorithm having to call reconcile tasks
>>>> multiple times after waiting some time in step 6? Shouldn't it be just once
>>>> per (re)registration?
>>>>
>>>
>>>> Are there time bound guarantees within which a task update will be sent
>>>> out after a reconcile request is sent? In the algorithm for task
>>>> reconciliation, what would be a good timeout after which we conclude that
>>>> we got no task update from the master? Upon such a timeout, I would be
>>>> tempted to conclude that the task has disappeared. In which case, I would
>>>> call driver.killTask() (to be sure its marked as gone), mark my task as
>>>> terminated, then submit a replacement task.
>>>>
>>>> Does the "rate limiting" feature (in the works?) affect task
>>>> reconciliation due to the volume of task updates sent back?
>>>>
>>>> Thanks.
>>>>
>>>>
>>>> On Wed, Oct 15, 2014 at 2:05 PM, Benjamin Mahler <
>>>> benjamin.mahler@gmail.com> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> I've sent a review out for a document describing reconciliation, you
>>>>> can see the draft here:
>>>>> https://gist.github.com/bmahler/18409fc4f052df43f403
>>>>>
>>>>> Would love to gather high level feedback on it from framework
>>>>> developers. Feel free to reply here, or on the review:
>>>>> https://reviews.apache.org/r/26669/
>>>>>
>>>>> Thanks!
>>>>> Ben
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Reconciliation Document

Posted by Benjamin Mahler <be...@gmail.com>.

Inline.

On Thu, Oct 16, 2014 at 7:43 PM, Sharma Podila <sp...@netflix.com> wrote:

> Response inline, below.
>
> On Thu, Oct 16, 2014 at 5:41 PM, Benjamin Mahler <
> benjamin.mahler@gmail.com> wrote:
>
>> Thanks for the thoughtful questions, I will take these into account in
>> the document.
>>
>> Addressing each question in order:
>>
>> *(1) Why the retry?*
>>
>> It could be once per (re-)registration in the future.
>>
>> Some requests are temporarily unanswerable. For example, if reconciling
>> task T on slave S, and slave S has not yet re-registered, we cannot reply
>> until the slave is re-registered or removed. Also, if a slave is
>> transitioning (being removed), we want to make sure that operation finishes
>> before we can answer.
>>
>> It's possible to keep the request around and trigger an event once we can
>> answer. However, we chose to drop and remain silent for these tasks. This
>> is both for implementation simplicity and as a defense against OOMing from
>> too many pending reconciliation requests.
>>
>
> I was thinking that the state machine that maintains the state of tasks
> always has answers for the current state. Therefore, I don't expect any
> blocking. For example, if S hasn't yet re-registered. the state machine
> must think that the state of T is still 'running' until either the slave
> re-registers and informs of the task being lost, or a timeout occurs after
> which master decides the slave is gone. At which point a new status update
> can be sent. I don't see a reason why reconcile needs to wait until slave
> re-registers here. Maybe I am missing something else? Same with
> transitioning... the state information is always available, say, as
> running, until transition happens. This results in two status updates, but
> always correct.
>

Task state in Mesos is persisted in the leaves of the system (the slaves)
for scalability reasons. So when a new master starts up, it doesn't know
anything about tasks; this state is bootstrapped from the slaves as they
re-register. This interim period of state recovery is when frameworks may
not receive answers to reconciliation requests, depending on whether the
particular slave has re-registered.

In your second case, once a slave is removed, we will send the LOST update
for all non-terminal tasks on the slave. There's little benefit of replying
to a reconciliation request while it's being removed, because LOST updates
are coming shortly thereafter. You can think of these LOST updates as the
reply to the reconciliation request, as far as the scheduler is concerned.

I think the two takeaways here are:

(1) Ultimately while it is possible to avoid the need for retries on the
framework side, it introduces too much complexity in the master and gives
us no flexibility in ignoring or dropping messages. Even in such a world,
the retries would be a valid resiliency measure for frameworks to insulate
themselves against anything being dropped.

(2) For now, we want to encourage framework developers to think about these
kinds of issues, we want them to implement their frameworks in a resilient
manner. And so in general we haven't chosen to provide a crutch when it
requires a lot of complexity in Mesos. Today we can't add these ergonomic
improvements in the scheduler driver because it has no persistence.
Hopefully as the project moves forward, we can have these kind of framework
side ergonomic improvements be contained in pure language bindings to
Mesos. A nice stateful language binding can hide this from you. :)


>
>
>
>>
>>
>> *(2) Any time bound guarantees?*
>>
>> No guarantees on exact timing, but you are guaranteed to eventually
>> receive an answer.
>>
>> This is why exponential backoff is important, to tolerate variability in
>> timing and avoid snowballing if a backlog ever occurs.
>>
>> For suggesting an initial timeout, I need to digress a bit. Currently the
>> driver does not explicitly expose the event queue to the scheduler, and so
>> when you call reconcile, you may have an event queue in the driver full of
>> status updates. Because of this lack of visibility, picking an initial
>> timeout will depend on your scheduler's update processing speed and scale
>> (# expected status updates). Again, backoff is recommended to handle this.
>>
>> We were considering exposing Java bindings for the newer Event/Call API.
>> It makes the queue explicit, which lets you avoid reconciling while you
>> have a queue full of updates.
>>
>> Here is what the C++ interface looks like:
>>
>> https://github.com/apache/mesos/blob/0.20.1/include/mesos/scheduler.hpp#L478
>>
>> Does this interest you?
>>
>
> I am interpreting this (correct me as needed) to mean that the Java
> callback statusUpdate() receives a queue instead of the current version
> with just one TaskStatus argument? I suppose this could be useful, yes. In
> that case, the acknowledgements of receiving the task status is sent to
> master once per the entire queue of task status. Which may be OK.
>

You would always receive a queue of events, which you can store and process
asynchronously (the key to enabling this was making acknowledgements
explicit). Sorry for the tangent, keep an eye out for discussions related
to the new API / HTTP API changes.


>
>
>
>>
>>
>> *(3) After timeout with no answer, I would be tempted to kill the task.*
>>
>> You will eventually receive an answer, so if you decide to kill the task
>> because you have not received an answer soon enough, you may make the wrong
>> decision. This is up to you.
>>
>> In particular, I would caution against making decisions without feedback
>> because it can lead to a snowball effect if tasks are treated
>> independently. In the event of a backlog, what's to stop you from killing
>> all tasks because you haven't received any answers?
>>
>> I would recommend that you only use this kind of timeout as a last
>> resort, when not receiving a response after a large amount of time and a
>> large number of reconciliation requests.
>>
>
> Yes, that is the timeout value I was after. However, based on my response
> to #1, this could be short, isn't it?
>

Yes it could be on the order of seconds to start with.


>
>
>
>>
>>
>> *(4) Does rate limiting affect this?*
>>
>> When enabled, rate limiting currently only operates on the rate of
>> incoming messages from a particular framework, so the number of updates
>> sent back has no effect on the limiting.
>>
>
> That sounds good. Although, just to be paranoid, what if there's a
> problematic framework that restarts frequently (due to a bug, for
> example)? This would keep Mesos master busy sending reconcile task updates
> to it constantly.
>

You're right, it's an orthogonal problem to address since it applies
broadly to other messages (e.g. framework sending 100MB tasks).


>
> Thanks.
>
> Sharma
>
>
>
>>
>>
>> On Wed, Oct 15, 2014 at 3:22 PM, Sharma Podila <sp...@netflix.com>
>> wrote:
>>
>>> Looks like a good step forward.
>>>
>>> What is the reason for the algorithm having to call reconcile tasks
>>> multiple times after waiting some time in step 6? Shouldn't it be just once
>>> per (re)registration?
>>>
>>
>>> Are there time bound guarantees within which a task update will be sent
>>> out after a reconcile request is sent? In the algorithm for task
>>> reconciliation, what would be a good timeout after which we conclude that
>>> we got no task update from the master? Upon such a timeout, I would be
>>> tempted to conclude that the task has disappeared. In which case, I would
>>> call driver.killTask() (to be sure its marked as gone), mark my task as
>>> terminated, then submit a replacement task.
>>>
>>> Does the "rate limiting" feature (in the works?) affect task
>>> reconciliation due to the volume of task updates sent back?
>>>
>>> Thanks.
>>>
>>>
>>> On Wed, Oct 15, 2014 at 2:05 PM, Benjamin Mahler <
>>> benjamin.mahler@gmail.com> wrote:
>>>
>>>> Hi all,
>>>>
>>>> I've sent a review out for a document describing reconciliation, you
>>>> can see the draft here:
>>>> https://gist.github.com/bmahler/18409fc4f052df43f403
>>>>
>>>> Would love to gather high level feedback on it from framework
>>>> developers. Feel free to reply here, or on the review:
>>>> https://reviews.apache.org/r/26669/
>>>>
>>>> Thanks!
>>>> Ben
>>>>
>>>
>>>
>>
>

Re: Reconciliation Document

Posted by Sharma Podila <sp...@netflix.com>.

Response inline, below.

On Thu, Oct 16, 2014 at 5:41 PM, Benjamin Mahler <be...@gmail.com>
wrote:

> Thanks for the thoughtful questions, I will take these into account in the
> document.
>
> Addressing each question in order:
>
> *(1) Why the retry?*
>
> It could be once per (re-)registration in the future.
>
> Some requests are temporarily unanswerable. For example, if reconciling
> task T on slave S, and slave S has not yet re-registered, we cannot reply
> until the slave is re-registered or removed. Also, if a slave is
> transitioning (being removed), we want to make sure that operation finishes
> before we can answer.
>
> It's possible to keep the request around and trigger an event once we can
> answer. However, we chose to drop and remain silent for these tasks. This
> is both for implementation simplicity and as a defense against OOMing from
> too many pending reconciliation requests.
>

I was thinking that the state machine that maintains the state of tasks
always has answers for the current state. Therefore, I don't expect any
blocking. For example, if S hasn't yet re-registered. the state machine
must think that the state of T is still 'running' until either the slave
re-registers and informs of the task being lost, or a timeout occurs after
which master decides the slave is gone. At which point a new status update
can be sent. I don't see a reason why reconcile needs to wait until slave
re-registers here. Maybe I am missing something else? Same with
transitioning... the state information is always available, say, as
running, until transition happens. This results in two status updates, but
always correct.



>
>
> *(2) Any time bound guarantees?*
>
> No guarantees on exact timing, but you are guaranteed to eventually
> receive an answer.
>
> This is why exponential backoff is important, to tolerate variability in
> timing and avoid snowballing if a backlog ever occurs.
>
> For suggesting an initial timeout, I need to digress a bit. Currently the
> driver does not explicitly expose the event queue to the scheduler, and so
> when you call reconcile, you may have an event queue in the driver full of
> status updates. Because of this lack of visibility, picking an initial
> timeout will depend on your scheduler's update processing speed and scale
> (# expected status updates). Again, backoff is recommended to handle this.
>
> We were considering exposing Java bindings for the newer Event/Call API.
> It makes the queue explicit, which lets you avoid reconciling while you
> have a queue full of updates.
>
> Here is what the C++ interface looks like:
>
> https://github.com/apache/mesos/blob/0.20.1/include/mesos/scheduler.hpp#L478
>
> Does this interest you?
>

I am interpreting this (correct me as needed) to mean that the Java
callback statusUpdate() receives a queue instead of the current version
with just one TaskStatus argument? I suppose this could be useful, yes. In
that case, the acknowledgements of receiving the task status is sent to
master once per the entire queue of task status. Which may be OK.



>
>
> *(3) After timeout with no answer, I would be tempted to kill the task.*
>
> You will eventually receive an answer, so if you decide to kill the task
> because you have not received an answer soon enough, you may make the wrong
> decision. This is up to you.
>
> In particular, I would caution against making decisions without feedback
> because it can lead to a snowball effect if tasks are treated
> independently. In the event of a backlog, what's to stop you from killing
> all tasks because you haven't received any answers?
>
> I would recommend that you only use this kind of timeout as a last resort,
> when not receiving a response after a large amount of time and a large
> number of reconciliation requests.
>

Yes, that is the timeout value I was after. However, based on my response
to #1, this could be short, isn't it?



>
>
> *(4) Does rate limiting affect this?*
>
> When enabled, rate limiting currently only operates on the rate of
> incoming messages from a particular framework, so the number of updates
> sent back has no effect on the limiting.
>

That sounds good. Although, just to be paranoid, what if there's a
problematic framework that restarts frequently (due to a bug, for
example)? This would keep Mesos master busy sending reconcile task updates
to it constantly.

Thanks.

Sharma



>
>
> On Wed, Oct 15, 2014 at 3:22 PM, Sharma Podila <sp...@netflix.com>
> wrote:
>
>> Looks like a good step forward.
>>
>> What is the reason for the algorithm having to call reconcile tasks
>> multiple times after waiting some time in step 6? Shouldn't it be just once
>> per (re)registration?
>>
>
>> Are there time bound guarantees within which a task update will be sent
>> out after a reconcile request is sent? In the algorithm for task
>> reconciliation, what would be a good timeout after which we conclude that
>> we got no task update from the master? Upon such a timeout, I would be
>> tempted to conclude that the task has disappeared. In which case, I would
>> call driver.killTask() (to be sure its marked as gone), mark my task as
>> terminated, then submit a replacement task.
>>
>> Does the "rate limiting" feature (in the works?) affect task
>> reconciliation due to the volume of task updates sent back?
>>
>> Thanks.
>>
>>
>> On Wed, Oct 15, 2014 at 2:05 PM, Benjamin Mahler <
>> benjamin.mahler@gmail.com> wrote:
>>
>>> Hi all,
>>>
>>> I've sent a review out for a document describing reconciliation, you can
>>> see the draft here:
>>> https://gist.github.com/bmahler/18409fc4f052df43f403
>>>
>>> Would love to gather high level feedback on it from framework
>>> developers. Feel free to reply here, or on the review:
>>> https://reviews.apache.org/r/26669/
>>>
>>> Thanks!
>>> Ben
>>>
>>
>>
>

Re: Reconciliation Document

Posted by Benjamin Mahler <be...@gmail.com>.

Thanks for the thoughtful questions, I will take these into account in the
document.

Addressing each question in order:

*(1) Why the retry?*

It could be once per (re-)registration in the future.

Some requests are temporarily unanswerable. For example, if reconciling
task T on slave S, and slave S has not yet re-registered, we cannot reply
until the slave is re-registered or removed. Also, if a slave is
transitioning (being removed), we want to make sure that operation finishes
before we can answer.

It's possible to keep the request around and trigger an event once we can
answer. However, we chose to drop and remain silent for these tasks. This
is both for implementation simplicity and as a defense against OOMing from
too many pending reconciliation requests.

*(2) Any time bound guarantees?*

No guarantees on exact timing, but you are guaranteed to eventually receive
an answer.

This is why exponential backoff is important, to tolerate variability in
timing and avoid snowballing if a backlog ever occurs.

For suggesting an initial timeout, I need to digress a bit. Currently the
driver does not explicitly expose the event queue to the scheduler, and so
when you call reconcile, you may have an event queue in the driver full of
status updates. Because of this lack of visibility, picking an initial
timeout will depend on your scheduler's update processing speed and scale
(# expected status updates). Again, backoff is recommended to handle this.

We were considering exposing Java bindings for the newer Event/Call API. It
makes the queue explicit, which lets you avoid reconciling while you have a
queue full of updates.

Here is what the C++ interface looks like:
https://github.com/apache/mesos/blob/0.20.1/include/mesos/scheduler.hpp#L478

Does this interest you?

*(3) After timeout with no answer, I would be tempted to kill the task.*

You will eventually receive an answer, so if you decide to kill the task
because you have not received an answer soon enough, you may make the wrong
decision. This is up to you.

In particular, I would caution against making decisions without feedback
because it can lead to a snowball effect if tasks are treated
independently. In the event of a backlog, what's to stop you from killing
all tasks because you haven't received any answers?

I would recommend that you only use this kind of timeout as a last resort,
when not receiving a response after a large amount of time and a large
number of reconciliation requests.

*(4) Does rate limiting affect this?*

When enabled, rate limiting currently only operates on the rate of incoming
messages from a particular framework, so the number of updates sent back
has no effect on the limiting.

On Wed, Oct 15, 2014 at 3:22 PM, Sharma Podila <sp...@netflix.com> wrote:

> Looks like a good step forward.
>
> What is the reason for the algorithm having to call reconcile tasks
> multiple times after waiting some time in step 6? Shouldn't it be just once
> per (re)registration?
>

> Are there time bound guarantees within which a task update will be sent
> out after a reconcile request is sent? In the algorithm for task
> reconciliation, what would be a good timeout after which we conclude that
> we got no task update from the master? Upon such a timeout, I would be
> tempted to conclude that the task has disappeared. In which case, I would
> call driver.killTask() (to be sure its marked as gone), mark my task as
> terminated, then submit a replacement task.
>
> Does the "rate limiting" feature (in the works?) affect task
> reconciliation due to the volume of task updates sent back?
>
> Thanks.
>
>
> On Wed, Oct 15, 2014 at 2:05 PM, Benjamin Mahler <
> benjamin.mahler@gmail.com> wrote:
>
>> Hi all,
>>
>> I've sent a review out for a document describing reconciliation, you can
>> see the draft here:
>> https://gist.github.com/bmahler/18409fc4f052df43f403
>>
>> Would love to gather high level feedback on it from framework developers.
>> Feel free to reply here, or on the review:
>> https://reviews.apache.org/r/26669/
>>
>> Thanks!
>> Ben
>>
>
>

Re: Reconciliation Document

Posted by Sharma Podila <sp...@netflix.com>.

Looks like a good step forward.

What is the reason for the algorithm having to call reconcile tasks
multiple times after waiting some time in step 6? Shouldn't it be just once
per (re)registration?

Are there time bound guarantees within which a task update will be sent out
after a reconcile request is sent? In the algorithm for task
reconciliation, what would be a good timeout after which we conclude that
we got no task update from the master? Upon such a timeout, I would be
tempted to conclude that the task has disappeared. In which case, I would
call driver.killTask() (to be sure its marked as gone), mark my task as
terminated, then submit a replacement task.

Does the "rate limiting" feature (in the works?) affect task reconciliation
due to the volume of task updates sent back?

Thanks.

On Wed, Oct 15, 2014 at 2:05 PM, Benjamin Mahler <be...@gmail.com>
wrote:

> Hi all,
>
> I've sent a review out for a document describing reconciliation, you can
> see the draft here:
> https://gist.github.com/bmahler/18409fc4f052df43f403
>
> Would love to gather high level feedback on it from framework developers.
> Feel free to reply here, or on the review:
> https://reviews.apache.org/r/26669/
>
> Thanks!
> Ben
>

Re: Reconciliation Document

Posted by Connor Doyle <co...@mesosphere.io>.

Thanks for writing this up Ben! I have a couple suggestions about additional details that could be helpful to explain.

First, could you go a little more in-depth about how this process works for terminated tasks? For example, how does reconciliation behave for tasks running on a slave that has become disconnected from the master? An overview of the various timeouts involved would also be really awesome.

Second, what happens when a framework attempts to reconcile a task that is completely unknown to Mesos? An example scenario could be that a task died, the terminal status update was ACKed, but the scheduler failed over before this information could be persisted. What task status (if any) does Mesos respond with?
--
Connor Doyle
http://mesosphere.io

On Oct 15, 2014, at 14:05, Benjamin Mahler <be...@gmail.com> wrote:

> Hi all,
> 
> I've sent a review out for a document describing reconciliation, you can see the draft here:
> https://gist.github.com/bmahler/18409fc4f052df43f403
> 
> Would love to gather high level feedback on it from framework developers. Feel free to reply here, or on the review:
> https://reviews.apache.org/r/26669/
> 
> Thanks!
> Ben

Re: Reconciliation Document

Posted by Benjamin Mahler <be...@gmail.com>.

Connor,

Frameworks should only be reconciling tasks that they believe are
non-terminal (but it's a race, of course). If you attempt to reconcile a
terminal or unknown task, currently you will receive TASK_LOST. However,
you will receive the actual terminal update before receiving this TASK_LOST.

When considering reconciliation and status update retries, the status
update stream will look like the following:

[STAGING, STAGING, ..., RUNNING, RUNNING, ..., FINISHED, FINISHED, ...,
LOST, LOST, ...]

The invariant here is that the stream is ordered, and may contain
duplicates within the order. With reconciliation, you may now receive LOST
updates at any point after you receive the initial terminal update.

Given that a more formalized description of status update invariants is
needed, I think a separate document is warranted. :)

On Thu, Oct 16, 2014 at 6:04 PM, Connor Doyle <co...@mesosphere.io> wrote:

> Thanks for writing this up Ben! I have a couple suggestions about
> additional details that could be helpful to explain.
>
> First, could you go a little more in-depth about how this process works
> for terminated tasks? For example, how does reconciliation behave for tasks
> running on a slave that has become disconnected from the master? An
> overview of the various timeouts involved would also be really awesome.
>
> Second, what happens when a framework attempts to reconcile a task that is
> completely unknown to Mesos? An example scenario could be that a task died,
> the terminal status update was ACKed, but the scheduler failed over before
> this information could be persisted. What task status (if any) does Mesos
> respond with?
> --
> Connor Doyle
> http://mesosphere.io
>
>
> On Oct 15, 2014, at 14:05, Benjamin Mahler <be...@gmail.com>
> wrote:
>
> > Hi all,
> >
> > I've sent a review out for a document describing reconciliation, you can
> see the draft here:
> > https://gist.github.com/bmahler/18409fc4f052df43f403
> >
> > Would love to gather high level feedback on it from framework
> developers. Feel free to reply here, or on the review:
> > https://reviews.apache.org/r/26669/
> >
> > Thanks!
> > Ben
>
>