You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mesos.apache.org by Benno Evers <be...@mesosphere.com> on 2019/01/03 22:30:08 UTC

Re: Discussion: Scheduler API for Operation Reconciliation

Hi Chun-Hung,

> imagine that there are 1k nodes and 10 active + 10 gone LRPs per node,
then the master need to maintain 20k entries for LRPs.

How big would the required additional storage be in this scenario? Even if
it's 1KiB per LRP, using 20 MiB of extra memory doesn't sound too bad for
such a big custer.

In general, it seems hard to discuss the trade-offs between your proposals
without looking at the users of that API - do you know if there are ayn
frameworks out there that already use
 operation reconciliation, and if so what do they do based on the
reconciliation response?

As far as I know, we don't have any formal guarantees on which operations
status changes the framework will receive without reconciliation. So
putting on my framework-implementer hat it seems like I'd have no choice
but to implement a continously polling background loop anyways if I care
about knowing the latest operation statuses. If this is indeed the case,
having a synchronous `RECONCILE_OPERATIONS` would seem to have little
additional benefit.

Best regards,
Benno

On Wed, Dec 12, 2018 at 4:07 AM Chun-Hung Hsiao <ch...@apache.org> wrote:

> Hi folks,
>
> Recently I've being discussing the problems of the current design of the
> experimental
> `RECONCILE_OPERATIONS` scheduler API with a couple people. The discussion
> was started
> from MESOS-9318 <https://issues.apache.org/jira/browse/MESOS-9318>: when a
> framework receives an `OPERATION_UNKNOWN`, it doesn't know
> if it should retry the operation or not (further details described below).
> As the discussion
> evolves, we realize there are more issues to consider, design-wise and
> implementation-wise, so
> I'd like to reach out to the community to get valuable opinions from you
> guys.
>
> Before I jump right into the issues I'd like to discuss, let me fill you
> guys in with some
> background of operation reconciliation. Since the design of this feature
> was informed by the
> pre-existing implementation of task reconciliation, I'll begin there.
>
> *Task Reconciliation: Design*
>
> The scheduler API has a `RECONCILE` call for a framework to query the
> current statuses
> of its tasks. This call supports the following modes:
>
>    - *Explicit reconciliation*: The framework specifies the list of tasks
>    it wants to know
>    about, and expects status updates for these tasks.
>
>    - *Implicit reconciliation*: The framework does not specify a list of
>    tasks, and simply
>    expects status updates for all tasks the master knows about.
>
> In both cases, the master looks into its in-memory task bookkeeping and
> sends
> *one or more`UPDATE` events* to respond to the reconciliation request.
>
> *Task Reconciliation: Problems*
>
> This API design of task reconciliation has the following shortcomings:
>
>    - (1) There is no clear boundary of when the "reconciliation response"
>    ends, and thus
>    there is
> *no 1-1 correspondence between the reconciliation request and the
> response*.
>    For explicit reconciliation, the framework might wait for an extended
> period
>    of time before it receives all status updates; for implicit
>    reconciliation, there is no way for
>    a framework to tell if it has learned about all of its tasks, which
>    could be inconvenient if
>    the framework has lost its task bookkeeping.
>
>    - (2) The "reconciliation response" may be outdated. If an agent
>    reregisters after a task
>    reconciliation has been responded,
> *the framework wouldn't learn about the tasks **from this recovered agent*.
>    Mesos relies on the framework to call the `RECONCILE` call
>    *periodically* to get up-to-date task statuses.
>
>
>
> *Operation Reconciliation: Design & Problems*
>
> When designing operation reconciliation, we made the `RECONCILE_OPERATIONS`
> call
> *asynchronous request-response style call* that returns a 200 OK with a
> list of operation status
> to avoid (1). However, this design does not resolve (2), and also
> introduces new problems:
>
>    - (3) *The synchronous response could race with the event stream* and
>    the framework
>    does not know which contains the latest operation status.
>
>    - (4) To ensure scalability, the master does not manage local resource
>    providers (LRPs);
>    the agents do. So the master cannot tell if an LRP is temporarily
>    unreachable/recovering
>    or permanently gone. As a result, if the framework explicitly reconciles
>    an LRP operation
>    that the master does not know about, it can only reply
>    `OPERATION_UNKNOWN`, but
>    then *the framework would not know if the operation would come back in
>    the future*,
>    and thus cannot decide if it should reissue another operation, which
>    leads to MESOS-9318 <https://issues.apache.org/jira/browse/MESOS-9318>.
>
>    Note that this is less of a problem for explicit task reconciliation,
>    because in most cases
>    the master can infer task statuses from agent statuses, and in the rare
>    cases that it
>    replies `TASK_UNKNOWN`, it is generally safe for the framework to
>    relaunch another
>    task.
>
>
> *The Open Question*
>
> Now, the big question here is:
> *are the benefits of a synchronous request-responsestyle
> `RECONCILE_OPERATIONS` call worth the complexity it introduces* in order to
> address (3) and (4) in the code? To explain what the complexity would be,
> let me lay out a
> couple proposals we've been discussing:
>
> I. Keep `RECONCILE_OPERATIONS` synchronous
>
> To address (3), we could add a *timestamp* to every operation status as
> well as the
> reconciliation response, so the framework can infer which one is the latest
> status, and if
> it receives a stale operation status update after the reconciliation
> response, it can just
> ack the status update without updating its bookkeeping. But, the framework
> needs to
> deal with a corner case:
>
> *when it receives a reconciliation response containing aterminal operation
> status, it may or may not receive one or more status updatesfor that
> operation later *because of the race.
>
>
> To address (4), we could either: (a) surface the unreachable/gone LRPs to
> the master, or
> (b) forward the explicit reconciliation request to the corresponding agent.
> The complexity
> of (a) is that
> *it might not be scalable for the master to maintain the list ofunreachable
> and gone LRPs*: imagine that there are 1k nodes and 10 active + 10 gone
> LRPs per node, then the master need to maintain 20k entries for LRPs. The
> complexity
> of (b) is that the response wouldn't be computed based on the master's
> state; instead,
> *the master needs to wait for the agent's reply to respond to the
> framework*.
> Note
> that it's probably not scalable to forward implicit reconciliation requests
> to all agents, so
> implicit reconciliation might have to still be responded based on the
> master's state.
>
>
> II. Make `RECONCILE_OPERATIONS` "semi-synchronous"
>
> Instead of returning a 200 OK, the master could return a 202 Accepted with
> an empty
> body, and then
> *reply a single event containing the operation status of all
> requestedoperations in the event stream asynchronously*. Although the
> framework loses the
> 1-1 correspondence between the request and the response, there's still a
> clear boundary
> for a reconciliation response. The advantage of this approach compared to
> proposal I is
> that we don't have a race between the reconciliation response and the event
> stream, so
> no timestamp is required. Still, we have to address (4) through either (a)
> or (b) described
> above, thus the complexity remains. That said, this approach fits with (b)
> better since no
> synchronous response is needed.
>
>
> III. Make `RECONCILE_OPERATIONS` an asynchronous trigger
>
> This would be similar to what we have for task reconciliation. The master
> would return a
> 202 Accepted, and then send
> *one or more `UPDATE_OPERATION_STATUS` events*based on its state for an
> implicit reconciliation, or
> *forward the request to some agent*for an explicit reconciliation. In other
> words, this call plays the role of a trigger of the
> operation status updates. This approach is the simplest in terms of the
> implementation,
> but the trade-off is that the framework needs to live with (1).
>
>
> So far we haven't discussed much about (2) for operation reconciliation, so
> let's also briefly talk
> about it. Potentially (2) can be addressed by making the agent *actively
> push *
> *operation statusupdates to the framework when an LRP is resubscribed*, so
> the framework won't need to do
> periodic operation reconciliation. If we do this in the future, it would
> also be more aligned with
> proposal II or III.
>
> So the question again: is it worth the complexity to keep
> `RECONCILE_OPERATIONS`
> synchronous? I'd like to hear the opinions from the community so we can
> drive towards a better
> API design!
>
> Best,
> Chun-Hung
>


-- 
Benno Evers
Software Engineer, Mesosphere

Re: Discussion: Scheduler API for Operation Reconciliation

Posted by Greg Mann <gr...@mesosphere.io>.
Hey folks,
Sorry to let this thread die out! I wanted to loop back and confirm our
planned approach. We would like to change the v1 scheduler API so that the
RECONCILE_OPERATIONS call no longer receives a synchronous HTTP response,
but instead results in an asynchronous stream of operation status updates
on the scheduler event stream. This mirrors what we currently do for task
reconciliation.

Feel free to chime in on this thread if you have any
questions/comments/concerns. I've added this item to the API working group
agenda
<https://docs.google.com/document/d/1JrF7pA6gcBZ6iyeP5YgDG62ifn0cZIBWw1f_Ler6fLM/edit?usp=sharing>for
this coming Tuesday, March 5. Feel free to join that meeting to participate
in a discussion!

Cheers,
Greg

On Thu, Jan 24, 2019 at 7:10 PM Chun-Hung Hsiao <ch...@mesosphere.io>
wrote:

> I chatted with Jie and Gaston, and here is a brief summary:
>
> 1. The ordering issue between the synchronous response and the event stream
> would lead to extra complication for a framework, and thus the benefit
> doesn't seem to worth the complication.
> 2. However, we should consider not forwarding the reconciliation requests
> to the agents. The status updates doesn't require a trigger, and if the
> agent could report gone and unregistered RPs to the master, the master can
> respond to the reconciliation request itself.
> The only problem I see is that frameworks may see
> `OPERATION_GONE_BY_OPERATOR` -> `OPERATION_UNREACHABLE` ->
> `OPERATION_GONE_BY_OPERATOR`, since the master does not persist gone RPs.
>
> To address the original problem of MESOS-9318, we could do the following:
> (1) Agent is gone => `OPERATION_GONE_BY_OPERATOR`
> (2) Agent is unreachable => `OPERATION_UNREACHABLE`
> (3) Agent is not registered => `OPERATION_RECOVERING`
> (4) Agent is unknown => `OPERATION_UNKNOWN`
> (5) Agent is registered, RP is gone => `OPERATION_GONE_BY_OPERATOR`
> (6) Agent is registered, RP is not registered => `OPERATION_UNREACHABLE` or
> `OPERATION_RECOVERING`
> (7) Agent is registered, RP is unknown => `OPERATION_UNKNOWN`
> (8) Agent is registered, RP is registered => maybe `OPERATION_UNKNOWN`?
>
> So it seems a number of people agree with going with the asynchronous
> responses through the event stream. Please reply if you have other
> opinions!
>
> On Thu, Jan 24, 2019 at 1:39 PM James DeFelice <ja...@gmail.com>
> wrote:
>
> > I've attempted to implement support for operation status reconciliation
> in
> > a framework that I've been building. Option (III) seems most convenient
> > from my perspective as well. A single source of updates:
> >
> > (a) Leads to a cleaner framework design; I've had to poke a few holes in
> > the framework's initial design to deal with multiple event sources,
> leading
> > to increased complexity.
> >
> > (b) Allows frameworks to consume events in the order they arrive (and
> > pushes the responsibility for event ordering back to Mesos). Multiple
> event
> > sources that the framework needs to (possibly) reorder based on a
> timestamp
> > would add further complexity that we should avoid pushing onto framework
> > writers.
> >
> > Some other thoughts:
> >
> > (c) I've implemented a background polling loop for exactly the reason
> that
> > Benno pointed out. An asychronous API call for operation status
> > reconciliation would be fine with me.
> >
> > (d) API consistency is important. Framework devs are used to the way that
> > the task status reconciliation API works, and have come up with solutions
> > for dealing with the lack of boundaries for streams of explicit
> > reconciliation events. The synchronous response defined for the currently
> > published operation status reconciliation call isn't consistent with the
> > rest of the v1 scheduler API, which generated a bit of extra work (for
> me)
> > in the low-level mesos v1 http client lib. Consistency should be a
> primary
> > goal when extending existing API sets.
> >
> > (e) There are probably other ways to solve the problem of a "lack of
> > boundaries within the event stream" for explicit reconciliation requests.
> > If this is this a problem that other framework devs need solved then
> let's
> > address it as a separate issue - and aim to resolve it in a consistent
> way
> > for both task and operation status event streams.
> >
> > (f) It sounds like option (III) would let Mesos send back smarter
> > operation statuses in agent/RP failover cases (UNREACHABLE vs. UNKNOWN).
> > Anything to limit the number of scenarios where UNKNOWN is returned to
> > frameworks sounds good to me.
> >
> > -James
> >
> >
> >
> > On Wed, Jan 16, 2019 at 4:15 PM Benjamin Bannier <
> > benjamin.bannier@mesosphere.io> wrote:
> >
> >> Hi,
> >>
> >> have we reached a conclusion here?
> >>
> >> From the Mesos side of things I would be strongly in favor of proposal
> >> (III). This is not only consistent with what we do with task status
> >> updates, but also would allow us to provide improved operation status
> >> (e.g., `OPERATION_UNREACHABLE` instead of just `OPERATION_UNKNOWN` to
> >> better distinguish non-terminal from terminal operation states. To
> >> accomplish that we wouldn’t need to introduce extra information leakage
> >> (e.g., explicitly keeping master up to date on local resource provider
> >> state and associated internal consistency complications).
> >>
> >> This approach should also simplify framework development as a framework
> >> would only need to watch a single channel to see operation status
> updates
> >> (no need to reconcile different information sources). The benefits of
> >> better status updates and simpler implementation IMO outweigh any
> benefits
> >> of the current approach (disclaimer: I filed the slightly inflammatory
> >> MESOS-9448).
> >>
> >> What is keeping us from moving forward with (III) at this point?
> >>
> >>
> >> Cheers,
> >>
> >> Benjamin
> >>
> >> > On Jan 3, 2019, at 11:30 PM, Benno Evers <be...@mesosphere.com>
> wrote:
> >> >
> >> > Hi Chun-Hung,
> >> >
> >> > > imagine that there are 1k nodes and 10 active + 10 gone LRPs per
> >> node, then the master need to maintain 20k entries for LRPs.
> >> >
> >> > How big would the required additional storage be in this scenario?
> Even
> >> if it's 1KiB per LRP, using 20 MiB of extra memory doesn't sound too bad
> >> for such a big custer.
> >> >
> >> > In general, it seems hard to discuss the trade-offs between your
> >> proposals without looking at the users of that API - do you know if
> there
> >> are ayn frameworks out there that already use
> >> >  operation reconciliation, and if so what do they do based on the
> >> reconciliation response?
> >> >
> >> > As far as I know, we don't have any formal guarantees on which
> >> operations status changes the framework will receive without
> >> reconciliation. So putting on my framework-implementer hat it seems like
> >> I'd have no choice but to implement a continously polling background
> loop
> >> anyways if I care about knowing the latest operation statuses. If this
> is
> >> indeed the case, having a synchronous `RECONCILE_OPERATIONS` would seem
> to
> >> have little additional benefit.
> >> >
> >> > Best regards,
> >> > Benno
> >> >
> >> > On Wed, Dec 12, 2018 at 4:07 AM Chun-Hung Hsiao <ch...@apache.org>
> >> wrote:
> >> > Hi folks,
> >> >
> >> > Recently I've being discussing the problems of the current design of
> the
> >> > experimental
> >> > `RECONCILE_OPERATIONS` scheduler API with a couple people. The
> >> discussion
> >> > was started
> >> > from MESOS-9318 <https://issues.apache.org/jira/browse/MESOS-9318>:
> >> when a
> >> > framework receives an `OPERATION_UNKNOWN`, it doesn't know
> >> > if it should retry the operation or not (further details described
> >> below).
> >> > As the discussion
> >> > evolves, we realize there are more issues to consider, design-wise and
> >> > implementation-wise, so
> >> > I'd like to reach out to the community to get valuable opinions from
> you
> >> > guys.
> >> >
> >> > Before I jump right into the issues I'd like to discuss, let me fill
> you
> >> > guys in with some
> >> > background of operation reconciliation. Since the design of this
> feature
> >> > was informed by the
> >> > pre-existing implementation of task reconciliation, I'll begin there.
> >> >
> >> > *Task Reconciliation: Design*
> >> >
> >> > The scheduler API has a `RECONCILE` call for a framework to query the
> >> > current statuses
> >> > of its tasks. This call supports the following modes:
> >> >
> >> >    - *Explicit reconciliation*: The framework specifies the list of
> >> tasks
> >> >    it wants to know
> >> >    about, and expects status updates for these tasks.
> >> >
> >> >    - *Implicit reconciliation*: The framework does not specify a list
> of
> >> >    tasks, and simply
> >> >    expects status updates for all tasks the master knows about.
> >> >
> >> > In both cases, the master looks into its in-memory task bookkeeping
> and
> >> > sends
> >> > *one or more`UPDATE` events* to respond to the reconciliation request.
> >> >
> >> > *Task Reconciliation: Problems*
> >> >
> >> > This API design of task reconciliation has the following shortcomings:
> >> >
> >> >    - (1) There is no clear boundary of when the "reconciliation
> >> response"
> >> >    ends, and thus
> >> >    there is
> >> > *no 1-1 correspondence between the reconciliation request and the
> >> response*.
> >> >    For explicit reconciliation, the framework might wait for an
> >> extended period
> >> >    of time before it receives all status updates; for implicit
> >> >    reconciliation, there is no way for
> >> >    a framework to tell if it has learned about all of its tasks, which
> >> >    could be inconvenient if
> >> >    the framework has lost its task bookkeeping.
> >> >
> >> >    - (2) The "reconciliation response" may be outdated. If an agent
> >> >    reregisters after a task
> >> >    reconciliation has been responded,
> >> > *the framework wouldn't learn about the tasks **from this recovered
> >> agent*.
> >> >    Mesos relies on the framework to call the `RECONCILE` call
> >> >    *periodically* to get up-to-date task statuses.
> >> >
> >> >
> >> >
> >> > *Operation Reconciliation: Design & Problems*
> >> >
> >> > When designing operation reconciliation, we made the
> >> `RECONCILE_OPERATIONS`
> >> > call
> >> > *asynchronous request-response style call* that returns a 200 OK with
> a
> >> > list of operation status
> >> > to avoid (1). However, this design does not resolve (2), and also
> >> > introduces new problems:
> >> >
> >> >    - (3) *The synchronous response could race with the event stream*
> and
> >> >    the framework
> >> >    does not know which contains the latest operation status.
> >> >
> >> >    - (4) To ensure scalability, the master does not manage local
> >> resource
> >> >    providers (LRPs);
> >> >    the agents do. So the master cannot tell if an LRP is temporarily
> >> >    unreachable/recovering
> >> >    or permanently gone. As a result, if the framework explicitly
> >> reconciles
> >> >    an LRP operation
> >> >    that the master does not know about, it can only reply
> >> >    `OPERATION_UNKNOWN`, but
> >> >    then *the framework would not know if the operation would come back
> >> in
> >> >    the future*,
> >> >    and thus cannot decide if it should reissue another operation,
> which
> >> >    leads to MESOS-9318 <
> >> https://issues.apache.org/jira/browse/MESOS-9318>.
> >> >
> >> >    Note that this is less of a problem for explicit task
> reconciliation,
> >> >    because in most cases
> >> >    the master can infer task statuses from agent statuses, and in the
> >> rare
> >> >    cases that it
> >> >    replies `TASK_UNKNOWN`, it is generally safe for the framework to
> >> >    relaunch another
> >> >    task.
> >> >
> >> >
> >> > *The Open Question*
> >> >
> >> > Now, the big question here is:
> >> > *are the benefits of a synchronous request-responsestyle
> >> > `RECONCILE_OPERATIONS` call worth the complexity it introduces* in
> >> order to
> >> > address (3) and (4) in the code? To explain what the complexity would
> >> be,
> >> > let me lay out a
> >> > couple proposals we've been discussing:
> >> >
> >> > I. Keep `RECONCILE_OPERATIONS` synchronous
> >> >
> >> > To address (3), we could add a *timestamp* to every operation status
> as
> >> > well as the
> >> > reconciliation response, so the framework can infer which one is the
> >> latest
> >> > status, and if
> >> > it receives a stale operation status update after the reconciliation
> >> > response, it can just
> >> > ack the status update without updating its bookkeeping. But, the
> >> framework
> >> > needs to
> >> > deal with a corner case:
> >> >
> >> > *when it receives a reconciliation response containing aterminal
> >> operation
> >> > status, it may or may not receive one or more status updatesfor that
> >> > operation later *because of the race.
> >> >
> >> >
> >> > To address (4), we could either: (a) surface the unreachable/gone LRPs
> >> to
> >> > the master, or
> >> > (b) forward the explicit reconciliation request to the corresponding
> >> agent.
> >> > The complexity
> >> > of (a) is that
> >> > *it might not be scalable for the master to maintain the list
> >> ofunreachable
> >> > and gone LRPs*: imagine that there are 1k nodes and 10 active + 10
> gone
> >> > LRPs per node, then the master need to maintain 20k entries for LRPs.
> >> The
> >> > complexity
> >> > of (b) is that the response wouldn't be computed based on the master's
> >> > state; instead,
> >> > *the master needs to wait for the agent's reply to respond to the
> >> framework*.
> >> > Note
> >> > that it's probably not scalable to forward implicit reconciliation
> >> requests
> >> > to all agents, so
> >> > implicit reconciliation might have to still be responded based on the
> >> > master's state.
> >> >
> >> >
> >> > II. Make `RECONCILE_OPERATIONS` "semi-synchronous"
> >> >
> >> > Instead of returning a 200 OK, the master could return a 202 Accepted
> >> with
> >> > an empty
> >> > body, and then
> >> > *reply a single event containing the operation status of all
> >> > requestedoperations in the event stream asynchronously*. Although the
> >> > framework loses the
> >> > 1-1 correspondence between the request and the response, there's
> still a
> >> > clear boundary
> >> > for a reconciliation response. The advantage of this approach compared
> >> to
> >> > proposal I is
> >> > that we don't have a race between the reconciliation response and the
> >> event
> >> > stream, so
> >> > no timestamp is required. Still, we have to address (4) through either
> >> (a)
> >> > or (b) described
> >> > above, thus the complexity remains. That said, this approach fits with
> >> (b)
> >> > better since no
> >> > synchronous response is needed.
> >> >
> >> >
> >> > III. Make `RECONCILE_OPERATIONS` an asynchronous trigger
> >> >
> >> > This would be similar to what we have for task reconciliation. The
> >> master
> >> > would return a
> >> > 202 Accepted, and then send
> >> > *one or more `UPDATE_OPERATION_STATUS` events*based on its state for
> an
> >> > implicit reconciliation, or
> >> > *forward the request to some agent*for an explicit reconciliation. In
> >> other
> >> > words, this call plays the role of a trigger of the
> >> > operation status updates. This approach is the simplest in terms of
> the
> >> > implementation,
> >> > but the trade-off is that the framework needs to live with (1).
> >> >
> >> >
> >> > So far we haven't discussed much about (2) for operation
> >> reconciliation, so
> >> > let's also briefly talk
> >> > about it. Potentially (2) can be addressed by making the agent
> *actively
> >> > push *
> >> > *operation statusupdates to the framework when an LRP is
> resubscribed*,
> >> so
> >> > the framework won't need to do
> >> > periodic operation reconciliation. If we do this in the future, it
> would
> >> > also be more aligned with
> >> > proposal II or III.
> >> >
> >> > So the question again: is it worth the complexity to keep
> >> > `RECONCILE_OPERATIONS`
> >> > synchronous? I'd like to hear the opinions from the community so we
> can
> >> > drive towards a better
> >> > API design!
> >> >
> >> > Best,
> >> > Chun-Hung
> >> >
> >> >
> >> > --
> >> > Benno Evers
> >> > Software Engineer, Mesosphere
> >>
> >>
> >
> > --
> > James DeFelice
> > 585.241.9488 (voice)
> > 650.649.6071 (fax)
> >
>

Re: Discussion: Scheduler API for Operation Reconciliation

Posted by Greg Mann <gr...@mesosphere.io>.
Hey folks,
Sorry to let this thread die out! I wanted to loop back and confirm our
planned approach. We would like to change the v1 scheduler API so that the
RECONCILE_OPERATIONS call no longer receives a synchronous HTTP response,
but instead results in an asynchronous stream of operation status updates
on the scheduler event stream. This mirrors what we currently do for task
reconciliation.

Feel free to chime in on this thread if you have any
questions/comments/concerns. I've added this item to the API working group
agenda
<https://docs.google.com/document/d/1JrF7pA6gcBZ6iyeP5YgDG62ifn0cZIBWw1f_Ler6fLM/edit?usp=sharing>for
this coming Tuesday, March 5. Feel free to join that meeting to participate
in a discussion!

Cheers,
Greg

On Thu, Jan 24, 2019 at 7:10 PM Chun-Hung Hsiao <ch...@mesosphere.io>
wrote:

> I chatted with Jie and Gaston, and here is a brief summary:
>
> 1. The ordering issue between the synchronous response and the event stream
> would lead to extra complication for a framework, and thus the benefit
> doesn't seem to worth the complication.
> 2. However, we should consider not forwarding the reconciliation requests
> to the agents. The status updates doesn't require a trigger, and if the
> agent could report gone and unregistered RPs to the master, the master can
> respond to the reconciliation request itself.
> The only problem I see is that frameworks may see
> `OPERATION_GONE_BY_OPERATOR` -> `OPERATION_UNREACHABLE` ->
> `OPERATION_GONE_BY_OPERATOR`, since the master does not persist gone RPs.
>
> To address the original problem of MESOS-9318, we could do the following:
> (1) Agent is gone => `OPERATION_GONE_BY_OPERATOR`
> (2) Agent is unreachable => `OPERATION_UNREACHABLE`
> (3) Agent is not registered => `OPERATION_RECOVERING`
> (4) Agent is unknown => `OPERATION_UNKNOWN`
> (5) Agent is registered, RP is gone => `OPERATION_GONE_BY_OPERATOR`
> (6) Agent is registered, RP is not registered => `OPERATION_UNREACHABLE` or
> `OPERATION_RECOVERING`
> (7) Agent is registered, RP is unknown => `OPERATION_UNKNOWN`
> (8) Agent is registered, RP is registered => maybe `OPERATION_UNKNOWN`?
>
> So it seems a number of people agree with going with the asynchronous
> responses through the event stream. Please reply if you have other
> opinions!
>
> On Thu, Jan 24, 2019 at 1:39 PM James DeFelice <ja...@gmail.com>
> wrote:
>
> > I've attempted to implement support for operation status reconciliation
> in
> > a framework that I've been building. Option (III) seems most convenient
> > from my perspective as well. A single source of updates:
> >
> > (a) Leads to a cleaner framework design; I've had to poke a few holes in
> > the framework's initial design to deal with multiple event sources,
> leading
> > to increased complexity.
> >
> > (b) Allows frameworks to consume events in the order they arrive (and
> > pushes the responsibility for event ordering back to Mesos). Multiple
> event
> > sources that the framework needs to (possibly) reorder based on a
> timestamp
> > would add further complexity that we should avoid pushing onto framework
> > writers.
> >
> > Some other thoughts:
> >
> > (c) I've implemented a background polling loop for exactly the reason
> that
> > Benno pointed out. An asychronous API call for operation status
> > reconciliation would be fine with me.
> >
> > (d) API consistency is important. Framework devs are used to the way that
> > the task status reconciliation API works, and have come up with solutions
> > for dealing with the lack of boundaries for streams of explicit
> > reconciliation events. The synchronous response defined for the currently
> > published operation status reconciliation call isn't consistent with the
> > rest of the v1 scheduler API, which generated a bit of extra work (for
> me)
> > in the low-level mesos v1 http client lib. Consistency should be a
> primary
> > goal when extending existing API sets.
> >
> > (e) There are probably other ways to solve the problem of a "lack of
> > boundaries within the event stream" for explicit reconciliation requests.
> > If this is this a problem that other framework devs need solved then
> let's
> > address it as a separate issue - and aim to resolve it in a consistent
> way
> > for both task and operation status event streams.
> >
> > (f) It sounds like option (III) would let Mesos send back smarter
> > operation statuses in agent/RP failover cases (UNREACHABLE vs. UNKNOWN).
> > Anything to limit the number of scenarios where UNKNOWN is returned to
> > frameworks sounds good to me.
> >
> > -James
> >
> >
> >
> > On Wed, Jan 16, 2019 at 4:15 PM Benjamin Bannier <
> > benjamin.bannier@mesosphere.io> wrote:
> >
> >> Hi,
> >>
> >> have we reached a conclusion here?
> >>
> >> From the Mesos side of things I would be strongly in favor of proposal
> >> (III). This is not only consistent with what we do with task status
> >> updates, but also would allow us to provide improved operation status
> >> (e.g., `OPERATION_UNREACHABLE` instead of just `OPERATION_UNKNOWN` to
> >> better distinguish non-terminal from terminal operation states. To
> >> accomplish that we wouldn’t need to introduce extra information leakage
> >> (e.g., explicitly keeping master up to date on local resource provider
> >> state and associated internal consistency complications).
> >>
> >> This approach should also simplify framework development as a framework
> >> would only need to watch a single channel to see operation status
> updates
> >> (no need to reconcile different information sources). The benefits of
> >> better status updates and simpler implementation IMO outweigh any
> benefits
> >> of the current approach (disclaimer: I filed the slightly inflammatory
> >> MESOS-9448).
> >>
> >> What is keeping us from moving forward with (III) at this point?
> >>
> >>
> >> Cheers,
> >>
> >> Benjamin
> >>
> >> > On Jan 3, 2019, at 11:30 PM, Benno Evers <be...@mesosphere.com>
> wrote:
> >> >
> >> > Hi Chun-Hung,
> >> >
> >> > > imagine that there are 1k nodes and 10 active + 10 gone LRPs per
> >> node, then the master need to maintain 20k entries for LRPs.
> >> >
> >> > How big would the required additional storage be in this scenario?
> Even
> >> if it's 1KiB per LRP, using 20 MiB of extra memory doesn't sound too bad
> >> for such a big custer.
> >> >
> >> > In general, it seems hard to discuss the trade-offs between your
> >> proposals without looking at the users of that API - do you know if
> there
> >> are ayn frameworks out there that already use
> >> >  operation reconciliation, and if so what do they do based on the
> >> reconciliation response?
> >> >
> >> > As far as I know, we don't have any formal guarantees on which
> >> operations status changes the framework will receive without
> >> reconciliation. So putting on my framework-implementer hat it seems like
> >> I'd have no choice but to implement a continously polling background
> loop
> >> anyways if I care about knowing the latest operation statuses. If this
> is
> >> indeed the case, having a synchronous `RECONCILE_OPERATIONS` would seem
> to
> >> have little additional benefit.
> >> >
> >> > Best regards,
> >> > Benno
> >> >
> >> > On Wed, Dec 12, 2018 at 4:07 AM Chun-Hung Hsiao <ch...@apache.org>
> >> wrote:
> >> > Hi folks,
> >> >
> >> > Recently I've being discussing the problems of the current design of
> the
> >> > experimental
> >> > `RECONCILE_OPERATIONS` scheduler API with a couple people. The
> >> discussion
> >> > was started
> >> > from MESOS-9318 <https://issues.apache.org/jira/browse/MESOS-9318>:
> >> when a
> >> > framework receives an `OPERATION_UNKNOWN`, it doesn't know
> >> > if it should retry the operation or not (further details described
> >> below).
> >> > As the discussion
> >> > evolves, we realize there are more issues to consider, design-wise and
> >> > implementation-wise, so
> >> > I'd like to reach out to the community to get valuable opinions from
> you
> >> > guys.
> >> >
> >> > Before I jump right into the issues I'd like to discuss, let me fill
> you
> >> > guys in with some
> >> > background of operation reconciliation. Since the design of this
> feature
> >> > was informed by the
> >> > pre-existing implementation of task reconciliation, I'll begin there.
> >> >
> >> > *Task Reconciliation: Design*
> >> >
> >> > The scheduler API has a `RECONCILE` call for a framework to query the
> >> > current statuses
> >> > of its tasks. This call supports the following modes:
> >> >
> >> >    - *Explicit reconciliation*: The framework specifies the list of
> >> tasks
> >> >    it wants to know
> >> >    about, and expects status updates for these tasks.
> >> >
> >> >    - *Implicit reconciliation*: The framework does not specify a list
> of
> >> >    tasks, and simply
> >> >    expects status updates for all tasks the master knows about.
> >> >
> >> > In both cases, the master looks into its in-memory task bookkeeping
> and
> >> > sends
> >> > *one or more`UPDATE` events* to respond to the reconciliation request.
> >> >
> >> > *Task Reconciliation: Problems*
> >> >
> >> > This API design of task reconciliation has the following shortcomings:
> >> >
> >> >    - (1) There is no clear boundary of when the "reconciliation
> >> response"
> >> >    ends, and thus
> >> >    there is
> >> > *no 1-1 correspondence between the reconciliation request and the
> >> response*.
> >> >    For explicit reconciliation, the framework might wait for an
> >> extended period
> >> >    of time before it receives all status updates; for implicit
> >> >    reconciliation, there is no way for
> >> >    a framework to tell if it has learned about all of its tasks, which
> >> >    could be inconvenient if
> >> >    the framework has lost its task bookkeeping.
> >> >
> >> >    - (2) The "reconciliation response" may be outdated. If an agent
> >> >    reregisters after a task
> >> >    reconciliation has been responded,
> >> > *the framework wouldn't learn about the tasks **from this recovered
> >> agent*.
> >> >    Mesos relies on the framework to call the `RECONCILE` call
> >> >    *periodically* to get up-to-date task statuses.
> >> >
> >> >
> >> >
> >> > *Operation Reconciliation: Design & Problems*
> >> >
> >> > When designing operation reconciliation, we made the
> >> `RECONCILE_OPERATIONS`
> >> > call
> >> > *asynchronous request-response style call* that returns a 200 OK with
> a
> >> > list of operation status
> >> > to avoid (1). However, this design does not resolve (2), and also
> >> > introduces new problems:
> >> >
> >> >    - (3) *The synchronous response could race with the event stream*
> and
> >> >    the framework
> >> >    does not know which contains the latest operation status.
> >> >
> >> >    - (4) To ensure scalability, the master does not manage local
> >> resource
> >> >    providers (LRPs);
> >> >    the agents do. So the master cannot tell if an LRP is temporarily
> >> >    unreachable/recovering
> >> >    or permanently gone. As a result, if the framework explicitly
> >> reconciles
> >> >    an LRP operation
> >> >    that the master does not know about, it can only reply
> >> >    `OPERATION_UNKNOWN`, but
> >> >    then *the framework would not know if the operation would come back
> >> in
> >> >    the future*,
> >> >    and thus cannot decide if it should reissue another operation,
> which
> >> >    leads to MESOS-9318 <
> >> https://issues.apache.org/jira/browse/MESOS-9318>.
> >> >
> >> >    Note that this is less of a problem for explicit task
> reconciliation,
> >> >    because in most cases
> >> >    the master can infer task statuses from agent statuses, and in the
> >> rare
> >> >    cases that it
> >> >    replies `TASK_UNKNOWN`, it is generally safe for the framework to
> >> >    relaunch another
> >> >    task.
> >> >
> >> >
> >> > *The Open Question*
> >> >
> >> > Now, the big question here is:
> >> > *are the benefits of a synchronous request-responsestyle
> >> > `RECONCILE_OPERATIONS` call worth the complexity it introduces* in
> >> order to
> >> > address (3) and (4) in the code? To explain what the complexity would
> >> be,
> >> > let me lay out a
> >> > couple proposals we've been discussing:
> >> >
> >> > I. Keep `RECONCILE_OPERATIONS` synchronous
> >> >
> >> > To address (3), we could add a *timestamp* to every operation status
> as
> >> > well as the
> >> > reconciliation response, so the framework can infer which one is the
> >> latest
> >> > status, and if
> >> > it receives a stale operation status update after the reconciliation
> >> > response, it can just
> >> > ack the status update without updating its bookkeeping. But, the
> >> framework
> >> > needs to
> >> > deal with a corner case:
> >> >
> >> > *when it receives a reconciliation response containing aterminal
> >> operation
> >> > status, it may or may not receive one or more status updatesfor that
> >> > operation later *because of the race.
> >> >
> >> >
> >> > To address (4), we could either: (a) surface the unreachable/gone LRPs
> >> to
> >> > the master, or
> >> > (b) forward the explicit reconciliation request to the corresponding
> >> agent.
> >> > The complexity
> >> > of (a) is that
> >> > *it might not be scalable for the master to maintain the list
> >> ofunreachable
> >> > and gone LRPs*: imagine that there are 1k nodes and 10 active + 10
> gone
> >> > LRPs per node, then the master need to maintain 20k entries for LRPs.
> >> The
> >> > complexity
> >> > of (b) is that the response wouldn't be computed based on the master's
> >> > state; instead,
> >> > *the master needs to wait for the agent's reply to respond to the
> >> framework*.
> >> > Note
> >> > that it's probably not scalable to forward implicit reconciliation
> >> requests
> >> > to all agents, so
> >> > implicit reconciliation might have to still be responded based on the
> >> > master's state.
> >> >
> >> >
> >> > II. Make `RECONCILE_OPERATIONS` "semi-synchronous"
> >> >
> >> > Instead of returning a 200 OK, the master could return a 202 Accepted
> >> with
> >> > an empty
> >> > body, and then
> >> > *reply a single event containing the operation status of all
> >> > requestedoperations in the event stream asynchronously*. Although the
> >> > framework loses the
> >> > 1-1 correspondence between the request and the response, there's
> still a
> >> > clear boundary
> >> > for a reconciliation response. The advantage of this approach compared
> >> to
> >> > proposal I is
> >> > that we don't have a race between the reconciliation response and the
> >> event
> >> > stream, so
> >> > no timestamp is required. Still, we have to address (4) through either
> >> (a)
> >> > or (b) described
> >> > above, thus the complexity remains. That said, this approach fits with
> >> (b)
> >> > better since no
> >> > synchronous response is needed.
> >> >
> >> >
> >> > III. Make `RECONCILE_OPERATIONS` an asynchronous trigger
> >> >
> >> > This would be similar to what we have for task reconciliation. The
> >> master
> >> > would return a
> >> > 202 Accepted, and then send
> >> > *one or more `UPDATE_OPERATION_STATUS` events*based on its state for
> an
> >> > implicit reconciliation, or
> >> > *forward the request to some agent*for an explicit reconciliation. In
> >> other
> >> > words, this call plays the role of a trigger of the
> >> > operation status updates. This approach is the simplest in terms of
> the
> >> > implementation,
> >> > but the trade-off is that the framework needs to live with (1).
> >> >
> >> >
> >> > So far we haven't discussed much about (2) for operation
> >> reconciliation, so
> >> > let's also briefly talk
> >> > about it. Potentially (2) can be addressed by making the agent
> *actively
> >> > push *
> >> > *operation statusupdates to the framework when an LRP is
> resubscribed*,
> >> so
> >> > the framework won't need to do
> >> > periodic operation reconciliation. If we do this in the future, it
> would
> >> > also be more aligned with
> >> > proposal II or III.
> >> >
> >> > So the question again: is it worth the complexity to keep
> >> > `RECONCILE_OPERATIONS`
> >> > synchronous? I'd like to hear the opinions from the community so we
> can
> >> > drive towards a better
> >> > API design!
> >> >
> >> > Best,
> >> > Chun-Hung
> >> >
> >> >
> >> > --
> >> > Benno Evers
> >> > Software Engineer, Mesosphere
> >>
> >>
> >
> > --
> > James DeFelice
> > 585.241.9488 (voice)
> > 650.649.6071 (fax)
> >
>

Re: Discussion: Scheduler API for Operation Reconciliation

Posted by Chun-Hung Hsiao <ch...@mesosphere.io>.
I chatted with Jie and Gaston, and here is a brief summary:

1. The ordering issue between the synchronous response and the event stream
would lead to extra complication for a framework, and thus the benefit
doesn't seem to worth the complication.
2. However, we should consider not forwarding the reconciliation requests
to the agents. The status updates doesn't require a trigger, and if the
agent could report gone and unregistered RPs to the master, the master can
respond to the reconciliation request itself.
The only problem I see is that frameworks may see
`OPERATION_GONE_BY_OPERATOR` -> `OPERATION_UNREACHABLE` ->
`OPERATION_GONE_BY_OPERATOR`, since the master does not persist gone RPs.

To address the original problem of MESOS-9318, we could do the following:
(1) Agent is gone => `OPERATION_GONE_BY_OPERATOR`
(2) Agent is unreachable => `OPERATION_UNREACHABLE`
(3) Agent is not registered => `OPERATION_RECOVERING`
(4) Agent is unknown => `OPERATION_UNKNOWN`
(5) Agent is registered, RP is gone => `OPERATION_GONE_BY_OPERATOR`
(6) Agent is registered, RP is not registered => `OPERATION_UNREACHABLE` or
`OPERATION_RECOVERING`
(7) Agent is registered, RP is unknown => `OPERATION_UNKNOWN`
(8) Agent is registered, RP is registered => maybe `OPERATION_UNKNOWN`?

So it seems a number of people agree with going with the asynchronous
responses through the event stream. Please reply if you have other opinions!

On Thu, Jan 24, 2019 at 1:39 PM James DeFelice <ja...@gmail.com>
wrote:

> I've attempted to implement support for operation status reconciliation in
> a framework that I've been building. Option (III) seems most convenient
> from my perspective as well. A single source of updates:
>
> (a) Leads to a cleaner framework design; I've had to poke a few holes in
> the framework's initial design to deal with multiple event sources, leading
> to increased complexity.
>
> (b) Allows frameworks to consume events in the order they arrive (and
> pushes the responsibility for event ordering back to Mesos). Multiple event
> sources that the framework needs to (possibly) reorder based on a timestamp
> would add further complexity that we should avoid pushing onto framework
> writers.
>
> Some other thoughts:
>
> (c) I've implemented a background polling loop for exactly the reason that
> Benno pointed out. An asychronous API call for operation status
> reconciliation would be fine with me.
>
> (d) API consistency is important. Framework devs are used to the way that
> the task status reconciliation API works, and have come up with solutions
> for dealing with the lack of boundaries for streams of explicit
> reconciliation events. The synchronous response defined for the currently
> published operation status reconciliation call isn't consistent with the
> rest of the v1 scheduler API, which generated a bit of extra work (for me)
> in the low-level mesos v1 http client lib. Consistency should be a primary
> goal when extending existing API sets.
>
> (e) There are probably other ways to solve the problem of a "lack of
> boundaries within the event stream" for explicit reconciliation requests.
> If this is this a problem that other framework devs need solved then let's
> address it as a separate issue - and aim to resolve it in a consistent way
> for both task and operation status event streams.
>
> (f) It sounds like option (III) would let Mesos send back smarter
> operation statuses in agent/RP failover cases (UNREACHABLE vs. UNKNOWN).
> Anything to limit the number of scenarios where UNKNOWN is returned to
> frameworks sounds good to me.
>
> -James
>
>
>
> On Wed, Jan 16, 2019 at 4:15 PM Benjamin Bannier <
> benjamin.bannier@mesosphere.io> wrote:
>
>> Hi,
>>
>> have we reached a conclusion here?
>>
>> From the Mesos side of things I would be strongly in favor of proposal
>> (III). This is not only consistent with what we do with task status
>> updates, but also would allow us to provide improved operation status
>> (e.g., `OPERATION_UNREACHABLE` instead of just `OPERATION_UNKNOWN` to
>> better distinguish non-terminal from terminal operation states. To
>> accomplish that we wouldn’t need to introduce extra information leakage
>> (e.g., explicitly keeping master up to date on local resource provider
>> state and associated internal consistency complications).
>>
>> This approach should also simplify framework development as a framework
>> would only need to watch a single channel to see operation status updates
>> (no need to reconcile different information sources). The benefits of
>> better status updates and simpler implementation IMO outweigh any benefits
>> of the current approach (disclaimer: I filed the slightly inflammatory
>> MESOS-9448).
>>
>> What is keeping us from moving forward with (III) at this point?
>>
>>
>> Cheers,
>>
>> Benjamin
>>
>> > On Jan 3, 2019, at 11:30 PM, Benno Evers <be...@mesosphere.com> wrote:
>> >
>> > Hi Chun-Hung,
>> >
>> > > imagine that there are 1k nodes and 10 active + 10 gone LRPs per
>> node, then the master need to maintain 20k entries for LRPs.
>> >
>> > How big would the required additional storage be in this scenario? Even
>> if it's 1KiB per LRP, using 20 MiB of extra memory doesn't sound too bad
>> for such a big custer.
>> >
>> > In general, it seems hard to discuss the trade-offs between your
>> proposals without looking at the users of that API - do you know if there
>> are ayn frameworks out there that already use
>> >  operation reconciliation, and if so what do they do based on the
>> reconciliation response?
>> >
>> > As far as I know, we don't have any formal guarantees on which
>> operations status changes the framework will receive without
>> reconciliation. So putting on my framework-implementer hat it seems like
>> I'd have no choice but to implement a continously polling background loop
>> anyways if I care about knowing the latest operation statuses. If this is
>> indeed the case, having a synchronous `RECONCILE_OPERATIONS` would seem to
>> have little additional benefit.
>> >
>> > Best regards,
>> > Benno
>> >
>> > On Wed, Dec 12, 2018 at 4:07 AM Chun-Hung Hsiao <ch...@apache.org>
>> wrote:
>> > Hi folks,
>> >
>> > Recently I've being discussing the problems of the current design of the
>> > experimental
>> > `RECONCILE_OPERATIONS` scheduler API with a couple people. The
>> discussion
>> > was started
>> > from MESOS-9318 <https://issues.apache.org/jira/browse/MESOS-9318>:
>> when a
>> > framework receives an `OPERATION_UNKNOWN`, it doesn't know
>> > if it should retry the operation or not (further details described
>> below).
>> > As the discussion
>> > evolves, we realize there are more issues to consider, design-wise and
>> > implementation-wise, so
>> > I'd like to reach out to the community to get valuable opinions from you
>> > guys.
>> >
>> > Before I jump right into the issues I'd like to discuss, let me fill you
>> > guys in with some
>> > background of operation reconciliation. Since the design of this feature
>> > was informed by the
>> > pre-existing implementation of task reconciliation, I'll begin there.
>> >
>> > *Task Reconciliation: Design*
>> >
>> > The scheduler API has a `RECONCILE` call for a framework to query the
>> > current statuses
>> > of its tasks. This call supports the following modes:
>> >
>> >    - *Explicit reconciliation*: The framework specifies the list of
>> tasks
>> >    it wants to know
>> >    about, and expects status updates for these tasks.
>> >
>> >    - *Implicit reconciliation*: The framework does not specify a list of
>> >    tasks, and simply
>> >    expects status updates for all tasks the master knows about.
>> >
>> > In both cases, the master looks into its in-memory task bookkeeping and
>> > sends
>> > *one or more`UPDATE` events* to respond to the reconciliation request.
>> >
>> > *Task Reconciliation: Problems*
>> >
>> > This API design of task reconciliation has the following shortcomings:
>> >
>> >    - (1) There is no clear boundary of when the "reconciliation
>> response"
>> >    ends, and thus
>> >    there is
>> > *no 1-1 correspondence between the reconciliation request and the
>> response*.
>> >    For explicit reconciliation, the framework might wait for an
>> extended period
>> >    of time before it receives all status updates; for implicit
>> >    reconciliation, there is no way for
>> >    a framework to tell if it has learned about all of its tasks, which
>> >    could be inconvenient if
>> >    the framework has lost its task bookkeeping.
>> >
>> >    - (2) The "reconciliation response" may be outdated. If an agent
>> >    reregisters after a task
>> >    reconciliation has been responded,
>> > *the framework wouldn't learn about the tasks **from this recovered
>> agent*.
>> >    Mesos relies on the framework to call the `RECONCILE` call
>> >    *periodically* to get up-to-date task statuses.
>> >
>> >
>> >
>> > *Operation Reconciliation: Design & Problems*
>> >
>> > When designing operation reconciliation, we made the
>> `RECONCILE_OPERATIONS`
>> > call
>> > *asynchronous request-response style call* that returns a 200 OK with a
>> > list of operation status
>> > to avoid (1). However, this design does not resolve (2), and also
>> > introduces new problems:
>> >
>> >    - (3) *The synchronous response could race with the event stream* and
>> >    the framework
>> >    does not know which contains the latest operation status.
>> >
>> >    - (4) To ensure scalability, the master does not manage local
>> resource
>> >    providers (LRPs);
>> >    the agents do. So the master cannot tell if an LRP is temporarily
>> >    unreachable/recovering
>> >    or permanently gone. As a result, if the framework explicitly
>> reconciles
>> >    an LRP operation
>> >    that the master does not know about, it can only reply
>> >    `OPERATION_UNKNOWN`, but
>> >    then *the framework would not know if the operation would come back
>> in
>> >    the future*,
>> >    and thus cannot decide if it should reissue another operation, which
>> >    leads to MESOS-9318 <
>> https://issues.apache.org/jira/browse/MESOS-9318>.
>> >
>> >    Note that this is less of a problem for explicit task reconciliation,
>> >    because in most cases
>> >    the master can infer task statuses from agent statuses, and in the
>> rare
>> >    cases that it
>> >    replies `TASK_UNKNOWN`, it is generally safe for the framework to
>> >    relaunch another
>> >    task.
>> >
>> >
>> > *The Open Question*
>> >
>> > Now, the big question here is:
>> > *are the benefits of a synchronous request-responsestyle
>> > `RECONCILE_OPERATIONS` call worth the complexity it introduces* in
>> order to
>> > address (3) and (4) in the code? To explain what the complexity would
>> be,
>> > let me lay out a
>> > couple proposals we've been discussing:
>> >
>> > I. Keep `RECONCILE_OPERATIONS` synchronous
>> >
>> > To address (3), we could add a *timestamp* to every operation status as
>> > well as the
>> > reconciliation response, so the framework can infer which one is the
>> latest
>> > status, and if
>> > it receives a stale operation status update after the reconciliation
>> > response, it can just
>> > ack the status update without updating its bookkeeping. But, the
>> framework
>> > needs to
>> > deal with a corner case:
>> >
>> > *when it receives a reconciliation response containing aterminal
>> operation
>> > status, it may or may not receive one or more status updatesfor that
>> > operation later *because of the race.
>> >
>> >
>> > To address (4), we could either: (a) surface the unreachable/gone LRPs
>> to
>> > the master, or
>> > (b) forward the explicit reconciliation request to the corresponding
>> agent.
>> > The complexity
>> > of (a) is that
>> > *it might not be scalable for the master to maintain the list
>> ofunreachable
>> > and gone LRPs*: imagine that there are 1k nodes and 10 active + 10 gone
>> > LRPs per node, then the master need to maintain 20k entries for LRPs.
>> The
>> > complexity
>> > of (b) is that the response wouldn't be computed based on the master's
>> > state; instead,
>> > *the master needs to wait for the agent's reply to respond to the
>> framework*.
>> > Note
>> > that it's probably not scalable to forward implicit reconciliation
>> requests
>> > to all agents, so
>> > implicit reconciliation might have to still be responded based on the
>> > master's state.
>> >
>> >
>> > II. Make `RECONCILE_OPERATIONS` "semi-synchronous"
>> >
>> > Instead of returning a 200 OK, the master could return a 202 Accepted
>> with
>> > an empty
>> > body, and then
>> > *reply a single event containing the operation status of all
>> > requestedoperations in the event stream asynchronously*. Although the
>> > framework loses the
>> > 1-1 correspondence between the request and the response, there's still a
>> > clear boundary
>> > for a reconciliation response. The advantage of this approach compared
>> to
>> > proposal I is
>> > that we don't have a race between the reconciliation response and the
>> event
>> > stream, so
>> > no timestamp is required. Still, we have to address (4) through either
>> (a)
>> > or (b) described
>> > above, thus the complexity remains. That said, this approach fits with
>> (b)
>> > better since no
>> > synchronous response is needed.
>> >
>> >
>> > III. Make `RECONCILE_OPERATIONS` an asynchronous trigger
>> >
>> > This would be similar to what we have for task reconciliation. The
>> master
>> > would return a
>> > 202 Accepted, and then send
>> > *one or more `UPDATE_OPERATION_STATUS` events*based on its state for an
>> > implicit reconciliation, or
>> > *forward the request to some agent*for an explicit reconciliation. In
>> other
>> > words, this call plays the role of a trigger of the
>> > operation status updates. This approach is the simplest in terms of the
>> > implementation,
>> > but the trade-off is that the framework needs to live with (1).
>> >
>> >
>> > So far we haven't discussed much about (2) for operation
>> reconciliation, so
>> > let's also briefly talk
>> > about it. Potentially (2) can be addressed by making the agent *actively
>> > push *
>> > *operation statusupdates to the framework when an LRP is resubscribed*,
>> so
>> > the framework won't need to do
>> > periodic operation reconciliation. If we do this in the future, it would
>> > also be more aligned with
>> > proposal II or III.
>> >
>> > So the question again: is it worth the complexity to keep
>> > `RECONCILE_OPERATIONS`
>> > synchronous? I'd like to hear the opinions from the community so we can
>> > drive towards a better
>> > API design!
>> >
>> > Best,
>> > Chun-Hung
>> >
>> >
>> > --
>> > Benno Evers
>> > Software Engineer, Mesosphere
>>
>>
>
> --
> James DeFelice
> 585.241.9488 (voice)
> 650.649.6071 (fax)
>

Re: Discussion: Scheduler API for Operation Reconciliation

Posted by Chun-Hung Hsiao <ch...@mesosphere.io>.
I chatted with Jie and Gaston, and here is a brief summary:

1. The ordering issue between the synchronous response and the event stream
would lead to extra complication for a framework, and thus the benefit
doesn't seem to worth the complication.
2. However, we should consider not forwarding the reconciliation requests
to the agents. The status updates doesn't require a trigger, and if the
agent could report gone and unregistered RPs to the master, the master can
respond to the reconciliation request itself.
The only problem I see is that frameworks may see
`OPERATION_GONE_BY_OPERATOR` -> `OPERATION_UNREACHABLE` ->
`OPERATION_GONE_BY_OPERATOR`, since the master does not persist gone RPs.

To address the original problem of MESOS-9318, we could do the following:
(1) Agent is gone => `OPERATION_GONE_BY_OPERATOR`
(2) Agent is unreachable => `OPERATION_UNREACHABLE`
(3) Agent is not registered => `OPERATION_RECOVERING`
(4) Agent is unknown => `OPERATION_UNKNOWN`
(5) Agent is registered, RP is gone => `OPERATION_GONE_BY_OPERATOR`
(6) Agent is registered, RP is not registered => `OPERATION_UNREACHABLE` or
`OPERATION_RECOVERING`
(7) Agent is registered, RP is unknown => `OPERATION_UNKNOWN`
(8) Agent is registered, RP is registered => maybe `OPERATION_UNKNOWN`?

So it seems a number of people agree with going with the asynchronous
responses through the event stream. Please reply if you have other opinions!

On Thu, Jan 24, 2019 at 1:39 PM James DeFelice <ja...@gmail.com>
wrote:

> I've attempted to implement support for operation status reconciliation in
> a framework that I've been building. Option (III) seems most convenient
> from my perspective as well. A single source of updates:
>
> (a) Leads to a cleaner framework design; I've had to poke a few holes in
> the framework's initial design to deal with multiple event sources, leading
> to increased complexity.
>
> (b) Allows frameworks to consume events in the order they arrive (and
> pushes the responsibility for event ordering back to Mesos). Multiple event
> sources that the framework needs to (possibly) reorder based on a timestamp
> would add further complexity that we should avoid pushing onto framework
> writers.
>
> Some other thoughts:
>
> (c) I've implemented a background polling loop for exactly the reason that
> Benno pointed out. An asychronous API call for operation status
> reconciliation would be fine with me.
>
> (d) API consistency is important. Framework devs are used to the way that
> the task status reconciliation API works, and have come up with solutions
> for dealing with the lack of boundaries for streams of explicit
> reconciliation events. The synchronous response defined for the currently
> published operation status reconciliation call isn't consistent with the
> rest of the v1 scheduler API, which generated a bit of extra work (for me)
> in the low-level mesos v1 http client lib. Consistency should be a primary
> goal when extending existing API sets.
>
> (e) There are probably other ways to solve the problem of a "lack of
> boundaries within the event stream" for explicit reconciliation requests.
> If this is this a problem that other framework devs need solved then let's
> address it as a separate issue - and aim to resolve it in a consistent way
> for both task and operation status event streams.
>
> (f) It sounds like option (III) would let Mesos send back smarter
> operation statuses in agent/RP failover cases (UNREACHABLE vs. UNKNOWN).
> Anything to limit the number of scenarios where UNKNOWN is returned to
> frameworks sounds good to me.
>
> -James
>
>
>
> On Wed, Jan 16, 2019 at 4:15 PM Benjamin Bannier <
> benjamin.bannier@mesosphere.io> wrote:
>
>> Hi,
>>
>> have we reached a conclusion here?
>>
>> From the Mesos side of things I would be strongly in favor of proposal
>> (III). This is not only consistent with what we do with task status
>> updates, but also would allow us to provide improved operation status
>> (e.g., `OPERATION_UNREACHABLE` instead of just `OPERATION_UNKNOWN` to
>> better distinguish non-terminal from terminal operation states. To
>> accomplish that we wouldn’t need to introduce extra information leakage
>> (e.g., explicitly keeping master up to date on local resource provider
>> state and associated internal consistency complications).
>>
>> This approach should also simplify framework development as a framework
>> would only need to watch a single channel to see operation status updates
>> (no need to reconcile different information sources). The benefits of
>> better status updates and simpler implementation IMO outweigh any benefits
>> of the current approach (disclaimer: I filed the slightly inflammatory
>> MESOS-9448).
>>
>> What is keeping us from moving forward with (III) at this point?
>>
>>
>> Cheers,
>>
>> Benjamin
>>
>> > On Jan 3, 2019, at 11:30 PM, Benno Evers <be...@mesosphere.com> wrote:
>> >
>> > Hi Chun-Hung,
>> >
>> > > imagine that there are 1k nodes and 10 active + 10 gone LRPs per
>> node, then the master need to maintain 20k entries for LRPs.
>> >
>> > How big would the required additional storage be in this scenario? Even
>> if it's 1KiB per LRP, using 20 MiB of extra memory doesn't sound too bad
>> for such a big custer.
>> >
>> > In general, it seems hard to discuss the trade-offs between your
>> proposals without looking at the users of that API - do you know if there
>> are ayn frameworks out there that already use
>> >  operation reconciliation, and if so what do they do based on the
>> reconciliation response?
>> >
>> > As far as I know, we don't have any formal guarantees on which
>> operations status changes the framework will receive without
>> reconciliation. So putting on my framework-implementer hat it seems like
>> I'd have no choice but to implement a continously polling background loop
>> anyways if I care about knowing the latest operation statuses. If this is
>> indeed the case, having a synchronous `RECONCILE_OPERATIONS` would seem to
>> have little additional benefit.
>> >
>> > Best regards,
>> > Benno
>> >
>> > On Wed, Dec 12, 2018 at 4:07 AM Chun-Hung Hsiao <ch...@apache.org>
>> wrote:
>> > Hi folks,
>> >
>> > Recently I've being discussing the problems of the current design of the
>> > experimental
>> > `RECONCILE_OPERATIONS` scheduler API with a couple people. The
>> discussion
>> > was started
>> > from MESOS-9318 <https://issues.apache.org/jira/browse/MESOS-9318>:
>> when a
>> > framework receives an `OPERATION_UNKNOWN`, it doesn't know
>> > if it should retry the operation or not (further details described
>> below).
>> > As the discussion
>> > evolves, we realize there are more issues to consider, design-wise and
>> > implementation-wise, so
>> > I'd like to reach out to the community to get valuable opinions from you
>> > guys.
>> >
>> > Before I jump right into the issues I'd like to discuss, let me fill you
>> > guys in with some
>> > background of operation reconciliation. Since the design of this feature
>> > was informed by the
>> > pre-existing implementation of task reconciliation, I'll begin there.
>> >
>> > *Task Reconciliation: Design*
>> >
>> > The scheduler API has a `RECONCILE` call for a framework to query the
>> > current statuses
>> > of its tasks. This call supports the following modes:
>> >
>> >    - *Explicit reconciliation*: The framework specifies the list of
>> tasks
>> >    it wants to know
>> >    about, and expects status updates for these tasks.
>> >
>> >    - *Implicit reconciliation*: The framework does not specify a list of
>> >    tasks, and simply
>> >    expects status updates for all tasks the master knows about.
>> >
>> > In both cases, the master looks into its in-memory task bookkeeping and
>> > sends
>> > *one or more`UPDATE` events* to respond to the reconciliation request.
>> >
>> > *Task Reconciliation: Problems*
>> >
>> > This API design of task reconciliation has the following shortcomings:
>> >
>> >    - (1) There is no clear boundary of when the "reconciliation
>> response"
>> >    ends, and thus
>> >    there is
>> > *no 1-1 correspondence between the reconciliation request and the
>> response*.
>> >    For explicit reconciliation, the framework might wait for an
>> extended period
>> >    of time before it receives all status updates; for implicit
>> >    reconciliation, there is no way for
>> >    a framework to tell if it has learned about all of its tasks, which
>> >    could be inconvenient if
>> >    the framework has lost its task bookkeeping.
>> >
>> >    - (2) The "reconciliation response" may be outdated. If an agent
>> >    reregisters after a task
>> >    reconciliation has been responded,
>> > *the framework wouldn't learn about the tasks **from this recovered
>> agent*.
>> >    Mesos relies on the framework to call the `RECONCILE` call
>> >    *periodically* to get up-to-date task statuses.
>> >
>> >
>> >
>> > *Operation Reconciliation: Design & Problems*
>> >
>> > When designing operation reconciliation, we made the
>> `RECONCILE_OPERATIONS`
>> > call
>> > *asynchronous request-response style call* that returns a 200 OK with a
>> > list of operation status
>> > to avoid (1). However, this design does not resolve (2), and also
>> > introduces new problems:
>> >
>> >    - (3) *The synchronous response could race with the event stream* and
>> >    the framework
>> >    does not know which contains the latest operation status.
>> >
>> >    - (4) To ensure scalability, the master does not manage local
>> resource
>> >    providers (LRPs);
>> >    the agents do. So the master cannot tell if an LRP is temporarily
>> >    unreachable/recovering
>> >    or permanently gone. As a result, if the framework explicitly
>> reconciles
>> >    an LRP operation
>> >    that the master does not know about, it can only reply
>> >    `OPERATION_UNKNOWN`, but
>> >    then *the framework would not know if the operation would come back
>> in
>> >    the future*,
>> >    and thus cannot decide if it should reissue another operation, which
>> >    leads to MESOS-9318 <
>> https://issues.apache.org/jira/browse/MESOS-9318>.
>> >
>> >    Note that this is less of a problem for explicit task reconciliation,
>> >    because in most cases
>> >    the master can infer task statuses from agent statuses, and in the
>> rare
>> >    cases that it
>> >    replies `TASK_UNKNOWN`, it is generally safe for the framework to
>> >    relaunch another
>> >    task.
>> >
>> >
>> > *The Open Question*
>> >
>> > Now, the big question here is:
>> > *are the benefits of a synchronous request-responsestyle
>> > `RECONCILE_OPERATIONS` call worth the complexity it introduces* in
>> order to
>> > address (3) and (4) in the code? To explain what the complexity would
>> be,
>> > let me lay out a
>> > couple proposals we've been discussing:
>> >
>> > I. Keep `RECONCILE_OPERATIONS` synchronous
>> >
>> > To address (3), we could add a *timestamp* to every operation status as
>> > well as the
>> > reconciliation response, so the framework can infer which one is the
>> latest
>> > status, and if
>> > it receives a stale operation status update after the reconciliation
>> > response, it can just
>> > ack the status update without updating its bookkeeping. But, the
>> framework
>> > needs to
>> > deal with a corner case:
>> >
>> > *when it receives a reconciliation response containing aterminal
>> operation
>> > status, it may or may not receive one or more status updatesfor that
>> > operation later *because of the race.
>> >
>> >
>> > To address (4), we could either: (a) surface the unreachable/gone LRPs
>> to
>> > the master, or
>> > (b) forward the explicit reconciliation request to the corresponding
>> agent.
>> > The complexity
>> > of (a) is that
>> > *it might not be scalable for the master to maintain the list
>> ofunreachable
>> > and gone LRPs*: imagine that there are 1k nodes and 10 active + 10 gone
>> > LRPs per node, then the master need to maintain 20k entries for LRPs.
>> The
>> > complexity
>> > of (b) is that the response wouldn't be computed based on the master's
>> > state; instead,
>> > *the master needs to wait for the agent's reply to respond to the
>> framework*.
>> > Note
>> > that it's probably not scalable to forward implicit reconciliation
>> requests
>> > to all agents, so
>> > implicit reconciliation might have to still be responded based on the
>> > master's state.
>> >
>> >
>> > II. Make `RECONCILE_OPERATIONS` "semi-synchronous"
>> >
>> > Instead of returning a 200 OK, the master could return a 202 Accepted
>> with
>> > an empty
>> > body, and then
>> > *reply a single event containing the operation status of all
>> > requestedoperations in the event stream asynchronously*. Although the
>> > framework loses the
>> > 1-1 correspondence between the request and the response, there's still a
>> > clear boundary
>> > for a reconciliation response. The advantage of this approach compared
>> to
>> > proposal I is
>> > that we don't have a race between the reconciliation response and the
>> event
>> > stream, so
>> > no timestamp is required. Still, we have to address (4) through either
>> (a)
>> > or (b) described
>> > above, thus the complexity remains. That said, this approach fits with
>> (b)
>> > better since no
>> > synchronous response is needed.
>> >
>> >
>> > III. Make `RECONCILE_OPERATIONS` an asynchronous trigger
>> >
>> > This would be similar to what we have for task reconciliation. The
>> master
>> > would return a
>> > 202 Accepted, and then send
>> > *one or more `UPDATE_OPERATION_STATUS` events*based on its state for an
>> > implicit reconciliation, or
>> > *forward the request to some agent*for an explicit reconciliation. In
>> other
>> > words, this call plays the role of a trigger of the
>> > operation status updates. This approach is the simplest in terms of the
>> > implementation,
>> > but the trade-off is that the framework needs to live with (1).
>> >
>> >
>> > So far we haven't discussed much about (2) for operation
>> reconciliation, so
>> > let's also briefly talk
>> > about it. Potentially (2) can be addressed by making the agent *actively
>> > push *
>> > *operation statusupdates to the framework when an LRP is resubscribed*,
>> so
>> > the framework won't need to do
>> > periodic operation reconciliation. If we do this in the future, it would
>> > also be more aligned with
>> > proposal II or III.
>> >
>> > So the question again: is it worth the complexity to keep
>> > `RECONCILE_OPERATIONS`
>> > synchronous? I'd like to hear the opinions from the community so we can
>> > drive towards a better
>> > API design!
>> >
>> > Best,
>> > Chun-Hung
>> >
>> >
>> > --
>> > Benno Evers
>> > Software Engineer, Mesosphere
>>
>>
>
> --
> James DeFelice
> 585.241.9488 (voice)
> 650.649.6071 (fax)
>

Re: Discussion: Scheduler API for Operation Reconciliation

Posted by James DeFelice <ja...@gmail.com>.
I've attempted to implement support for operation status reconciliation in
a framework that I've been building. Option (III) seems most convenient
from my perspective as well. A single source of updates:

(a) Leads to a cleaner framework design; I've had to poke a few holes in
the framework's initial design to deal with multiple event sources, leading
to increased complexity.

(b) Allows frameworks to consume events in the order they arrive (and
pushes the responsibility for event ordering back to Mesos). Multiple event
sources that the framework needs to (possibly) reorder based on a timestamp
would add further complexity that we should avoid pushing onto framework
writers.

Some other thoughts:

(c) I've implemented a background polling loop for exactly the reason that
Benno pointed out. An asychronous API call for operation status
reconciliation would be fine with me.

(d) API consistency is important. Framework devs are used to the way that
the task status reconciliation API works, and have come up with solutions
for dealing with the lack of boundaries for streams of explicit
reconciliation events. The synchronous response defined for the currently
published operation status reconciliation call isn't consistent with the
rest of the v1 scheduler API, which generated a bit of extra work (for me)
in the low-level mesos v1 http client lib. Consistency should be a primary
goal when extending existing API sets.

(e) There are probably other ways to solve the problem of a "lack of
boundaries within the event stream" for explicit reconciliation requests.
If this is this a problem that other framework devs need solved then let's
address it as a separate issue - and aim to resolve it in a consistent way
for both task and operation status event streams.

(f) It sounds like option (III) would let Mesos send back smarter operation
statuses in agent/RP failover cases (UNREACHABLE vs. UNKNOWN). Anything to
limit the number of scenarios where UNKNOWN is returned to frameworks
sounds good to me.

-James



On Wed, Jan 16, 2019 at 4:15 PM Benjamin Bannier <
benjamin.bannier@mesosphere.io> wrote:

> Hi,
>
> have we reached a conclusion here?
>
> From the Mesos side of things I would be strongly in favor of proposal
> (III). This is not only consistent with what we do with task status
> updates, but also would allow us to provide improved operation status
> (e.g., `OPERATION_UNREACHABLE` instead of just `OPERATION_UNKNOWN` to
> better distinguish non-terminal from terminal operation states. To
> accomplish that we wouldn’t need to introduce extra information leakage
> (e.g., explicitly keeping master up to date on local resource provider
> state and associated internal consistency complications).
>
> This approach should also simplify framework development as a framework
> would only need to watch a single channel to see operation status updates
> (no need to reconcile different information sources). The benefits of
> better status updates and simpler implementation IMO outweigh any benefits
> of the current approach (disclaimer: I filed the slightly inflammatory
> MESOS-9448).
>
> What is keeping us from moving forward with (III) at this point?
>
>
> Cheers,
>
> Benjamin
>
> > On Jan 3, 2019, at 11:30 PM, Benno Evers <be...@mesosphere.com> wrote:
> >
> > Hi Chun-Hung,
> >
> > > imagine that there are 1k nodes and 10 active + 10 gone LRPs per node,
> then the master need to maintain 20k entries for LRPs.
> >
> > How big would the required additional storage be in this scenario? Even
> if it's 1KiB per LRP, using 20 MiB of extra memory doesn't sound too bad
> for such a big custer.
> >
> > In general, it seems hard to discuss the trade-offs between your
> proposals without looking at the users of that API - do you know if there
> are ayn frameworks out there that already use
> >  operation reconciliation, and if so what do they do based on the
> reconciliation response?
> >
> > As far as I know, we don't have any formal guarantees on which
> operations status changes the framework will receive without
> reconciliation. So putting on my framework-implementer hat it seems like
> I'd have no choice but to implement a continously polling background loop
> anyways if I care about knowing the latest operation statuses. If this is
> indeed the case, having a synchronous `RECONCILE_OPERATIONS` would seem to
> have little additional benefit.
> >
> > Best regards,
> > Benno
> >
> > On Wed, Dec 12, 2018 at 4:07 AM Chun-Hung Hsiao <ch...@apache.org>
> wrote:
> > Hi folks,
> >
> > Recently I've being discussing the problems of the current design of the
> > experimental
> > `RECONCILE_OPERATIONS` scheduler API with a couple people. The discussion
> > was started
> > from MESOS-9318 <https://issues.apache.org/jira/browse/MESOS-9318>:
> when a
> > framework receives an `OPERATION_UNKNOWN`, it doesn't know
> > if it should retry the operation or not (further details described
> below).
> > As the discussion
> > evolves, we realize there are more issues to consider, design-wise and
> > implementation-wise, so
> > I'd like to reach out to the community to get valuable opinions from you
> > guys.
> >
> > Before I jump right into the issues I'd like to discuss, let me fill you
> > guys in with some
> > background of operation reconciliation. Since the design of this feature
> > was informed by the
> > pre-existing implementation of task reconciliation, I'll begin there.
> >
> > *Task Reconciliation: Design*
> >
> > The scheduler API has a `RECONCILE` call for a framework to query the
> > current statuses
> > of its tasks. This call supports the following modes:
> >
> >    - *Explicit reconciliation*: The framework specifies the list of tasks
> >    it wants to know
> >    about, and expects status updates for these tasks.
> >
> >    - *Implicit reconciliation*: The framework does not specify a list of
> >    tasks, and simply
> >    expects status updates for all tasks the master knows about.
> >
> > In both cases, the master looks into its in-memory task bookkeeping and
> > sends
> > *one or more`UPDATE` events* to respond to the reconciliation request.
> >
> > *Task Reconciliation: Problems*
> >
> > This API design of task reconciliation has the following shortcomings:
> >
> >    - (1) There is no clear boundary of when the "reconciliation response"
> >    ends, and thus
> >    there is
> > *no 1-1 correspondence between the reconciliation request and the
> response*.
> >    For explicit reconciliation, the framework might wait for an extended
> period
> >    of time before it receives all status updates; for implicit
> >    reconciliation, there is no way for
> >    a framework to tell if it has learned about all of its tasks, which
> >    could be inconvenient if
> >    the framework has lost its task bookkeeping.
> >
> >    - (2) The "reconciliation response" may be outdated. If an agent
> >    reregisters after a task
> >    reconciliation has been responded,
> > *the framework wouldn't learn about the tasks **from this recovered
> agent*.
> >    Mesos relies on the framework to call the `RECONCILE` call
> >    *periodically* to get up-to-date task statuses.
> >
> >
> >
> > *Operation Reconciliation: Design & Problems*
> >
> > When designing operation reconciliation, we made the
> `RECONCILE_OPERATIONS`
> > call
> > *asynchronous request-response style call* that returns a 200 OK with a
> > list of operation status
> > to avoid (1). However, this design does not resolve (2), and also
> > introduces new problems:
> >
> >    - (3) *The synchronous response could race with the event stream* and
> >    the framework
> >    does not know which contains the latest operation status.
> >
> >    - (4) To ensure scalability, the master does not manage local resource
> >    providers (LRPs);
> >    the agents do. So the master cannot tell if an LRP is temporarily
> >    unreachable/recovering
> >    or permanently gone. As a result, if the framework explicitly
> reconciles
> >    an LRP operation
> >    that the master does not know about, it can only reply
> >    `OPERATION_UNKNOWN`, but
> >    then *the framework would not know if the operation would come back in
> >    the future*,
> >    and thus cannot decide if it should reissue another operation, which
> >    leads to MESOS-9318 <https://issues.apache.org/jira/browse/MESOS-9318
> >.
> >
> >    Note that this is less of a problem for explicit task reconciliation,
> >    because in most cases
> >    the master can infer task statuses from agent statuses, and in the
> rare
> >    cases that it
> >    replies `TASK_UNKNOWN`, it is generally safe for the framework to
> >    relaunch another
> >    task.
> >
> >
> > *The Open Question*
> >
> > Now, the big question here is:
> > *are the benefits of a synchronous request-responsestyle
> > `RECONCILE_OPERATIONS` call worth the complexity it introduces* in order
> to
> > address (3) and (4) in the code? To explain what the complexity would be,
> > let me lay out a
> > couple proposals we've been discussing:
> >
> > I. Keep `RECONCILE_OPERATIONS` synchronous
> >
> > To address (3), we could add a *timestamp* to every operation status as
> > well as the
> > reconciliation response, so the framework can infer which one is the
> latest
> > status, and if
> > it receives a stale operation status update after the reconciliation
> > response, it can just
> > ack the status update without updating its bookkeeping. But, the
> framework
> > needs to
> > deal with a corner case:
> >
> > *when it receives a reconciliation response containing aterminal
> operation
> > status, it may or may not receive one or more status updatesfor that
> > operation later *because of the race.
> >
> >
> > To address (4), we could either: (a) surface the unreachable/gone LRPs to
> > the master, or
> > (b) forward the explicit reconciliation request to the corresponding
> agent.
> > The complexity
> > of (a) is that
> > *it might not be scalable for the master to maintain the list
> ofunreachable
> > and gone LRPs*: imagine that there are 1k nodes and 10 active + 10 gone
> > LRPs per node, then the master need to maintain 20k entries for LRPs. The
> > complexity
> > of (b) is that the response wouldn't be computed based on the master's
> > state; instead,
> > *the master needs to wait for the agent's reply to respond to the
> framework*.
> > Note
> > that it's probably not scalable to forward implicit reconciliation
> requests
> > to all agents, so
> > implicit reconciliation might have to still be responded based on the
> > master's state.
> >
> >
> > II. Make `RECONCILE_OPERATIONS` "semi-synchronous"
> >
> > Instead of returning a 200 OK, the master could return a 202 Accepted
> with
> > an empty
> > body, and then
> > *reply a single event containing the operation status of all
> > requestedoperations in the event stream asynchronously*. Although the
> > framework loses the
> > 1-1 correspondence between the request and the response, there's still a
> > clear boundary
> > for a reconciliation response. The advantage of this approach compared to
> > proposal I is
> > that we don't have a race between the reconciliation response and the
> event
> > stream, so
> > no timestamp is required. Still, we have to address (4) through either
> (a)
> > or (b) described
> > above, thus the complexity remains. That said, this approach fits with
> (b)
> > better since no
> > synchronous response is needed.
> >
> >
> > III. Make `RECONCILE_OPERATIONS` an asynchronous trigger
> >
> > This would be similar to what we have for task reconciliation. The master
> > would return a
> > 202 Accepted, and then send
> > *one or more `UPDATE_OPERATION_STATUS` events*based on its state for an
> > implicit reconciliation, or
> > *forward the request to some agent*for an explicit reconciliation. In
> other
> > words, this call plays the role of a trigger of the
> > operation status updates. This approach is the simplest in terms of the
> > implementation,
> > but the trade-off is that the framework needs to live with (1).
> >
> >
> > So far we haven't discussed much about (2) for operation reconciliation,
> so
> > let's also briefly talk
> > about it. Potentially (2) can be addressed by making the agent *actively
> > push *
> > *operation statusupdates to the framework when an LRP is resubscribed*,
> so
> > the framework won't need to do
> > periodic operation reconciliation. If we do this in the future, it would
> > also be more aligned with
> > proposal II or III.
> >
> > So the question again: is it worth the complexity to keep
> > `RECONCILE_OPERATIONS`
> > synchronous? I'd like to hear the opinions from the community so we can
> > drive towards a better
> > API design!
> >
> > Best,
> > Chun-Hung
> >
> >
> > --
> > Benno Evers
> > Software Engineer, Mesosphere
>
>

-- 
James DeFelice
585.241.9488 (voice)
650.649.6071 (fax)

Re: Discussion: Scheduler API for Operation Reconciliation

Posted by James DeFelice <ja...@gmail.com>.
I've attempted to implement support for operation status reconciliation in
a framework that I've been building. Option (III) seems most convenient
from my perspective as well. A single source of updates:

(a) Leads to a cleaner framework design; I've had to poke a few holes in
the framework's initial design to deal with multiple event sources, leading
to increased complexity.

(b) Allows frameworks to consume events in the order they arrive (and
pushes the responsibility for event ordering back to Mesos). Multiple event
sources that the framework needs to (possibly) reorder based on a timestamp
would add further complexity that we should avoid pushing onto framework
writers.

Some other thoughts:

(c) I've implemented a background polling loop for exactly the reason that
Benno pointed out. An asychronous API call for operation status
reconciliation would be fine with me.

(d) API consistency is important. Framework devs are used to the way that
the task status reconciliation API works, and have come up with solutions
for dealing with the lack of boundaries for streams of explicit
reconciliation events. The synchronous response defined for the currently
published operation status reconciliation call isn't consistent with the
rest of the v1 scheduler API, which generated a bit of extra work (for me)
in the low-level mesos v1 http client lib. Consistency should be a primary
goal when extending existing API sets.

(e) There are probably other ways to solve the problem of a "lack of
boundaries within the event stream" for explicit reconciliation requests.
If this is this a problem that other framework devs need solved then let's
address it as a separate issue - and aim to resolve it in a consistent way
for both task and operation status event streams.

(f) It sounds like option (III) would let Mesos send back smarter operation
statuses in agent/RP failover cases (UNREACHABLE vs. UNKNOWN). Anything to
limit the number of scenarios where UNKNOWN is returned to frameworks
sounds good to me.

-James



On Wed, Jan 16, 2019 at 4:15 PM Benjamin Bannier <
benjamin.bannier@mesosphere.io> wrote:

> Hi,
>
> have we reached a conclusion here?
>
> From the Mesos side of things I would be strongly in favor of proposal
> (III). This is not only consistent with what we do with task status
> updates, but also would allow us to provide improved operation status
> (e.g., `OPERATION_UNREACHABLE` instead of just `OPERATION_UNKNOWN` to
> better distinguish non-terminal from terminal operation states. To
> accomplish that we wouldn’t need to introduce extra information leakage
> (e.g., explicitly keeping master up to date on local resource provider
> state and associated internal consistency complications).
>
> This approach should also simplify framework development as a framework
> would only need to watch a single channel to see operation status updates
> (no need to reconcile different information sources). The benefits of
> better status updates and simpler implementation IMO outweigh any benefits
> of the current approach (disclaimer: I filed the slightly inflammatory
> MESOS-9448).
>
> What is keeping us from moving forward with (III) at this point?
>
>
> Cheers,
>
> Benjamin
>
> > On Jan 3, 2019, at 11:30 PM, Benno Evers <be...@mesosphere.com> wrote:
> >
> > Hi Chun-Hung,
> >
> > > imagine that there are 1k nodes and 10 active + 10 gone LRPs per node,
> then the master need to maintain 20k entries for LRPs.
> >
> > How big would the required additional storage be in this scenario? Even
> if it's 1KiB per LRP, using 20 MiB of extra memory doesn't sound too bad
> for such a big custer.
> >
> > In general, it seems hard to discuss the trade-offs between your
> proposals without looking at the users of that API - do you know if there
> are ayn frameworks out there that already use
> >  operation reconciliation, and if so what do they do based on the
> reconciliation response?
> >
> > As far as I know, we don't have any formal guarantees on which
> operations status changes the framework will receive without
> reconciliation. So putting on my framework-implementer hat it seems like
> I'd have no choice but to implement a continously polling background loop
> anyways if I care about knowing the latest operation statuses. If this is
> indeed the case, having a synchronous `RECONCILE_OPERATIONS` would seem to
> have little additional benefit.
> >
> > Best regards,
> > Benno
> >
> > On Wed, Dec 12, 2018 at 4:07 AM Chun-Hung Hsiao <ch...@apache.org>
> wrote:
> > Hi folks,
> >
> > Recently I've being discussing the problems of the current design of the
> > experimental
> > `RECONCILE_OPERATIONS` scheduler API with a couple people. The discussion
> > was started
> > from MESOS-9318 <https://issues.apache.org/jira/browse/MESOS-9318>:
> when a
> > framework receives an `OPERATION_UNKNOWN`, it doesn't know
> > if it should retry the operation or not (further details described
> below).
> > As the discussion
> > evolves, we realize there are more issues to consider, design-wise and
> > implementation-wise, so
> > I'd like to reach out to the community to get valuable opinions from you
> > guys.
> >
> > Before I jump right into the issues I'd like to discuss, let me fill you
> > guys in with some
> > background of operation reconciliation. Since the design of this feature
> > was informed by the
> > pre-existing implementation of task reconciliation, I'll begin there.
> >
> > *Task Reconciliation: Design*
> >
> > The scheduler API has a `RECONCILE` call for a framework to query the
> > current statuses
> > of its tasks. This call supports the following modes:
> >
> >    - *Explicit reconciliation*: The framework specifies the list of tasks
> >    it wants to know
> >    about, and expects status updates for these tasks.
> >
> >    - *Implicit reconciliation*: The framework does not specify a list of
> >    tasks, and simply
> >    expects status updates for all tasks the master knows about.
> >
> > In both cases, the master looks into its in-memory task bookkeeping and
> > sends
> > *one or more`UPDATE` events* to respond to the reconciliation request.
> >
> > *Task Reconciliation: Problems*
> >
> > This API design of task reconciliation has the following shortcomings:
> >
> >    - (1) There is no clear boundary of when the "reconciliation response"
> >    ends, and thus
> >    there is
> > *no 1-1 correspondence between the reconciliation request and the
> response*.
> >    For explicit reconciliation, the framework might wait for an extended
> period
> >    of time before it receives all status updates; for implicit
> >    reconciliation, there is no way for
> >    a framework to tell if it has learned about all of its tasks, which
> >    could be inconvenient if
> >    the framework has lost its task bookkeeping.
> >
> >    - (2) The "reconciliation response" may be outdated. If an agent
> >    reregisters after a task
> >    reconciliation has been responded,
> > *the framework wouldn't learn about the tasks **from this recovered
> agent*.
> >    Mesos relies on the framework to call the `RECONCILE` call
> >    *periodically* to get up-to-date task statuses.
> >
> >
> >
> > *Operation Reconciliation: Design & Problems*
> >
> > When designing operation reconciliation, we made the
> `RECONCILE_OPERATIONS`
> > call
> > *asynchronous request-response style call* that returns a 200 OK with a
> > list of operation status
> > to avoid (1). However, this design does not resolve (2), and also
> > introduces new problems:
> >
> >    - (3) *The synchronous response could race with the event stream* and
> >    the framework
> >    does not know which contains the latest operation status.
> >
> >    - (4) To ensure scalability, the master does not manage local resource
> >    providers (LRPs);
> >    the agents do. So the master cannot tell if an LRP is temporarily
> >    unreachable/recovering
> >    or permanently gone. As a result, if the framework explicitly
> reconciles
> >    an LRP operation
> >    that the master does not know about, it can only reply
> >    `OPERATION_UNKNOWN`, but
> >    then *the framework would not know if the operation would come back in
> >    the future*,
> >    and thus cannot decide if it should reissue another operation, which
> >    leads to MESOS-9318 <https://issues.apache.org/jira/browse/MESOS-9318
> >.
> >
> >    Note that this is less of a problem for explicit task reconciliation,
> >    because in most cases
> >    the master can infer task statuses from agent statuses, and in the
> rare
> >    cases that it
> >    replies `TASK_UNKNOWN`, it is generally safe for the framework to
> >    relaunch another
> >    task.
> >
> >
> > *The Open Question*
> >
> > Now, the big question here is:
> > *are the benefits of a synchronous request-responsestyle
> > `RECONCILE_OPERATIONS` call worth the complexity it introduces* in order
> to
> > address (3) and (4) in the code? To explain what the complexity would be,
> > let me lay out a
> > couple proposals we've been discussing:
> >
> > I. Keep `RECONCILE_OPERATIONS` synchronous
> >
> > To address (3), we could add a *timestamp* to every operation status as
> > well as the
> > reconciliation response, so the framework can infer which one is the
> latest
> > status, and if
> > it receives a stale operation status update after the reconciliation
> > response, it can just
> > ack the status update without updating its bookkeeping. But, the
> framework
> > needs to
> > deal with a corner case:
> >
> > *when it receives a reconciliation response containing aterminal
> operation
> > status, it may or may not receive one or more status updatesfor that
> > operation later *because of the race.
> >
> >
> > To address (4), we could either: (a) surface the unreachable/gone LRPs to
> > the master, or
> > (b) forward the explicit reconciliation request to the corresponding
> agent.
> > The complexity
> > of (a) is that
> > *it might not be scalable for the master to maintain the list
> ofunreachable
> > and gone LRPs*: imagine that there are 1k nodes and 10 active + 10 gone
> > LRPs per node, then the master need to maintain 20k entries for LRPs. The
> > complexity
> > of (b) is that the response wouldn't be computed based on the master's
> > state; instead,
> > *the master needs to wait for the agent's reply to respond to the
> framework*.
> > Note
> > that it's probably not scalable to forward implicit reconciliation
> requests
> > to all agents, so
> > implicit reconciliation might have to still be responded based on the
> > master's state.
> >
> >
> > II. Make `RECONCILE_OPERATIONS` "semi-synchronous"
> >
> > Instead of returning a 200 OK, the master could return a 202 Accepted
> with
> > an empty
> > body, and then
> > *reply a single event containing the operation status of all
> > requestedoperations in the event stream asynchronously*. Although the
> > framework loses the
> > 1-1 correspondence between the request and the response, there's still a
> > clear boundary
> > for a reconciliation response. The advantage of this approach compared to
> > proposal I is
> > that we don't have a race between the reconciliation response and the
> event
> > stream, so
> > no timestamp is required. Still, we have to address (4) through either
> (a)
> > or (b) described
> > above, thus the complexity remains. That said, this approach fits with
> (b)
> > better since no
> > synchronous response is needed.
> >
> >
> > III. Make `RECONCILE_OPERATIONS` an asynchronous trigger
> >
> > This would be similar to what we have for task reconciliation. The master
> > would return a
> > 202 Accepted, and then send
> > *one or more `UPDATE_OPERATION_STATUS` events*based on its state for an
> > implicit reconciliation, or
> > *forward the request to some agent*for an explicit reconciliation. In
> other
> > words, this call plays the role of a trigger of the
> > operation status updates. This approach is the simplest in terms of the
> > implementation,
> > but the trade-off is that the framework needs to live with (1).
> >
> >
> > So far we haven't discussed much about (2) for operation reconciliation,
> so
> > let's also briefly talk
> > about it. Potentially (2) can be addressed by making the agent *actively
> > push *
> > *operation statusupdates to the framework when an LRP is resubscribed*,
> so
> > the framework won't need to do
> > periodic operation reconciliation. If we do this in the future, it would
> > also be more aligned with
> > proposal II or III.
> >
> > So the question again: is it worth the complexity to keep
> > `RECONCILE_OPERATIONS`
> > synchronous? I'd like to hear the opinions from the community so we can
> > drive towards a better
> > API design!
> >
> > Best,
> > Chun-Hung
> >
> >
> > --
> > Benno Evers
> > Software Engineer, Mesosphere
>
>

-- 
James DeFelice
585.241.9488 (voice)
650.649.6071 (fax)

Re: Discussion: Scheduler API for Operation Reconciliation

Posted by Benjamin Bannier <be...@mesosphere.io>.
Hi,

have we reached a conclusion here?

From the Mesos side of things I would be strongly in favor of proposal (III). This is not only consistent with what we do with task status updates, but also would allow us to provide improved operation status (e.g., `OPERATION_UNREACHABLE` instead of just `OPERATION_UNKNOWN` to better distinguish non-terminal from terminal operation states. To accomplish that we wouldn’t need to introduce extra information leakage (e.g., explicitly keeping master up to date on local resource provider state and associated internal consistency complications).

This approach should also simplify framework development as a framework would only need to watch a single channel to see operation status updates (no need to reconcile different information sources). The benefits of better status updates and simpler implementation IMO outweigh any benefits of the current approach (disclaimer: I filed the slightly inflammatory MESOS-9448).

What is keeping us from moving forward with (III) at this point?


Cheers,

Benjamin

> On Jan 3, 2019, at 11:30 PM, Benno Evers <be...@mesosphere.com> wrote:
> 
> Hi Chun-Hung,
> 
> > imagine that there are 1k nodes and 10 active + 10 gone LRPs per node, then the master need to maintain 20k entries for LRPs.
> 
> How big would the required additional storage be in this scenario? Even if it's 1KiB per LRP, using 20 MiB of extra memory doesn't sound too bad for such a big custer.
> 
> In general, it seems hard to discuss the trade-offs between your proposals without looking at the users of that API - do you know if there are ayn frameworks out there that already use
>  operation reconciliation, and if so what do they do based on the reconciliation response?
> 
> As far as I know, we don't have any formal guarantees on which operations status changes the framework will receive without reconciliation. So putting on my framework-implementer hat it seems like I'd have no choice but to implement a continously polling background loop anyways if I care about knowing the latest operation statuses. If this is indeed the case, having a synchronous `RECONCILE_OPERATIONS` would seem to have little additional benefit.
> 
> Best regards,
> Benno
> 
> On Wed, Dec 12, 2018 at 4:07 AM Chun-Hung Hsiao <ch...@apache.org> wrote:
> Hi folks,
> 
> Recently I've being discussing the problems of the current design of the
> experimental
> `RECONCILE_OPERATIONS` scheduler API with a couple people. The discussion
> was started
> from MESOS-9318 <https://issues.apache.org/jira/browse/MESOS-9318>: when a
> framework receives an `OPERATION_UNKNOWN`, it doesn't know
> if it should retry the operation or not (further details described below).
> As the discussion
> evolves, we realize there are more issues to consider, design-wise and
> implementation-wise, so
> I'd like to reach out to the community to get valuable opinions from you
> guys.
> 
> Before I jump right into the issues I'd like to discuss, let me fill you
> guys in with some
> background of operation reconciliation. Since the design of this feature
> was informed by the
> pre-existing implementation of task reconciliation, I'll begin there.
> 
> *Task Reconciliation: Design*
> 
> The scheduler API has a `RECONCILE` call for a framework to query the
> current statuses
> of its tasks. This call supports the following modes:
> 
>    - *Explicit reconciliation*: The framework specifies the list of tasks
>    it wants to know
>    about, and expects status updates for these tasks.
> 
>    - *Implicit reconciliation*: The framework does not specify a list of
>    tasks, and simply
>    expects status updates for all tasks the master knows about.
> 
> In both cases, the master looks into its in-memory task bookkeeping and
> sends
> *one or more`UPDATE` events* to respond to the reconciliation request.
> 
> *Task Reconciliation: Problems*
> 
> This API design of task reconciliation has the following shortcomings:
> 
>    - (1) There is no clear boundary of when the "reconciliation response"
>    ends, and thus
>    there is
> *no 1-1 correspondence between the reconciliation request and the response*.
>    For explicit reconciliation, the framework might wait for an extended period
>    of time before it receives all status updates; for implicit
>    reconciliation, there is no way for
>    a framework to tell if it has learned about all of its tasks, which
>    could be inconvenient if
>    the framework has lost its task bookkeeping.
> 
>    - (2) The "reconciliation response" may be outdated. If an agent
>    reregisters after a task
>    reconciliation has been responded,
> *the framework wouldn't learn about the tasks **from this recovered agent*.
>    Mesos relies on the framework to call the `RECONCILE` call
>    *periodically* to get up-to-date task statuses.
> 
> 
> 
> *Operation Reconciliation: Design & Problems*
> 
> When designing operation reconciliation, we made the `RECONCILE_OPERATIONS`
> call
> *asynchronous request-response style call* that returns a 200 OK with a
> list of operation status
> to avoid (1). However, this design does not resolve (2), and also
> introduces new problems:
> 
>    - (3) *The synchronous response could race with the event stream* and
>    the framework
>    does not know which contains the latest operation status.
> 
>    - (4) To ensure scalability, the master does not manage local resource
>    providers (LRPs);
>    the agents do. So the master cannot tell if an LRP is temporarily
>    unreachable/recovering
>    or permanently gone. As a result, if the framework explicitly reconciles
>    an LRP operation
>    that the master does not know about, it can only reply
>    `OPERATION_UNKNOWN`, but
>    then *the framework would not know if the operation would come back in
>    the future*,
>    and thus cannot decide if it should reissue another operation, which
>    leads to MESOS-9318 <https://issues.apache.org/jira/browse/MESOS-9318>.
> 
>    Note that this is less of a problem for explicit task reconciliation,
>    because in most cases
>    the master can infer task statuses from agent statuses, and in the rare
>    cases that it
>    replies `TASK_UNKNOWN`, it is generally safe for the framework to
>    relaunch another
>    task.
> 
> 
> *The Open Question*
> 
> Now, the big question here is:
> *are the benefits of a synchronous request-responsestyle
> `RECONCILE_OPERATIONS` call worth the complexity it introduces* in order to
> address (3) and (4) in the code? To explain what the complexity would be,
> let me lay out a
> couple proposals we've been discussing:
> 
> I. Keep `RECONCILE_OPERATIONS` synchronous
> 
> To address (3), we could add a *timestamp* to every operation status as
> well as the
> reconciliation response, so the framework can infer which one is the latest
> status, and if
> it receives a stale operation status update after the reconciliation
> response, it can just
> ack the status update without updating its bookkeeping. But, the framework
> needs to
> deal with a corner case:
> 
> *when it receives a reconciliation response containing aterminal operation
> status, it may or may not receive one or more status updatesfor that
> operation later *because of the race.
> 
> 
> To address (4), we could either: (a) surface the unreachable/gone LRPs to
> the master, or
> (b) forward the explicit reconciliation request to the corresponding agent.
> The complexity
> of (a) is that
> *it might not be scalable for the master to maintain the list ofunreachable
> and gone LRPs*: imagine that there are 1k nodes and 10 active + 10 gone
> LRPs per node, then the master need to maintain 20k entries for LRPs. The
> complexity
> of (b) is that the response wouldn't be computed based on the master's
> state; instead,
> *the master needs to wait for the agent's reply to respond to the framework*.
> Note
> that it's probably not scalable to forward implicit reconciliation requests
> to all agents, so
> implicit reconciliation might have to still be responded based on the
> master's state.
> 
> 
> II. Make `RECONCILE_OPERATIONS` "semi-synchronous"
> 
> Instead of returning a 200 OK, the master could return a 202 Accepted with
> an empty
> body, and then
> *reply a single event containing the operation status of all
> requestedoperations in the event stream asynchronously*. Although the
> framework loses the
> 1-1 correspondence between the request and the response, there's still a
> clear boundary
> for a reconciliation response. The advantage of this approach compared to
> proposal I is
> that we don't have a race between the reconciliation response and the event
> stream, so
> no timestamp is required. Still, we have to address (4) through either (a)
> or (b) described
> above, thus the complexity remains. That said, this approach fits with (b)
> better since no
> synchronous response is needed.
> 
> 
> III. Make `RECONCILE_OPERATIONS` an asynchronous trigger
> 
> This would be similar to what we have for task reconciliation. The master
> would return a
> 202 Accepted, and then send
> *one or more `UPDATE_OPERATION_STATUS` events*based on its state for an
> implicit reconciliation, or
> *forward the request to some agent*for an explicit reconciliation. In other
> words, this call plays the role of a trigger of the
> operation status updates. This approach is the simplest in terms of the
> implementation,
> but the trade-off is that the framework needs to live with (1).
> 
> 
> So far we haven't discussed much about (2) for operation reconciliation, so
> let's also briefly talk
> about it. Potentially (2) can be addressed by making the agent *actively
> push *
> *operation statusupdates to the framework when an LRP is resubscribed*, so
> the framework won't need to do
> periodic operation reconciliation. If we do this in the future, it would
> also be more aligned with
> proposal II or III.
> 
> So the question again: is it worth the complexity to keep
> `RECONCILE_OPERATIONS`
> synchronous? I'd like to hear the opinions from the community so we can
> drive towards a better
> API design!
> 
> Best,
> Chun-Hung
> 
> 
> -- 
> Benno Evers
> Software Engineer, Mesosphere


Re: Discussion: Scheduler API for Operation Reconciliation

Posted by Benjamin Bannier <be...@mesosphere.io>.
Hi,

have we reached a conclusion here?

From the Mesos side of things I would be strongly in favor of proposal (III). This is not only consistent with what we do with task status updates, but also would allow us to provide improved operation status (e.g., `OPERATION_UNREACHABLE` instead of just `OPERATION_UNKNOWN` to better distinguish non-terminal from terminal operation states. To accomplish that we wouldn’t need to introduce extra information leakage (e.g., explicitly keeping master up to date on local resource provider state and associated internal consistency complications).

This approach should also simplify framework development as a framework would only need to watch a single channel to see operation status updates (no need to reconcile different information sources). The benefits of better status updates and simpler implementation IMO outweigh any benefits of the current approach (disclaimer: I filed the slightly inflammatory MESOS-9448).

What is keeping us from moving forward with (III) at this point?


Cheers,

Benjamin

> On Jan 3, 2019, at 11:30 PM, Benno Evers <be...@mesosphere.com> wrote:
> 
> Hi Chun-Hung,
> 
> > imagine that there are 1k nodes and 10 active + 10 gone LRPs per node, then the master need to maintain 20k entries for LRPs.
> 
> How big would the required additional storage be in this scenario? Even if it's 1KiB per LRP, using 20 MiB of extra memory doesn't sound too bad for such a big custer.
> 
> In general, it seems hard to discuss the trade-offs between your proposals without looking at the users of that API - do you know if there are ayn frameworks out there that already use
>  operation reconciliation, and if so what do they do based on the reconciliation response?
> 
> As far as I know, we don't have any formal guarantees on which operations status changes the framework will receive without reconciliation. So putting on my framework-implementer hat it seems like I'd have no choice but to implement a continously polling background loop anyways if I care about knowing the latest operation statuses. If this is indeed the case, having a synchronous `RECONCILE_OPERATIONS` would seem to have little additional benefit.
> 
> Best regards,
> Benno
> 
> On Wed, Dec 12, 2018 at 4:07 AM Chun-Hung Hsiao <ch...@apache.org> wrote:
> Hi folks,
> 
> Recently I've being discussing the problems of the current design of the
> experimental
> `RECONCILE_OPERATIONS` scheduler API with a couple people. The discussion
> was started
> from MESOS-9318 <https://issues.apache.org/jira/browse/MESOS-9318>: when a
> framework receives an `OPERATION_UNKNOWN`, it doesn't know
> if it should retry the operation or not (further details described below).
> As the discussion
> evolves, we realize there are more issues to consider, design-wise and
> implementation-wise, so
> I'd like to reach out to the community to get valuable opinions from you
> guys.
> 
> Before I jump right into the issues I'd like to discuss, let me fill you
> guys in with some
> background of operation reconciliation. Since the design of this feature
> was informed by the
> pre-existing implementation of task reconciliation, I'll begin there.
> 
> *Task Reconciliation: Design*
> 
> The scheduler API has a `RECONCILE` call for a framework to query the
> current statuses
> of its tasks. This call supports the following modes:
> 
>    - *Explicit reconciliation*: The framework specifies the list of tasks
>    it wants to know
>    about, and expects status updates for these tasks.
> 
>    - *Implicit reconciliation*: The framework does not specify a list of
>    tasks, and simply
>    expects status updates for all tasks the master knows about.
> 
> In both cases, the master looks into its in-memory task bookkeeping and
> sends
> *one or more`UPDATE` events* to respond to the reconciliation request.
> 
> *Task Reconciliation: Problems*
> 
> This API design of task reconciliation has the following shortcomings:
> 
>    - (1) There is no clear boundary of when the "reconciliation response"
>    ends, and thus
>    there is
> *no 1-1 correspondence between the reconciliation request and the response*.
>    For explicit reconciliation, the framework might wait for an extended period
>    of time before it receives all status updates; for implicit
>    reconciliation, there is no way for
>    a framework to tell if it has learned about all of its tasks, which
>    could be inconvenient if
>    the framework has lost its task bookkeeping.
> 
>    - (2) The "reconciliation response" may be outdated. If an agent
>    reregisters after a task
>    reconciliation has been responded,
> *the framework wouldn't learn about the tasks **from this recovered agent*.
>    Mesos relies on the framework to call the `RECONCILE` call
>    *periodically* to get up-to-date task statuses.
> 
> 
> 
> *Operation Reconciliation: Design & Problems*
> 
> When designing operation reconciliation, we made the `RECONCILE_OPERATIONS`
> call
> *asynchronous request-response style call* that returns a 200 OK with a
> list of operation status
> to avoid (1). However, this design does not resolve (2), and also
> introduces new problems:
> 
>    - (3) *The synchronous response could race with the event stream* and
>    the framework
>    does not know which contains the latest operation status.
> 
>    - (4) To ensure scalability, the master does not manage local resource
>    providers (LRPs);
>    the agents do. So the master cannot tell if an LRP is temporarily
>    unreachable/recovering
>    or permanently gone. As a result, if the framework explicitly reconciles
>    an LRP operation
>    that the master does not know about, it can only reply
>    `OPERATION_UNKNOWN`, but
>    then *the framework would not know if the operation would come back in
>    the future*,
>    and thus cannot decide if it should reissue another operation, which
>    leads to MESOS-9318 <https://issues.apache.org/jira/browse/MESOS-9318>.
> 
>    Note that this is less of a problem for explicit task reconciliation,
>    because in most cases
>    the master can infer task statuses from agent statuses, and in the rare
>    cases that it
>    replies `TASK_UNKNOWN`, it is generally safe for the framework to
>    relaunch another
>    task.
> 
> 
> *The Open Question*
> 
> Now, the big question here is:
> *are the benefits of a synchronous request-responsestyle
> `RECONCILE_OPERATIONS` call worth the complexity it introduces* in order to
> address (3) and (4) in the code? To explain what the complexity would be,
> let me lay out a
> couple proposals we've been discussing:
> 
> I. Keep `RECONCILE_OPERATIONS` synchronous
> 
> To address (3), we could add a *timestamp* to every operation status as
> well as the
> reconciliation response, so the framework can infer which one is the latest
> status, and if
> it receives a stale operation status update after the reconciliation
> response, it can just
> ack the status update without updating its bookkeeping. But, the framework
> needs to
> deal with a corner case:
> 
> *when it receives a reconciliation response containing aterminal operation
> status, it may or may not receive one or more status updatesfor that
> operation later *because of the race.
> 
> 
> To address (4), we could either: (a) surface the unreachable/gone LRPs to
> the master, or
> (b) forward the explicit reconciliation request to the corresponding agent.
> The complexity
> of (a) is that
> *it might not be scalable for the master to maintain the list ofunreachable
> and gone LRPs*: imagine that there are 1k nodes and 10 active + 10 gone
> LRPs per node, then the master need to maintain 20k entries for LRPs. The
> complexity
> of (b) is that the response wouldn't be computed based on the master's
> state; instead,
> *the master needs to wait for the agent's reply to respond to the framework*.
> Note
> that it's probably not scalable to forward implicit reconciliation requests
> to all agents, so
> implicit reconciliation might have to still be responded based on the
> master's state.
> 
> 
> II. Make `RECONCILE_OPERATIONS` "semi-synchronous"
> 
> Instead of returning a 200 OK, the master could return a 202 Accepted with
> an empty
> body, and then
> *reply a single event containing the operation status of all
> requestedoperations in the event stream asynchronously*. Although the
> framework loses the
> 1-1 correspondence between the request and the response, there's still a
> clear boundary
> for a reconciliation response. The advantage of this approach compared to
> proposal I is
> that we don't have a race between the reconciliation response and the event
> stream, so
> no timestamp is required. Still, we have to address (4) through either (a)
> or (b) described
> above, thus the complexity remains. That said, this approach fits with (b)
> better since no
> synchronous response is needed.
> 
> 
> III. Make `RECONCILE_OPERATIONS` an asynchronous trigger
> 
> This would be similar to what we have for task reconciliation. The master
> would return a
> 202 Accepted, and then send
> *one or more `UPDATE_OPERATION_STATUS` events*based on its state for an
> implicit reconciliation, or
> *forward the request to some agent*for an explicit reconciliation. In other
> words, this call plays the role of a trigger of the
> operation status updates. This approach is the simplest in terms of the
> implementation,
> but the trade-off is that the framework needs to live with (1).
> 
> 
> So far we haven't discussed much about (2) for operation reconciliation, so
> let's also briefly talk
> about it. Potentially (2) can be addressed by making the agent *actively
> push *
> *operation statusupdates to the framework when an LRP is resubscribed*, so
> the framework won't need to do
> periodic operation reconciliation. If we do this in the future, it would
> also be more aligned with
> proposal II or III.
> 
> So the question again: is it worth the complexity to keep
> `RECONCILE_OPERATIONS`
> synchronous? I'd like to hear the opinions from the community so we can
> drive towards a better
> API design!
> 
> Best,
> Chun-Hung
> 
> 
> -- 
> Benno Evers
> Software Engineer, Mesosphere