You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@aurora.apache.org by Hussein Elgridly <hu...@broadinstitute.org> on 2015/02/19 22:27:56 UTC

Making sense of Aurora terminal states

I've just spent the afternoon making a flowchart out of
TaskStateMachine.java in an attempt to figure out what Aurora states
actually mean. Given that all the jobs I submit have unique names and I
don't permit retries, I would like to put together a set of rules that
determine whether a job is _really_ terminal and definitely won't be
rescheduled.

Would one of the Aurora devs be willing to play a game of True or False
with the following statements?

1. If all my job names are unique and I do an aurora job status
--write-json, there will be at most one element in the "active" list.

2. Jobs in the "inactive" list are ordered by last update time, most recent
first.

3. A job's "status" will always equal the status of the last item in its
"taskEvents" list.

4. The full list of terminal states is [LOST, FINISHED, FAILED, KILLED]. A
job that is not in one of these states will undergo more transitions and
will remain in the "active" list until it gets to one of these states.
(Will I ever see DELETED, or do they not show up in aurora job status?)

5. A job in the LOST state will always be rescheduled unless it went
through KILLING first. (What does this represent - killed by user and then
lost connectivity to the slave?)

6. A job will be rescheduled if if it goes through one of [RESTARTING,
DRAINING, PREEMPTING].

7. Assuming maxTaskFailures = 1, #5 and #6 are the ONLY situations in which
a job will be rescheduled.

8. These rules are unlikely to change in the future ;)

Finally, I noticed something odd: ASSIGNED -> LOST has followups [KILL,
RESCHEDULE], but STARTING and RUNNING -> LOST only has [RESCHEDULE] as a
followup. Why?

Thanks,
Hussein Elgridly
Senior Software Engineer, DSDE
The Broad Institute of MIT and Harvard

Re: Making sense of Aurora terminal states

Posted by Hussein Elgridly <hu...@broadinstitute.org>.

> You seem like you are now sufficiently-equipped to add this doc.  Any
> chance you're game to write the doc you wish you had read? :-)

Possibly. Time constraints aside, my concern is that the questions I've
asked (and the answers I was seeking) were based on the assumption that my
jobs all had unique names and ran a single instance with no retry on fail.
I'd hazard a guess that this is unlike the majority of Aurora use cases.

Given that, it doesn't immediately strike me as obvious how to roll the
above information into the existing user guide page. It's possible that
it's achievable by remapping nomenclature (where I usually say "job",
replace with "task instance")... I'll give it a shot if I can find the time.


Hussein Elgridly
Senior Software Engineer, DSDE
The Broad Institute of MIT and Harvard


On 21 February 2015 at 14:18, Bill Farner <wf...@apache.org> wrote:

> >
> > Might I suggest folding this information into the user guide?
>
>
> You seem like you are now sufficiently-equipped to add this doc.  Any
> chance you're game to write the doc you wish you had read? :-)
>
> Just to be absolutely clear on this: KILLING -> LOST will _never_ result in
> > a reschedule? What happens if Mesos fails to kill the task and finishes
> > running it - will it pass a success message back to Aurora that then gets
> > thrown away?
>
>
> Correct, it will not be rescheduled.  We count on reconciliation to take
> care of this.  Currently that's the GC executor, and soon it will be direct
> reconciliation with the master.
>
>
> > Also (sorry for repeated messages), what's the deal with KILLING ->
> > [FINISHED, FAILED]? User sends kill request but Mesos reports it's done
> > before it gets through so congratulations, you get to keep it?
>
>
> Correct, this would usually indicate a race between kill and task exit.
>
>
>
>
> -=Bill
>
> On Fri, Feb 20, 2015 at 1:11 PM, Hussein Elgridly <
> hussein@broadinstitute.org> wrote:
>
> > Also (sorry for repeated messages), what's the deal with KILLING ->
> > [FINISHED, FAILED]? User sends kill request but Mesos reports it's done
> > before it gets through so congratulations, you get to keep it?
> >
> > Hussein Elgridly
> > Senior Software Engineer, DSDE
> > The Broad Institute of MIT and Harvard
> >
> >
> > On 20 February 2015 at 14:18, Hussein Elgridly <
> hussein@broadinstitute.org
> > >
> > wrote:
> >
> > > >> 5. A job in the LOST state will always be rescheduled unless it went
> > > >> through KILLING first. (What does this represent - killed by user
> and
> > > then
> > > >> lost connectivity to the slave?)
> > > >>
> > >
> > > > True.  That is one way it could happen, it could also happen if the
> > > > scheduler times the task out while waiting to hear back from mesos
> > after
> > > > attempting to kill the task.
> > >
> > > Just to be absolutely clear on this: KILLING -> LOST will _never_
> result
> > > in a reschedule? What happens if Mesos fails to kill the task and
> > finishes
> > > running it - will it pass a success message back to Aurora that then
> gets
> > > thrown away?
> > >
> > > Hussein Elgridly
> > > Senior Software Engineer, DSDE
> > > The Broad Institute of MIT and Harvard
> > >
> > >
> > > On 20 February 2015 at 11:08, Hussein Elgridly <
> > hussein@broadinstitute.org
> > > > wrote:
> > >
> > >> This is fantastic (and I'm glad that my understanding was mostly
> > correct)
> > >> - thanks a lot.
> > >>
> > >> Might I suggest folding this information into the user guide? Maybe
> it's
> > >> only relevant for my use case, but I feel like "tasks in terminal
> states
> > >> might be cloned and rescheduled; here's when that might happened"
> isn't
> > >> made as explicit as it could be. I know I'd have had an easier time if
> > >> there had been an explanation of "here's what each state means and
> what
> > >> might happen next", and I can imagine [weasel words; citation needed]
> > that
> > >> other users might also find this useful.
> > >>
> > >> Hussein Elgridly
> > >> Senior Software Engineer, DSDE
> > >> The Broad Institute of MIT and Harvard
> > >>
> > >>
> > >> On 19 February 2015 at 17:35, Bill Farner <wf...@apache.org> wrote:
> > >>
> > >>> On Thu, Feb 19, 2015 at 1:27 PM, Hussein Elgridly <
> > >>> hussein@broadinstitute.org> wrote:
> > >>>
> > >>> > I've just spent the afternoon making a flowchart out of
> > >>> > TaskStateMachine.java in an attempt to figure out what Aurora
> states
> > >>> > actually mean. Given that all the jobs I submit have unique names
> > and I
> > >>> > don't permit retries, I would like to put together a set of rules
> > that
> > >>> > determine whether a job is _really_ terminal and definitely won't
> be
> > >>> > rescheduled.
> > >>> >
> > >>> > Would one of the Aurora devs be willing to play a game of True or
> > False
> > >>> > with the following statements?
> > >>> >
> > >>> > 1. If all my job names are unique and I do an aurora job status
> > >>> > --write-json, there will be at most one element in the "active"
> list.
> > >>> >
> > >>>
> > >>> True iff the job has only one instance.
> > >>>
> > >>>
> > >>> > 2. Jobs in the "inactive" list are ordered by last update time,
> most
> > >>> recent
> > >>> > first.
> > >>> >
> > >>>
> > >>> False.  They are sorted by instance ID [1], which doesn't make much
> > >>> sense.
> > >>>
> > >>> [1]
> > >>>
> > >>>
> >
> https://github.com/apache/incubator-aurora/blob/9fe6d5408d4aed113a239f22fa5c43aa4f9ae338/src/main/python/apache/aurora/client/cli/jobs.py#L635-L636
> > >>>
> > >>>
> > >>> > 3. A job's "status" will always equal the status of the last item
> in
> > >>> its
> > >>> > "taskEvents" list.
> > >>> >
> > >>>
> > >>> True.
> > >>>
> > >>>
> > >>> > 4. The full list of terminal states is [LOST, FINISHED, FAILED,
> > >>> KILLED]. A
> > >>> > job that is not in one of these states will undergo more
> transitions
> > >>> and
> > >>> > will remain in the "active" list until it gets to one of these
> > states.
> > >>> > (Will I ever see DELETED, or do they not show up in aurora job
> > status?)
> > >>> >
> > >>>
> > >>> True.  Source of truth is [1].  We actually don't have a state [2]
> for
> > >>> DELETED.
> > >>>
> > >>> [1]
> > >>>
> > >>>
> >
> https://github.com/apache/incubator-aurora/blob/9fe6d5408d4aed113a239f22fa5c43aa4f9ae338/api/src/main/thrift/org/apache/aurora/gen/api.thrift#L410-L413
> > >>> [2]
> > >>>
> > >>>
> >
> https://github.com/apache/incubator-aurora/blob/9fe6d5408d4aed113a239f22fa5c43aa4f9ae338/api/src/main/thrift/org/apache/aurora/gen/api.thrift#L348-L380
> > >>>
> > >>>
> > >>> > 5. A job in the LOST state will always be rescheduled unless it
> went
> > >>> > through KILLING first. (What does this represent - killed by user
> and
> > >>> then
> > >>> > lost connectivity to the slave?)
> > >>> >
> > >>>
> > >>> True.  That is one way it could happen, it could also happen if the
> > >>> scheduler times the task out while waiting to hear back from mesos
> > after
> > >>> attempting to kill the task.
> > >>>
> > >>>
> > >>> > 6. A job will be rescheduled if if it goes through one of
> > [RESTARTING,
> > >>> > DRAINING, PREEMPTING].
> > >>> >
> > >>>
> > >>> True.
> > >>>
> > >>>
> > >>> > 7. Assuming maxTaskFailures = 1, #5 and #6 are the ONLY situations
> in
> > >>> which
> > >>> > a job will be rescheduled.
> > >>> >
> > >>>
> > >>> True.
> > >>>
> > >>>
> > >>> > 8. These rules are unlikely to change in the future ;)
> > >>> >
> > >>>
> > >>> True, though we could add more states, which would invalidate (4) and
> > >>> (6).
> > >>> In practice, we have changed the states and their meanings very
> little
> > in
> > >>> ~5 years.
> > >>>
> > >>>
> > >>> > Finally, I noticed something odd: ASSIGNED -> LOST has followups
> > [KILL,
> > >>> > RESCHEDULE], but STARTING and RUNNING -> LOST only has [RESCHEDULE]
> > as
> > >>> a
> > >>> > followup. Why?
> > >>> >
> > >>>
> > >>> This is because ASSIGNED -> LOST may mean that there was a race
> between
> > >>> creating the task and Aurora timing out the launch (it may not have
> > heard
> > >>> back from mesos).  To reduce the likelihood of a redundant instance,
> we
> > >>> try
> > >>> to proactively kill the race.  The RUNNING state does not time out,
> so
> > we
> > >>> do not have the same concern there.
> > >>>
> > >>>
> > >>> > Thanks,
> > >>> > Hussein Elgridly
> > >>> > Senior Software Engineer, DSDE
> > >>> > The Broad Institute of MIT and Harvard
> > >>> >
> > >>>
> > >>
> > >>
> > >
> >
>

Re: Making sense of Aurora terminal states

Posted by Bill Farner <wf...@apache.org>.

>
> Might I suggest folding this information into the user guide?


You seem like you are now sufficiently-equipped to add this doc.  Any
chance you're game to write the doc you wish you had read? :-)

Just to be absolutely clear on this: KILLING -> LOST will _never_ result in
> a reschedule? What happens if Mesos fails to kill the task and finishes
> running it - will it pass a success message back to Aurora that then gets
> thrown away?


Correct, it will not be rescheduled.  We count on reconciliation to take
care of this.  Currently that's the GC executor, and soon it will be direct
reconciliation with the master.


> Also (sorry for repeated messages), what's the deal with KILLING ->
> [FINISHED, FAILED]? User sends kill request but Mesos reports it's done
> before it gets through so congratulations, you get to keep it?


Correct, this would usually indicate a race between kill and task exit.




-=Bill

On Fri, Feb 20, 2015 at 1:11 PM, Hussein Elgridly <
hussein@broadinstitute.org> wrote:

> Also (sorry for repeated messages), what's the deal with KILLING ->
> [FINISHED, FAILED]? User sends kill request but Mesos reports it's done
> before it gets through so congratulations, you get to keep it?
>
> Hussein Elgridly
> Senior Software Engineer, DSDE
> The Broad Institute of MIT and Harvard
>
>
> On 20 February 2015 at 14:18, Hussein Elgridly <hussein@broadinstitute.org
> >
> wrote:
>
> > >> 5. A job in the LOST state will always be rescheduled unless it went
> > >> through KILLING first. (What does this represent - killed by user and
> > then
> > >> lost connectivity to the slave?)
> > >>
> >
> > > True.  That is one way it could happen, it could also happen if the
> > > scheduler times the task out while waiting to hear back from mesos
> after
> > > attempting to kill the task.
> >
> > Just to be absolutely clear on this: KILLING -> LOST will _never_ result
> > in a reschedule? What happens if Mesos fails to kill the task and
> finishes
> > running it - will it pass a success message back to Aurora that then gets
> > thrown away?
> >
> > Hussein Elgridly
> > Senior Software Engineer, DSDE
> > The Broad Institute of MIT and Harvard
> >
> >
> > On 20 February 2015 at 11:08, Hussein Elgridly <
> hussein@broadinstitute.org
> > > wrote:
> >
> >> This is fantastic (and I'm glad that my understanding was mostly
> correct)
> >> - thanks a lot.
> >>
> >> Might I suggest folding this information into the user guide? Maybe it's
> >> only relevant for my use case, but I feel like "tasks in terminal states
> >> might be cloned and rescheduled; here's when that might happened" isn't
> >> made as explicit as it could be. I know I'd have had an easier time if
> >> there had been an explanation of "here's what each state means and what
> >> might happen next", and I can imagine [weasel words; citation needed]
> that
> >> other users might also find this useful.
> >>
> >> Hussein Elgridly
> >> Senior Software Engineer, DSDE
> >> The Broad Institute of MIT and Harvard
> >>
> >>
> >> On 19 February 2015 at 17:35, Bill Farner <wf...@apache.org> wrote:
> >>
> >>> On Thu, Feb 19, 2015 at 1:27 PM, Hussein Elgridly <
> >>> hussein@broadinstitute.org> wrote:
> >>>
> >>> > I've just spent the afternoon making a flowchart out of
> >>> > TaskStateMachine.java in an attempt to figure out what Aurora states
> >>> > actually mean. Given that all the jobs I submit have unique names
> and I
> >>> > don't permit retries, I would like to put together a set of rules
> that
> >>> > determine whether a job is _really_ terminal and definitely won't be
> >>> > rescheduled.
> >>> >
> >>> > Would one of the Aurora devs be willing to play a game of True or
> False
> >>> > with the following statements?
> >>> >
> >>> > 1. If all my job names are unique and I do an aurora job status
> >>> > --write-json, there will be at most one element in the "active" list.
> >>> >
> >>>
> >>> True iff the job has only one instance.
> >>>
> >>>
> >>> > 2. Jobs in the "inactive" list are ordered by last update time, most
> >>> recent
> >>> > first.
> >>> >
> >>>
> >>> False.  They are sorted by instance ID [1], which doesn't make much
> >>> sense.
> >>>
> >>> [1]
> >>>
> >>>
> https://github.com/apache/incubator-aurora/blob/9fe6d5408d4aed113a239f22fa5c43aa4f9ae338/src/main/python/apache/aurora/client/cli/jobs.py#L635-L636
> >>>
> >>>
> >>> > 3. A job's "status" will always equal the status of the last item in
> >>> its
> >>> > "taskEvents" list.
> >>> >
> >>>
> >>> True.
> >>>
> >>>
> >>> > 4. The full list of terminal states is [LOST, FINISHED, FAILED,
> >>> KILLED]. A
> >>> > job that is not in one of these states will undergo more transitions
> >>> and
> >>> > will remain in the "active" list until it gets to one of these
> states.
> >>> > (Will I ever see DELETED, or do they not show up in aurora job
> status?)
> >>> >
> >>>
> >>> True.  Source of truth is [1].  We actually don't have a state [2] for
> >>> DELETED.
> >>>
> >>> [1]
> >>>
> >>>
> https://github.com/apache/incubator-aurora/blob/9fe6d5408d4aed113a239f22fa5c43aa4f9ae338/api/src/main/thrift/org/apache/aurora/gen/api.thrift#L410-L413
> >>> [2]
> >>>
> >>>
> https://github.com/apache/incubator-aurora/blob/9fe6d5408d4aed113a239f22fa5c43aa4f9ae338/api/src/main/thrift/org/apache/aurora/gen/api.thrift#L348-L380
> >>>
> >>>
> >>> > 5. A job in the LOST state will always be rescheduled unless it went
> >>> > through KILLING first. (What does this represent - killed by user and
> >>> then
> >>> > lost connectivity to the slave?)
> >>> >
> >>>
> >>> True.  That is one way it could happen, it could also happen if the
> >>> scheduler times the task out while waiting to hear back from mesos
> after
> >>> attempting to kill the task.
> >>>
> >>>
> >>> > 6. A job will be rescheduled if if it goes through one of
> [RESTARTING,
> >>> > DRAINING, PREEMPTING].
> >>> >
> >>>
> >>> True.
> >>>
> >>>
> >>> > 7. Assuming maxTaskFailures = 1, #5 and #6 are the ONLY situations in
> >>> which
> >>> > a job will be rescheduled.
> >>> >
> >>>
> >>> True.
> >>>
> >>>
> >>> > 8. These rules are unlikely to change in the future ;)
> >>> >
> >>>
> >>> True, though we could add more states, which would invalidate (4) and
> >>> (6).
> >>> In practice, we have changed the states and their meanings very little
> in
> >>> ~5 years.
> >>>
> >>>
> >>> > Finally, I noticed something odd: ASSIGNED -> LOST has followups
> [KILL,
> >>> > RESCHEDULE], but STARTING and RUNNING -> LOST only has [RESCHEDULE]
> as
> >>> a
> >>> > followup. Why?
> >>> >
> >>>
> >>> This is because ASSIGNED -> LOST may mean that there was a race between
> >>> creating the task and Aurora timing out the launch (it may not have
> heard
> >>> back from mesos).  To reduce the likelihood of a redundant instance, we
> >>> try
> >>> to proactively kill the race.  The RUNNING state does not time out, so
> we
> >>> do not have the same concern there.
> >>>
> >>>
> >>> > Thanks,
> >>> > Hussein Elgridly
> >>> > Senior Software Engineer, DSDE
> >>> > The Broad Institute of MIT and Harvard
> >>> >
> >>>
> >>
> >>
> >
>

Re: Making sense of Aurora terminal states

Posted by Hussein Elgridly <hu...@broadinstitute.org>.

Also (sorry for repeated messages), what's the deal with KILLING ->
[FINISHED, FAILED]? User sends kill request but Mesos reports it's done
before it gets through so congratulations, you get to keep it?

Hussein Elgridly
Senior Software Engineer, DSDE
The Broad Institute of MIT and Harvard


On 20 February 2015 at 14:18, Hussein Elgridly <hu...@broadinstitute.org>
wrote:

> >> 5. A job in the LOST state will always be rescheduled unless it went
> >> through KILLING first. (What does this represent - killed by user and
> then
> >> lost connectivity to the slave?)
> >>
>
> > True.  That is one way it could happen, it could also happen if the
> > scheduler times the task out while waiting to hear back from mesos after
> > attempting to kill the task.
>
> Just to be absolutely clear on this: KILLING -> LOST will _never_ result
> in a reschedule? What happens if Mesos fails to kill the task and finishes
> running it - will it pass a success message back to Aurora that then gets
> thrown away?
>
> Hussein Elgridly
> Senior Software Engineer, DSDE
> The Broad Institute of MIT and Harvard
>
>
> On 20 February 2015 at 11:08, Hussein Elgridly <hussein@broadinstitute.org
> > wrote:
>
>> This is fantastic (and I'm glad that my understanding was mostly correct)
>> - thanks a lot.
>>
>> Might I suggest folding this information into the user guide? Maybe it's
>> only relevant for my use case, but I feel like "tasks in terminal states
>> might be cloned and rescheduled; here's when that might happened" isn't
>> made as explicit as it could be. I know I'd have had an easier time if
>> there had been an explanation of "here's what each state means and what
>> might happen next", and I can imagine [weasel words; citation needed] that
>> other users might also find this useful.
>>
>> Hussein Elgridly
>> Senior Software Engineer, DSDE
>> The Broad Institute of MIT and Harvard
>>
>>
>> On 19 February 2015 at 17:35, Bill Farner <wf...@apache.org> wrote:
>>
>>> On Thu, Feb 19, 2015 at 1:27 PM, Hussein Elgridly <
>>> hussein@broadinstitute.org> wrote:
>>>
>>> > I've just spent the afternoon making a flowchart out of
>>> > TaskStateMachine.java in an attempt to figure out what Aurora states
>>> > actually mean. Given that all the jobs I submit have unique names and I
>>> > don't permit retries, I would like to put together a set of rules that
>>> > determine whether a job is _really_ terminal and definitely won't be
>>> > rescheduled.
>>> >
>>> > Would one of the Aurora devs be willing to play a game of True or False
>>> > with the following statements?
>>> >
>>> > 1. If all my job names are unique and I do an aurora job status
>>> > --write-json, there will be at most one element in the "active" list.
>>> >
>>>
>>> True iff the job has only one instance.
>>>
>>>
>>> > 2. Jobs in the "inactive" list are ordered by last update time, most
>>> recent
>>> > first.
>>> >
>>>
>>> False.  They are sorted by instance ID [1], which doesn't make much
>>> sense.
>>>
>>> [1]
>>>
>>> https://github.com/apache/incubator-aurora/blob/9fe6d5408d4aed113a239f22fa5c43aa4f9ae338/src/main/python/apache/aurora/client/cli/jobs.py#L635-L636
>>>
>>>
>>> > 3. A job's "status" will always equal the status of the last item in
>>> its
>>> > "taskEvents" list.
>>> >
>>>
>>> True.
>>>
>>>
>>> > 4. The full list of terminal states is [LOST, FINISHED, FAILED,
>>> KILLED]. A
>>> > job that is not in one of these states will undergo more transitions
>>> and
>>> > will remain in the "active" list until it gets to one of these states.
>>> > (Will I ever see DELETED, or do they not show up in aurora job status?)
>>> >
>>>
>>> True.  Source of truth is [1].  We actually don't have a state [2] for
>>> DELETED.
>>>
>>> [1]
>>>
>>> https://github.com/apache/incubator-aurora/blob/9fe6d5408d4aed113a239f22fa5c43aa4f9ae338/api/src/main/thrift/org/apache/aurora/gen/api.thrift#L410-L413
>>> [2]
>>>
>>> https://github.com/apache/incubator-aurora/blob/9fe6d5408d4aed113a239f22fa5c43aa4f9ae338/api/src/main/thrift/org/apache/aurora/gen/api.thrift#L348-L380
>>>
>>>
>>> > 5. A job in the LOST state will always be rescheduled unless it went
>>> > through KILLING first. (What does this represent - killed by user and
>>> then
>>> > lost connectivity to the slave?)
>>> >
>>>
>>> True.  That is one way it could happen, it could also happen if the
>>> scheduler times the task out while waiting to hear back from mesos after
>>> attempting to kill the task.
>>>
>>>
>>> > 6. A job will be rescheduled if if it goes through one of [RESTARTING,
>>> > DRAINING, PREEMPTING].
>>> >
>>>
>>> True.
>>>
>>>
>>> > 7. Assuming maxTaskFailures = 1, #5 and #6 are the ONLY situations in
>>> which
>>> > a job will be rescheduled.
>>> >
>>>
>>> True.
>>>
>>>
>>> > 8. These rules are unlikely to change in the future ;)
>>> >
>>>
>>> True, though we could add more states, which would invalidate (4) and
>>> (6).
>>> In practice, we have changed the states and their meanings very little in
>>> ~5 years.
>>>
>>>
>>> > Finally, I noticed something odd: ASSIGNED -> LOST has followups [KILL,
>>> > RESCHEDULE], but STARTING and RUNNING -> LOST only has [RESCHEDULE] as
>>> a
>>> > followup. Why?
>>> >
>>>
>>> This is because ASSIGNED -> LOST may mean that there was a race between
>>> creating the task and Aurora timing out the launch (it may not have heard
>>> back from mesos).  To reduce the likelihood of a redundant instance, we
>>> try
>>> to proactively kill the race.  The RUNNING state does not time out, so we
>>> do not have the same concern there.
>>>
>>>
>>> > Thanks,
>>> > Hussein Elgridly
>>> > Senior Software Engineer, DSDE
>>> > The Broad Institute of MIT and Harvard
>>> >
>>>
>>
>>
>

Re: Making sense of Aurora terminal states

Posted by Hussein Elgridly <hu...@broadinstitute.org>.

>> 5. A job in the LOST state will always be rescheduled unless it went
>> through KILLING first. (What does this represent - killed by user and
then
>> lost connectivity to the slave?)
>>

> True.  That is one way it could happen, it could also happen if the
> scheduler times the task out while waiting to hear back from mesos after
> attempting to kill the task.

Just to be absolutely clear on this: KILLING -> LOST will _never_ result in
a reschedule? What happens if Mesos fails to kill the task and finishes
running it - will it pass a success message back to Aurora that then gets
thrown away?

Hussein Elgridly
Senior Software Engineer, DSDE
The Broad Institute of MIT and Harvard


On 20 February 2015 at 11:08, Hussein Elgridly <hu...@broadinstitute.org>
wrote:

> This is fantastic (and I'm glad that my understanding was mostly correct)
> - thanks a lot.
>
> Might I suggest folding this information into the user guide? Maybe it's
> only relevant for my use case, but I feel like "tasks in terminal states
> might be cloned and rescheduled; here's when that might happened" isn't
> made as explicit as it could be. I know I'd have had an easier time if
> there had been an explanation of "here's what each state means and what
> might happen next", and I can imagine [weasel words; citation needed] that
> other users might also find this useful.
>
> Hussein Elgridly
> Senior Software Engineer, DSDE
> The Broad Institute of MIT and Harvard
>
>
> On 19 February 2015 at 17:35, Bill Farner <wf...@apache.org> wrote:
>
>> On Thu, Feb 19, 2015 at 1:27 PM, Hussein Elgridly <
>> hussein@broadinstitute.org> wrote:
>>
>> > I've just spent the afternoon making a flowchart out of
>> > TaskStateMachine.java in an attempt to figure out what Aurora states
>> > actually mean. Given that all the jobs I submit have unique names and I
>> > don't permit retries, I would like to put together a set of rules that
>> > determine whether a job is _really_ terminal and definitely won't be
>> > rescheduled.
>> >
>> > Would one of the Aurora devs be willing to play a game of True or False
>> > with the following statements?
>> >
>> > 1. If all my job names are unique and I do an aurora job status
>> > --write-json, there will be at most one element in the "active" list.
>> >
>>
>> True iff the job has only one instance.
>>
>>
>> > 2. Jobs in the "inactive" list are ordered by last update time, most
>> recent
>> > first.
>> >
>>
>> False.  They are sorted by instance ID [1], which doesn't make much sense.
>>
>> [1]
>>
>> https://github.com/apache/incubator-aurora/blob/9fe6d5408d4aed113a239f22fa5c43aa4f9ae338/src/main/python/apache/aurora/client/cli/jobs.py#L635-L636
>>
>>
>> > 3. A job's "status" will always equal the status of the last item in its
>> > "taskEvents" list.
>> >
>>
>> True.
>>
>>
>> > 4. The full list of terminal states is [LOST, FINISHED, FAILED,
>> KILLED]. A
>> > job that is not in one of these states will undergo more transitions and
>> > will remain in the "active" list until it gets to one of these states.
>> > (Will I ever see DELETED, or do they not show up in aurora job status?)
>> >
>>
>> True.  Source of truth is [1].  We actually don't have a state [2] for
>> DELETED.
>>
>> [1]
>>
>> https://github.com/apache/incubator-aurora/blob/9fe6d5408d4aed113a239f22fa5c43aa4f9ae338/api/src/main/thrift/org/apache/aurora/gen/api.thrift#L410-L413
>> [2]
>>
>> https://github.com/apache/incubator-aurora/blob/9fe6d5408d4aed113a239f22fa5c43aa4f9ae338/api/src/main/thrift/org/apache/aurora/gen/api.thrift#L348-L380
>>
>>
>> > 5. A job in the LOST state will always be rescheduled unless it went
>> > through KILLING first. (What does this represent - killed by user and
>> then
>> > lost connectivity to the slave?)
>> >
>>
>> True.  That is one way it could happen, it could also happen if the
>> scheduler times the task out while waiting to hear back from mesos after
>> attempting to kill the task.
>>
>>
>> > 6. A job will be rescheduled if if it goes through one of [RESTARTING,
>> > DRAINING, PREEMPTING].
>> >
>>
>> True.
>>
>>
>> > 7. Assuming maxTaskFailures = 1, #5 and #6 are the ONLY situations in
>> which
>> > a job will be rescheduled.
>> >
>>
>> True.
>>
>>
>> > 8. These rules are unlikely to change in the future ;)
>> >
>>
>> True, though we could add more states, which would invalidate (4) and (6).
>> In practice, we have changed the states and their meanings very little in
>> ~5 years.
>>
>>
>> > Finally, I noticed something odd: ASSIGNED -> LOST has followups [KILL,
>> > RESCHEDULE], but STARTING and RUNNING -> LOST only has [RESCHEDULE] as a
>> > followup. Why?
>> >
>>
>> This is because ASSIGNED -> LOST may mean that there was a race between
>> creating the task and Aurora timing out the launch (it may not have heard
>> back from mesos).  To reduce the likelihood of a redundant instance, we
>> try
>> to proactively kill the race.  The RUNNING state does not time out, so we
>> do not have the same concern there.
>>
>>
>> > Thanks,
>> > Hussein Elgridly
>> > Senior Software Engineer, DSDE
>> > The Broad Institute of MIT and Harvard
>> >
>>
>
>

Re: Making sense of Aurora terminal states

Posted by Hussein Elgridly <hu...@broadinstitute.org>.

This is fantastic (and I'm glad that my understanding was mostly correct) -
thanks a lot.

Might I suggest folding this information into the user guide? Maybe it's
only relevant for my use case, but I feel like "tasks in terminal states
might be cloned and rescheduled; here's when that might happened" isn't
made as explicit as it could be. I know I'd have had an easier time if
there had been an explanation of "here's what each state means and what
might happen next", and I can imagine [weasel words; citation needed] that
other users might also find this useful.

Hussein Elgridly
Senior Software Engineer, DSDE
The Broad Institute of MIT and Harvard


On 19 February 2015 at 17:35, Bill Farner <wf...@apache.org> wrote:

> On Thu, Feb 19, 2015 at 1:27 PM, Hussein Elgridly <
> hussein@broadinstitute.org> wrote:
>
> > I've just spent the afternoon making a flowchart out of
> > TaskStateMachine.java in an attempt to figure out what Aurora states
> > actually mean. Given that all the jobs I submit have unique names and I
> > don't permit retries, I would like to put together a set of rules that
> > determine whether a job is _really_ terminal and definitely won't be
> > rescheduled.
> >
> > Would one of the Aurora devs be willing to play a game of True or False
> > with the following statements?
> >
> > 1. If all my job names are unique and I do an aurora job status
> > --write-json, there will be at most one element in the "active" list.
> >
>
> True iff the job has only one instance.
>
>
> > 2. Jobs in the "inactive" list are ordered by last update time, most
> recent
> > first.
> >
>
> False.  They are sorted by instance ID [1], which doesn't make much sense.
>
> [1]
>
> https://github.com/apache/incubator-aurora/blob/9fe6d5408d4aed113a239f22fa5c43aa4f9ae338/src/main/python/apache/aurora/client/cli/jobs.py#L635-L636
>
>
> > 3. A job's "status" will always equal the status of the last item in its
> > "taskEvents" list.
> >
>
> True.
>
>
> > 4. The full list of terminal states is [LOST, FINISHED, FAILED, KILLED].
> A
> > job that is not in one of these states will undergo more transitions and
> > will remain in the "active" list until it gets to one of these states.
> > (Will I ever see DELETED, or do they not show up in aurora job status?)
> >
>
> True.  Source of truth is [1].  We actually don't have a state [2] for
> DELETED.
>
> [1]
>
> https://github.com/apache/incubator-aurora/blob/9fe6d5408d4aed113a239f22fa5c43aa4f9ae338/api/src/main/thrift/org/apache/aurora/gen/api.thrift#L410-L413
> [2]
>
> https://github.com/apache/incubator-aurora/blob/9fe6d5408d4aed113a239f22fa5c43aa4f9ae338/api/src/main/thrift/org/apache/aurora/gen/api.thrift#L348-L380
>
>
> > 5. A job in the LOST state will always be rescheduled unless it went
> > through KILLING first. (What does this represent - killed by user and
> then
> > lost connectivity to the slave?)
> >
>
> True.  That is one way it could happen, it could also happen if the
> scheduler times the task out while waiting to hear back from mesos after
> attempting to kill the task.
>
>
> > 6. A job will be rescheduled if if it goes through one of [RESTARTING,
> > DRAINING, PREEMPTING].
> >
>
> True.
>
>
> > 7. Assuming maxTaskFailures = 1, #5 and #6 are the ONLY situations in
> which
> > a job will be rescheduled.
> >
>
> True.
>
>
> > 8. These rules are unlikely to change in the future ;)
> >
>
> True, though we could add more states, which would invalidate (4) and (6).
> In practice, we have changed the states and their meanings very little in
> ~5 years.
>
>
> > Finally, I noticed something odd: ASSIGNED -> LOST has followups [KILL,
> > RESCHEDULE], but STARTING and RUNNING -> LOST only has [RESCHEDULE] as a
> > followup. Why?
> >
>
> This is because ASSIGNED -> LOST may mean that there was a race between
> creating the task and Aurora timing out the launch (it may not have heard
> back from mesos).  To reduce the likelihood of a redundant instance, we try
> to proactively kill the race.  The RUNNING state does not time out, so we
> do not have the same concern there.
>
>
> > Thanks,
> > Hussein Elgridly
> > Senior Software Engineer, DSDE
> > The Broad Institute of MIT and Harvard
> >
>

Re: Making sense of Aurora terminal states

Posted by Bill Farner <wf...@apache.org>.

On Thu, Feb 19, 2015 at 1:27 PM, Hussein Elgridly <
hussein@broadinstitute.org> wrote:

> I've just spent the afternoon making a flowchart out of
> TaskStateMachine.java in an attempt to figure out what Aurora states
> actually mean. Given that all the jobs I submit have unique names and I
> don't permit retries, I would like to put together a set of rules that
> determine whether a job is _really_ terminal and definitely won't be
> rescheduled.
>
> Would one of the Aurora devs be willing to play a game of True or False
> with the following statements?
>
> 1. If all my job names are unique and I do an aurora job status
> --write-json, there will be at most one element in the "active" list.
>

True iff the job has only one instance.


> 2. Jobs in the "inactive" list are ordered by last update time, most recent
> first.
>

False.  They are sorted by instance ID [1], which doesn't make much sense.

[1]
https://github.com/apache/incubator-aurora/blob/9fe6d5408d4aed113a239f22fa5c43aa4f9ae338/src/main/python/apache/aurora/client/cli/jobs.py#L635-L636


> 3. A job's "status" will always equal the status of the last item in its
> "taskEvents" list.
>

True.


> 4. The full list of terminal states is [LOST, FINISHED, FAILED, KILLED]. A
> job that is not in one of these states will undergo more transitions and
> will remain in the "active" list until it gets to one of these states.
> (Will I ever see DELETED, or do they not show up in aurora job status?)
>

True.  Source of truth is [1].  We actually don't have a state [2] for
DELETED.

[1]
https://github.com/apache/incubator-aurora/blob/9fe6d5408d4aed113a239f22fa5c43aa4f9ae338/api/src/main/thrift/org/apache/aurora/gen/api.thrift#L410-L413
[2]
https://github.com/apache/incubator-aurora/blob/9fe6d5408d4aed113a239f22fa5c43aa4f9ae338/api/src/main/thrift/org/apache/aurora/gen/api.thrift#L348-L380


> 5. A job in the LOST state will always be rescheduled unless it went
> through KILLING first. (What does this represent - killed by user and then
> lost connectivity to the slave?)
>

True.  That is one way it could happen, it could also happen if the
scheduler times the task out while waiting to hear back from mesos after
attempting to kill the task.


> 6. A job will be rescheduled if if it goes through one of [RESTARTING,
> DRAINING, PREEMPTING].
>

True.


> 7. Assuming maxTaskFailures = 1, #5 and #6 are the ONLY situations in which
> a job will be rescheduled.
>

True.


> 8. These rules are unlikely to change in the future ;)
>

True, though we could add more states, which would invalidate (4) and (6).
In practice, we have changed the states and their meanings very little in
~5 years.


> Finally, I noticed something odd: ASSIGNED -> LOST has followups [KILL,
> RESCHEDULE], but STARTING and RUNNING -> LOST only has [RESCHEDULE] as a
> followup. Why?
>

This is because ASSIGNED -> LOST may mean that there was a race between
creating the task and Aurora timing out the launch (it may not have heard
back from mesos).  To reduce the likelihood of a redundant instance, we try
to proactively kill the race.  The RUNNING state does not time out, so we
do not have the same concern there.


> Thanks,
> Hussein Elgridly
> Senior Software Engineer, DSDE
> The Broad Institute of MIT and Harvard
>