You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@mesos.apache.org by Benjamin Bannier <bb...@apache.org> on 2019/06/24 14:03:13 UTC

Review Request 70936: Adjusted task status updates during draining.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/70936/
-----------------------------------------------------------

Review request for mesos, Greg Mann and Joseph Wu.


Bugs: MESOS-9823
    https://issues.apache.org/jira/browse/MESOS-9823


Repository: mesos


Description
-------

When a task is reported as killed to the agent during active agent
draining we now decorate the reported status with
`REASON_AGENT_DRAINING` if no other status was previously present. If
the draining marks the agent as gone via the `mark_gone` draining flag
we additionally report `TASK_GONE_BY_OPERATOR` instead of the original
state.

This patch leaves some ambiguity in what triggered the kill since the
agent-executor protocol does not transport reasons; instead
the reason is here only inferred after the killed task has
been observed. This should usually be fine since due to the inherit race
between e.g., any user- and drain-triggered kill a user cannot
distinguish racy reasons.


Diffs
-----

  src/slave/slave.cpp 30039b0857a4d85b4b96fa95d7f8724d57cdec6e 
  src/tests/slave_tests.cpp c2035976713abb31b3646c0d23771fa40df93271 


Diff: https://reviews.apache.org/r/70936/diff/1/


Testing
-------

`make check`


Thanks,

Benjamin Bannier

Re: Review Request 70936: Adjusted task status updates during draining.

Posted by Joseph Wu <jo...@mesosphere.io>.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/70936/#review216571
-----------------------------------------------------------


Ship it!




Ship It!

- Joseph Wu


On June 28, 2019, 2:27 a.m., Benjamin Bannier wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/70936/
> -----------------------------------------------------------
> 
> (Updated June 28, 2019, 2:27 a.m.)
> 
> 
> Review request for mesos, Greg Mann and Joseph Wu.
> 
> 
> Bugs: MESOS-9823
>     https://issues.apache.org/jira/browse/MESOS-9823
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> When a task is reported as killed to the agent during active agent
> draining we now decorate the reported status with
> `REASON_AGENT_DRAINING` unconditionally. If the draining marks the agent
> as gone via the `mark_gone` draining flag we additionally report
> `TASK_GONE_BY_OPERATOR` instead of the original state.
> 
> This patch leaves some ambiguity in what triggered the kill since the
> agent-executor protocol does not transport reasons; instead
> the reason is here only inferred after the killed task has
> been observed. This should usually be fine since due to the inherit race
> between e.g., any user- and drain-triggered kill a user cannot
> distinguish racy reasons.
> 
> 
> Diffs
> -----
> 
>   src/slave/slave.cpp 30039b0857a4d85b4b96fa95d7f8724d57cdec6e 
>   src/tests/slave_tests.cpp 7ba2df9fef9966cd09e0af3f4ef3bb260b6167f6 
> 
> 
> Diff: https://reviews.apache.org/r/70936/diff/4/
> 
> 
> Testing
> -------
> 
> `make check`
> 
> 
> Thanks,
> 
> Benjamin Bannier
> 
>

Re: Review Request 70936: Adjusted task status updates during draining.

Posted by Benjamin Bannier <bb...@apache.org>.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/70936/
-----------------------------------------------------------

(Updated June 28, 2019, 11:27 a.m.)


Review request for mesos, Greg Mann and Joseph Wu.


Changes
-------

Fix formatting


Bugs: MESOS-9823
    https://issues.apache.org/jira/browse/MESOS-9823


Repository: mesos


Description
-------

When a task is reported as killed to the agent during active agent
draining we now decorate the reported status with
`REASON_AGENT_DRAINING` unconditionally. If the draining marks the agent
as gone via the `mark_gone` draining flag we additionally report
`TASK_GONE_BY_OPERATOR` instead of the original state.

This patch leaves some ambiguity in what triggered the kill since the
agent-executor protocol does not transport reasons; instead
the reason is here only inferred after the killed task has
been observed. This should usually be fine since due to the inherit race
between e.g., any user- and drain-triggered kill a user cannot
distinguish racy reasons.


Diffs (updated)
-----

  src/slave/slave.cpp 30039b0857a4d85b4b96fa95d7f8724d57cdec6e 
  src/tests/slave_tests.cpp 7ba2df9fef9966cd09e0af3f4ef3bb260b6167f6 


Diff: https://reviews.apache.org/r/70936/diff/4/

Changes: https://reviews.apache.org/r/70936/diff/3-4/


Testing
-------

`make check`


Thanks,

Benjamin Bannier

Re: Review Request 70936: Adjusted task status updates during draining.

Posted by Greg Mann <gr...@mesosphere.io>.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/70936/#review216217
-----------------------------------------------------------


Fix it, then Ship it!





src/tests/slave_tests.cpp
Lines 11996 (patched)
<https://reviews.apache.org/r/70936/#comment303324>

    Nit: not indented far enough.



src/tests/slave_tests.cpp
Lines 12224-12225 (patched)
<https://reviews.apache.org/r/70936/#comment303325>

    Nit: indentation.


- Greg Mann


On June 27, 2019, 3:23 p.m., Benjamin Bannier wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/70936/
> -----------------------------------------------------------
> 
> (Updated June 27, 2019, 3:23 p.m.)
> 
> 
> Review request for mesos, Greg Mann and Joseph Wu.
> 
> 
> Bugs: MESOS-9823
>     https://issues.apache.org/jira/browse/MESOS-9823
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> When a task is reported as killed to the agent during active agent
> draining we now decorate the reported status with
> `REASON_AGENT_DRAINING` unconditionally. If the draining marks the agent
> as gone via the `mark_gone` draining flag we additionally report
> `TASK_GONE_BY_OPERATOR` instead of the original state.
> 
> This patch leaves some ambiguity in what triggered the kill since the
> agent-executor protocol does not transport reasons; instead
> the reason is here only inferred after the killed task has
> been observed. This should usually be fine since due to the inherit race
> between e.g., any user- and drain-triggered kill a user cannot
> distinguish racy reasons.
> 
> 
> Diffs
> -----
> 
>   src/slave/slave.cpp 30039b0857a4d85b4b96fa95d7f8724d57cdec6e 
>   src/tests/slave_tests.cpp 7ba2df9fef9966cd09e0af3f4ef3bb260b6167f6 
> 
> 
> Diff: https://reviews.apache.org/r/70936/diff/3/
> 
> 
> Testing
> -------
> 
> `make check`
> 
> 
> Thanks,
> 
> Benjamin Bannier
> 
>

Re: Review Request 70936: Adjusted task status updates during draining.

Posted by Benjamin Bannier <bb...@apache.org>.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/70936/
-----------------------------------------------------------

(Updated June 27, 2019, 5:23 p.m.)


Review request for mesos, Greg Mann and Joseph Wu.


Changes
-------

Unconditionally overwrite reason


Bugs: MESOS-9823
    https://issues.apache.org/jira/browse/MESOS-9823


Repository: mesos


Description (updated)
-------

When a task is reported as killed to the agent during active agent
draining we now decorate the reported status with
`REASON_AGENT_DRAINING` unconditionally. If the draining marks the agent
as gone via the `mark_gone` draining flag we additionally report
`TASK_GONE_BY_OPERATOR` instead of the original state.

This patch leaves some ambiguity in what triggered the kill since the
agent-executor protocol does not transport reasons; instead
the reason is here only inferred after the killed task has
been observed. This should usually be fine since due to the inherit race
between e.g., any user- and drain-triggered kill a user cannot
distinguish racy reasons.


Diffs (updated)
-----

  src/slave/slave.cpp 30039b0857a4d85b4b96fa95d7f8724d57cdec6e 
  src/tests/slave_tests.cpp 7ba2df9fef9966cd09e0af3f4ef3bb260b6167f6 


Diff: https://reviews.apache.org/r/70936/diff/3/

Changes: https://reviews.apache.org/r/70936/diff/2-3/


Testing
-------

`make check`


Thanks,

Benjamin Bannier

Re: Review Request 70936: Adjusted task status updates during draining.

Posted by Benjamin Bannier <bb...@apache.org>.


> On June 26, 2019, 10:16 p.m., Benjamin Bannier wrote:
> > src/slave/slave.cpp
> > Lines 5732-5734 (patched)
> > <https://reviews.apache.org/r/70936/diff/2/?file=2152320#file2152320line5732>
> >
> >     I think I got this part incorrect; we should unconditionally set a `TASK_GONE_BY_OPERATOR` state.

Incorrect, we should preserve the original state if the draining does not `mark_gone`. Dropping.


- Benjamin


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/70936/#review216160
-----------------------------------------------------------


On June 26, 2019, 2:35 p.m., Benjamin Bannier wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/70936/
> -----------------------------------------------------------
> 
> (Updated June 26, 2019, 2:35 p.m.)
> 
> 
> Review request for mesos, Greg Mann and Joseph Wu.
> 
> 
> Bugs: MESOS-9823
>     https://issues.apache.org/jira/browse/MESOS-9823
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> When a task is reported as killed to the agent during active agent
> draining we now decorate the reported status with
> `REASON_AGENT_DRAINING` if no other status was previously present. If
> the draining marks the agent as gone via the `mark_gone` draining flag
> we additionally report `TASK_GONE_BY_OPERATOR` instead of the original
> state.
> 
> This patch leaves some ambiguity in what triggered the kill since the
> agent-executor protocol does not transport reasons; instead
> the reason is here only inferred after the killed task has
> been observed. This should usually be fine since due to the inherit race
> between e.g., any user- and drain-triggered kill a user cannot
> distinguish racy reasons.
> 
> 
> Diffs
> -----
> 
>   src/slave/slave.cpp 30039b0857a4d85b4b96fa95d7f8724d57cdec6e 
>   src/tests/slave_tests.cpp 7ba2df9fef9966cd09e0af3f4ef3bb260b6167f6 
> 
> 
> Diff: https://reviews.apache.org/r/70936/diff/2/
> 
> 
> Testing
> -------
> 
> `make check`
> 
> 
> Thanks,
> 
> Benjamin Bannier
> 
>

Re: Review Request 70936: Adjusted task status updates during draining.

Posted by Benjamin Bannier <bb...@apache.org>.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/70936/#review216160
-----------------------------------------------------------




src/slave/slave.cpp
Lines 5732-5734 (patched)
<https://reviews.apache.org/r/70936/#comment303212>

    I think I got this part incorrect; we should unconditionally set a `TASK_GONE_BY_OPERATOR` state.


- Benjamin Bannier


On June 26, 2019, 2:35 p.m., Benjamin Bannier wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/70936/
> -----------------------------------------------------------
> 
> (Updated June 26, 2019, 2:35 p.m.)
> 
> 
> Review request for mesos, Greg Mann and Joseph Wu.
> 
> 
> Bugs: MESOS-9823
>     https://issues.apache.org/jira/browse/MESOS-9823
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> When a task is reported as killed to the agent during active agent
> draining we now decorate the reported status with
> `REASON_AGENT_DRAINING` if no other status was previously present. If
> the draining marks the agent as gone via the `mark_gone` draining flag
> we additionally report `TASK_GONE_BY_OPERATOR` instead of the original
> state.
> 
> This patch leaves some ambiguity in what triggered the kill since the
> agent-executor protocol does not transport reasons; instead
> the reason is here only inferred after the killed task has
> been observed. This should usually be fine since due to the inherit race
> between e.g., any user- and drain-triggered kill a user cannot
> distinguish racy reasons.
> 
> 
> Diffs
> -----
> 
>   src/slave/slave.cpp 30039b0857a4d85b4b96fa95d7f8724d57cdec6e 
>   src/tests/slave_tests.cpp 7ba2df9fef9966cd09e0af3f4ef3bb260b6167f6 
> 
> 
> Diff: https://reviews.apache.org/r/70936/diff/2/
> 
> 
> Testing
> -------
> 
> `make check`
> 
> 
> Thanks,
> 
> Benjamin Bannier
> 
>

Re: Review Request 70936: Adjusted task status updates during draining.

Posted by Benjamin Bannier <bb...@apache.org>.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/70936/
-----------------------------------------------------------

(Updated June 26, 2019, 2:35 p.m.)


Review request for mesos, Greg Mann and Joseph Wu.


Changes
-------

Rebase; address comments from Greg


Bugs: MESOS-9823
    https://issues.apache.org/jira/browse/MESOS-9823


Repository: mesos


Description
-------

When a task is reported as killed to the agent during active agent
draining we now decorate the reported status with
`REASON_AGENT_DRAINING` if no other status was previously present. If
the draining marks the agent as gone via the `mark_gone` draining flag
we additionally report `TASK_GONE_BY_OPERATOR` instead of the original
state.

This patch leaves some ambiguity in what triggered the kill since the
agent-executor protocol does not transport reasons; instead
the reason is here only inferred after the killed task has
been observed. This should usually be fine since due to the inherit race
between e.g., any user- and drain-triggered kill a user cannot
distinguish racy reasons.


Diffs (updated)
-----

  src/slave/slave.cpp 30039b0857a4d85b4b96fa95d7f8724d57cdec6e 
  src/tests/slave_tests.cpp 7ba2df9fef9966cd09e0af3f4ef3bb260b6167f6 


Diff: https://reviews.apache.org/r/70936/diff/2/

Changes: https://reviews.apache.org/r/70936/diff/1-2/


Testing
-------

`make check`


Thanks,

Benjamin Bannier

Re: Review Request 70936: Adjusted task status updates during draining.

Posted by Benjamin Bannier <bb...@apache.org>.


> On June 25, 2019, 11:48 p.m., Greg Mann wrote:
> > src/tests/slave_tests.cpp
> > Lines 12109 (patched)
> > <https://reviews.apache.org/r/70936/diff/1/?file=2151881#file2151881line12109>
> >
> >     Maybe we should unconditionally set the reason to REASON_AGENT_DRAINING? It doesn't make sense to me that the reason received by the scheduler will depend on what stage the task launch was in when the drain message was received?

I was wondering about this as well. The crucial bit to communicate here is that the task terminated unexpectedly which we would provide in any case.

I feel providing frameworks slightly more granular information on where in the task's lifecycle this happened might be interesting as e.g., a task killed while running could have had application-level side effects while one killed during launch wouldn't. Ultimately the framework still wouldn't know where in its application-level lifecycle a running task was killed (and we wouldn't be able to provide that) and the framework would need to deal with this outside of Mesos, but I also do not see a reason to not make the lower level detail on whether the task was launched and of which Mesos is aware visible to the framework.

What do you think?


- Benjamin


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/70936/#review216143
-----------------------------------------------------------


On June 26, 2019, 2:35 p.m., Benjamin Bannier wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/70936/
> -----------------------------------------------------------
> 
> (Updated June 26, 2019, 2:35 p.m.)
> 
> 
> Review request for mesos, Greg Mann and Joseph Wu.
> 
> 
> Bugs: MESOS-9823
>     https://issues.apache.org/jira/browse/MESOS-9823
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> When a task is reported as killed to the agent during active agent
> draining we now decorate the reported status with
> `REASON_AGENT_DRAINING` if no other status was previously present. If
> the draining marks the agent as gone via the `mark_gone` draining flag
> we additionally report `TASK_GONE_BY_OPERATOR` instead of the original
> state.
> 
> This patch leaves some ambiguity in what triggered the kill since the
> agent-executor protocol does not transport reasons; instead
> the reason is here only inferred after the killed task has
> been observed. This should usually be fine since due to the inherit race
> between e.g., any user- and drain-triggered kill a user cannot
> distinguish racy reasons.
> 
> 
> Diffs
> -----
> 
>   src/slave/slave.cpp 30039b0857a4d85b4b96fa95d7f8724d57cdec6e 
>   src/tests/slave_tests.cpp 7ba2df9fef9966cd09e0af3f4ef3bb260b6167f6 
> 
> 
> Diff: https://reviews.apache.org/r/70936/diff/2/
> 
> 
> Testing
> -------
> 
> `make check`
> 
> 
> Thanks,
> 
> Benjamin Bannier
> 
>

Re: Review Request 70936: Adjusted task status updates during draining.

Posted by Greg Mann <gr...@mesosphere.io>.


> On June 25, 2019, 9:48 p.m., Greg Mann wrote:
> > src/tests/slave_tests.cpp
> > Lines 12109 (patched)
> > <https://reviews.apache.org/r/70936/diff/1/?file=2151881#file2151881line12109>
> >
> >     Maybe we should unconditionally set the reason to REASON_AGENT_DRAINING? It doesn't make sense to me that the reason received by the scheduler will depend on what stage the task launch was in when the drain message was received?
> 
> Benjamin Bannier wrote:
>     I was wondering about this as well. The crucial bit to communicate here is that the task terminated unexpectedly which we would provide in any case.
>     
>     I feel providing frameworks slightly more granular information on where in the task's lifecycle this happened might be interesting as e.g., a task killed while running could have had application-level side effects while one killed during launch wouldn't. Ultimately the framework still wouldn't know where in its application-level lifecycle a running task was killed (and we wouldn't be able to provide that) and the framework would need to deal with this outside of Mesos, but I also do not see a reason to not make the lower level detail on whether the task was launched and of which Mesos is aware visible to the framework.
>     
>     What do you think?

In the case that the agent is draining, and then the agent sets the REASON_TASK_KILLED_DURING_LAUNCH when killing a task, it seems that we must lose one of the two reasons, so we need to choose which one to retain.

I think that the cost of retaining the REASON_TASK_KILLED_DURING_LAUNCH is that the signal provided to schedulers by REASON_AGENT_DRAINING becomes much less consistent, and thus less useful. When the scheduler can't rely on all killed tasks from a draining agent to contain REASON_AGENT_DRAINING, then it can't always respond correctly to the killed task. In the case of stateful tasks, for example, the scheduler may want to relaunch a new instance immediately if the agent is being drained, rather than waiting for an offer containing persistent volumes which may never come.

My guess is that there are more cases in which REASON_AGENT_DRAINING will help schedulers take meaningful action, rather than REASON_TASK_KILLED_DURING_LAUNCH. The latter could be useful perhaps for schedulers which avoid performing cleanup for tasks which were never successfully launched? Maybe there are other use cases I'm not thinking of, but my intuition is that REASON_AGENT_DRAINED is more useful, so I'd vote for unconditionally setting the reason to REASON_AGENT_DRAINING.


- Greg


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/70936/#review216143
-----------------------------------------------------------


On June 26, 2019, 12:35 p.m., Benjamin Bannier wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/70936/
> -----------------------------------------------------------
> 
> (Updated June 26, 2019, 12:35 p.m.)
> 
> 
> Review request for mesos, Greg Mann and Joseph Wu.
> 
> 
> Bugs: MESOS-9823
>     https://issues.apache.org/jira/browse/MESOS-9823
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> When a task is reported as killed to the agent during active agent
> draining we now decorate the reported status with
> `REASON_AGENT_DRAINING` if no other status was previously present. If
> the draining marks the agent as gone via the `mark_gone` draining flag
> we additionally report `TASK_GONE_BY_OPERATOR` instead of the original
> state.
> 
> This patch leaves some ambiguity in what triggered the kill since the
> agent-executor protocol does not transport reasons; instead
> the reason is here only inferred after the killed task has
> been observed. This should usually be fine since due to the inherit race
> between e.g., any user- and drain-triggered kill a user cannot
> distinguish racy reasons.
> 
> 
> Diffs
> -----
> 
>   src/slave/slave.cpp 30039b0857a4d85b4b96fa95d7f8724d57cdec6e 
>   src/tests/slave_tests.cpp 7ba2df9fef9966cd09e0af3f4ef3bb260b6167f6 
> 
> 
> Diff: https://reviews.apache.org/r/70936/diff/2/
> 
> 
> Testing
> -------
> 
> `make check`
> 
> 
> Thanks,
> 
> Benjamin Bannier
> 
>

Re: Review Request 70936: Adjusted task status updates during draining.

Posted by Benjamin Bannier <bb...@apache.org>.


> On June 25, 2019, 11:48 p.m., Greg Mann wrote:
> > src/tests/slave_tests.cpp
> > Lines 12109 (patched)
> > <https://reviews.apache.org/r/70936/diff/1/?file=2151881#file2151881line12109>
> >
> >     Maybe we should unconditionally set the reason to REASON_AGENT_DRAINING? It doesn't make sense to me that the reason received by the scheduler will depend on what stage the task launch was in when the drain message was received?
> 
> Benjamin Bannier wrote:
>     I was wondering about this as well. The crucial bit to communicate here is that the task terminated unexpectedly which we would provide in any case.
>     
>     I feel providing frameworks slightly more granular information on where in the task's lifecycle this happened might be interesting as e.g., a task killed while running could have had application-level side effects while one killed during launch wouldn't. Ultimately the framework still wouldn't know where in its application-level lifecycle a running task was killed (and we wouldn't be able to provide that) and the framework would need to deal with this outside of Mesos, but I also do not see a reason to not make the lower level detail on whether the task was launched and of which Mesos is aware visible to the framework.
>     
>     What do you think?
> 
> Greg Mann wrote:
>     In the case that the agent is draining, and then the agent sets the REASON_TASK_KILLED_DURING_LAUNCH when killing a task, it seems that we must lose one of the two reasons, so we need to choose which one to retain.
>     
>     I think that the cost of retaining the REASON_TASK_KILLED_DURING_LAUNCH is that the signal provided to schedulers by REASON_AGENT_DRAINING becomes much less consistent, and thus less useful. When the scheduler can't rely on all killed tasks from a draining agent to contain REASON_AGENT_DRAINING, then it can't always respond correctly to the killed task. In the case of stateful tasks, for example, the scheduler may want to relaunch a new instance immediately if the agent is being drained, rather than waiting for an offer containing persistent volumes which may never come.
>     
>     My guess is that there are more cases in which REASON_AGENT_DRAINING will help schedulers take meaningful action, rather than REASON_TASK_KILLED_DURING_LAUNCH. The latter could be useful perhaps for schedulers which avoid performing cleanup for tasks which were never successfully launched? Maybe there are other use cases I'm not thinking of, but my intuition is that REASON_AGENT_DRAINED is more useful, so I'd vote for unconditionally setting the reason to REASON_AGENT_DRAINING.

It is done.


- Benjamin


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/70936/#review216143
-----------------------------------------------------------


On June 27, 2019, 5:23 p.m., Benjamin Bannier wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/70936/
> -----------------------------------------------------------
> 
> (Updated June 27, 2019, 5:23 p.m.)
> 
> 
> Review request for mesos, Greg Mann and Joseph Wu.
> 
> 
> Bugs: MESOS-9823
>     https://issues.apache.org/jira/browse/MESOS-9823
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> When a task is reported as killed to the agent during active agent
> draining we now decorate the reported status with
> `REASON_AGENT_DRAINING` unconditionally. If the draining marks the agent
> as gone via the `mark_gone` draining flag we additionally report
> `TASK_GONE_BY_OPERATOR` instead of the original state.
> 
> This patch leaves some ambiguity in what triggered the kill since the
> agent-executor protocol does not transport reasons; instead
> the reason is here only inferred after the killed task has
> been observed. This should usually be fine since due to the inherit race
> between e.g., any user- and drain-triggered kill a user cannot
> distinguish racy reasons.
> 
> 
> Diffs
> -----
> 
>   src/slave/slave.cpp 30039b0857a4d85b4b96fa95d7f8724d57cdec6e 
>   src/tests/slave_tests.cpp 7ba2df9fef9966cd09e0af3f4ef3bb260b6167f6 
> 
> 
> Diff: https://reviews.apache.org/r/70936/diff/3/
> 
> 
> Testing
> -------
> 
> `make check`
> 
> 
> Thanks,
> 
> Benjamin Bannier
> 
>

Re: Review Request 70936: Adjusted task status updates during draining.

Posted by Greg Mann <gr...@mesosphere.io>.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/70936/#review216143
-----------------------------------------------------------




src/slave/slave.cpp
Lines 5701 (patched)
<https://reviews.apache.org/r/70936/#comment303161>

    s/KILLEd/KILLED/



src/slave/slave.cpp
Lines 5702 (patched)
<https://reviews.apache.org/r/70936/#comment303164>

    s/drainInfo/drainConfig/
    
    Here and elsewhere.



src/slave/slave.cpp
Lines 5721 (patched)
<https://reviews.apache.org/r/70936/#comment303165>

    s/a/the/



src/tests/slave_tests.cpp
Lines 12109 (patched)
<https://reviews.apache.org/r/70936/#comment303166>

    Maybe we should unconditionally set the reason to REASON_AGENT_DRAINING? It doesn't make sense to me that the reason received by the scheduler will depend on what stage the task launch was in when the drain message was received?



src/tests/slave_tests.cpp
Line 12251 (original), 12263 (patched)
<https://reviews.apache.org/r/70936/#comment303167>

    Is the framework ID necessary? Would a `Future<Nothing>` suffice here?


- Greg Mann


On June 24, 2019, 2:03 p.m., Benjamin Bannier wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/70936/
> -----------------------------------------------------------
> 
> (Updated June 24, 2019, 2:03 p.m.)
> 
> 
> Review request for mesos, Greg Mann and Joseph Wu.
> 
> 
> Bugs: MESOS-9823
>     https://issues.apache.org/jira/browse/MESOS-9823
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> When a task is reported as killed to the agent during active agent
> draining we now decorate the reported status with
> `REASON_AGENT_DRAINING` if no other status was previously present. If
> the draining marks the agent as gone via the `mark_gone` draining flag
> we additionally report `TASK_GONE_BY_OPERATOR` instead of the original
> state.
> 
> This patch leaves some ambiguity in what triggered the kill since the
> agent-executor protocol does not transport reasons; instead
> the reason is here only inferred after the killed task has
> been observed. This should usually be fine since due to the inherit race
> between e.g., any user- and drain-triggered kill a user cannot
> distinguish racy reasons.
> 
> 
> Diffs
> -----
> 
>   src/slave/slave.cpp 30039b0857a4d85b4b96fa95d7f8724d57cdec6e 
>   src/tests/slave_tests.cpp c2035976713abb31b3646c0d23771fa40df93271 
> 
> 
> Diff: https://reviews.apache.org/r/70936/diff/1/
> 
> 
> Testing
> -------
> 
> `make check`
> 
> 
> Thanks,
> 
> Benjamin Bannier
> 
>