You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@mesos.apache.org by Benjamin Mahler <bm...@apache.org> on 2017/08/15 07:00:41 UTC

Review Request 61645: Fixed a bug where TASK_KILLED updates can be dropped by the agent.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/61645/
-----------------------------------------------------------

Review request for mesos, Anand Mazumdar and Vinod Kone.


Bugs: MESOS-7783 and MESOS-7863
    https://issues.apache.org/jira/browse/MESOS-7783
    https://issues.apache.org/jira/browse/MESOS-7863


Repository: mesos


Description
-------

Per the description of MESOS-7863, there is currently an assumption
that when a pending task is killed, the framework will be stored in
the agent when the launch proceeds for the killed task. When this
assumption does not hold, the TASK_KILLED update will be dropped
due to the frameowrk being unknown when the launch proceeds. This
assumption doesn't hold in two cases:

  (1) Another pending task was killed and we removed the framework
      in 'Slave::run' thinking it was idle, because pending tasks
      was empty (we remove from pending tasks when processing the
      kill). (MESOS-7783 is an example instance of this).

  (2) The last executor terminated without tasks to send terminal
      updates for, or the last terminated executor received its
      last acknowledgement. At this point, we remove the framework
      thinking there were no pending tasks if the task was killed
      (removed from pending).

The fix applied here is to send the status updates from the kill
path rather than the launch path, to be consistent with how we kill
tasks queued within the Executor struct. We ensure that the task
is removed synchronously within the kill path to prevent its launch.


Diffs
-----

  src/slave/slave.hpp 1fe93dab1b2bef24721cc1bcffebe1b259e96d79 
  src/slave/slave.cpp 7381530515f86faf4c3e8f82bcd9483f6cf0498b 
  src/tests/slave_tests.cpp 1d9d142ed9e801b79535a2c28f5a94ffbf1bf160 


Diff: https://reviews.apache.org/r/61645/diff/1/


Testing
-------

Added a test in a subsequent patch.


Thanks,

Benjamin Mahler


Re: Review Request 61645: Fixed a bug where TASK_KILLED updates can be dropped by the agent.

Posted by Vinod Kone <vi...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/61645/#review183678
-----------------------------------------------------------


Fix it, then Ship it!





src/slave/slave.hpp
Lines 882 (patched)
<https://reviews.apache.org/r/61645/#comment259746>

    s/will be/will also be/ ?



src/slave/slave.cpp
Lines 1820 (patched)
<https://reviews.apache.org/r/61645/#comment259747>

    Can you add a TODO here to track pending tasks on the executor struct instead of framework? We might even be able to leverage queued tasks.


- Vinod Kone


On Aug. 23, 2017, 9:29 p.m., Benjamin Mahler wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/61645/
> -----------------------------------------------------------
> 
> (Updated Aug. 23, 2017, 9:29 p.m.)
> 
> 
> Review request for mesos, Anand Mazumdar and Vinod Kone.
> 
> 
> Bugs: MESOS-7783 and MESOS-7863
>     https://issues.apache.org/jira/browse/MESOS-7783
>     https://issues.apache.org/jira/browse/MESOS-7863
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Per the description of MESOS-7863, there is currently an assumption
> that when a pending task is killed, the framework will be stored in
> the agent when the launch proceeds for the killed task. When this
> assumption does not hold, the TASK_KILLED update will be dropped
> due to the frameowrk being unknown when the launch proceeds. This
> assumption doesn't hold in two cases:
> 
>   (1) Another pending task was killed and we removed the framework
>       in 'Slave::run' thinking it was idle, because pending tasks
>       was empty (we remove from pending tasks when processing the
>       kill). (MESOS-7783 is an example instance of this).
> 
>   (2) The last executor terminated without tasks to send terminal
>       updates for, or the last terminated executor received its
>       last acknowledgement. At this point, we remove the framework
>       thinking there were no pending tasks if the task was killed
>       (removed from pending).
> 
> The fix applied here is to send the status updates from the kill
> path rather than the launch path, to be consistent with how we kill
> tasks queued within the Executor struct. We ensure that the task
> is removed synchronously within the kill path to prevent its launch.
> 
> 
> Diffs
> -----
> 
>   src/slave/slave.hpp 0e07a1af733003bb897cbebb7c1f64437063353d 
>   src/slave/slave.cpp 50d2a10cd68f6611efd4e691e5325e6e0c06f33a 
>   src/tests/slave_tests.cpp a8b46a450c4f3986139390aeed5544e7f9091781 
> 
> 
> Diff: https://reviews.apache.org/r/61645/diff/3/
> 
> 
> Testing
> -------
> 
> Added a test in a subsequent patch.
> 
> 
> Thanks,
> 
> Benjamin Mahler
> 
>


Re: Review Request 61645: Fixed a bug where TASK_KILLED updates can be dropped by the agent.

Posted by Benjamin Mahler <bm...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/61645/#review183690
-----------------------------------------------------------




src/slave/slave.cpp
Lines 2985-2999 (original), 2948-2996 (patched)
<https://reviews.apache.org/r/61645/#comment259755>

    Forgot to add a return statement at the end of this block, will follow up.


- Benjamin Mahler


On Aug. 23, 2017, 9:29 p.m., Benjamin Mahler wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/61645/
> -----------------------------------------------------------
> 
> (Updated Aug. 23, 2017, 9:29 p.m.)
> 
> 
> Review request for mesos, Anand Mazumdar and Vinod Kone.
> 
> 
> Bugs: MESOS-7783 and MESOS-7863
>     https://issues.apache.org/jira/browse/MESOS-7783
>     https://issues.apache.org/jira/browse/MESOS-7863
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Per the description of MESOS-7863, there is currently an assumption
> that when a pending task is killed, the framework will be stored in
> the agent when the launch proceeds for the killed task. When this
> assumption does not hold, the TASK_KILLED update will be dropped
> due to the frameowrk being unknown when the launch proceeds. This
> assumption doesn't hold in two cases:
> 
>   (1) Another pending task was killed and we removed the framework
>       in 'Slave::run' thinking it was idle, because pending tasks
>       was empty (we remove from pending tasks when processing the
>       kill). (MESOS-7783 is an example instance of this).
> 
>   (2) The last executor terminated without tasks to send terminal
>       updates for, or the last terminated executor received its
>       last acknowledgement. At this point, we remove the framework
>       thinking there were no pending tasks if the task was killed
>       (removed from pending).
> 
> The fix applied here is to send the status updates from the kill
> path rather than the launch path, to be consistent with how we kill
> tasks queued within the Executor struct. We ensure that the task
> is removed synchronously within the kill path to prevent its launch.
> 
> 
> Diffs
> -----
> 
>   src/slave/slave.hpp 0e07a1af733003bb897cbebb7c1f64437063353d 
>   src/slave/slave.cpp 50d2a10cd68f6611efd4e691e5325e6e0c06f33a 
>   src/tests/slave_tests.cpp a8b46a450c4f3986139390aeed5544e7f9091781 
> 
> 
> Diff: https://reviews.apache.org/r/61645/diff/3/
> 
> 
> Testing
> -------
> 
> Added a test in a subsequent patch.
> 
> 
> Thanks,
> 
> Benjamin Mahler
> 
>


Re: Review Request 61645: Fixed a bug where TASK_KILLED updates can be dropped by the agent.

Posted by Benjamin Mahler <bm...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/61645/
-----------------------------------------------------------

(Updated Aug. 23, 2017, 9:29 p.m.)


Review request for mesos, Anand Mazumdar and Vinod Kone.


Changes
-------

Addressed review comments.


Bugs: MESOS-7783 and MESOS-7863
    https://issues.apache.org/jira/browse/MESOS-7783
    https://issues.apache.org/jira/browse/MESOS-7863


Repository: mesos


Description
-------

Per the description of MESOS-7863, there is currently an assumption
that when a pending task is killed, the framework will be stored in
the agent when the launch proceeds for the killed task. When this
assumption does not hold, the TASK_KILLED update will be dropped
due to the frameowrk being unknown when the launch proceeds. This
assumption doesn't hold in two cases:

  (1) Another pending task was killed and we removed the framework
      in 'Slave::run' thinking it was idle, because pending tasks
      was empty (we remove from pending tasks when processing the
      kill). (MESOS-7783 is an example instance of this).

  (2) The last executor terminated without tasks to send terminal
      updates for, or the last terminated executor received its
      last acknowledgement. At this point, we remove the framework
      thinking there were no pending tasks if the task was killed
      (removed from pending).

The fix applied here is to send the status updates from the kill
path rather than the launch path, to be consistent with how we kill
tasks queued within the Executor struct. We ensure that the task
is removed synchronously within the kill path to prevent its launch.


Diffs (updated)
-----

  src/slave/slave.hpp 0e07a1af733003bb897cbebb7c1f64437063353d 
  src/slave/slave.cpp 50d2a10cd68f6611efd4e691e5325e6e0c06f33a 
  src/tests/slave_tests.cpp a8b46a450c4f3986139390aeed5544e7f9091781 


Diff: https://reviews.apache.org/r/61645/diff/3/

Changes: https://reviews.apache.org/r/61645/diff/2-3/


Testing
-------

Added a test in a subsequent patch.


Thanks,

Benjamin Mahler


Re: Review Request 61645: Fixed a bug where TASK_KILLED updates can be dropped by the agent.

Posted by Benjamin Mahler <bm...@apache.org>.

> On Aug. 22, 2017, 11:42 p.m., Vinod Kone wrote:
> > src/slave/slave.cpp
> > Lines 4451 (patched)
> > <https://reviews.apache.org/r/61645/diff/2/?file=1801801#file1801801line4501>
> >
> >     Can you add a note/TODO here that if the terminal update for the pending task is retried by the status update manager, it might potentially get dropped if the framework gets removed? Ideally, we should transition the pending task to terminated tasks instead of removing it from the map and forgetting about it.

Added a TODO that we should track it like we do within the executor struct, but didn't mention that issue since `Slave::forward` seems to be ok with a removed framework.


- Benjamin


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/61645/#review183502
-----------------------------------------------------------


On Aug. 23, 2017, 9:29 p.m., Benjamin Mahler wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/61645/
> -----------------------------------------------------------
> 
> (Updated Aug. 23, 2017, 9:29 p.m.)
> 
> 
> Review request for mesos, Anand Mazumdar and Vinod Kone.
> 
> 
> Bugs: MESOS-7783 and MESOS-7863
>     https://issues.apache.org/jira/browse/MESOS-7783
>     https://issues.apache.org/jira/browse/MESOS-7863
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Per the description of MESOS-7863, there is currently an assumption
> that when a pending task is killed, the framework will be stored in
> the agent when the launch proceeds for the killed task. When this
> assumption does not hold, the TASK_KILLED update will be dropped
> due to the frameowrk being unknown when the launch proceeds. This
> assumption doesn't hold in two cases:
> 
>   (1) Another pending task was killed and we removed the framework
>       in 'Slave::run' thinking it was idle, because pending tasks
>       was empty (we remove from pending tasks when processing the
>       kill). (MESOS-7783 is an example instance of this).
> 
>   (2) The last executor terminated without tasks to send terminal
>       updates for, or the last terminated executor received its
>       last acknowledgement. At this point, we remove the framework
>       thinking there were no pending tasks if the task was killed
>       (removed from pending).
> 
> The fix applied here is to send the status updates from the kill
> path rather than the launch path, to be consistent with how we kill
> tasks queued within the Executor struct. We ensure that the task
> is removed synchronously within the kill path to prevent its launch.
> 
> 
> Diffs
> -----
> 
>   src/slave/slave.hpp 0e07a1af733003bb897cbebb7c1f64437063353d 
>   src/slave/slave.cpp 50d2a10cd68f6611efd4e691e5325e6e0c06f33a 
>   src/tests/slave_tests.cpp a8b46a450c4f3986139390aeed5544e7f9091781 
> 
> 
> Diff: https://reviews.apache.org/r/61645/diff/3/
> 
> 
> Testing
> -------
> 
> Added a test in a subsequent patch.
> 
> 
> Thanks,
> 
> Benjamin Mahler
> 
>


Re: Review Request 61645: Fixed a bug where TASK_KILLED updates can be dropped by the agent.

Posted by Vinod Kone <vi...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/61645/#review183502
-----------------------------------------------------------




src/slave/slave.hpp
Lines 883 (patched)
<https://reviews.apache.org/r/61645/#comment259514>

    s/the/a pending/ ?



src/slave/slave.hpp
Lines 884 (patched)
<https://reviews.apache.org/r/61645/#comment259515>

    s/getPendingTaskGroup/getTaskGroupForPendingTask/



src/slave/slave.cpp
Line 1796 (original), 1796 (patched)
<https://reviews.apache.org/r/61645/#comment259516>

    Update the outdated comment.



src/slave/slave.hpp
Lines 879 (patched)
<https://reviews.apache.org/r/61645/#comment259554>

    Can you add a comment here that adding a taskgroup also adds the corresponding tasks into `pendingTasks` ?



src/slave/slave.cpp
Line 1909 (original), 1912 (patched)
<https://reviews.apache.org/r/61645/#comment259557>

    s/Present/Pending/ ?



src/slave/slave.cpp
Lines 1923 (patched)
<https://reviews.apache.org/r/61645/#comment259558>

    can you log the `taskOrTaskgroup` for easier debugging?



src/slave/slave.cpp
Line 2070 (original), 2051 (patched)
<https://reviews.apache.org/r/61645/#comment259556>

    s/Present/Pending/ ?



src/slave/slave.cpp
Lines 2062 (patched)
<https://reviews.apache.org/r/61645/#comment259560>

    can you log the `taskOrTaskgroup` for easier debugging?



src/slave/slave.cpp
Lines 4451 (patched)
<https://reviews.apache.org/r/61645/#comment259567>

    Can you add a note/TODO here that if the terminal update for the pending task is retried by the status update manager, it might potentially get dropped if the framework gets removed? Ideally, we should transition the pending task to terminated tasks instead of removing it from the map and forgetting about it.



src/slave/slave.cpp
Lines 4459 (patched)
<https://reviews.apache.org/r/61645/#comment259568>

    Ideally if the executor is available, you should do a checkpointing update; if not available you can do a non-checkpointing update.



src/slave/slave.cpp
Lines 7512 (patched)
<https://reviews.apache.org/r/61645/#comment259563>

    s/the task/the pending task/



src/slave/slave.cpp
Lines 7545-7570 (patched)
<https://reviews.apache.org/r/61645/#comment259569>

    This is a bit hard to read but am not sure how to improve it yet.


- Vinod Kone


On Aug. 22, 2017, 8:19 p.m., Benjamin Mahler wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/61645/
> -----------------------------------------------------------
> 
> (Updated Aug. 22, 2017, 8:19 p.m.)
> 
> 
> Review request for mesos, Anand Mazumdar and Vinod Kone.
> 
> 
> Bugs: MESOS-7783 and MESOS-7863
>     https://issues.apache.org/jira/browse/MESOS-7783
>     https://issues.apache.org/jira/browse/MESOS-7863
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Per the description of MESOS-7863, there is currently an assumption
> that when a pending task is killed, the framework will be stored in
> the agent when the launch proceeds for the killed task. When this
> assumption does not hold, the TASK_KILLED update will be dropped
> due to the frameowrk being unknown when the launch proceeds. This
> assumption doesn't hold in two cases:
> 
>   (1) Another pending task was killed and we removed the framework
>       in 'Slave::run' thinking it was idle, because pending tasks
>       was empty (we remove from pending tasks when processing the
>       kill). (MESOS-7783 is an example instance of this).
> 
>   (2) The last executor terminated without tasks to send terminal
>       updates for, or the last terminated executor received its
>       last acknowledgement. At this point, we remove the framework
>       thinking there were no pending tasks if the task was killed
>       (removed from pending).
> 
> The fix applied here is to send the status updates from the kill
> path rather than the launch path, to be consistent with how we kill
> tasks queued within the Executor struct. We ensure that the task
> is removed synchronously within the kill path to prevent its launch.
> 
> 
> Diffs
> -----
> 
>   src/slave/slave.hpp 0e07a1af733003bb897cbebb7c1f64437063353d 
>   src/slave/slave.cpp 50d2a10cd68f6611efd4e691e5325e6e0c06f33a 
>   src/tests/slave_tests.cpp a8b46a450c4f3986139390aeed5544e7f9091781 
> 
> 
> Diff: https://reviews.apache.org/r/61645/diff/2/
> 
> 
> Testing
> -------
> 
> Added a test in a subsequent patch.
> 
> 
> Thanks,
> 
> Benjamin Mahler
> 
>


Re: Review Request 61645: Fixed a bug where TASK_KILLED updates can be dropped by the agent.

Posted by Benjamin Mahler <bm...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/61645/
-----------------------------------------------------------

(Updated Aug. 22, 2017, 8:19 p.m.)


Review request for mesos, Anand Mazumdar and Vinod Kone.


Changes
-------

Rebased and updated to remove the old logic from the launch path.


Bugs: MESOS-7783 and MESOS-7863
    https://issues.apache.org/jira/browse/MESOS-7783
    https://issues.apache.org/jira/browse/MESOS-7863


Repository: mesos


Description
-------

Per the description of MESOS-7863, there is currently an assumption
that when a pending task is killed, the framework will be stored in
the agent when the launch proceeds for the killed task. When this
assumption does not hold, the TASK_KILLED update will be dropped
due to the frameowrk being unknown when the launch proceeds. This
assumption doesn't hold in two cases:

  (1) Another pending task was killed and we removed the framework
      in 'Slave::run' thinking it was idle, because pending tasks
      was empty (we remove from pending tasks when processing the
      kill). (MESOS-7783 is an example instance of this).

  (2) The last executor terminated without tasks to send terminal
      updates for, or the last terminated executor received its
      last acknowledgement. At this point, we remove the framework
      thinking there were no pending tasks if the task was killed
      (removed from pending).

The fix applied here is to send the status updates from the kill
path rather than the launch path, to be consistent with how we kill
tasks queued within the Executor struct. We ensure that the task
is removed synchronously within the kill path to prevent its launch.


Diffs (updated)
-----

  src/slave/slave.hpp 0e07a1af733003bb897cbebb7c1f64437063353d 
  src/slave/slave.cpp 50d2a10cd68f6611efd4e691e5325e6e0c06f33a 
  src/tests/slave_tests.cpp a8b46a450c4f3986139390aeed5544e7f9091781 


Diff: https://reviews.apache.org/r/61645/diff/2/

Changes: https://reviews.apache.org/r/61645/diff/1-2/


Testing
-------

Added a test in a subsequent patch.


Thanks,

Benjamin Mahler