You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@mesos.apache.org by Benjamin Mahler <bm...@apache.org> on 2017/08/22 19:11:42 UTC

Re: Review Request 61639: Fixed a bug where the agent kills and still launches a task.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/61639/
-----------------------------------------------------------

(Updated Aug. 22, 2017, 7:11 p.m.)


Review request for mesos, Anand Mazumdar and Vinod Kone.


Changes
-------

Rebased and clarified why the task must be transitioned synchronously per vinod's suggestion.


Summary (updated)
-----------------

Fixed a bug where the agent kills and still launches a task.


Bugs: MESOS-7744 and MESOS-7865
    https://issues.apache.org/jira/browse/MESOS-7744
    https://issues.apache.org/jira/browse/MESOS-7865


Repository: mesos


Description
-------

The following race leads to the agent both killing and launching a task:

  (1) Slave::__run completes, task is now within Executor::queuedTasks.
  (2) Slave::killTask locates the executor based on the task ID residing
      in queuedTasks, calls Slave::statusUpdate() with TASK_KILLED.
  (3) Slave::___run assumes that killed tasks have been removed from
      Executor::queuedTasks, but this now occurs asynchronously in
      Slave::_statusUpdate. So, the agent still sees the queued task
      and delivers it and adds the task to Executor::launchedTasks.
  (3) Slave::_statusUpdate runs, removes the task from
      Executor::launchedTasks and adds it to Executor::terminatedTasks.

The fix applied here is to synchronously transition queued tasks to
a terminal state when statusUpdate is called. This can be done because
for queued tasks, we do not need to retrieve the container status (the
task never reached the container).


Diffs (updated)
-----

  src/slave/slave.hpp 0e07a1af733003bb897cbebb7c1f64437063353d 
  src/slave/slave.cpp 50d2a10cd68f6611efd4e691e5325e6e0c06f33a 


Diff: https://reviews.apache.org/r/61639/diff/2/

Changes: https://reviews.apache.org/r/61639/diff/1-2/


Testing
-------

make check

SlaveTest.KillQueuedTaskDuringExecutorRegistration captures this case, but it did not delay retrieving the container status. This test could have been updated previously to delay the container status, but now there is no container status to delay, so I've left the test alone.


Thanks,

Benjamin Mahler