You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@myriad.apache.org by "John Yost (JIRA)" <ji...@apache.org> on 2016/06/16 19:34:05 UTC

[jira] [Commented] (MYRIAD-220) Improve reliability of kill task messaging

    [ https://issues.apache.org/jira/browse/MYRIAD-220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15334515#comment-15334515 ] 

John Yost commented on MYRIAD-220:
----------------------------------

Myriad interacts with the Mesos scheduler via the Mesos ScheduleDriver class and receives acknowledgement of requests via a callback mechanism defined in the Mesos Scheduler interface.

Myriad Tasks are killed via a process that is ideally one-step but can be two-step if there are any system or network issues at the time the task kill request is submitted to Mesos:

1. The YarnNodeCapacityManager invokes MyriadDriver.kill, which delegates to the Mesos SchedulerDriver.killTask method. At this stage the kill request is sent by the ScheduleDriver to Mesos. A Protos.Status object is returned, but that Status object only indicates whether the SchedulerDriver is in a running state. Until a callback is invoked by Mesos upon the Myriad Scheduler implementation, there is no guarantee that the task kill request succeeded.

2. The TaskTerminator is a daemon that periodically checks the SchedulerState killable tasks queue and invokes MyriadDriverManager.kill, which is a wrapper for the MyriadDriver (and, again, the Mesos SchedulerDriver). In the master branch the TaskTerminator also invokes SchedulerState.removeTask to remove the killable task from the SchedulerState.  This is potentially an issue because, again, the task kill request is not guaranteed to have worked unless the corresponding Mesos callback method is invoked.

The only way to ensure that all Killable Myriad tasks are eventually killed is to invoke SchedulerState.removeTask from within Myriad's Mesos lifecycle callback implementation, MyriadScheduler, either directly or within a local Java object.  Specifically, MyriadScheduler implements the org.apache.mesos.Scheduler interface, which is the Mesos callback interface for all Mesos Task lifecycle methods (i.e., register, statusUpdate, disconnected, etc...). When the statusUpdate callback is invoked, one of the status updates is that the Myriad task was killed. 

When a StatusUpdate is received from Mesos, a Myriad StatusUpdateEvent is created and fired via Disruptor. The message listener is the StatusUpdateEventHandler class. Within it's onEvent method, there is logic to decline new resource offers as well as remove the corresponding task from the Myriad SchedulerState. 

Consequently, the only code change required is to remove the SchedulerState.removeTask method invocation from TaskTerminator. I am also adding some JUnit tests and Javadoc comments.

--John


> Improve reliability of kill task messaging
> ------------------------------------------
>
>                 Key: MYRIAD-220
>                 URL: https://issues.apache.org/jira/browse/MYRIAD-220
>             Project: Myriad
>          Issue Type: Improvement
>          Components: Scheduler
>            Reporter: John Yost
>            Assignee: John Yost
>
> Currently within the YarnNodeCapacityManager there is a two-step process of killing a YARN task via the following method invocations:
>       state.makeTaskKillable(taskId);
>       myriadDriver.kill(taskId);
> Need to add logic to ensure all killable tasks are killed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)