You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@reef.apache.org by "Markus Weimer (JIRA)" <ji...@apache.org> on 2017/04/17 16:33:41 UTC

[jira] [Commented] (REEF-1778) The IIMRUResultHandler Dispose() method isn't allowed to complete before the job finishes.

    [ https://issues.apache.org/jira/browse/REEF-1778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15971307#comment-15971307 ] 

Markus Weimer commented on REEF-1778:
-------------------------------------

Thanks for reporting this!

I believe your analysis is spot-on. The `Task.Dispose()` method is guaranteed to complete right now. The sequence of events at the end of an IMRU Job is as follows, IIRC:

  * {{UpdateTaskHost}} determines that the job is done and {{return}}s.
  * On a Thread, {{TaskRuntime}} sends the {{ITask}}'s return value to {{TaskStatus.SetResult()}} ({{TaskRuntime.cs}} line 103) which in turn triggers a heartbeat ({{TaskStatus.cs}} line 127. 
  * Receipt of this heartbeat causes the Driver to shutdown the job.
  * Meanwhile, Thread mentioned above is still busy working through the lines after line 103 in {{TaskRuntime.cs}}, which includes calling the {{Dispose}} method on the {{ITask}}, which eventually calls the {{Dispose}} methods in IMRU Tasks.

These last two steps create a race condition. If the roundtrip to the Driver and back is faster than the work done in {{TaskRuntime.cs}} post-line 103, the Thread could be rudely interrupted.

Does this sound like a reasonable explanation for this bug? If so, what ideas do we have to remedy it? My first idea is to restructure {{TaskRuntime}} such that {{Dispose}} is called before a result is send to the Driver. Any other ideas?




> The IIMRUResultHandler Dispose() method isn't allowed to complete before the job finishes.
> ------------------------------------------------------------------------------------------
>
>                 Key: REEF-1778
>                 URL: https://issues.apache.org/jira/browse/REEF-1778
>             Project: REEF
>          Issue Type: Bug
>          Components: IMRU, REEF.NET
>         Environment: YARN. Does not affect the local runtime.
>            Reporter: Rogan Carr
>
> The IIMRUResultHandler Dispose() method isn't allowed to complete before the job finishes when run on YARN. When running locally, this isn't an issue. I have not tested a Mesos runtime.
> My hunch is that YARN is killing the container before Dispose() finishes. If I had to guess, I think that the driver calls Dispose() is on the UpdateTaskHost (which calls it on the IIMRUResultHandler), and then the driver tells YARN that the entire job is done, and then YARN kills all the containers before they have a chance to dispose.
> Whatever the case is, this looks like a race condition. I occasionally do see my Dispose() actions successfully complete.
> -----------------------------------------
> [1] Log showing that IIMRUResultHandler.Dispose() did not finish. Here Dispose() is a for loop that writes a number and waits a second. It is supposed to count to 100, but only counts till 9 before the logs end.
> {noformat}
> ...
> INFO: receive a ReefMessage from Org.Apache.REEF.Common.Protobuf.ReefProtocol.REEFMessage Driver at socket://100.119.200.14:9570.
> Org.Apache.REEF.Common.Runtime.Evaluator.ReefMessageProtoObserver Information: 0 : 2017-04-14T15:48:09.5616766+00:00 0020
> INFO: Control protobuf to remove context MasterRootContext
> Org.Apache.REEF.Common.Runtime.Evaluator.EvaluatorRuntime Information: 0 : 2017-04-14T15:48:09.5616766+00:00 0020
> INFO: Handle Evaluator control message:  done_evaluator = null kill_evaluator = null stop_evaluator = null context_control = not null
> Org.Apache.REEF.Common.Runtime.Evaluator.EvaluatorRuntime Information: 0 : 2017-04-14T15:48:09.5616766+00:00 0020
> INFO: Send task control message to ContextManager
> Org.Apache.REEF.Common.Runtime.Evaluator.Context.ContextManager Information: 0 : 2017-04-14T15:48:09.5616766+00:00 0020
> INFO: RemoveContext with id MasterRootContext
> Org.Apache.REEF.Common.Runtime.Evaluator.Context.ContextRuntime Warning: 0 : 2017-04-14T15:48:09.5616766+00:00 0020
> WARNING: Shutting down an task because the underlying context is being closed.
> Org.Apache.REEF.Common.Runtime.Evaluator.Task.TaskRuntime Information: 0 : 2017-04-14T15:48:09.5616766+00:00 0020
> INFO: Trying to close Task IMRUMaster-0
> Org.Apache.REEF.Common.Runtime.Evaluator.Task.TaskRuntime Warning: 0 : 2017-04-14T15:48:09.5616766+00:00 0020
> WARNING: Trying to close an task that is in Done state. Ignored.
> Org.Apache.REEF.Common.Context.Defaults.DefaultContextStopHandler Information: 0 : 2017-04-14T15:48:09.5616766+00:00 0020
> INFO: DefaultContextStopHandler received for context: MasterRootContext
> Org.Apache.REEF.Common.Runtime.Evaluator.EvaluatorRuntime Information: 0 : 2017-04-14T15:48:09.5616766+00:00 0020
> INFO: Context stack is empty, done
> Org.Apache.REEF.Common.Runtime.Evaluator.HeartBeatManager Information: 0 : 2017-04-14T15:48:09.5616766+00:00 0020
> INFO: Triggered a heartbeat: EvaluatorHeartbeatProto: task_id=[], task_status=[], task_message=[], evaluator_status=[DONE], context_status=[], time
> stamp=[1492184889561], recoveryFlag =[False].
> Org.Apache.REEF.Common.Runtime.Evaluator.ReefMessageProtoObserver Information: 0 : 2017-04-14T15:48:09.5929225+00:00 0005
> INFO: receive a ReefMessage from Org.Apache.REEF.Common.Protobuf.ReefProtocol.REEFMessage Driver at socket://100.119.200.14:9570.
> Org.Apache.REEF.Common.Runtime.Evaluator.EvaluatorRuntime Information: 0 : 2017-04-14T15:48:09.5929225+00:00 0005
> INFO: Handle Evaluator control message:  done_evaluator = not null kill_evaluator = null stop_evaluator = null context_control = null
> Org.Apache.REEF.Common.Runtime.Evaluator.EvaluatorRuntime Information: 0 : 2017-04-14T15:48:09.5929225+00:00 0005
> INFO: Received ACK from Driver, shutting down Evaluator.
> Org.Apache.REEF.Common.Runtime.Evaluator.EvaluatorExitLogger Information: 0 : 2017-04-14T15:48:09.5929225+00:00 0001
> INFO: The Evaluator has successfully exited.
> Org.Apache.REEF.Evaluator.Evaluator Information: 0 : 2017-04-14T15:48:09.5929225+00:00 0001
> INFO: Evaluator is returned from Run()
> [name-of-class-deleted]`1 Information: 0 : 2017-04-14T15:48:10.0460051+00:00 0014
> INFO: IIMRUResultHandler Dispose Countdown: 1
> [name-of-class-deleted]`1 Information: 0 : 2017-04-14T15:48:11.0615866+00:00 0014
> INFO: IIMRUResultHandler Dispose Countdown: 2
> [name-of-class-deleted]`1 Information: 0 : 2017-04-14T15:48:12.0771779+00:00 0014
> INFO: IIMRUResultHandler Dispose Countdown: 3
> [name-of-class-deleted]`1 Information: 0 : 2017-04-14T15:48:13.0927401+00:00 0014
> INFO: IIMRUResultHandler Dispose Countdown: 4
> [name-of-class-deleted]`1 Information: 0 : 2017-04-14T15:48:14.1083243+00:00 0014
> INFO: IIMRUResultHandler Dispose Countdown: 5
> [name-of-class-deleted]`1 Information: 0 : 2017-04-14T15:48:15.1239021+00:00 0014
> INFO: IIMRUResultHandler Dispose Countdown: 6
> [name-of-class-deleted]`1 Information: 0 : 2017-04-14T15:48:16.1394801+00:00 0014
> INFO: IIMRUResultHandler Dispose Countdown: 7
> [name-of-class-deleted]`1 Information: 0 : 2017-04-14T15:48:17.1550601+00:00 0014
> INFO: IIMRUResultHandler Dispose Countdown: 8
> [name-of-class-deleted]`1 Information: 0 : 2017-04-14T15:48:18.1706423+00:00 0014
> INFO: IIMRUResultHandler Dispose Countdown: 9
> End of LogType:evaluator.stdout
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)