You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@reef.apache.org by "Julia (JIRA)" <ji...@apache.org> on 2017/05/03 22:20:04 UTC

[jira] [Resolved] (REEF-1778) The IIMRUResultHandler Dispose() method isn't allowed to complete before the job finishes.

     [ https://issues.apache.org/jira/browse/REEF-1778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Julia resolved REEF-1778.
-------------------------
    Resolution: Fixed

Resolved via https://github.com/apache/reef/pull/1293

> The IIMRUResultHandler Dispose() method isn't allowed to complete before the job finishes.
> ------------------------------------------------------------------------------------------
>
>                 Key: REEF-1778
>                 URL: https://issues.apache.org/jira/browse/REEF-1778
>             Project: REEF
>          Issue Type: Bug
>          Components: IMRU, REEF.NET
>         Environment: YARN. Does not affect the local runtime.
>            Reporter: Rogan Carr
>            Assignee: Markus Weimer
>
> The IIMRUResultHandler Dispose() method isn't allowed to complete before the job finishes when run on YARN. When running locally, this isn't an issue. I have not tested a Mesos runtime.
> My hunch is that YARN is killing the container before Dispose() finishes. If I had to guess, I think that the driver calls Dispose() is on the UpdateTaskHost (which calls it on the IIMRUResultHandler), and then the driver tells YARN that the entire job is done, and then YARN kills all the containers before they have a chance to dispose.
> Whatever the case is, this looks like a race condition. I occasionally do see my Dispose() actions successfully complete.
> -----------------------------------------
> [1] Log showing that IIMRUResultHandler.Dispose() did not finish. Here Dispose() is a for loop that writes a number and waits a second. It is supposed to count to 100, but only counts till 9 before the logs end.
> {noformat}
> ...
> INFO: receive a ReefMessage from Org.Apache.REEF.Common.Protobuf.ReefProtocol.REEFMessage Driver at socket://100.119.200.14:9570.
> Org.Apache.REEF.Common.Runtime.Evaluator.ReefMessageProtoObserver Information: 0 : 2017-04-14T15:48:09.5616766+00:00 0020
> INFO: Control protobuf to remove context MasterRootContext
> Org.Apache.REEF.Common.Runtime.Evaluator.EvaluatorRuntime Information: 0 : 2017-04-14T15:48:09.5616766+00:00 0020
> INFO: Handle Evaluator control message:  done_evaluator = null kill_evaluator = null stop_evaluator = null context_control = not null
> Org.Apache.REEF.Common.Runtime.Evaluator.EvaluatorRuntime Information: 0 : 2017-04-14T15:48:09.5616766+00:00 0020
> INFO: Send task control message to ContextManager
> Org.Apache.REEF.Common.Runtime.Evaluator.Context.ContextManager Information: 0 : 2017-04-14T15:48:09.5616766+00:00 0020
> INFO: RemoveContext with id MasterRootContext
> Org.Apache.REEF.Common.Runtime.Evaluator.Context.ContextRuntime Warning: 0 : 2017-04-14T15:48:09.5616766+00:00 0020
> WARNING: Shutting down an task because the underlying context is being closed.
> Org.Apache.REEF.Common.Runtime.Evaluator.Task.TaskRuntime Information: 0 : 2017-04-14T15:48:09.5616766+00:00 0020
> INFO: Trying to close Task IMRUMaster-0
> Org.Apache.REEF.Common.Runtime.Evaluator.Task.TaskRuntime Warning: 0 : 2017-04-14T15:48:09.5616766+00:00 0020
> WARNING: Trying to close an task that is in Done state. Ignored.
> Org.Apache.REEF.Common.Context.Defaults.DefaultContextStopHandler Information: 0 : 2017-04-14T15:48:09.5616766+00:00 0020
> INFO: DefaultContextStopHandler received for context: MasterRootContext
> Org.Apache.REEF.Common.Runtime.Evaluator.EvaluatorRuntime Information: 0 : 2017-04-14T15:48:09.5616766+00:00 0020
> INFO: Context stack is empty, done
> Org.Apache.REEF.Common.Runtime.Evaluator.HeartBeatManager Information: 0 : 2017-04-14T15:48:09.5616766+00:00 0020
> INFO: Triggered a heartbeat: EvaluatorHeartbeatProto: task_id=[], task_status=[], task_message=[], evaluator_status=[DONE], context_status=[], time
> stamp=[1492184889561], recoveryFlag =[False].
> Org.Apache.REEF.Common.Runtime.Evaluator.ReefMessageProtoObserver Information: 0 : 2017-04-14T15:48:09.5929225+00:00 0005
> INFO: receive a ReefMessage from Org.Apache.REEF.Common.Protobuf.ReefProtocol.REEFMessage Driver at socket://100.119.200.14:9570.
> Org.Apache.REEF.Common.Runtime.Evaluator.EvaluatorRuntime Information: 0 : 2017-04-14T15:48:09.5929225+00:00 0005
> INFO: Handle Evaluator control message:  done_evaluator = not null kill_evaluator = null stop_evaluator = null context_control = null
> Org.Apache.REEF.Common.Runtime.Evaluator.EvaluatorRuntime Information: 0 : 2017-04-14T15:48:09.5929225+00:00 0005
> INFO: Received ACK from Driver, shutting down Evaluator.
> Org.Apache.REEF.Common.Runtime.Evaluator.EvaluatorExitLogger Information: 0 : 2017-04-14T15:48:09.5929225+00:00 0001
> INFO: The Evaluator has successfully exited.
> Org.Apache.REEF.Evaluator.Evaluator Information: 0 : 2017-04-14T15:48:09.5929225+00:00 0001
> INFO: Evaluator is returned from Run()
> [name-of-class-deleted]`1 Information: 0 : 2017-04-14T15:48:10.0460051+00:00 0014
> INFO: IIMRUResultHandler Dispose Countdown: 1
> [name-of-class-deleted]`1 Information: 0 : 2017-04-14T15:48:11.0615866+00:00 0014
> INFO: IIMRUResultHandler Dispose Countdown: 2
> [name-of-class-deleted]`1 Information: 0 : 2017-04-14T15:48:12.0771779+00:00 0014
> INFO: IIMRUResultHandler Dispose Countdown: 3
> [name-of-class-deleted]`1 Information: 0 : 2017-04-14T15:48:13.0927401+00:00 0014
> INFO: IIMRUResultHandler Dispose Countdown: 4
> [name-of-class-deleted]`1 Information: 0 : 2017-04-14T15:48:14.1083243+00:00 0014
> INFO: IIMRUResultHandler Dispose Countdown: 5
> [name-of-class-deleted]`1 Information: 0 : 2017-04-14T15:48:15.1239021+00:00 0014
> INFO: IIMRUResultHandler Dispose Countdown: 6
> [name-of-class-deleted]`1 Information: 0 : 2017-04-14T15:48:16.1394801+00:00 0014
> INFO: IIMRUResultHandler Dispose Countdown: 7
> [name-of-class-deleted]`1 Information: 0 : 2017-04-14T15:48:17.1550601+00:00 0014
> INFO: IIMRUResultHandler Dispose Countdown: 8
> [name-of-class-deleted]`1 Information: 0 : 2017-04-14T15:48:18.1706423+00:00 0014
> INFO: IIMRUResultHandler Dispose Countdown: 9
> End of LogType:evaluator.stdout
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)