You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@reef.apache.org by "Mariia Mykhailova (JIRA)" <ji...@apache.org> on 2016/09/20 18:58:20 UTC

[jira] [Commented] (REEF-1511) timeout for Task Shutdown during IMRU recovery

    [ https://issues.apache.org/jira/browse/REEF-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15507447#comment-15507447 ] 

Mariia Mykhailova commented on REEF-1511:
-----------------------------------------

We should consider time-based timeout (wait for X seconds, kill evaluators for all tasks which haven't reported back by that moment) vs task-percentage-based timeout (wait until Y% of tasks report, kill the rest of them). In the first case, we introduce real time units in distributed system. In the second case, we always lose certain percentage of evaluators upon restart which is what we're trying to minimize with our fault-tolerance work, but we're not bound to real time.

> timeout for Task Shutdown during IMRU recovery
> ----------------------------------------------
>
>                 Key: REEF-1511
>                 URL: https://issues.apache.org/jira/browse/REEF-1511
>             Project: REEF
>          Issue Type: Improvement
>          Components: IMRU
>            Reporter: Andrey
>              Labels: FT
>
> This related to fault tolerance implementation in PR-1251.
> Currently recovery logic in IMRU driver is to wait for all task to move to a final state (failed or completed) before restarting the job check AreAllTasksInFinalState() in TryRecovery() method)
> We've seen driver hanging for a long time waiting for few last tasks finalize.
> Aborting tasks should be quick, so there is bug there, but we also can add logic in driver not to wait for all tasks to complete.
> For instance: if 5% of tasks did not report final state withing expected period, release corresponding evaluators  and proceed with new job retry.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)