You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@reef.apache.org by "Dhruv Mahajan (JIRA)" <ji...@apache.org> on 2016/08/31 17:45:20 UTC
[jira] [Comment Edited] (REEF-1251) IMRU Driver handlers for Fault Tolerant

    [ https://issues.apache.org/jira/browse/REEF-1251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15452860#comment-15452860 ] 

Dhruv Mahajan edited comment on REEF-1251 at 8/31/16 5:44 PM:
--------------------------------------------------------------

[~juliaw] [~MariiaMykhailova][~markus.weimer] [~andreym] Looking at the current PR it seems there are separate multiple PRs it can be divided in to. However, I see the concern that testing by putting these as seprate JIRas will become a daunting task. Hence, it might be ok to merge it as one off big PR. However, I would like to lay down concerns and shortcoming and that we are on the same page that these need to be addressed to make IMRU Ft practically usable.

# Currently, if the evaluator fails while we are still in the phase of task submission phase, we will have an issue where the newly created tasks will wait for a long time in {{WaitForRegistration}} in Group communication initialization before getting cancelled. Two ways this can be handled:
#* A proper let's get it over with way would be to let driver handle this registration and synchronization mechanism. Having it at central place will solve lots of issues for us. The {{GroupCommDriver}} service would need to be extended and addition event handler would need to be binded so that when driver received the messages from context that GroupComm. service is ready it can start tasks. This also means having an additional context later in IMRU FT.
#* A less optimal and less preferrable way would be to pass the {{WaitForRegistration}} some sort of cancellation token or bool sort of variable that it can check after every retry that whether it needs to come out. The task on receiving the close signal can then simply set this boolean to true. Will this work? One question: if we are in the constructor of the task, i.e. driver still has not got an {{IRunningTask}}, is there a way to send close signal and act on it, I guess not. If not, then I would seriously suggest working on the first above.
# The whole exception handling in {{*TaskHost}} look very convoluted to me. It seems they were put after testing a lot on cluster and observing exceptions we encountered. What if we encounter anew sort of exception? I understand this is a trickier problem in general and I propose to simplify it. There can be multiple sources of exceptions or failures : a) Bug in base REEF, b) Bug in IMRU, c) Bug in user's map and update tasks, and d) Bug in group communication since codecs provided by user are buggy or he forgot to provide one. Can't we have a simple logic where all failures are recoverable? a), b) should not happen in any case since those are REEF bugs and we should not run IMRU FT while they exist. For c) and d) responsibility lies with user and it's ok to do weird things there. Infact, Hadoop Map-reduce also does that. There can be another issue where cluster itself is doing weird things beyond a)-d) although I can not thing what. Then in any case we can't do much.

Thoughts?




was (Author: dkm2110):
[~juliaw] [~MariiaMykhailova][~markus.weimer] [~andreym] Looking at the current PR it seems there are separate multiple PRs it can be divided in to. However, I see the concern that testing by putting these as seprate JIRas will become a daunting task. Hence, it might be ok to merge it as one off big PR. However, I would like to lay down concerns and shortcoming and that we are on the same page that these need to be addressed to make IMRU Ft practically usable.

# Currently, if the evaluator fails while we are still in the phase of task submission phase, we will have an issue where the newly created tasks will wait for a long time in {{WaitForRegistration}} in Group communication initialization before getting cancelled. Two ways this can be handled:
#* A proper let's get it over with way would be to let driver handle this registration and synchronization mechanism. Having it at central place will solve lots of issues for us. The {{GroupCommDriver}} service would need to be extended and addition event handler would need to be binded so that when driver received the messages from context that GroupComm. service is ready it can start tasks. This also means having an additional context later in IMRU FT.
#* A less optimal and less preferrable way would be to pass the {{WaitForRegistration}} some sort of cancellation token or bool sort of variable that it can check after every retry that whether it needs to come out. The task on receiving the close signal can then simply set this boolean to true. Will this work? One question: if we are in the constructor of the task, i.e. driver still has not got an {{IRunningTask}}, is there a way to send close signal and act on it, I guess not. If not, then I would seriously suggest working on the first above.

# The whole exception handling in {{*TaskHost}} look very convoluted to me. It seems they were put after testing a lot on cluster and observing exceptions we encountered. What if we encounter anew sort of exception? I understand this is a trickier problem in general and I propose to simplify it. There can be multiple sources of exceptions or failures : a) Bug in base REEF, b) Bug in IMRU, c) Bug in user's map and update tasks, and d) Bug in group communication since codecs provided by user are buggy or he forgot to provide one. Can't we have a simple logic where all failures are recoverable? a), b) should not happen in any case since those are REEF bugs and we should not run IMRU FT while they exist. For c) and d) responsibility lies with user and it's ok to do weird things there. Infact, Hadoop Map-reduce also does that. There can be another issue where cluster itself is doing weird things beyond a)-d) although I can not thing what. Then in any case we can't do much.

Thoughts?



> IMRU Driver handlers for Fault Tolerant
> ---------------------------------------
>
>                 Key: REEF-1251
>                 URL: https://issues.apache.org/jira/browse/REEF-1251
>             Project: REEF
>          Issue Type: Task
>          Components: REEF.NET, REEF.NET Evaluator
>            Reporter: Julia
>            Assignee: Julia
>              Labels: FT
>
> Handles communications between driver and evaluators for evaluator and task recovery when some evaluators fail. The following describe a flow for an example:
> Here is the control flow in normal scenario:
> a.	All the task, context and task status information is maintained in Task Manager when tasks are created at the first time
> b.	Task1, task2, Task3 s are queued in Task Starter 
> c.	When all tasks in a group is ready, tasks are submitted
> d.	When tasks start running, task status is updated in Task Manager
> e.	Evaluator 3 failed 
> f.	Driver received failed evaluator event and report it to Evaluator Manager
> g.	Task Manager update task status to set task3 as failed
> h.	Driver send message to task1 and task2 to stop them and update task status in Task Manager
> i.	Driver request a new evaluator3’ for failed evaluator and submit a new context3’ for it and add a new task3’ to the queue
> j.	Driver recreate task1’ and task2’ with existing context1 and context2 add them to the queue
> k.	When all the new tasks in the communication group are ready, start tasks as in step c.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)