You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@reef.apache.org by "Julia (JIRA)" <ji...@apache.org> on 2016/12/09 03:01:07 UTC

[jira] [Commented] (REEF-1492) On IMRU recovery: if ResultHandler.Dispose() throws exception, IMRU Driver hangs.

    [ https://issues.apache.org/jira/browse/REEF-1492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15734115#comment-15734115 ] 

Julia commented on REEF-1492:
-----------------------------

Currently we call ResultHandler.Dispose in finally block of TaskHost and TLCPlusPlus current implementation of ResultHandler copies local file from remote in its Dispose() method. As the exception in task can happen any time, or close event can be sent at any stage, in Dispose of ResultHandler , there might be no result yet, local file may not be created. etc. So very possibly exception will be thrown. 

This exception should be caught by TaskRuntime and eventually send back to driver. However, this call is before SignalTaskStopped in TaskHost base. So when exception happens in ResultHandler.Dispose (), we will miss the call to SignalTaskStopped that may cause something hung. 

What I would suggest is, 
1. Coping result local data file to remote should be in ResultHandler.HandleResult() method. This method is called only when there is result. I would assume this method only called once at the end of the iteration. [~dkm2110] please let me know if that is not the case. We should not put a lot of logic in Dispose method. It should be release resource only. 
2. We should catch exception when calling FinallyBlock() which calls Dispose() in the TaskHost. If there is no complex logic in Dispose() method, the chance of failure should be low. If we really cannot release some resource in dispose method, it should result in FailedEvaluator. As it is master, so no recovery.
3. Add another layer of finally for FinallyBlock() to call SignalTaskStopped in TaskHostBase to ensure the task close event handler is returned. 



> On IMRU recovery: if ResultHandler.Dispose() throws exception, IMRU Driver hangs.
> ---------------------------------------------------------------------------------
>
>                 Key: REEF-1492
>                 URL: https://issues.apache.org/jira/browse/REEF-1492
>             Project: REEF
>          Issue Type: Bug
>          Components: REEF
>            Reporter: Andrey
>              Labels: FT
>
> IMRU scenario:
> - one of the map tasks fails
> - Driver triggers shutdown on all tasks 
> - UpdateTaskHost on shutdown is calling ResultHandler.Dispose()
> - resulthandler (in my case WriteResultHandler) throws exeption because there are no results (Update function was never executed)
> There are couple questions here:
> - WriteResulthandler should handle [no results] situation more gracefully,  especially on Dispose() 
> Probably logic of copy file should be moved from Dispose() to HandleResult() function.
> - UpdateTaskHost should handle exceptions from Dispose() call....result handler can be provided by client, so code can throw.
> In case of Dispose() failure the UpdateTaskHost should probably trigger non-recoverable failure, which in turn triggers Driver failure  (right now driver  hangs)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)