You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@tez.apache.org by "Jeff Zhang (JIRA)" <ji...@apache.org> on 2014/09/10 17:19:34 UTC

[jira] [Commented] (TEZ-1559) Add system tests for AM recovery

    [ https://issues.apache.org/jira/browse/TEZ-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14128583#comment-14128583 ] 

Jeff Zhang commented on TEZ-1559:
---------------------------------

Attach the patch.

* Add unit test covering the case in the description. ( read recovery log to verify the behavior )
* Add vertexName to TaskAttemptFinishedProto for unit test ( also do the same thing for other events for future unit test usage ) 
* Handle the remaining HistoryEvent in the event queue in RecoveryService to ensure all the events is written when AM is killed. Add a configuration option to enable this feature. 

> Add system tests for AM recovery
> --------------------------------
>
>                 Key: TEZ-1559
>                 URL: https://issues.apache.org/jira/browse/TEZ-1559
>             Project: Apache Tez
>          Issue Type: Sub-task
>            Reporter: Jeff Zhang
>            Assignee: Jeff Zhang
>         Attachments: Tez-1559.patch
>
>
> * [Fine-grained recovery task-level] In a vertex, task 0 is done task 1 is running. History flush happens. AM dies. Once AM is recovered, task 0 is not re-run. Task 1 is re-run.
> * [Data movement types] Test AM recovery with all data movement types including 1-1, broadcast, scatter-gather with/without shuffle. AM should die in 2 scenarios: first-vertex task finishes completely and partially.
> * [Kill AM many times] Set AM max attempt to high number. Kill many attempts. Last AM can still be recovered with latest AM history data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)