You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@tez.apache.org by "Jeff Zhang (JIRA)" <ji...@apache.org> on 2015/05/19 06:52:00 UTC

[jira] [Comment Edited] (TEZ-2456) Refactor recovery event logging to ensure it meet the recovery event spec

    [ https://issues.apache.org/jira/browse/TEZ-2456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14549781#comment-14549781 ] 

Jeff Zhang edited comment on TEZ-2456 at 5/19/15 4:51 AM:
----------------------------------------------------------

[~hitesh] Thanks for the review. 

bq. TaskFinishedEvent must be logged before VertexFinishedEvent   ( Retrospective failures? )
Rephase it to:   For VertexFinishedEvent (SUCCEEDED), before it there must be at least n TaskFinishedEvent (SUCCEEDED)

bq. TaskAttemptFinishedEvent should be logged before TaskFinishedEvent ( Retrospective failures? )
Rephase it to:  For TaskFinishedEvent (SUCCEEDED), before it there must be at least one TaskAttemptFinishedEvent (SUCCEEDED)

bq. Also, nothing in the list related to speculated attempts. How should those be handled?
After TEZ-2249, all the task attempts should be finished before task is finished. 

bq. There are multiple places where this is called out. Is there an issue if it gets logged twice? Will something break? Should there be checks to ensure it is logged only once or can the recovery handle it if the event is logged twice? What kind of problems do you see if it happens twice? 
Two issues will be caused by multiple logging for the same event.
* The metrics will be incorrect, specially for the start_time & finished_time
* If the AM is killed again, the next recovery will handle the same recovery event multiple times, may cause some potential issue. I think the restoreFromEvent method assume every event is logged once. 




was (Author: zjffdu):
[~hitesh] Thanks for the review. 

bq. TaskFinishedEvent must be logged before VertexFinishedEvent   ( Retrospective failures? )
Rephase it to:   For VertexFinishedEvent (SUCCEEDED), there must be at least n TaskFinishedEvent (SUCCEEDED)

bq. TaskAttemptFinishedEvent should be logged before TaskFinishedEvent ( Retrospective failures? )
Rephase it to:  For TaskFinishedEvent (SUCCEEDED), there must be at least one TaskAttemptFinishedEvent (SUCCEEDED)

bq. Also, nothing in the list related to speculated attempts. How should those be handled?
After TEZ-2249, all the task attempts should be finished before task is finished. 

bq. There are multiple places where this is called out. Is there an issue if it gets logged twice? Will something break? Should there be checks to ensure it is logged only once or can the recovery handle it if the event is logged twice? What kind of problems do you see if it happens twice? 
Two issues will be caused by multiple logging for the same event.
* The metrics will be incorrect, specially for the start_time & finished_time
* If the AM is killed again, the next recovery will handle the same recovery event multiple times, may cause some potential issue. I think the restoreFromEvent method assume every event is logged once. 



> Refactor recovery event logging to ensure it meet the recovery event spec
> -------------------------------------------------------------------------
>
>                 Key: TEZ-2456
>                 URL: https://issues.apache.org/jira/browse/TEZ-2456
>             Project: Apache Tez
>          Issue Type: Sub-task
>            Reporter: Jeff Zhang
>            Assignee: Jeff Zhang
>
> Currently we don't have spec for the recovery event logging. Recovery would be fragile to code change. This jira try to define the spec and refactor the recovery event logging to ensure it meet the spec. [~hitesh] Please help review the following spec I drafted.
> *DAG*
> * DAGSubmitted/DAGInitializedEvent/DAGStartedEvent must been logged once, Should not log it again when it’s recovered.
> * DAGFinishedEvent may be logged multiple times.  ( DAG move from SUCCEEDED from ERROR ? Should we ignore this ? )
> * VertexFinishedEvent should be logged before DAGFinishedEvent
> *Vertex* 
> * RootInputDataInformation must be logged before VertexInitializedEvent
> * DataMovement must be logged before TaskFinishedEvent
> * TaskFinishedEvent must be logged before VertexFinishedEvent
> * VertexInitializedEvent / VertexStartedEvent should only be logged once, should not log again when it’s recovered.
> * VertexFinishedEvent may be logged multiple times. (e.g. Vertex move from SUCCEEDED to FAILED)
> * VertexParallelismUpdatedEvent must be logged before TaskStartedEvent
> * TaskFinishedEvent should be logged before VertexFinishedEvent
> *Task*
> * If there’s no TaskStartedEvent, TaskFinishedEvent may still be logged (e.g. Task is killed in NEW )  Current’s behavior is that TaskFinishedEvent won’t be logged if there’s no TaskStartedEvent. 
> * TaskStartedEvent should only be logged once.  Should not log again when it’s recovered.
> * TaskFinishedEvent may be logged multiple times (e.g. Task move from SUCCEEDED to FAILED)
> * TaskAttemptFinishedEvent should be logged before TaskFinishedEvent
> 	
> *TaskAttempt*
> * If there’s no TaskAttemptStartedEvent, TaskAttemptFinishedEvent may still be logged ( e.g. TaskAttempt is killed in NEW )  Current’s behavior is that TaskAttemptFinishedEvent won’t be logged if there’s no TaskAttemptStartedEvent
> * TaskAttemptStartedEvent should only be logged once.  Should not log again when it’s recovered.
> * TaskAttemptFinishedEvent may be logged multiple times. (e.g. TaskAttempt move from SUCCEEDED to FAILED)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)