You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@tez.apache.org by "Hitesh Shah (JIRA)" <ji...@apache.org> on 2015/05/18 22:51:01 UTC

[jira] [Commented] (TEZ-2456) Refactor recovery event logging to ensure it meet the recovery event spec

    [ https://issues.apache.org/jira/browse/TEZ-2456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14549198#comment-14549198 ] 

Hitesh Shah commented on TEZ-2456:
----------------------------------

bq. DAGFinishedEvent may be logged multiple times. ( DAG move from SUCCEEDED from ERROR ? Should we ignore this ? )

No - as this affects final state. 

bq. VertexFinishedEvent should be logged before DAGFinishedEvent

There should never be cases where a vertex can change state after a dag has finished. We should make sure that the state machine ensures that this scenario can never occur. \cc [~bikassaha]. 

bq. RootInputDataInformation must be logged before VertexInitializedEvent

This depends on what criteria we choose to detect whether the root input initializer has completed running. If that flag is the vertex init event, that it fine. 

bq. DataMovement must be logged before TaskFinishedEvent

Re=phrase this to all events generated from all task attempts of a given task should be logged before the task finished event. This becomes tricky for retrospective failures. Needs to be looked at what happens when we have events followed by task finished, then task re-run, more events and a final finished event. There could be crashes at any stage in the list.

bq. TaskFinishedEvent must be logged before VertexFinishedEvent

Task re-runs? Retrospective failures? 

bq. VertexParallelismUpdatedEvent must be logged before TaskStartedEvent

This depends on the state machine. If the framework supports it, then recovery also should. 

bq. TaskAttemptFinishedEvent should be logged before TaskFinishedEvent

Retrospective failures? 
Also, nothing in the list related to speculated attempts. How should those be handled?

bq. should only be logged once. Should not log again when it’s recovered.

There are multiple places where this is called out. Is there an issue if it gets logged twice? Will something break? Should there be checks to ensure it is logged only once or can the recovery handle it if the event is logged twice? What kind of problems do you see if it happens twice? 






> Refactor recovery event logging to ensure it meet the recovery event spec
> -------------------------------------------------------------------------
>
>                 Key: TEZ-2456
>                 URL: https://issues.apache.org/jira/browse/TEZ-2456
>             Project: Apache Tez
>          Issue Type: Sub-task
>            Reporter: Jeff Zhang
>            Assignee: Jeff Zhang
>
> Currently we don't have spec for the recovery event logging. Recovery would be fragile to code change. This jira try to define the spec and refactor the recovery event logging to ensure it meet the spec. [~hitesh] Please help review the following spec I drafted.
> *DAG*
> * DAGSubmitted/DAGInitializedEvent/DAGStartedEvent must been logged once, Should not log it again when it’s recovered.
> * DAGFinishedEvent may be logged multiple times.  ( DAG move from SUCCEEDED from ERROR ? Should we ignore this ? )
> * VertexFinishedEvent should be logged before DAGFinishedEvent
> *Vertex* 
> * RootInputDataInformation must be logged before VertexInitializedEvent
> * DataMovement must be logged before TaskFinishedEvent
> * TaskFinishedEvent must be logged before VertexFinishedEvent
> * VertexInitializedEvent / VertexStartedEvent should only be logged once, should not log again when it’s recovered.
> * VertexFinishedEvent may be logged multiple times. (e.g. Vertex move from SUCCEEDED to FAILED)
> * VertexParallelismUpdatedEvent must be logged before TaskStartedEvent
> * TaskFinishedEvent should be logged before VertexFinishedEvent
> *Task*
> * If there’s no TaskStartedEvent, TaskFinishedEvent may still be logged (e.g. Task is killed in NEW )  Current’s behavior is that TaskFinishedEvent won’t be logged if there’s no TaskStartedEvent. 
> * TaskStartedEvent should only be logged once.  Should not log again when it’s recovered.
> * TaskFinishedEvent may be logged multiple times (e.g. Task move from SUCCEEDED to FAILED)
> * TaskAttemptFinishedEvent should be logged before TaskFinishedEvent
> 	
> *TaskAttempt*
> * If there’s no TaskAttemptStartedEvent, TaskAttemptFinishedEvent may still be logged ( e.g. TaskAttempt is killed in NEW )  Current’s behavior is that TaskAttemptFinishedEvent won’t be logged if there’s no TaskAttemptStartedEvent
> * TaskAttemptStartedEvent should only be logged once.  Should not log again when it’s recovered.
> * TaskAttemptFinishedEvent may be logged multiple times. (e.g. TaskAttempt move from SUCCEEDED to FAILED)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)