You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@tez.apache.org by "Jeff Zhang (JIRA)" <ji...@apache.org> on 2015/05/25 08:29:17 UTC
[jira] [Comment Edited] (TEZ-2304) InvalidStateTransitonException TA_SCHEDULE at START_WAIT during recovery

    [ https://issues.apache.org/jira/browse/TEZ-2304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14557994#comment-14557994 ] 

Jeff Zhang edited comment on TEZ-2304 at 5/25/15 6:28 AM:
----------------------------------------------------------

In this log, there's only recovery events for attempt_1428329756093_168563_1_00_006728_1 (attempt_1) but no attempt_1428329756093_168563_1_00_006728_0 (attempt_0)
It is possible that attempt_0 is killed before it started so there's no any recovery events for it. But we should log the TaskAttemptFinishedEvent even when there's no TaskAttemptStartedEvent. (link this with TEZ-2456)
Otherwise in this case, attempt_0 wouldn't be recovered and attempt_1 will be recovered, and when a new attempt is scheduled its task attempt id would be the same as the attempt_1, because we create task attempt id by using the attempts.size();
{code}
TaskAttempt attempt = createAttempt(attempts.size());
{code}

That's why we would see the following weird transition ( from NEW to KILLED, and then form NEW to START_WAIT), actually these are 2 different task attempt but with the same attempt id, so their state machines are messed up together. 
{noformat}
2015-04-09 20:05:42,055 INFO [AsyncDispatcher event handler] impl.TaskAttemptImpl: attempt_1428329756093_168563_1_00_006728_1 TaskAttempt Transitioned from NEW to KILLED due to event TA_RECOVER
{noformat}
{noformat}
2015-04-09 20:05:45,748 INFO [AsyncDispatcher event handler] impl.TaskAttemptImpl: attempt_1428329756093_168563_1_00_006728_1 TaskAttempt Transitioned from NEW to START_WAIT due to event TA_SCHEDULE
{noformat}


was (Author: zjffdu):
In this log, there's only recovery events for attempt_1428329756093_168563_1_00_006728_1 (attempt_1) but no attempt_1428329756093_168563_1_00_006728_0 (attempt_0)
It is possible that attempt_0 is killed before it started so there's no any recovery events for it. We should log the TaskAttemptFinishedEvent even when there's no TaskAttemptStartedEvent. (link this with TEZ-2456)
In this case, attempt_0 wouldn't be recovered and attempt_1 will be recovered, and when a new attempt is scheduled its task attempt id would be the same as the attempt_1, because we create task attempt id by using the attempts.size();
{code}
TaskAttempt attempt = createAttempt(attempts.size());
{code}

That's why we would see the following weird transition ( from NEW to KILLED, and then form NEW to START_WAIT), actually these are 2 different task attempt but with the same attempt id, so their state machines are messed up together. 
{noformat}
2015-04-09 20:05:42,055 INFO [AsyncDispatcher event handler] impl.TaskAttemptImpl: attempt_1428329756093_168563_1_00_006728_1 TaskAttempt Transitioned from NEW to KILLED due to event TA_RECOVER
{noformat}
{noformat}
2015-04-09 20:05:45,748 INFO [AsyncDispatcher event handler] impl.TaskAttemptImpl: attempt_1428329756093_168563_1_00_006728_1 TaskAttempt Transitioned from NEW to START_WAIT due to event TA_SCHEDULE
{noformat}

> InvalidStateTransitonException TA_SCHEDULE at START_WAIT during recovery
> ------------------------------------------------------------------------
>
>                 Key: TEZ-2304
>                 URL: https://issues.apache.org/jira/browse/TEZ-2304
>             Project: Apache Tez
>          Issue Type: Bug
>    Affects Versions: 0.6.0
>            Reporter: Jason Lowe
>         Attachments: 168563_recovery.gz
>
>
> I saw a Tez AM throw a few InvalidStateTransitonException (sic) instances during recovery complaining about TA_SCHEDULE arriving at the START_WAIT state.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)