You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@tez.apache.org by "Jeff Zhang (JIRA)" <ji...@apache.org> on 2015/05/25 08:29:17 UTC
[jira] [Comment Edited] (TEZ-2304) InvalidStateTransitonException
TA_SCHEDULE at START_WAIT during recovery
[ https://issues.apache.org/jira/browse/TEZ-2304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14557994#comment-14557994 ]
Jeff Zhang edited comment on TEZ-2304 at 5/25/15 6:28 AM:
----------------------------------------------------------
In this log, there's only recovery events for attempt_1428329756093_168563_1_00_006728_1 (attempt_1) but no attempt_1428329756093_168563_1_00_006728_0 (attempt_0)
It is possible that attempt_0 is killed before it started so there's no any recovery events for it. But we should log the TaskAttemptFinishedEvent even when there's no TaskAttemptStartedEvent. (link this with TEZ-2456)
Otherwise in this case, attempt_0 wouldn't be recovered and attempt_1 will be recovered, and when a new attempt is scheduled its task attempt id would be the same as the attempt_1, because we create task attempt id by using the attempts.size();
{code}
TaskAttempt attempt = createAttempt(attempts.size());
{code}
That's why we would see the following weird transition ( from NEW to KILLED, and then form NEW to START_WAIT), actually these are 2 different task attempt but with the same attempt id, so their state machines are messed up together.
{noformat}
2015-04-09 20:05:42,055 INFO [AsyncDispatcher event handler] impl.TaskAttemptImpl: attempt_1428329756093_168563_1_00_006728_1 TaskAttempt Transitioned from NEW to KILLED due to event TA_RECOVER
{noformat}
{noformat}
2015-04-09 20:05:45,748 INFO [AsyncDispatcher event handler] impl.TaskAttemptImpl: attempt_1428329756093_168563_1_00_006728_1 TaskAttempt Transitioned from NEW to START_WAIT due to event TA_SCHEDULE
{noformat}
was (Author: zjffdu):
In this log, there's only recovery events for attempt_1428329756093_168563_1_00_006728_1 (attempt_1) but no attempt_1428329756093_168563_1_00_006728_0 (attempt_0)
It is possible that attempt_0 is killed before it started so there's no any recovery events for it. We should log the TaskAttemptFinishedEvent even when there's no TaskAttemptStartedEvent. (link this with TEZ-2456)
In this case, attempt_0 wouldn't be recovered and attempt_1 will be recovered, and when a new attempt is scheduled its task attempt id would be the same as the attempt_1, because we create task attempt id by using the attempts.size();
{code}
TaskAttempt attempt = createAttempt(attempts.size());
{code}
That's why we would see the following weird transition ( from NEW to KILLED, and then form NEW to START_WAIT), actually these are 2 different task attempt but with the same attempt id, so their state machines are messed up together.
{noformat}
2015-04-09 20:05:42,055 INFO [AsyncDispatcher event handler] impl.TaskAttemptImpl: attempt_1428329756093_168563_1_00_006728_1 TaskAttempt Transitioned from NEW to KILLED due to event TA_RECOVER
{noformat}
{noformat}
2015-04-09 20:05:45,748 INFO [AsyncDispatcher event handler] impl.TaskAttemptImpl: attempt_1428329756093_168563_1_00_006728_1 TaskAttempt Transitioned from NEW to START_WAIT due to event TA_SCHEDULE
{noformat}
> InvalidStateTransitonException TA_SCHEDULE at START_WAIT during recovery
> ------------------------------------------------------------------------
>
> Key: TEZ-2304
> URL: https://issues.apache.org/jira/browse/TEZ-2304
> Project: Apache Tez
> Issue Type: Bug
> Affects Versions: 0.6.0
> Reporter: Jason Lowe
> Attachments: 168563_recovery.gz
>
>
> I saw a Tez AM throw a few InvalidStateTransitonException (sic) instances during recovery complaining about TA_SCHEDULE arriving at the START_WAIT state.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)