You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Owen O'Malley (JIRA)" <ji...@apache.org> on 2008/09/05 00:25:44 UTC

[jira] Commented: (HADOOP-4068) JobTracker might wrongly log a tip as failed

    [ https://issues.apache.org/jira/browse/HADOOP-4068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12628488#action_12628488 ] 

Owen O'Malley commented on HADOOP-4068:
---------------------------------------

There used to be code that prevented this. TIPs should not fail unless all of the instances have failed. At some point, we really should redesign the state tracking code in the JobTracker.

> JobTracker might wrongly log a tip as failed
> --------------------------------------------
>
>                 Key: HADOOP-4068
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4068
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Amar Kamat
>
> Consider the following case
> 1) attempt _attempt_1_0_ from tip _tip_1_ that ran on tracker _tracker_1_ failed
> 2) jobtracker will mark _attempt_1_0_ for removal under _tracker_1_. Marking basically means removal of the mapping _tracker_1_->_attempt_1_0_
> 3) Marked attempts are removed only on next heartbeat from _tracker__1 or when _tracker_1_ is lost.
> 4) Consider a case where _tracker_1_ goes down.
> 5) In the meanwhile attempt _attempt_1_1_ succeeds on _tracker_2_ and the jobtracker marks the tip _tip_1_ as complete
> 6) Now the expiry-tracker thread detect that _tracker_1_ is lost and fails all the attempt under _tracker_1_. 
> 7) Here the jobtracker will kill _attempt_1_0_ *again* and log tip _tip_1_ as failed in the history although tip _tip_1_ is really complete/succeeded.
> The events in the history file would be something like
> {noformat}
> tip_1 start
> ---------
> attempt_1_0 start
> attempt_1_0 failed
> ---------
> attempt_1_1 start
> attempt_1_1 finished
> tip_1 finished
> ---------
> tip_1 failed
> {noformat}
> Note that this true even for tasks that expire. Tasks that are scheduled and never come back are killed by the {{ExpireLaunchingTasks}} thread. It will also call {{JobInProgress.failedTask()}} which will fail the attempt and log the TIP as failed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.