You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by "Amar Kamat (JIRA)" <ji...@apache.org> on 2008/09/04 11:43:44 UTC

[jira] Created: (HADOOP-4068) JobTracker might wrongly log a tip as failed

JobTracker might wrongly log a tip as failed
--------------------------------------------

Key: HADOOP-4068
URL: https://issues.apache.org/jira/browse/HADOOP-4068
Project: Hadoop Core
Issue Type: Bug
Components: mapred
Reporter: Amar Kamat

Consider the following case
1) attempt _attempt_1_0_ from tip _tip_1_ that ran on tracker _tracker_1_ failed
2) jobtracker will mark _attempt_1_0_ for removal under _tracker_1_. Marking basically means removal of the mapping _tracker_1_->_attempt_1_0_
3) Marked attempts are removed only on next heartbeat from _tracker__1 or when _tracker_1_ is lost.
4) Consider a case where _tracker_1_ goes down.
5) In the meanwhile attempt _attempt_1_1_ succeeds on _tracker_2_ and the jobtracker marks the tip _tip_1_ as complete
6) Now the expiry-tracker thread detect that _tracker_1_ is lost and fails all the attempt under _tracker_1_.
7) Here the jobtracker will kill _attempt_1_0_ *again* and log tip _tip_1_ as failed in the history although tip _tip_1_ is really complete/succeeded.

The events in the history file would be something like
{noformat}
tip_1 start
---------
attempt_1_0 start
attempt_1_0 failed
---------
attempt_1_1 start
attempt_1_1 finished
tip_1 finished
---------
tip_1 failed
{noformat}

Note that this true even for tasks that expire. Tasks that are scheduled and never come back are killed by the {{ExpireLaunchingTasks}} thread. It will also call {{JobInProgress.failedTask()}} which will fail the attempt and log the TIP as failed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4068) JobTracker might wrongly log a tip as failed

Posted by "Amar Kamat (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12628537#action_12628537 ] 

Amar Kamat commented on HADOOP-4068:
------------------------------------

+1

> JobTracker might wrongly log a tip as failed
> --------------------------------------------
>
>                 Key: HADOOP-4068
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4068
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Amar Kamat
>
> Consider the following case
> 1) attempt _attempt_1_0_ from tip _tip_1_ that ran on tracker _tracker_1_ failed
> 2) jobtracker will mark _attempt_1_0_ for removal under _tracker_1_. Marking basically means removal of the mapping _tracker_1_->_attempt_1_0_
> 3) Marked attempts are removed only on next heartbeat from _tracker__1 or when _tracker_1_ is lost.
> 4) Consider a case where _tracker_1_ goes down.
> 5) In the meanwhile attempt _attempt_1_1_ succeeds on _tracker_2_ and the jobtracker marks the tip _tip_1_ as complete
> 6) Now the expiry-tracker thread detect that _tracker_1_ is lost and fails all the attempt under _tracker_1_. 
> 7) Here the jobtracker will kill _attempt_1_0_ *again* and log tip _tip_1_ as failed in the history although tip _tip_1_ is really complete/succeeded.
> The events in the history file would be something like
> {noformat}
> tip_1 start
> ---------
> attempt_1_0 start
> attempt_1_0 failed
> ---------
> attempt_1_1 start
> attempt_1_1 finished
> tip_1 finished
> ---------
> tip_1 failed
> {noformat}
> Note that this true even for tasks that expire. Tasks that are scheduled and never come back are killed by the {{ExpireLaunchingTasks}} thread. It will also call {{JobInProgress.failedTask()}} which will fail the attempt and log the TIP as failed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4068) JobTracker might wrongly log a tip as failed

Posted by "Amar Kamat (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12628552#action_12628552 ] 

Amar Kamat commented on HADOOP-4068:
------------------------------------

I suggest we do something like 
{code:title=JobInProgress#failedTask()|borderStyle=solid}
public void failedTask(TaskInProgress tip, TaskAttemptID taskid, String reason, 
                         TaskStatus.Phase phase, TaskStatus.State state, 
                         String trackerName, JobTrackerInstrumentation metrics) {
    TaskStatus status =  ....
    TaskStatus.State oldState = tip.getTaskStatus(taskid).getRunState();
    updateTaskStatus(tip, status, metrics);
    TaskStatus.State newState = tip.getTaskStatus(taskid).getRunState();
    
    // Make sure that the tip fails only if the state changes for the attempt
    // that fails it
    if (oldState == newState) {
      return;
    }
   JobHistory.Task.logFailed(...);
{code}

> JobTracker might wrongly log a tip as failed
> --------------------------------------------
>
>                 Key: HADOOP-4068
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4068
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Amar Kamat
>
> Consider the following case
> 1) attempt _attempt_1_0_ from tip _tip_1_ that ran on tracker _tracker_1_ failed
> 2) jobtracker will mark _attempt_1_0_ for removal under _tracker_1_. Marking basically means removal of the mapping _tracker_1_->_attempt_1_0_
> 3) Marked attempts are removed only on next heartbeat from _tracker__1 or when _tracker_1_ is lost.
> 4) Consider a case where _tracker_1_ goes down.
> 5) In the meanwhile attempt _attempt_1_1_ succeeds on _tracker_2_ and the jobtracker marks the tip _tip_1_ as complete
> 6) Now the expiry-tracker thread detect that _tracker_1_ is lost and fails all the attempt under _tracker_1_. 
> 7) Here the jobtracker will kill _attempt_1_0_ *again* and log tip _tip_1_ as failed in the history although tip _tip_1_ is really complete/succeeded.
> The events in the history file would be something like
> {noformat}
> tip_1 start
> ---------
> attempt_1_0 start
> attempt_1_0 failed
> ---------
> attempt_1_1 start
> attempt_1_1 finished
> tip_1 finished
> ---------
> tip_1 failed
> {noformat}
> Note that this true even for tasks that expire. Tasks that are scheduled and never come back are killed by the {{ExpireLaunchingTasks}} thread. It will also call {{JobInProgress.failedTask()}} which will fail the attempt and log the TIP as failed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (HADOOP-4068) JobTracker might wrongly log a tip as failed

Posted by "Amar Kamat (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12628552#action_12628552 ] 

amar_kamat edited comment on HADOOP-4068 at 9/4/08 10:21 PM:
-------------------------------------------------------------

I suggest we do something like 
{code:title=JobInProgress|borderStyle=solid}
public void failedTask(TaskInProgress tip, TaskAttemptID taskid, String reason, 
                         TaskStatus.Phase phase, TaskStatus.State state, 
                         String trackerName, JobTrackerInstrumentation metrics) {
    TaskStatus status =  ....
    TaskStatus.State oldState = tip.getTaskStatus(taskid).getRunState();
    updateTaskStatus(tip, status, metrics);
    TaskStatus.State newState = tip.getTaskStatus(taskid).getRunState();
    
    // Make sure that the tip fails only if the state changes for the attempt
    // that fails it
    if (oldState == newState) {
      return;
    }
   JobHistory.Task.logFailed(...);
{code}

      was (Author: amar_kamat):
    I suggest we do something like 
{code:title=JobInProgress#failedTask()|borderStyle=solid}
public void failedTask(TaskInProgress tip, TaskAttemptID taskid, String reason, 
                         TaskStatus.Phase phase, TaskStatus.State state, 
                         String trackerName, JobTrackerInstrumentation metrics) {
    TaskStatus status =  ....
    TaskStatus.State oldState = tip.getTaskStatus(taskid).getRunState();
    updateTaskStatus(tip, status, metrics);
    TaskStatus.State newState = tip.getTaskStatus(taskid).getRunState();
    
    // Make sure that the tip fails only if the state changes for the attempt
    // that fails it
    if (oldState == newState) {
      return;
    }
   JobHistory.Task.logFailed(...);
{code}
  
> JobTracker might wrongly log a tip as failed
> --------------------------------------------
>
>                 Key: HADOOP-4068
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4068
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Amar Kamat
>
> Consider the following case
> 1) attempt _attempt_1_0_ from tip _tip_1_ that ran on tracker _tracker_1_ failed
> 2) jobtracker will mark _attempt_1_0_ for removal under _tracker_1_. Marking basically means removal of the mapping _tracker_1_->_attempt_1_0_
> 3) Marked attempts are removed only on next heartbeat from _tracker__1 or when _tracker_1_ is lost.
> 4) Consider a case where _tracker_1_ goes down.
> 5) In the meanwhile attempt _attempt_1_1_ succeeds on _tracker_2_ and the jobtracker marks the tip _tip_1_ as complete
> 6) Now the expiry-tracker thread detect that _tracker_1_ is lost and fails all the attempt under _tracker_1_. 
> 7) Here the jobtracker will kill _attempt_1_0_ *again* and log tip _tip_1_ as failed in the history although tip _tip_1_ is really complete/succeeded.
> The events in the history file would be something like
> {noformat}
> tip_1 start
> ---------
> attempt_1_0 start
> attempt_1_0 failed
> ---------
> attempt_1_1 start
> attempt_1_1 finished
> tip_1 finished
> ---------
> tip_1 failed
> {noformat}
> Note that this true even for tasks that expire. Tasks that are scheduled and never come back are killed by the {{ExpireLaunchingTasks}} thread. It will also call {{JobInProgress.failedTask()}} which will fail the attempt and log the TIP as failed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4068) JobTracker might wrongly log a tip as failed

Posted by "Owen O'Malley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12628488#action_12628488 ] 

Owen O'Malley commented on HADOOP-4068:
---------------------------------------

There used to be code that prevented this. TIPs should not fail unless all of the instances have failed. At some point, we really should redesign the state tracking code in the JobTracker.

> JobTracker might wrongly log a tip as failed
> --------------------------------------------
>
>                 Key: HADOOP-4068
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4068
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Amar Kamat
>
> Consider the following case
> 1) attempt _attempt_1_0_ from tip _tip_1_ that ran on tracker _tracker_1_ failed
> 2) jobtracker will mark _attempt_1_0_ for removal under _tracker_1_. Marking basically means removal of the mapping _tracker_1_->_attempt_1_0_
> 3) Marked attempts are removed only on next heartbeat from _tracker__1 or when _tracker_1_ is lost.
> 4) Consider a case where _tracker_1_ goes down.
> 5) In the meanwhile attempt _attempt_1_1_ succeeds on _tracker_2_ and the jobtracker marks the tip _tip_1_ as complete
> 6) Now the expiry-tracker thread detect that _tracker_1_ is lost and fails all the attempt under _tracker_1_. 
> 7) Here the jobtracker will kill _attempt_1_0_ *again* and log tip _tip_1_ as failed in the history although tip _tip_1_ is really complete/succeeded.
> The events in the history file would be something like
> {noformat}
> tip_1 start
> ---------
> attempt_1_0 start
> attempt_1_0 failed
> ---------
> attempt_1_1 start
> attempt_1_1 finished
> tip_1 finished
> ---------
> tip_1 failed
> {noformat}
> Note that this true even for tasks that expire. Tasks that are scheduled and never come back are killed by the {{ExpireLaunchingTasks}} thread. It will also call {{JobInProgress.failedTask()}} which will fail the attempt and log the TIP as failed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.