You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by "Nathan Marz (JIRA)" <ji...@apache.org> on 2009/03/20 18:20:50 UTC

[jira] Created: (HADOOP-5547) One bad node can cause whole job to fail

One bad node can cause whole job to fail
----------------------------------------

                 Key: HADOOP-5547
                 URL: https://issues.apache.org/jira/browse/HADOOP-5547
             Project: Hadoop Core
          Issue Type: Bug
            Reporter: Nathan Marz


This happened on the 0.19.2 branch. One of the nodes in our cluster was having disk problems and every task run on it was failing. In general the node would get blacklisted and jobs would run fine on it. However, for one job, the job ran the "Job setup" task on this bad node. When the task failed, the task was then retried on the same bad node 3 more times until the job failed. Hadoop should be able to handle this situation better.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-5547) One bad node can cause whole job to fail

Posted by "Amareshwari Sriramadasu (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-5547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12688212#action_12688212 ] 

Amareshwari Sriramadasu commented on HADOOP-5547:
-------------------------------------------------

This should not happen, until there are no other nodes in cluster to run the task. Did you have other nodes with free slot on your cluster?

> One bad node can cause whole job to fail
> ----------------------------------------
>
>                 Key: HADOOP-5547
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5547
>             Project: Hadoop Core
>          Issue Type: Bug
>            Reporter: Nathan Marz
>
> This happened on the 0.19.2 branch. One of the nodes in our cluster was having disk problems and every task run on it was failing. In general the node would get blacklisted and jobs would run fine on it. However, for one job, the job ran the "Job setup" task on this bad node. When the task failed, the task was then retried on the same bad node 3 more times until the job failed. Hadoop should be able to handle this situation better.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-5547) One bad node can cause whole job to fail

Posted by "Nathan Marz (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-5547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12688428#action_12688428 ] 

Nathan Marz commented on HADOOP-5547:
-------------------------------------

Yes, the rest of the cluster was free. There are 40 nodes in our cluster.

> One bad node can cause whole job to fail
> ----------------------------------------
>
>                 Key: HADOOP-5547
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5547
>             Project: Hadoop Core
>          Issue Type: Bug
>            Reporter: Nathan Marz
>
> This happened on the 0.19.2 branch. One of the nodes in our cluster was having disk problems and every task run on it was failing. In general the node would get blacklisted and jobs would run fine on it. However, for one job, the job ran the "Job setup" task on this bad node. When the task failed, the task was then retried on the same bad node 3 more times until the job failed. Hadoop should be able to handle this situation better.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.