You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by "Christian Kunz (JIRA)" <ji...@apache.org> on 2008/05/01 19:26:55 UTC

[jira] Created: (HADOOP-3333) job failing because of reassigning same tasktracker to failing tasks

job failing because of reassigning same tasktracker to failing tasks
--------------------------------------------------------------------

                 Key: HADOOP-3333
                 URL: https://issues.apache.org/jira/browse/HADOOP-3333
             Project: Hadoop Core
          Issue Type: Bug
          Components: mapred
    Affects Versions: 0.16.3
            Reporter: Christian Kunz
            Priority: Blocker


We are long running a job in a 2nd atttempt. Previous job was failing and current jobs risks to fail as well, because  reduce tasks failing on marginal TaskTrackers are assigned repeatedly to the same TaskTrackers (probably because it is the only available slot), eventually running out of attempts.
Reduce tasks should be assigned to the same TaskTrackers at most twice, or TaskTrackers need to get some better smarts to find  failing hardware.
BTW, mapred.reduce.max.attempts=12, which is high, but does not help in this case.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3333) job failing because of reassigning same tasktracker to failing tasks

Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12593690#action_12593690 ] 

Arun C Murthy commented on HADOOP-3333:
---------------------------------------

bq. Christian, do you know how many 'blacklisted' TaskTrackers were present when you noticed this?

The reason I ask:
A TaskTracker is blacklisted once it has more than 4 failed tasks, however once more than 25% of machines in your cluster are 'blacklisted' the blacklist is ignored and the job is allowed to be executed anywhere to ensure it fails.

> job failing because of reassigning same tasktracker to failing tasks
> --------------------------------------------------------------------
>
>                 Key: HADOOP-3333
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3333
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.3
>            Reporter: Christian Kunz
>            Priority: Blocker
>
> We have a long running a job in a 2nd atttempt. Previous job was failing and current jobs risks to fail as well, because  reduce tasks failing on marginal TaskTrackers are assigned repeatedly to the same TaskTrackers (probably because it is the only available slot), eventually running out of attempts.
> Reduce tasks should be assigned to the same TaskTrackers at most twice, or TaskTrackers need to get some better smarts to find  failing hardware.
> BTW, mapred.reduce.max.attempts=12, which is high, but does not help in this case.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3333) job failing because of reassigning same tasktracker to failing tasks

Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12594381#action_12594381 ] 

Arun C Murthy commented on HADOOP-3333:
---------------------------------------

bq. I think the ant tests depend on the fact that the trackernames can have same hostname but different port. You should fix that too.

Amar, this patch only fiddles with the semantics of TaskInProgress.machinesWhereFailed to get it to track hostnames rather than trackerNames; I don't see any tests depending on that - or am I missing something?

> job failing because of reassigning same tasktracker to failing tasks
> --------------------------------------------------------------------
>
>                 Key: HADOOP-3333
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3333
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.3
>            Reporter: Christian Kunz
>            Assignee: Arun C Murthy
>            Priority: Critical
>             Fix For: 0.18.0
>
>         Attachments: HADOOP-3333_0_20080503.patch, HADOOP-3333_1_20080505.patch
>
>
> We have a long running a job in a 2nd atttempt. Previous job was failing and current jobs risks to fail as well, because  reduce tasks failing on marginal TaskTrackers are assigned repeatedly to the same TaskTrackers (probably because it is the only available slot), eventually running out of attempts.
> Reduce tasks should be assigned to the same TaskTrackers at most twice, or TaskTrackers need to get some better smarts to find  failing hardware.
> BTW, mapred.reduce.max.attempts=12, which is high, but does not help in this case.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3333) job failing because of reassigning same tasktracker to failing tasks

Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Arun C Murthy updated HADOOP-3333:
----------------------------------

    Attachment: HADOOP-3333_1_20080505.patch

Update patch.

I've had to fix MiniMRCluster to specify a default rack so as to get it to force it to generate host-names and use the StaticMapping to ensure test-cases don't get affected by failures on the TTs; otherwise the tests will stall since we now use 'hostname' rather than 'trackername'...

I felt this was the simplest solution rather than hacking the MR framework for the tests or doing more extensive changes to the  Thoughts?

> job failing because of reassigning same tasktracker to failing tasks
> --------------------------------------------------------------------
>
>                 Key: HADOOP-3333
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3333
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.3
>            Reporter: Christian Kunz
>            Assignee: Arun C Murthy
>            Priority: Critical
>             Fix For: 0.18.0
>
>         Attachments: HADOOP-3333_0_20080503.patch, HADOOP-3333_1_20080505.patch
>
>
> We have a long running a job in a 2nd atttempt. Previous job was failing and current jobs risks to fail as well, because  reduce tasks failing on marginal TaskTrackers are assigned repeatedly to the same TaskTrackers (probably because it is the only available slot), eventually running out of attempts.
> Reduce tasks should be assigned to the same TaskTrackers at most twice, or TaskTrackers need to get some better smarts to find  failing hardware.
> BTW, mapred.reduce.max.attempts=12, which is high, but does not help in this case.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3333) job failing because of reassigning same tasktracker to failing tasks

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12594738#action_12594738 ] 

Hadoop QA commented on HADOOP-3333:
-----------------------------------

+1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12381540/HADOOP-3333_2_20080506.patch
  against trunk revision 653906.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 3 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs.  The patch does not introduce any new Findbugs warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    +1 core tests.  The patch passed core unit tests.

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2414/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2414/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2414/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2414/console

This message is automatically generated.

> job failing because of reassigning same tasktracker to failing tasks
> --------------------------------------------------------------------
>
>                 Key: HADOOP-3333
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3333
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.3
>            Reporter: Christian Kunz
>            Assignee: Arun C Murthy
>            Priority: Critical
>             Fix For: 0.18.0
>
>         Attachments: HADOOP-3333_0_20080503.patch, HADOOP-3333_1_20080505.patch, HADOOP-3333_2_20080506.patch
>
>
> We have a long running a job in a 2nd atttempt. Previous job was failing and current jobs risks to fail as well, because  reduce tasks failing on marginal TaskTrackers are assigned repeatedly to the same TaskTrackers (probably because it is the only available slot), eventually running out of attempts.
> Reduce tasks should be assigned to the same TaskTrackers at most twice, or TaskTrackers need to get some better smarts to find  failing hardware.
> BTW, mapred.reduce.max.attempts=12, which is high, but does not help in this case.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3333) job failing because of reassigning same tasktracker to failing tasks

Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Arun C Murthy updated HADOOP-3333:
----------------------------------

    Fix Version/s: 0.16.4

> job failing because of reassigning same tasktracker to failing tasks
> --------------------------------------------------------------------
>
>                 Key: HADOOP-3333
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3333
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.3
>            Reporter: Christian Kunz
>            Assignee: Arun C Murthy
>            Priority: Blocker
>             Fix For: 0.16.4
>
>
> We have a long running a job in a 2nd atttempt. Previous job was failing and current jobs risks to fail as well, because  reduce tasks failing on marginal TaskTrackers are assigned repeatedly to the same TaskTrackers (probably because it is the only available slot), eventually running out of attempts.
> Reduce tasks should be assigned to the same TaskTrackers at most twice, or TaskTrackers need to get some better smarts to find  failing hardware.
> BTW, mapred.reduce.max.attempts=12, which is high, but does not help in this case.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3333) job failing because of reassigning same tasktracker to failing tasks

Posted by "Hudson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12606731#action_12606731 ] 

Hudson commented on HADOOP-3333:
--------------------------------

Integrated in Hadoop-trunk #524 (See [http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/524/])

> job failing because of reassigning same tasktracker to failing tasks
> --------------------------------------------------------------------
>
>                 Key: HADOOP-3333
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3333
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.3
>            Reporter: Christian Kunz
>            Assignee: Jothi Padmanabhan
>            Priority: Blocker
>             Fix For: 0.18.0
>
>         Attachments: hadoop-3333-v1.patch, hadoop-3333-v2.patch, hadoop-3333-v3.patch, hadoop-3333.patch, HADOOP-3333_0_20080503.patch, HADOOP-3333_1_20080505.patch, HADOOP-3333_2_20080506.patch
>
>
> We have a long running a job in a 2nd atttempt. Previous job was failing and current jobs risks to fail as well, because  reduce tasks failing on marginal TaskTrackers are assigned repeatedly to the same TaskTrackers (probably because it is the only available slot), eventually running out of attempts.
> Reduce tasks should be assigned to the same TaskTrackers at most twice, or TaskTrackers need to get some better smarts to find  failing hardware.
> BTW, mapred.reduce.max.attempts=12, which is high, but does not help in this case.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3333) job failing because of reassigning same tasktracker to failing tasks

Posted by "Jothi Padmanabhan (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jothi Padmanabhan updated HADOOP-3333:
--------------------------------------

    Status: Patch Available  (was: Open)

This patch is  built on top of Arun's patch. It keeps track of the number of unique hosts that have track taskers and propogates it from the Jobtracker on to the TaskInProgress.

> job failing because of reassigning same tasktracker to failing tasks
> --------------------------------------------------------------------
>
>                 Key: HADOOP-3333
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3333
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.3
>            Reporter: Christian Kunz
>            Assignee: Jothi Padmanabhan
>            Priority: Blocker
>             Fix For: 0.18.0
>
>         Attachments: hadoop-3333.patch, HADOOP-3333_0_20080503.patch, HADOOP-3333_1_20080505.patch, HADOOP-3333_2_20080506.patch
>
>
> We have a long running a job in a 2nd atttempt. Previous job was failing and current jobs risks to fail as well, because  reduce tasks failing on marginal TaskTrackers are assigned repeatedly to the same TaskTrackers (probably because it is the only available slot), eventually running out of attempts.
> Reduce tasks should be assigned to the same TaskTrackers at most twice, or TaskTrackers need to get some better smarts to find  failing hardware.
> BTW, mapred.reduce.max.attempts=12, which is high, but does not help in this case.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3333) job failing because of reassigning same tasktracker to failing tasks

Posted by "Jothi Padmanabhan (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jothi Padmanabhan updated HADOOP-3333:
--------------------------------------

    Attachment: hadoop-3333-v2.patch

> job failing because of reassigning same tasktracker to failing tasks
> --------------------------------------------------------------------
>
>                 Key: HADOOP-3333
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3333
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.3
>            Reporter: Christian Kunz
>            Assignee: Jothi Padmanabhan
>            Priority: Blocker
>             Fix For: 0.18.0
>
>         Attachments: hadoop-3333-v1.patch, hadoop-3333-v2.patch, hadoop-3333.patch, HADOOP-3333_0_20080503.patch, HADOOP-3333_1_20080505.patch, HADOOP-3333_2_20080506.patch
>
>
> We have a long running a job in a 2nd atttempt. Previous job was failing and current jobs risks to fail as well, because  reduce tasks failing on marginal TaskTrackers are assigned repeatedly to the same TaskTrackers (probably because it is the only available slot), eventually running out of attempts.
> Reduce tasks should be assigned to the same TaskTrackers at most twice, or TaskTrackers need to get some better smarts to find  failing hardware.
> BTW, mapred.reduce.max.attempts=12, which is high, but does not help in this case.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3333) job failing because of reassigning same tasktracker to failing tasks

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12593860#action_12593860 ] 

Devaraj Das commented on HADOOP-3333:
-------------------------------------

Amar, note that this is not a case of multiple trackers per machine. if a tracker is lost and later joins back, it reinitializes and upon reinitialization it might bind to a different port.

> job failing because of reassigning same tasktracker to failing tasks
> --------------------------------------------------------------------
>
>                 Key: HADOOP-3333
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3333
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.3
>            Reporter: Christian Kunz
>            Assignee: Arun C Murthy
>            Priority: Blocker
>
> We have a long running a job in a 2nd atttempt. Previous job was failing and current jobs risks to fail as well, because  reduce tasks failing on marginal TaskTrackers are assigned repeatedly to the same TaskTrackers (probably because it is the only available slot), eventually running out of attempts.
> Reduce tasks should be assigned to the same TaskTrackers at most twice, or TaskTrackers need to get some better smarts to find  failing hardware.
> BTW, mapred.reduce.max.attempts=12, which is high, but does not help in this case.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3333) job failing because of reassigning same tasktracker to failing tasks

Posted by "Amar Kamat (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12593832#action_12593832 ] 

Amar Kamat commented on HADOOP-3333:
------------------------------------

Yes even I saw these things in the logs. 
bq. However a lost tasktracker leads to tasks being marked KILLED.
As this is different from FAILED, we should probably keep it as it is. Tasktracker can be lost because of some  transient issues. Blacklisting such trackers for the tips that are local might not be good. But a tracker getting lost too frequently can be considered for blacklisting. 
- Can we do something regarding the machines where the trackers (on different ports) are failing or getting lost too often?

bq. We also have to track hostnames rather than 'trackernames', trackername includes the host:port... (#2)
+1. What can be done is that the TIP can be scheduled to trackers on different machines in the first go. Then consider scheduling it to TaskTrackers sharing the machines.
----
One thing I felt was the system was loaded. That could be the possible reason for job failures. Wondering under what conditions running multiple TaskTrackers is better than running one tracker.

> job failing because of reassigning same tasktracker to failing tasks
> --------------------------------------------------------------------
>
>                 Key: HADOOP-3333
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3333
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.3
>            Reporter: Christian Kunz
>            Assignee: Arun C Murthy
>            Priority: Blocker
>
> We have a long running a job in a 2nd atttempt. Previous job was failing and current jobs risks to fail as well, because  reduce tasks failing on marginal TaskTrackers are assigned repeatedly to the same TaskTrackers (probably because it is the only available slot), eventually running out of attempts.
> Reduce tasks should be assigned to the same TaskTrackers at most twice, or TaskTrackers need to get some better smarts to find  failing hardware.
> BTW, mapred.reduce.max.attempts=12, which is high, but does not help in this case.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3333) job failing because of reassigning same tasktracker to failing tasks

Posted by "Nigel Daley (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Nigel Daley updated HADOOP-3333:
--------------------------------

         Priority: Critical  (was: Blocker)
    Fix Version/s:     (was: 0.16.4)
                   0.18.0

This is not a regression nor is this an incompatibility.  Moving this to 0.18.

> job failing because of reassigning same tasktracker to failing tasks
> --------------------------------------------------------------------
>
>                 Key: HADOOP-3333
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3333
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.3
>            Reporter: Christian Kunz
>            Assignee: Arun C Murthy
>            Priority: Critical
>             Fix For: 0.18.0
>
>         Attachments: HADOOP-3333_0_20080503.patch
>
>
> We have a long running a job in a 2nd atttempt. Previous job was failing and current jobs risks to fail as well, because  reduce tasks failing on marginal TaskTrackers are assigned repeatedly to the same TaskTrackers (probably because it is the only available slot), eventually running out of attempts.
> Reduce tasks should be assigned to the same TaskTrackers at most twice, or TaskTrackers need to get some better smarts to find  failing hardware.
> BTW, mapred.reduce.max.attempts=12, which is high, but does not help in this case.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3333) job failing because of reassigning same tasktracker to failing tasks

Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Arun C Murthy updated HADOOP-3333:
----------------------------------

    Status: Patch Available  (was: Open)

Devaraj, I don't see any reason why it affects the jsps too... no problems while I was testing this patch on clusters.

> job failing because of reassigning same tasktracker to failing tasks
> --------------------------------------------------------------------
>
>                 Key: HADOOP-3333
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3333
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.3
>            Reporter: Christian Kunz
>            Assignee: Arun C Murthy
>            Priority: Critical
>             Fix For: 0.18.0
>
>         Attachments: HADOOP-3333_0_20080503.patch, HADOOP-3333_1_20080505.patch, HADOOP-3333_2_20080506.patch
>
>
> We have a long running a job in a 2nd atttempt. Previous job was failing and current jobs risks to fail as well, because  reduce tasks failing on marginal TaskTrackers are assigned repeatedly to the same TaskTrackers (probably because it is the only available slot), eventually running out of attempts.
> Reduce tasks should be assigned to the same TaskTrackers at most twice, or TaskTrackers need to get some better smarts to find  failing hardware.
> BTW, mapred.reduce.max.attempts=12, which is high, but does not help in this case.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3333) job failing because of reassigning same tasktracker to failing tasks

Posted by "Amareshwari Sriramadasu (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12606321#action_12606321 ] 

Amareshwari Sriramadasu commented on HADOOP-3333:
-------------------------------------------------

+1
Patch looks good.

> job failing because of reassigning same tasktracker to failing tasks
> --------------------------------------------------------------------
>
>                 Key: HADOOP-3333
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3333
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.3
>            Reporter: Christian Kunz
>            Assignee: Jothi Padmanabhan
>            Priority: Blocker
>             Fix For: 0.18.0
>
>         Attachments: hadoop-3333-v1.patch, hadoop-3333-v2.patch, hadoop-3333-v3.patch, hadoop-3333.patch, HADOOP-3333_0_20080503.patch, HADOOP-3333_1_20080505.patch, HADOOP-3333_2_20080506.patch
>
>
> We have a long running a job in a 2nd atttempt. Previous job was failing and current jobs risks to fail as well, because  reduce tasks failing on marginal TaskTrackers are assigned repeatedly to the same TaskTrackers (probably because it is the only available slot), eventually running out of attempts.
> Reduce tasks should be assigned to the same TaskTrackers at most twice, or TaskTrackers need to get some better smarts to find  failing hardware.
> BTW, mapred.reduce.max.attempts=12, which is high, but does not help in this case.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3333) job failing because of reassigning same tasktracker to failing tasks

Posted by "Mukund Madhugiri (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mukund Madhugiri updated HADOOP-3333:
-------------------------------------

    Fix Version/s:     (was: 0.18.0)

> job failing because of reassigning same tasktracker to failing tasks
> --------------------------------------------------------------------
>
>                 Key: HADOOP-3333
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3333
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.3
>            Reporter: Christian Kunz
>            Assignee: Arun C Murthy
>            Priority: Critical
>         Attachments: HADOOP-3333_0_20080503.patch, HADOOP-3333_1_20080505.patch, HADOOP-3333_2_20080506.patch
>
>
> We have a long running a job in a 2nd atttempt. Previous job was failing and current jobs risks to fail as well, because  reduce tasks failing on marginal TaskTrackers are assigned repeatedly to the same TaskTrackers (probably because it is the only available slot), eventually running out of attempts.
> Reduce tasks should be assigned to the same TaskTrackers at most twice, or TaskTrackers need to get some better smarts to find  failing hardware.
> BTW, mapred.reduce.max.attempts=12, which is high, but does not help in this case.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (HADOOP-3333) job failing because of reassigning same tasktracker to failing tasks

Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Arun C Murthy reassigned HADOOP-3333:
-------------------------------------

    Assignee: Arun C Murthy

> job failing because of reassigning same tasktracker to failing tasks
> --------------------------------------------------------------------
>
>                 Key: HADOOP-3333
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3333
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.3
>            Reporter: Christian Kunz
>            Assignee: Arun C Murthy
>            Priority: Blocker
>
> We have a long running a job in a 2nd atttempt. Previous job was failing and current jobs risks to fail as well, because  reduce tasks failing on marginal TaskTrackers are assigned repeatedly to the same TaskTrackers (probably because it is the only available slot), eventually running out of attempts.
> Reduce tasks should be assigned to the same TaskTrackers at most twice, or TaskTrackers need to get some better smarts to find  failing hardware.
> BTW, mapred.reduce.max.attempts=12, which is high, but does not help in this case.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3333) job failing because of reassigning same tasktracker to failing tasks

Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12593807#action_12593807 ] 

Arun C Murthy commented on HADOOP-3333:
---------------------------------------

Here are the symptoms and possible remedies...

1. The same TIP FAILED on a previously 'lost' tasktracker.
2. The same TIP FAILED on the same machine, however the tasktracker had a different 'port'. i.e. Failed on x.y.z:30342 and x.y.z:34223

So, a couple of thoughts:
1. We might have to rework the logic to work around task FAILURES; currently the JT only schedules around nodes where the task FAILED. However a lost tasktracker leads to tasks being marked KILLED.
2. We also have to track hostnames rather than 'trackernames', trackername includes the host:port... (#2)

Thoughts?

> job failing because of reassigning same tasktracker to failing tasks
> --------------------------------------------------------------------
>
>                 Key: HADOOP-3333
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3333
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.3
>            Reporter: Christian Kunz
>            Assignee: Arun C Murthy
>            Priority: Blocker
>
> We have a long running a job in a 2nd atttempt. Previous job was failing and current jobs risks to fail as well, because  reduce tasks failing on marginal TaskTrackers are assigned repeatedly to the same TaskTrackers (probably because it is the only available slot), eventually running out of attempts.
> Reduce tasks should be assigned to the same TaskTrackers at most twice, or TaskTrackers need to get some better smarts to find  failing hardware.
> BTW, mapred.reduce.max.attempts=12, which is high, but does not help in this case.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3333) job failing because of reassigning same tasktracker to failing tasks

Posted by "Robert Chansler (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Chansler updated HADOOP-3333:
------------------------------------

         Priority: Blocker  (was: Critical)
    Fix Version/s: 0.18.0

> job failing because of reassigning same tasktracker to failing tasks
> --------------------------------------------------------------------
>
>                 Key: HADOOP-3333
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3333
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.3
>            Reporter: Christian Kunz
>            Assignee: Arun C Murthy
>            Priority: Blocker
>             Fix For: 0.18.0
>
>         Attachments: HADOOP-3333_0_20080503.patch, HADOOP-3333_1_20080505.patch, HADOOP-3333_2_20080506.patch
>
>
> We have a long running a job in a 2nd atttempt. Previous job was failing and current jobs risks to fail as well, because  reduce tasks failing on marginal TaskTrackers are assigned repeatedly to the same TaskTrackers (probably because it is the only available slot), eventually running out of attempts.
> Reduce tasks should be assigned to the same TaskTrackers at most twice, or TaskTrackers need to get some better smarts to find  failing hardware.
> BTW, mapred.reduce.max.attempts=12, which is high, but does not help in this case.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3333) job failing because of reassigning same tasktracker to failing tasks

Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Arun C Murthy updated HADOOP-3333:
----------------------------------

    Status: Patch Available  (was: Open)

> job failing because of reassigning same tasktracker to failing tasks
> --------------------------------------------------------------------
>
>                 Key: HADOOP-3333
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3333
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.3
>            Reporter: Christian Kunz
>            Assignee: Arun C Murthy
>            Priority: Critical
>             Fix For: 0.18.0
>
>         Attachments: HADOOP-3333_0_20080503.patch, HADOOP-3333_1_20080505.patch
>
>
> We have a long running a job in a 2nd atttempt. Previous job was failing and current jobs risks to fail as well, because  reduce tasks failing on marginal TaskTrackers are assigned repeatedly to the same TaskTrackers (probably because it is the only available slot), eventually running out of attempts.
> Reduce tasks should be assigned to the same TaskTrackers at most twice, or TaskTrackers need to get some better smarts to find  failing hardware.
> BTW, mapred.reduce.max.attempts=12, which is high, but does not help in this case.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3333) job failing because of reassigning same tasktracker to failing tasks

Posted by "Christian Kunz (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12594338#action_12594338 ] 

Christian Kunz commented on HADOOP-3333:
----------------------------------------

Nigel, in order to make our job not fail, we had to continuously monitor the job and make sure that any bouncing TaskTracker did not come back before any reduce task ran out of attempts and failed the whole job. Isn't that worth a 'blocker' status, independent on whether it is a 'regression' or 'incompatibility'?

> job failing because of reassigning same tasktracker to failing tasks
> --------------------------------------------------------------------
>
>                 Key: HADOOP-3333
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3333
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.3
>            Reporter: Christian Kunz
>            Assignee: Arun C Murthy
>            Priority: Critical
>             Fix For: 0.18.0
>
>         Attachments: HADOOP-3333_0_20080503.patch
>
>
> We have a long running a job in a 2nd atttempt. Previous job was failing and current jobs risks to fail as well, because  reduce tasks failing on marginal TaskTrackers are assigned repeatedly to the same TaskTrackers (probably because it is the only available slot), eventually running out of attempts.
> Reduce tasks should be assigned to the same TaskTrackers at most twice, or TaskTrackers need to get some better smarts to find  failing hardware.
> BTW, mapred.reduce.max.attempts=12, which is high, but does not help in this case.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3333) job failing because of reassigning same tasktracker to failing tasks

Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Arun C Murthy updated HADOOP-3333:
----------------------------------

    Attachment: HADOOP-3333_2_20080506.patch

Updated patch to reflect recent commits and fixed MiniMRCluster to use NetworkTopology.DEFAULT_RACK ...

> job failing because of reassigning same tasktracker to failing tasks
> --------------------------------------------------------------------
>
>                 Key: HADOOP-3333
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3333
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.3
>            Reporter: Christian Kunz
>            Assignee: Arun C Murthy
>            Priority: Critical
>             Fix For: 0.18.0
>
>         Attachments: HADOOP-3333_0_20080503.patch, HADOOP-3333_1_20080505.patch, HADOOP-3333_2_20080506.patch
>
>
> We have a long running a job in a 2nd atttempt. Previous job was failing and current jobs risks to fail as well, because  reduce tasks failing on marginal TaskTrackers are assigned repeatedly to the same TaskTrackers (probably because it is the only available slot), eventually running out of attempts.
> Reduce tasks should be assigned to the same TaskTrackers at most twice, or TaskTrackers need to get some better smarts to find  failing hardware.
> BTW, mapred.reduce.max.attempts=12, which is high, but does not help in this case.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3333) job failing because of reassigning same tasktracker to failing tasks

Posted by "Owen O'Malley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12593879#action_12593879 ] 

Owen O'Malley commented on HADOOP-3333:
---------------------------------------

+1 for not blocking tasks that were previously killed on a node.

+1 for using the host name rather than the tracker name for avoiding nodes.

> job failing because of reassigning same tasktracker to failing tasks
> --------------------------------------------------------------------
>
>                 Key: HADOOP-3333
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3333
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.3
>            Reporter: Christian Kunz
>            Assignee: Arun C Murthy
>            Priority: Blocker
>
> We have a long running a job in a 2nd atttempt. Previous job was failing and current jobs risks to fail as well, because  reduce tasks failing on marginal TaskTrackers are assigned repeatedly to the same TaskTrackers (probably because it is the only available slot), eventually running out of attempts.
> Reduce tasks should be assigned to the same TaskTrackers at most twice, or TaskTrackers need to get some better smarts to find  failing hardware.
> BTW, mapred.reduce.max.attempts=12, which is high, but does not help in this case.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3333) job failing because of reassigning same tasktracker to failing tasks

Posted by "Jothi Padmanabhan (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jothi Padmanabhan updated HADOOP-3333:
--------------------------------------

    Status: Patch Available  (was: Open)

Patch after incorporating the review comments

> job failing because of reassigning same tasktracker to failing tasks
> --------------------------------------------------------------------
>
>                 Key: HADOOP-3333
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3333
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.3
>            Reporter: Christian Kunz
>            Assignee: Jothi Padmanabhan
>            Priority: Blocker
>             Fix For: 0.18.0
>
>         Attachments: hadoop-3333-v1.patch, hadoop-3333-v2.patch, hadoop-3333.patch, HADOOP-3333_0_20080503.patch, HADOOP-3333_1_20080505.patch, HADOOP-3333_2_20080506.patch
>
>
> We have a long running a job in a 2nd atttempt. Previous job was failing and current jobs risks to fail as well, because  reduce tasks failing on marginal TaskTrackers are assigned repeatedly to the same TaskTrackers (probably because it is the only available slot), eventually running out of attempts.
> Reduce tasks should be assigned to the same TaskTrackers at most twice, or TaskTrackers need to get some better smarts to find  failing hardware.
> BTW, mapred.reduce.max.attempts=12, which is high, but does not help in this case.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3333) job failing because of reassigning same tasktracker to failing tasks

Posted by "Jothi Padmanabhan (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jothi Padmanabhan updated HADOOP-3333:
--------------------------------------

    Attachment: hadoop-3333-v3.patch

> job failing because of reassigning same tasktracker to failing tasks
> --------------------------------------------------------------------
>
>                 Key: HADOOP-3333
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3333
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.3
>            Reporter: Christian Kunz
>            Assignee: Jothi Padmanabhan
>            Priority: Blocker
>             Fix For: 0.18.0
>
>         Attachments: hadoop-3333-v1.patch, hadoop-3333-v2.patch, hadoop-3333-v3.patch, hadoop-3333.patch, HADOOP-3333_0_20080503.patch, HADOOP-3333_1_20080505.patch, HADOOP-3333_2_20080506.patch
>
>
> We have a long running a job in a 2nd atttempt. Previous job was failing and current jobs risks to fail as well, because  reduce tasks failing on marginal TaskTrackers are assigned repeatedly to the same TaskTrackers (probably because it is the only available slot), eventually running out of attempts.
> Reduce tasks should be assigned to the same TaskTrackers at most twice, or TaskTrackers need to get some better smarts to find  failing hardware.
> BTW, mapred.reduce.max.attempts=12, which is high, but does not help in this case.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3333) job failing because of reassigning same tasktracker to failing tasks

Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12594374#action_12594374 ] 

Arun C Murthy commented on HADOOP-3333:
---------------------------------------

Christian, one way to mitigate this problem is to configure your *mapred.task.tracker.report.address* which ensures that the TaskTracker uses the same port when it bounces, thus avoiding this problem.

You could set mapred.task.tracker.report.address to 0.0.0.0:<fixed_port> and that should get the job done... thoughts?

> job failing because of reassigning same tasktracker to failing tasks
> --------------------------------------------------------------------
>
>                 Key: HADOOP-3333
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3333
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.3
>            Reporter: Christian Kunz
>            Assignee: Arun C Murthy
>            Priority: Critical
>             Fix For: 0.18.0
>
>         Attachments: HADOOP-3333_0_20080503.patch, HADOOP-3333_1_20080505.patch
>
>
> We have a long running a job in a 2nd atttempt. Previous job was failing and current jobs risks to fail as well, because  reduce tasks failing on marginal TaskTrackers are assigned repeatedly to the same TaskTrackers (probably because it is the only available slot), eventually running out of attempts.
> Reduce tasks should be assigned to the same TaskTrackers at most twice, or TaskTrackers need to get some better smarts to find  failing hardware.
> BTW, mapred.reduce.max.attempts=12, which is high, but does not help in this case.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3333) job failing because of reassigning same tasktracker to failing tasks

Posted by "Amar Kamat (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12594784#action_12594784 ] 

Amar Kamat commented on HADOOP-3333:
------------------------------------

bq. Why dont we do something like .... 
Consider the following
1) Consider a case where the _num-machines_ < _num-trackers_.
||Node||Trackers||
|N1|T1, T2|
|N2|T3, T4|
2) Lets assume a corner case where the TIP fails on atleast one tracker on each node. 
Say TIP t1 fails on trackers T1 and T3.
3) As per the scheduling logic (see line18/19)
{code:title=JobInProgress.java|borderStyle=solid}
1  private synchronized TaskInProgress findTaskFromList(
2     Collection<TaskInProgress> tips, String taskTracker, boolean removeFailedTip) {
3   Iterator<TaskInProgress> iter = tips.iterator();
4   while (iter.hasNext()) {
5     TaskInProgress tip = iter.next();
6
7      // Select a tip if
8      //   1. runnable   : still needs to be run and is not completed
9      //   2. ~running   : no other node is running it
10      //   3. earlier attempt failed : has not failed on this host
11     //                               and has failed on all the other hosts
12      // A TIP is removed from the list if 
13      // (1) this tip is scheduled
14      // (2) if the passed list is a level 0 (host) cache
15      // (3) when the TIP is non-schedulable (running, killed, complete)
16      if (tip.isRunnable() && !tip.isRunning()) {
17       // check if the tip has failed on this host
18        if (!tip.hasFailedOnMachine(taskTracker) || 
19             tip.getNumberOfFailedMachines() >= clusterSize) {
{code}
The tip _t1_ has failed on 2 machines but the clustersize (# of trackers) is 4 and hence the job will be stuck. With this patch the {{total-failures-per-tip}} is upper bounded by {{num-nodes}} while the parameter {{cluster-size}} is upper bounded by {{num-trackers}}.

> job failing because of reassigning same tasktracker to failing tasks
> --------------------------------------------------------------------
>
>                 Key: HADOOP-3333
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3333
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.3
>            Reporter: Christian Kunz
>            Assignee: Arun C Murthy
>            Priority: Critical
>             Fix For: 0.18.0
>
>         Attachments: HADOOP-3333_0_20080503.patch, HADOOP-3333_1_20080505.patch, HADOOP-3333_2_20080506.patch
>
>
> We have a long running a job in a 2nd atttempt. Previous job was failing and current jobs risks to fail as well, because  reduce tasks failing on marginal TaskTrackers are assigned repeatedly to the same TaskTrackers (probably because it is the only available slot), eventually running out of attempts.
> Reduce tasks should be assigned to the same TaskTrackers at most twice, or TaskTrackers need to get some better smarts to find  failing hardware.
> BTW, mapred.reduce.max.attempts=12, which is high, but does not help in this case.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3333) job failing because of reassigning same tasktracker to failing tasks

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Devaraj Das updated HADOOP-3333:
--------------------------------

      Resolution: Fixed
    Hadoop Flags: [Reviewed]
          Status: Resolved  (was: Patch Available)

I just committed this. Thanks Arun and Jothi!

> job failing because of reassigning same tasktracker to failing tasks
> --------------------------------------------------------------------
>
>                 Key: HADOOP-3333
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3333
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.3
>            Reporter: Christian Kunz
>            Assignee: Jothi Padmanabhan
>            Priority: Blocker
>             Fix For: 0.18.0
>
>         Attachments: hadoop-3333-v1.patch, hadoop-3333-v2.patch, hadoop-3333-v3.patch, hadoop-3333.patch, HADOOP-3333_0_20080503.patch, HADOOP-3333_1_20080505.patch, HADOOP-3333_2_20080506.patch
>
>
> We have a long running a job in a 2nd atttempt. Previous job was failing and current jobs risks to fail as well, because  reduce tasks failing on marginal TaskTrackers are assigned repeatedly to the same TaskTrackers (probably because it is the only available slot), eventually running out of attempts.
> Reduce tasks should be assigned to the same TaskTrackers at most twice, or TaskTrackers need to get some better smarts to find  failing hardware.
> BTW, mapred.reduce.max.attempts=12, which is high, but does not help in this case.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3333) job failing because of reassigning same tasktracker to failing tasks

Posted by "Jothi Padmanabhan (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jothi Padmanabhan updated HADOOP-3333:
--------------------------------------

    Status: Open  (was: Patch Available)

> job failing because of reassigning same tasktracker to failing tasks
> --------------------------------------------------------------------
>
>                 Key: HADOOP-3333
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3333
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.3
>            Reporter: Christian Kunz
>            Assignee: Jothi Padmanabhan
>            Priority: Blocker
>             Fix For: 0.18.0
>
>         Attachments: hadoop-3333.patch, HADOOP-3333_0_20080503.patch, HADOOP-3333_1_20080505.patch, HADOOP-3333_2_20080506.patch
>
>
> We have a long running a job in a 2nd atttempt. Previous job was failing and current jobs risks to fail as well, because  reduce tasks failing on marginal TaskTrackers are assigned repeatedly to the same TaskTrackers (probably because it is the only available slot), eventually running out of attempts.
> Reduce tasks should be assigned to the same TaskTrackers at most twice, or TaskTrackers need to get some better smarts to find  failing hardware.
> BTW, mapred.reduce.max.attempts=12, which is high, but does not help in this case.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3333) job failing because of reassigning same tasktracker to failing tasks

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12605529#action_12605529 ] 

Hadoop QA commented on HADOOP-3333:
-----------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12384106/hadoop-3333-v1.patch
  against trunk revision 668483.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 3 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    -1 findbugs.  The patch appears to introduce 1 new Findbugs warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    +1 core tests.  The patch passed core unit tests.

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2669/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2669/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2669/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2669/console

This message is automatically generated.

> job failing because of reassigning same tasktracker to failing tasks
> --------------------------------------------------------------------
>
>                 Key: HADOOP-3333
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3333
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.3
>            Reporter: Christian Kunz
>            Assignee: Jothi Padmanabhan
>            Priority: Blocker
>             Fix For: 0.18.0
>
>         Attachments: hadoop-3333-v1.patch, hadoop-3333.patch, HADOOP-3333_0_20080503.patch, HADOOP-3333_1_20080505.patch, HADOOP-3333_2_20080506.patch
>
>
> We have a long running a job in a 2nd atttempt. Previous job was failing and current jobs risks to fail as well, because  reduce tasks failing on marginal TaskTrackers are assigned repeatedly to the same TaskTrackers (probably because it is the only available slot), eventually running out of attempts.
> Reduce tasks should be assigned to the same TaskTrackers at most twice, or TaskTrackers need to get some better smarts to find  failing hardware.
> BTW, mapred.reduce.max.attempts=12, which is high, but does not help in this case.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3333) job failing because of reassigning same tasktracker to failing tasks

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12594559#action_12594559 ] 

Devaraj Das commented on HADOOP-3333:
-------------------------------------

+1. Pls check whether the changes affects the JSPs in any way.

> job failing because of reassigning same tasktracker to failing tasks
> --------------------------------------------------------------------
>
>                 Key: HADOOP-3333
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3333
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.3
>            Reporter: Christian Kunz
>            Assignee: Arun C Murthy
>            Priority: Critical
>             Fix For: 0.18.0
>
>         Attachments: HADOOP-3333_0_20080503.patch, HADOOP-3333_1_20080505.patch
>
>
> We have a long running a job in a 2nd atttempt. Previous job was failing and current jobs risks to fail as well, because  reduce tasks failing on marginal TaskTrackers are assigned repeatedly to the same TaskTrackers (probably because it is the only available slot), eventually running out of attempts.
> Reduce tasks should be assigned to the same TaskTrackers at most twice, or TaskTrackers need to get some better smarts to find  failing hardware.
> BTW, mapred.reduce.max.attempts=12, which is high, but does not help in this case.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (HADOOP-3333) job failing because of reassigning same tasktracker to failing tasks

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Devaraj Das reassigned HADOOP-3333:
-----------------------------------

    Assignee: Jothi Padmanabhan  (was: Arun C Murthy)

> job failing because of reassigning same tasktracker to failing tasks
> --------------------------------------------------------------------
>
>                 Key: HADOOP-3333
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3333
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.3
>            Reporter: Christian Kunz
>            Assignee: Jothi Padmanabhan
>            Priority: Blocker
>             Fix For: 0.18.0
>
>         Attachments: HADOOP-3333_0_20080503.patch, HADOOP-3333_1_20080505.patch, HADOOP-3333_2_20080506.patch
>
>
> We have a long running a job in a 2nd atttempt. Previous job was failing and current jobs risks to fail as well, because  reduce tasks failing on marginal TaskTrackers are assigned repeatedly to the same TaskTrackers (probably because it is the only available slot), eventually running out of attempts.
> Reduce tasks should be assigned to the same TaskTrackers at most twice, or TaskTrackers need to get some better smarts to find  failing hardware.
> BTW, mapred.reduce.max.attempts=12, which is high, but does not help in this case.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3333) job failing because of reassigning same tasktracker to failing tasks

Posted by "Jothi Padmanabhan (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12604804#action_12604804 ] 

Jothi Padmanabhan commented on HADOOP-3333:
-------------------------------------------

It appears that the only way to handle the case of multiple TTs per node is to trickle down the list of uniques hosts (that run TT) down from the JobTracker (hostsReader.getHosts().size()). This information needs to trickle down through several layers before it can be used in findTaskFromList. We need to evaluate if we want to make this change to handle this special case or just go ahead with the existing patch and document this case as a limitation.

> job failing because of reassigning same tasktracker to failing tasks
> --------------------------------------------------------------------
>
>                 Key: HADOOP-3333
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3333
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.3
>            Reporter: Christian Kunz
>            Assignee: Jothi Padmanabhan
>            Priority: Blocker
>             Fix For: 0.18.0
>
>         Attachments: HADOOP-3333_0_20080503.patch, HADOOP-3333_1_20080505.patch, HADOOP-3333_2_20080506.patch
>
>
> We have a long running a job in a 2nd atttempt. Previous job was failing and current jobs risks to fail as well, because  reduce tasks failing on marginal TaskTrackers are assigned repeatedly to the same TaskTrackers (probably because it is the only available slot), eventually running out of attempts.
> Reduce tasks should be assigned to the same TaskTrackers at most twice, or TaskTrackers need to get some better smarts to find  failing hardware.
> BTW, mapred.reduce.max.attempts=12, which is high, but does not help in this case.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3333) job failing because of reassigning same tasktracker to failing tasks

Posted by "Christian Kunz (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Christian Kunz updated HADOOP-3333:
-----------------------------------

    Description: 
We have a long running a job in a 2nd atttempt. Previous job was failing and current jobs risks to fail as well, because  reduce tasks failing on marginal TaskTrackers are assigned repeatedly to the same TaskTrackers (probably because it is the only available slot), eventually running out of attempts.
Reduce tasks should be assigned to the same TaskTrackers at most twice, or TaskTrackers need to get some better smarts to find  failing hardware.
BTW, mapred.reduce.max.attempts=12, which is high, but does not help in this case.

  was:
We are long running a job in a 2nd atttempt. Previous job was failing and current jobs risks to fail as well, because  reduce tasks failing on marginal TaskTrackers are assigned repeatedly to the same TaskTrackers (probably because it is the only available slot), eventually running out of attempts.
Reduce tasks should be assigned to the same TaskTrackers at most twice, or TaskTrackers need to get some better smarts to find  failing hardware.
BTW, mapred.reduce.max.attempts=12, which is high, but does not help in this case.


> job failing because of reassigning same tasktracker to failing tasks
> --------------------------------------------------------------------
>
>                 Key: HADOOP-3333
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3333
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.3
>            Reporter: Christian Kunz
>            Priority: Blocker
>
> We have a long running a job in a 2nd atttempt. Previous job was failing and current jobs risks to fail as well, because  reduce tasks failing on marginal TaskTrackers are assigned repeatedly to the same TaskTrackers (probably because it is the only available slot), eventually running out of attempts.
> Reduce tasks should be assigned to the same TaskTrackers at most twice, or TaskTrackers need to get some better smarts to find  failing hardware.
> BTW, mapred.reduce.max.attempts=12, which is high, but does not help in this case.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3333) job failing because of reassigning same tasktracker to failing tasks

Posted by "Christian Kunz (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12593793#action_12593793 ] 

Christian Kunz commented on HADOOP-3333:
----------------------------------------

Arun,
Number of blacklisted TaskTrackers is low (less than 1%), because we have a high threshold (100 failures) for TaskTrackers to be declared blacklisted. In the past, with the default setting, we lost too many TaskTrackers too fast even when there were no hardware issues -- but this might have been fixed and we might want to change this back to a more reasonable value. On the other hand, we did not have any problems using the high value till 0.16.3.

Amar,
With a 'marginal' TaskTracker I mean a TaskTracker running on a node with hardware failures, that still runs most short tasks successfully, but with a higher chance of failing long running tasks (e.g. reduce tasks shuffling the map outputs from many waves of short map tasks).
Concerning 'repeatedly same tasks assigned to same Tasktracker', I can point you to a running job offline exhibiting the problem.

> job failing because of reassigning same tasktracker to failing tasks
> --------------------------------------------------------------------
>
>                 Key: HADOOP-3333
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3333
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.3
>            Reporter: Christian Kunz
>            Priority: Blocker
>
> We have a long running a job in a 2nd atttempt. Previous job was failing and current jobs risks to fail as well, because  reduce tasks failing on marginal TaskTrackers are assigned repeatedly to the same TaskTrackers (probably because it is the only available slot), eventually running out of attempts.
> Reduce tasks should be assigned to the same TaskTrackers at most twice, or TaskTrackers need to get some better smarts to find  failing hardware.
> BTW, mapred.reduce.max.attempts=12, which is high, but does not help in this case.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3333) job failing because of reassigning same tasktracker to failing tasks

Posted by "Amareshwari Sriramadasu (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12597391#action_12597391 ] 

Amareshwari Sriramadasu commented on HADOOP-3333:
-------------------------------------------------

The change in the order of calls to lostTaskTracker and updateTaskTackerStatus in JobTracker.ExpireTrackers fixes HADOOP-3403.

+1 for the change. 


> job failing because of reassigning same tasktracker to failing tasks
> --------------------------------------------------------------------
>
>                 Key: HADOOP-3333
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3333
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.3
>            Reporter: Christian Kunz
>            Assignee: Arun C Murthy
>            Priority: Critical
>             Fix For: 0.18.0
>
>         Attachments: HADOOP-3333_0_20080503.patch, HADOOP-3333_1_20080505.patch, HADOOP-3333_2_20080506.patch
>
>
> We have a long running a job in a 2nd atttempt. Previous job was failing and current jobs risks to fail as well, because  reduce tasks failing on marginal TaskTrackers are assigned repeatedly to the same TaskTrackers (probably because it is the only available slot), eventually running out of attempts.
> Reduce tasks should be assigned to the same TaskTrackers at most twice, or TaskTrackers need to get some better smarts to find  failing hardware.
> BTW, mapred.reduce.max.attempts=12, which is high, but does not help in this case.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3333) job failing because of reassigning same tasktracker to failing tasks

Posted by "Amar Kamat (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12593768#action_12593768 ] 

Amar Kamat commented on HADOOP-3333:
------------------------------------

bq. reduce tasks failing on marginal TaskTrackers
What do you mean by this?
bq. repeatedly to the same TaskTrackers (probably because it is the only available slot)
No. It will assign to the same tasktracker only under the conditions Arun mentioned or only after the TIP was tried on all the machines.
Can you provide more details as to how to reproduce this problem, how many nodes were there in the cluster and what was the overall behaviour of the job in terms of failures.

> job failing because of reassigning same tasktracker to failing tasks
> --------------------------------------------------------------------
>
>                 Key: HADOOP-3333
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3333
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.3
>            Reporter: Christian Kunz
>            Priority: Blocker
>
> We have a long running a job in a 2nd atttempt. Previous job was failing and current jobs risks to fail as well, because  reduce tasks failing on marginal TaskTrackers are assigned repeatedly to the same TaskTrackers (probably because it is the only available slot), eventually running out of attempts.
> Reduce tasks should be assigned to the same TaskTrackers at most twice, or TaskTrackers need to get some better smarts to find  failing hardware.
> BTW, mapred.reduce.max.attempts=12, which is high, but does not help in this case.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3333) job failing because of reassigning same tasktracker to failing tasks

Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Arun C Murthy updated HADOOP-3333:
----------------------------------

    Attachment: HADOOP-3333_0_20080503.patch

Straight-forward patch which fixes TaskInProgress to track the hostnames of the TaskTrackers rather than the 'tracker name'. I'm currently running tests on this...

> job failing because of reassigning same tasktracker to failing tasks
> --------------------------------------------------------------------
>
>                 Key: HADOOP-3333
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3333
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.3
>            Reporter: Christian Kunz
>            Assignee: Arun C Murthy
>            Priority: Blocker
>             Fix For: 0.16.4
>
>         Attachments: HADOOP-3333_0_20080503.patch
>
>
> We have a long running a job in a 2nd atttempt. Previous job was failing and current jobs risks to fail as well, because  reduce tasks failing on marginal TaskTrackers are assigned repeatedly to the same TaskTrackers (probably because it is the only available slot), eventually running out of attempts.
> Reduce tasks should be assigned to the same TaskTrackers at most twice, or TaskTrackers need to get some better smarts to find  failing hardware.
> BTW, mapred.reduce.max.attempts=12, which is high, but does not help in this case.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3333) job failing because of reassigning same tasktracker to failing tasks

Posted by "Jothi Padmanabhan (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jothi Padmanabhan updated HADOOP-3333:
--------------------------------------

    Attachment: hadoop-3333-v1.patch

> job failing because of reassigning same tasktracker to failing tasks
> --------------------------------------------------------------------
>
>                 Key: HADOOP-3333
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3333
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.3
>            Reporter: Christian Kunz
>            Assignee: Jothi Padmanabhan
>            Priority: Blocker
>             Fix For: 0.18.0
>
>         Attachments: hadoop-3333-v1.patch, hadoop-3333.patch, HADOOP-3333_0_20080503.patch, HADOOP-3333_1_20080505.patch, HADOOP-3333_2_20080506.patch
>
>
> We have a long running a job in a 2nd atttempt. Previous job was failing and current jobs risks to fail as well, because  reduce tasks failing on marginal TaskTrackers are assigned repeatedly to the same TaskTrackers (probably because it is the only available slot), eventually running out of attempts.
> Reduce tasks should be assigned to the same TaskTrackers at most twice, or TaskTrackers need to get some better smarts to find  failing hardware.
> BTW, mapred.reduce.max.attempts=12, which is high, but does not help in this case.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3333) job failing because of reassigning same tasktracker to failing tasks

Posted by "Christian Kunz (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12593823#action_12593823 ] 

Christian Kunz commented on HADOOP-3333:
----------------------------------------

Just noticed HADOOP_2770 with a seemingly related issue, making a case for not rescheduling tasks on nodes where they have been killed (not just failed).

> job failing because of reassigning same tasktracker to failing tasks
> --------------------------------------------------------------------
>
>                 Key: HADOOP-3333
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3333
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.3
>            Reporter: Christian Kunz
>            Assignee: Arun C Murthy
>            Priority: Blocker
>
> We have a long running a job in a 2nd atttempt. Previous job was failing and current jobs risks to fail as well, because  reduce tasks failing on marginal TaskTrackers are assigned repeatedly to the same TaskTrackers (probably because it is the only available slot), eventually running out of attempts.
> Reduce tasks should be assigned to the same TaskTrackers at most twice, or TaskTrackers need to get some better smarts to find  failing hardware.
> BTW, mapred.reduce.max.attempts=12, which is high, but does not help in this case.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3333) job failing because of reassigning same tasktracker to failing tasks

Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Arun C Murthy updated HADOOP-3333:
----------------------------------

    Attachment:     (was: HADOOP-3333_1_20080505.patch)

> job failing because of reassigning same tasktracker to failing tasks
> --------------------------------------------------------------------
>
>                 Key: HADOOP-3333
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3333
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.3
>            Reporter: Christian Kunz
>            Assignee: Arun C Murthy
>            Priority: Critical
>             Fix For: 0.18.0
>
>         Attachments: HADOOP-3333_0_20080503.patch, HADOOP-3333_1_20080505.patch
>
>
> We have a long running a job in a 2nd atttempt. Previous job was failing and current jobs risks to fail as well, because  reduce tasks failing on marginal TaskTrackers are assigned repeatedly to the same TaskTrackers (probably because it is the only available slot), eventually running out of attempts.
> Reduce tasks should be assigned to the same TaskTrackers at most twice, or TaskTrackers need to get some better smarts to find  failing hardware.
> BTW, mapred.reduce.max.attempts=12, which is high, but does not help in this case.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3333) job failing because of reassigning same tasktracker to failing tasks

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12593825#action_12593825 ] 

Devaraj Das commented on HADOOP-3333:
-------------------------------------

bq. 1. We might have to rework the logic to work around task FAILURES; currently the JT only schedules around nodes where the task FAILED. However a lost tasktracker leads to tasks being marked KILLED.

IMO we should leave this logic unchanged. The re-execution on this lost TT, if it fails, will make the JT schedule tasks around that.

bq. 2. We also have to track hostnames rather than 'trackernames', trackername includes the host:port... (#2)

This makes sense (as long as we don't depend host:port esp. in the unit tests).

> job failing because of reassigning same tasktracker to failing tasks
> --------------------------------------------------------------------
>
>                 Key: HADOOP-3333
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3333
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.3
>            Reporter: Christian Kunz
>            Assignee: Arun C Murthy
>            Priority: Blocker
>
> We have a long running a job in a 2nd atttempt. Previous job was failing and current jobs risks to fail as well, because  reduce tasks failing on marginal TaskTrackers are assigned repeatedly to the same TaskTrackers (probably because it is the only available slot), eventually running out of attempts.
> Reduce tasks should be assigned to the same TaskTrackers at most twice, or TaskTrackers need to get some better smarts to find  failing hardware.
> BTW, mapred.reduce.max.attempts=12, which is high, but does not help in this case.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3333) job failing because of reassigning same tasktracker to failing tasks

Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12593797#action_12593797 ] 

Arun C Murthy commented on HADOOP-3333:
---------------------------------------

bq. Number of blacklisted TaskTrackers is low (less than 1%), because we have a high threshold (100 failures) for TaskTrackers to be declared blacklisted.

Christian, the notion of blacklisting is useful to ensure that tasks aren't scheduled on the marginal tasktrackers. That said, the scheduling code in the JobTracker tries to not allocate on nodes on which it hasn't previously been executed, unless it has failed on all machines. What is your configuration for mapred.reduce.max.attempts?

bq. I can point you to a running job offline exhibiting the problem.
Yes, that will help. Thanks!

> job failing because of reassigning same tasktracker to failing tasks
> --------------------------------------------------------------------
>
>                 Key: HADOOP-3333
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3333
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.3
>            Reporter: Christian Kunz
>            Priority: Blocker
>
> We have a long running a job in a 2nd atttempt. Previous job was failing and current jobs risks to fail as well, because  reduce tasks failing on marginal TaskTrackers are assigned repeatedly to the same TaskTrackers (probably because it is the only available slot), eventually running out of attempts.
> Reduce tasks should be assigned to the same TaskTrackers at most twice, or TaskTrackers need to get some better smarts to find  failing hardware.
> BTW, mapred.reduce.max.attempts=12, which is high, but does not help in this case.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3333) job failing because of reassigning same tasktracker to failing tasks

Posted by "Christian Kunz (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12594379#action_12594379 ] 

Christian Kunz commented on HADOOP-3333:
----------------------------------------

This is a good enough work-around, i.e. it is okay with me to wait for 0.18.0 with the official fix

BTW, just for the record, the tasktracker's  were not actually bouncing as in restarting, but they got lost and rejoined.

> job failing because of reassigning same tasktracker to failing tasks
> --------------------------------------------------------------------
>
>                 Key: HADOOP-3333
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3333
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.3
>            Reporter: Christian Kunz
>            Assignee: Arun C Murthy
>            Priority: Critical
>             Fix For: 0.18.0
>
>         Attachments: HADOOP-3333_0_20080503.patch, HADOOP-3333_1_20080505.patch
>
>
> We have a long running a job in a 2nd atttempt. Previous job was failing and current jobs risks to fail as well, because  reduce tasks failing on marginal TaskTrackers are assigned repeatedly to the same TaskTrackers (probably because it is the only available slot), eventually running out of attempts.
> Reduce tasks should be assigned to the same TaskTrackers at most twice, or TaskTrackers need to get some better smarts to find  failing hardware.
> BTW, mapred.reduce.max.attempts=12, which is high, but does not help in this case.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3333) job failing because of reassigning same tasktracker to failing tasks

Posted by "Jothi Padmanabhan (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jothi Padmanabhan updated HADOOP-3333:
--------------------------------------

    Status: Patch Available  (was: Open)

Resubmitting the patch after fixing the findbugs warning

> job failing because of reassigning same tasktracker to failing tasks
> --------------------------------------------------------------------
>
>                 Key: HADOOP-3333
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3333
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.3
>            Reporter: Christian Kunz
>            Assignee: Jothi Padmanabhan
>            Priority: Blocker
>             Fix For: 0.18.0
>
>         Attachments: hadoop-3333-v1.patch, hadoop-3333.patch, HADOOP-3333_0_20080503.patch, HADOOP-3333_1_20080505.patch, HADOOP-3333_2_20080506.patch
>
>
> We have a long running a job in a 2nd atttempt. Previous job was failing and current jobs risks to fail as well, because  reduce tasks failing on marginal TaskTrackers are assigned repeatedly to the same TaskTrackers (probably because it is the only available slot), eventually running out of attempts.
> Reduce tasks should be assigned to the same TaskTrackers at most twice, or TaskTrackers need to get some better smarts to find  failing hardware.
> BTW, mapred.reduce.max.attempts=12, which is high, but does not help in this case.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3333) job failing because of reassigning same tasktracker to failing tasks

Posted by "Jothi Padmanabhan (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jothi Padmanabhan updated HADOOP-3333:
--------------------------------------

    Attachment: hadoop-3333.patch

> job failing because of reassigning same tasktracker to failing tasks
> --------------------------------------------------------------------
>
>                 Key: HADOOP-3333
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3333
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.3
>            Reporter: Christian Kunz
>            Assignee: Jothi Padmanabhan
>            Priority: Blocker
>             Fix For: 0.18.0
>
>         Attachments: hadoop-3333.patch, HADOOP-3333_0_20080503.patch, HADOOP-3333_1_20080505.patch, HADOOP-3333_2_20080506.patch
>
>
> We have a long running a job in a 2nd atttempt. Previous job was failing and current jobs risks to fail as well, because  reduce tasks failing on marginal TaskTrackers are assigned repeatedly to the same TaskTrackers (probably because it is the only available slot), eventually running out of attempts.
> Reduce tasks should be assigned to the same TaskTrackers at most twice, or TaskTrackers need to get some better smarts to find  failing hardware.
> BTW, mapred.reduce.max.attempts=12, which is high, but does not help in this case.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3333) job failing because of reassigning same tasktracker to failing tasks

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12605898#action_12605898 ] 

Hadoop QA commented on HADOOP-3333:
-----------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12384180/hadoop-3333-v3.patch
  against trunk revision 669088.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 3 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    -1 findbugs.  The patch appears to introduce 1 new Findbugs warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    +1 core tests.  The patch passed core unit tests.

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2680/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2680/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2680/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2680/console

This message is automatically generated.

> job failing because of reassigning same tasktracker to failing tasks
> --------------------------------------------------------------------
>
>                 Key: HADOOP-3333
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3333
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.3
>            Reporter: Christian Kunz
>            Assignee: Jothi Padmanabhan
>            Priority: Blocker
>             Fix For: 0.18.0
>
>         Attachments: hadoop-3333-v1.patch, hadoop-3333-v2.patch, hadoop-3333-v3.patch, hadoop-3333.patch, HADOOP-3333_0_20080503.patch, HADOOP-3333_1_20080505.patch, HADOOP-3333_2_20080506.patch
>
>
> We have a long running a job in a 2nd atttempt. Previous job was failing and current jobs risks to fail as well, because  reduce tasks failing on marginal TaskTrackers are assigned repeatedly to the same TaskTrackers (probably because it is the only available slot), eventually running out of attempts.
> Reduce tasks should be assigned to the same TaskTrackers at most twice, or TaskTrackers need to get some better smarts to find  failing hardware.
> BTW, mapred.reduce.max.attempts=12, which is high, but does not help in this case.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3333) job failing because of reassigning same tasktracker to failing tasks

Posted by "Amar Kamat (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12594774#action_12594774 ] 

Amar Kamat commented on HADOOP-3333:
------------------------------------

Why dont we do something like
!) If {{(num-unique-nodes-amongst-the-trackers / total-trackers-registered-with-jt) > K}} then blacklist the node for that TIP i.e do as discussed above.
2) Else avoid blacklisting the host for that TIP (similar to the current blacklisting of trackers).
Where K = 0.25
This will overcome the corner case where the cluster is running on smaller number of nodes and the TIP has failed on atleast one tracker on each node. Also the test can be kept as it is.

> job failing because of reassigning same tasktracker to failing tasks
> --------------------------------------------------------------------
>
>                 Key: HADOOP-3333
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3333
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.3
>            Reporter: Christian Kunz
>            Assignee: Arun C Murthy
>            Priority: Critical
>             Fix For: 0.18.0
>
>         Attachments: HADOOP-3333_0_20080503.patch, HADOOP-3333_1_20080505.patch, HADOOP-3333_2_20080506.patch
>
>
> We have a long running a job in a 2nd atttempt. Previous job was failing and current jobs risks to fail as well, because  reduce tasks failing on marginal TaskTrackers are assigned repeatedly to the same TaskTrackers (probably because it is the only available slot), eventually running out of attempts.
> Reduce tasks should be assigned to the same TaskTrackers at most twice, or TaskTrackers need to get some better smarts to find  failing hardware.
> BTW, mapred.reduce.max.attempts=12, which is high, but does not help in this case.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3333) job failing because of reassigning same tasktracker to failing tasks

Posted by "Christian Kunz (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12593801#action_12593801 ] 

Christian Kunz commented on HADOOP-3333:
----------------------------------------

Arun, as mentioned in the description, mapred.reduce.max.attempts=12.
I pointed Amar to the running job.
E.g. a reduce task xxx_r_000309 had attempts 4-7 on the same TaskTracker with 2 failures and 2 kills.

> job failing because of reassigning same tasktracker to failing tasks
> --------------------------------------------------------------------
>
>                 Key: HADOOP-3333
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3333
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.3
>            Reporter: Christian Kunz
>            Priority: Blocker
>
> We have a long running a job in a 2nd atttempt. Previous job was failing and current jobs risks to fail as well, because  reduce tasks failing on marginal TaskTrackers are assigned repeatedly to the same TaskTrackers (probably because it is the only available slot), eventually running out of attempts.
> Reduce tasks should be assigned to the same TaskTrackers at most twice, or TaskTrackers need to get some better smarts to find  failing hardware.
> BTW, mapred.reduce.max.attempts=12, which is high, but does not help in this case.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3333) job failing because of reassigning same tasktracker to failing tasks

Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Arun C Murthy updated HADOOP-3333:
----------------------------------

    Status: Open  (was: Patch Available)

After a discussion, Owen/me agree that we need to use NetworkTopology.DEFAULT_RACK as the default rack in MiniMRCluster ...

> job failing because of reassigning same tasktracker to failing tasks
> --------------------------------------------------------------------
>
>                 Key: HADOOP-3333
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3333
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.3
>            Reporter: Christian Kunz
>            Assignee: Arun C Murthy
>            Priority: Critical
>             Fix For: 0.18.0
>
>         Attachments: HADOOP-3333_0_20080503.patch, HADOOP-3333_1_20080505.patch
>
>
> We have a long running a job in a 2nd atttempt. Previous job was failing and current jobs risks to fail as well, because  reduce tasks failing on marginal TaskTrackers are assigned repeatedly to the same TaskTrackers (probably because it is the only available slot), eventually running out of attempts.
> Reduce tasks should be assigned to the same TaskTrackers at most twice, or TaskTrackers need to get some better smarts to find  failing hardware.
> BTW, mapred.reduce.max.attempts=12, which is high, but does not help in this case.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3333) job failing because of reassigning same tasktracker to failing tasks

Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12593688#action_12593688 ] 

Arun C Murthy commented on HADOOP-3333:
---------------------------------------

Christian, do you know how many 'blacklisted' TaskTrackers were present when you noticed this?

> job failing because of reassigning same tasktracker to failing tasks
> --------------------------------------------------------------------
>
>                 Key: HADOOP-3333
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3333
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.3
>            Reporter: Christian Kunz
>            Priority: Blocker
>
> We have a long running a job in a 2nd atttempt. Previous job was failing and current jobs risks to fail as well, because  reduce tasks failing on marginal TaskTrackers are assigned repeatedly to the same TaskTrackers (probably because it is the only available slot), eventually running out of attempts.
> Reduce tasks should be assigned to the same TaskTrackers at most twice, or TaskTrackers need to get some better smarts to find  failing hardware.
> BTW, mapred.reduce.max.attempts=12, which is high, but does not help in this case.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3333) job failing because of reassigning same tasktracker to failing tasks

Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12594387#action_12594387 ] 

Arun C Murthy commented on HADOOP-3333:
---------------------------------------

bq. BTW, just for the record, the tasktracker's were not actually bouncing as in restarting, but they got lost and rejoined.

Yes, I'm guessing you aren't fiddling with the default value of *mapred.task.tracker.report.address* which is 127.0.0.1:0. When task-trackers 'reinitialize' they are binding again to port zero and hence they are getting a new 'trackername'...

> job failing because of reassigning same tasktracker to failing tasks
> --------------------------------------------------------------------
>
>                 Key: HADOOP-3333
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3333
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.3
>            Reporter: Christian Kunz
>            Assignee: Arun C Murthy
>            Priority: Critical
>             Fix For: 0.18.0
>
>         Attachments: HADOOP-3333_0_20080503.patch, HADOOP-3333_1_20080505.patch
>
>
> We have a long running a job in a 2nd atttempt. Previous job was failing and current jobs risks to fail as well, because  reduce tasks failing on marginal TaskTrackers are assigned repeatedly to the same TaskTrackers (probably because it is the only available slot), eventually running out of attempts.
> Reduce tasks should be assigned to the same TaskTrackers at most twice, or TaskTrackers need to get some better smarts to find  failing hardware.
> BTW, mapred.reduce.max.attempts=12, which is high, but does not help in this case.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3333) job failing because of reassigning same tasktracker to failing tasks

Posted by "Amareshwari Sriramadasu (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12605542#action_12605542 ] 

Amareshwari Sriramadasu commented on HADOOP-3333:
-------------------------------------------------

some minor comments:
1. Indentation of Line length needs to be fixed (to make 80) in a couple of places in JobTracker and JIP
2. Since taskTrackerStatus is passed as parameter to JIP.failedTask(), earlier call to jobtracker.getTaskTracker(status.getTaskTracker()) for getting the taskTrackerStatus can be replaced with the parameter passed.


> job failing because of reassigning same tasktracker to failing tasks
> --------------------------------------------------------------------
>
>                 Key: HADOOP-3333
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3333
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.3
>            Reporter: Christian Kunz
>            Assignee: Jothi Padmanabhan
>            Priority: Blocker
>             Fix For: 0.18.0
>
>         Attachments: hadoop-3333-v1.patch, hadoop-3333.patch, HADOOP-3333_0_20080503.patch, HADOOP-3333_1_20080505.patch, HADOOP-3333_2_20080506.patch
>
>
> We have a long running a job in a 2nd atttempt. Previous job was failing and current jobs risks to fail as well, because  reduce tasks failing on marginal TaskTrackers are assigned repeatedly to the same TaskTrackers (probably because it is the only available slot), eventually running out of attempts.
> Reduce tasks should be assigned to the same TaskTrackers at most twice, or TaskTrackers need to get some better smarts to find  failing hardware.
> BTW, mapred.reduce.max.attempts=12, which is high, but does not help in this case.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3333) job failing because of reassigning same tasktracker to failing tasks

Posted by "Jothi Padmanabhan (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12605541#action_12605541 ] 

Jothi Padmanabhan commented on HADOOP-3333:
-------------------------------------------

Findbugs is complaining on this piece of code

{
Integer numTaskTrackersInHost;
.
.
numTaskTrackersInHost ++;
uniqueHosts.put(host, numTaskTrackersinHost);
}

It appears that while the compiler is handling the case of applying native integer operations to the Integer object, findbugs is confused.

> job failing because of reassigning same tasktracker to failing tasks
> --------------------------------------------------------------------
>
>                 Key: HADOOP-3333
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3333
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.3
>            Reporter: Christian Kunz
>            Assignee: Jothi Padmanabhan
>            Priority: Blocker
>             Fix For: 0.18.0
>
>         Attachments: hadoop-3333-v1.patch, hadoop-3333.patch, HADOOP-3333_0_20080503.patch, HADOOP-3333_1_20080505.patch, HADOOP-3333_2_20080506.patch
>
>
> We have a long running a job in a 2nd atttempt. Previous job was failing and current jobs risks to fail as well, because  reduce tasks failing on marginal TaskTrackers are assigned repeatedly to the same TaskTrackers (probably because it is the only available slot), eventually running out of attempts.
> Reduce tasks should be assigned to the same TaskTrackers at most twice, or TaskTrackers need to get some better smarts to find  failing hardware.
> BTW, mapred.reduce.max.attempts=12, which is high, but does not help in this case.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3333) job failing because of reassigning same tasktracker to failing tasks

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12605658#action_12605658 ] 

Hadoop QA commented on HADOOP-3333:
-----------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12384117/hadoop-3333-v2.patch
  against trunk revision 668612.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 3 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    -1 findbugs.  The patch appears to introduce 1 new Findbugs warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    +1 core tests.  The patch passed core unit tests.

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2672/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2672/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2672/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2672/console

This message is automatically generated.

> job failing because of reassigning same tasktracker to failing tasks
> --------------------------------------------------------------------
>
>                 Key: HADOOP-3333
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3333
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.3
>            Reporter: Christian Kunz
>            Assignee: Jothi Padmanabhan
>            Priority: Blocker
>             Fix For: 0.18.0
>
>         Attachments: hadoop-3333-v1.patch, hadoop-3333-v2.patch, hadoop-3333.patch, HADOOP-3333_0_20080503.patch, HADOOP-3333_1_20080505.patch, HADOOP-3333_2_20080506.patch
>
>
> We have a long running a job in a 2nd atttempt. Previous job was failing and current jobs risks to fail as well, because  reduce tasks failing on marginal TaskTrackers are assigned repeatedly to the same TaskTrackers (probably because it is the only available slot), eventually running out of attempts.
> Reduce tasks should be assigned to the same TaskTrackers at most twice, or TaskTrackers need to get some better smarts to find  failing hardware.
> BTW, mapred.reduce.max.attempts=12, which is high, but does not help in this case.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3333) job failing because of reassigning same tasktracker to failing tasks

Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Arun C Murthy updated HADOOP-3333:
----------------------------------

    Comment: was deleted

> job failing because of reassigning same tasktracker to failing tasks
> --------------------------------------------------------------------
>
>                 Key: HADOOP-3333
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3333
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.3
>            Reporter: Christian Kunz
>            Assignee: Arun C Murthy
>            Priority: Critical
>             Fix For: 0.18.0
>
>         Attachments: HADOOP-3333_0_20080503.patch, HADOOP-3333_1_20080505.patch
>
>
> We have a long running a job in a 2nd atttempt. Previous job was failing and current jobs risks to fail as well, because  reduce tasks failing on marginal TaskTrackers are assigned repeatedly to the same TaskTrackers (probably because it is the only available slot), eventually running out of attempts.
> Reduce tasks should be assigned to the same TaskTrackers at most twice, or TaskTrackers need to get some better smarts to find  failing hardware.
> BTW, mapred.reduce.max.attempts=12, which is high, but does not help in this case.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3333) job failing because of reassigning same tasktracker to failing tasks

Posted by "Amar Kamat (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12605596#action_12605596 ] 

Amar Kamat commented on HADOOP-3333:
------------------------------------

The approach looks fine. I code also looks fine. Looks like this addition also addresses the corner case raised earlier.  Some super minor comment for JobTracker.java : The patch removes and adds LinkedHashMap. We can probably avoid that. 

> job failing because of reassigning same tasktracker to failing tasks
> --------------------------------------------------------------------
>
>                 Key: HADOOP-3333
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3333
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.3
>            Reporter: Christian Kunz
>            Assignee: Jothi Padmanabhan
>            Priority: Blocker
>             Fix For: 0.18.0
>
>         Attachments: hadoop-3333-v1.patch, hadoop-3333-v2.patch, hadoop-3333.patch, HADOOP-3333_0_20080503.patch, HADOOP-3333_1_20080505.patch, HADOOP-3333_2_20080506.patch
>
>
> We have a long running a job in a 2nd atttempt. Previous job was failing and current jobs risks to fail as well, because  reduce tasks failing on marginal TaskTrackers are assigned repeatedly to the same TaskTrackers (probably because it is the only available slot), eventually running out of attempts.
> Reduce tasks should be assigned to the same TaskTrackers at most twice, or TaskTrackers need to get some better smarts to find  failing hardware.
> BTW, mapred.reduce.max.attempts=12, which is high, but does not help in this case.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3333) job failing because of reassigning same tasktracker to failing tasks

Posted by "Jothi Padmanabhan (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jothi Padmanabhan updated HADOOP-3333:
--------------------------------------

    Status: Open  (was: Patch Available)

> job failing because of reassigning same tasktracker to failing tasks
> --------------------------------------------------------------------
>
>                 Key: HADOOP-3333
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3333
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.3
>            Reporter: Christian Kunz
>            Assignee: Jothi Padmanabhan
>            Priority: Blocker
>             Fix For: 0.18.0
>
>         Attachments: hadoop-3333-v1.patch, hadoop-3333-v2.patch, hadoop-3333.patch, HADOOP-3333_0_20080503.patch, HADOOP-3333_1_20080505.patch, HADOOP-3333_2_20080506.patch
>
>
> We have a long running a job in a 2nd atttempt. Previous job was failing and current jobs risks to fail as well, because  reduce tasks failing on marginal TaskTrackers are assigned repeatedly to the same TaskTrackers (probably because it is the only available slot), eventually running out of attempts.
> Reduce tasks should be assigned to the same TaskTrackers at most twice, or TaskTrackers need to get some better smarts to find  failing hardware.
> BTW, mapred.reduce.max.attempts=12, which is high, but does not help in this case.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3333) job failing because of reassigning same tasktracker to failing tasks

Posted by "Amar Kamat (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12594059#action_12594059 ] 

Amar Kamat commented on HADOOP-3333:
------------------------------------

Arun,
I think the ant tests depend on the fact that the trackernames can have same hostname but different port. You should fix that too.

> job failing because of reassigning same tasktracker to failing tasks
> --------------------------------------------------------------------
>
>                 Key: HADOOP-3333
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3333
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.3
>            Reporter: Christian Kunz
>            Assignee: Arun C Murthy
>            Priority: Blocker
>             Fix For: 0.16.4
>
>         Attachments: HADOOP-3333_0_20080503.patch
>
>
> We have a long running a job in a 2nd atttempt. Previous job was failing and current jobs risks to fail as well, because  reduce tasks failing on marginal TaskTrackers are assigned repeatedly to the same TaskTrackers (probably because it is the only available slot), eventually running out of attempts.
> Reduce tasks should be assigned to the same TaskTrackers at most twice, or TaskTrackers need to get some better smarts to find  failing hardware.
> BTW, mapred.reduce.max.attempts=12, which is high, but does not help in this case.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3333) job failing because of reassigning same tasktracker to failing tasks

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12594457#action_12594457 ] 

Hadoop QA commented on HADOOP-3333:
-----------------------------------

+1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12381469/HADOOP-3333_1_20080505.patch
  against trunk revision 653638.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 3 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs.  The patch does not introduce any new Findbugs warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    +1 core tests.  The patch passed core unit tests.

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2403/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2403/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2403/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2403/console

This message is automatically generated.

> job failing because of reassigning same tasktracker to failing tasks
> --------------------------------------------------------------------
>
>                 Key: HADOOP-3333
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3333
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.3
>            Reporter: Christian Kunz
>            Assignee: Arun C Murthy
>            Priority: Critical
>             Fix For: 0.18.0
>
>         Attachments: HADOOP-3333_0_20080503.patch, HADOOP-3333_1_20080505.patch
>
>
> We have a long running a job in a 2nd atttempt. Previous job was failing and current jobs risks to fail as well, because  reduce tasks failing on marginal TaskTrackers are assigned repeatedly to the same TaskTrackers (probably because it is the only available slot), eventually running out of attempts.
> Reduce tasks should be assigned to the same TaskTrackers at most twice, or TaskTrackers need to get some better smarts to find  failing hardware.
> BTW, mapred.reduce.max.attempts=12, which is high, but does not help in this case.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3333) job failing because of reassigning same tasktracker to failing tasks

Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Arun C Murthy updated HADOOP-3333:
----------------------------------

    Attachment: HADOOP-3333_1_20080505.patch

Updated patch, I had to fix JobTracker.ExpireTrackers.run to correctly call JobTracker.lostTaskTracker first before nuking the knowledge about it's existence in JobTracker.updataTaskTrackerStatus.

> job failing because of reassigning same tasktracker to failing tasks
> --------------------------------------------------------------------
>
>                 Key: HADOOP-3333
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3333
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.3
>            Reporter: Christian Kunz
>            Assignee: Arun C Murthy
>            Priority: Critical
>             Fix For: 0.18.0
>
>         Attachments: HADOOP-3333_0_20080503.patch, HADOOP-3333_1_20080505.patch
>
>
> We have a long running a job in a 2nd atttempt. Previous job was failing and current jobs risks to fail as well, because  reduce tasks failing on marginal TaskTrackers are assigned repeatedly to the same TaskTrackers (probably because it is the only available slot), eventually running out of attempts.
> Reduce tasks should be assigned to the same TaskTrackers at most twice, or TaskTrackers need to get some better smarts to find  failing hardware.
> BTW, mapred.reduce.max.attempts=12, which is high, but does not help in this case.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3333) job failing because of reassigning same tasktracker to failing tasks

Posted by "Jothi Padmanabhan (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jothi Padmanabhan updated HADOOP-3333:
--------------------------------------

    Status: Patch Available  (was: Open)

New patch with the following changes:
Fixing Amar's micro review comment, corrected a couple of Java docs and modified the taskTrackers to use a HashMap instead of TreeMap for efficiency.

> job failing because of reassigning same tasktracker to failing tasks
> --------------------------------------------------------------------
>
>                 Key: HADOOP-3333
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3333
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.3
>            Reporter: Christian Kunz
>            Assignee: Jothi Padmanabhan
>            Priority: Blocker
>             Fix For: 0.18.0
>
>         Attachments: hadoop-3333-v1.patch, hadoop-3333-v2.patch, hadoop-3333-v3.patch, hadoop-3333.patch, HADOOP-3333_0_20080503.patch, HADOOP-3333_1_20080505.patch, HADOOP-3333_2_20080506.patch
>
>
> We have a long running a job in a 2nd atttempt. Previous job was failing and current jobs risks to fail as well, because  reduce tasks failing on marginal TaskTrackers are assigned repeatedly to the same TaskTrackers (probably because it is the only available slot), eventually running out of attempts.
> Reduce tasks should be assigned to the same TaskTrackers at most twice, or TaskTrackers need to get some better smarts to find  failing hardware.
> BTW, mapred.reduce.max.attempts=12, which is high, but does not help in this case.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3333) job failing because of reassigning same tasktracker to failing tasks

Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Arun C Murthy updated HADOOP-3333:
----------------------------------

    Status: Open  (was: Patch Available)

Cancelling patch based on Amar's comments...

> job failing because of reassigning same tasktracker to failing tasks
> --------------------------------------------------------------------
>
>                 Key: HADOOP-3333
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3333
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.3
>            Reporter: Christian Kunz
>            Assignee: Arun C Murthy
>            Priority: Critical
>             Fix For: 0.18.0
>
>         Attachments: HADOOP-3333_0_20080503.patch, HADOOP-3333_1_20080505.patch, HADOOP-3333_2_20080506.patch
>
>
> We have a long running a job in a 2nd atttempt. Previous job was failing and current jobs risks to fail as well, because  reduce tasks failing on marginal TaskTrackers are assigned repeatedly to the same TaskTrackers (probably because it is the only available slot), eventually running out of attempts.
> Reduce tasks should be assigned to the same TaskTrackers at most twice, or TaskTrackers need to get some better smarts to find  failing hardware.
> BTW, mapred.reduce.max.attempts=12, which is high, but does not help in this case.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3333) job failing because of reassigning same tasktracker to failing tasks

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12605375#action_12605375 ] 

Hadoop QA commented on HADOOP-3333:
-----------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12384074/hadoop-3333.patch
  against trunk revision 667706.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 3 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    -1 findbugs.  The patch appears to introduce 1 new Findbugs warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    +1 core tests.  The patch passed core unit tests.

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2664/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2664/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2664/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2664/console

This message is automatically generated.

> job failing because of reassigning same tasktracker to failing tasks
> --------------------------------------------------------------------
>
>                 Key: HADOOP-3333
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3333
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.3
>            Reporter: Christian Kunz
>            Assignee: Jothi Padmanabhan
>            Priority: Blocker
>             Fix For: 0.18.0
>
>         Attachments: hadoop-3333.patch, HADOOP-3333_0_20080503.patch, HADOOP-3333_1_20080505.patch, HADOOP-3333_2_20080506.patch
>
>
> We have a long running a job in a 2nd atttempt. Previous job was failing and current jobs risks to fail as well, because  reduce tasks failing on marginal TaskTrackers are assigned repeatedly to the same TaskTrackers (probably because it is the only available slot), eventually running out of attempts.
> Reduce tasks should be assigned to the same TaskTrackers at most twice, or TaskTrackers need to get some better smarts to find  failing hardware.
> BTW, mapred.reduce.max.attempts=12, which is high, but does not help in this case.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3333) job failing because of reassigning same tasktracker to failing tasks

Posted by "Jothi Padmanabhan (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jothi Padmanabhan updated HADOOP-3333:
--------------------------------------

    Status: Open  (was: Patch Available)

> job failing because of reassigning same tasktracker to failing tasks
> --------------------------------------------------------------------
>
>                 Key: HADOOP-3333
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3333
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.3
>            Reporter: Christian Kunz
>            Assignee: Jothi Padmanabhan
>            Priority: Blocker
>             Fix For: 0.18.0
>
>         Attachments: hadoop-3333-v1.patch, hadoop-3333-v2.patch, hadoop-3333-v3.patch, hadoop-3333.patch, HADOOP-3333_0_20080503.patch, HADOOP-3333_1_20080505.patch, HADOOP-3333_2_20080506.patch
>
>
> We have a long running a job in a 2nd atttempt. Previous job was failing and current jobs risks to fail as well, because  reduce tasks failing on marginal TaskTrackers are assigned repeatedly to the same TaskTrackers (probably because it is the only available slot), eventually running out of attempts.
> Reduce tasks should be assigned to the same TaskTrackers at most twice, or TaskTrackers need to get some better smarts to find  failing hardware.
> BTW, mapred.reduce.max.attempts=12, which is high, but does not help in this case.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.