You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by "Christian Kunz (JIRA)" <ji...@apache.org> on 2008/09/29 17:03:44 UTC

[jira] Created: (HADOOP-4305) repeatedly blacklisted tasktrackers should get declared dead

repeatedly blacklisted tasktrackers should get declared dead
------------------------------------------------------------

                 Key: HADOOP-4305
                 URL: https://issues.apache.org/jira/browse/HADOOP-4305
             Project: Hadoop Core
          Issue Type: Improvement
          Components: mapred
            Reporter: Christian Kunz


When running a batch of jobs it often happens that the same tasktrackers are blacklisted again and again. This can slow job execution considerably, in particular, when tasks fail because of timeout.
It would make sense to no longer assign any tasks to such tasktrackers and to declare them dead.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4305) repeatedly blacklisted tasktrackers should get declared dead

Posted by "Christian Kunz (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653392#action_12653392 ] 

Christian Kunz commented on HADOOP-4305:
----------------------------------------

We applied the patch to hadoop-0.18 and verified that two TaskTrackers repeatedly blacklisted disappeared from the list of active TaskTrackers:
        
2008-12-04 02:46:04,251 INFO org.apache.hadoop.mapred.JobTracker: Adding tracker_xxxto the blackList
2008-12-04 08:51:09,957 INFO org.apache.hadoop.mapred.JobTracker: Adding tracker_yyyto the blackList

These two TaskTrackers indeed had disk problems and the corresponding DataNodes were not active as well.

Great Job!


> repeatedly blacklisted tasktrackers should get declared dead
> ------------------------------------------------------------
>
>                 Key: HADOOP-4305
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4305
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Christian Kunz
>            Assignee: Amareshwari Sriramadasu
>             Fix For: 0.20.0
>
>         Attachments: patch-4305-0.18.txt, patch-4305-1.txt, patch-4305-2.txt, patch-4305-3.txt
>
>
> When running a batch of jobs it often happens that the same tasktrackers are blacklisted again and again. This can slow job execution considerably, in particular, when tasks fail because of timeout.
> It would make sense to no longer assign any tasks to such tasktrackers and to declare them dead.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4305) repeatedly blacklisted tasktrackers should get declared dead

Posted by "Robert Chansler (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Chansler updated HADOOP-4305:
------------------------------------

    Release Note: Improved TaskTracker blacklisting strategy to better exclude faulty tracker from executing tasks.  (was: Improves the blacklisting strategy, whereby, tasktrackers that are blacklisted are not given tasks to run from other jobs, subject to the following conditions (all must be met): 
1) The TaskTracker has been blacklisted by at least 4 jobs ( can be configured by mapred.max.tasktracker.blacklists) 
2) The TaskTracker has been blacklisted 50% more number of times than the average 
3) The cluster has less than 50% trackers blacklisted. 
Once in 24 hours, a TaskTracker blacklisted for all jobs is given a chance. 
Restarting the TaskTracker moves it out of the blacklist.)

Edit release note for publication.

Improves the blacklisting strategy, whereby, tasktrackers that are blacklisted are not given tasks to run from other jobs, subject to the following conditions (all must be met): 
1) The TaskTracker has been blacklisted by at least 4 jobs ( can be configured by mapred.max.tasktracker.blacklists) 
2) The TaskTracker has been blacklisted 50% more number of times than the average 
3) The cluster has less than 50% trackers blacklisted. 
Once in 24 hours, a TaskTracker blacklisted for all jobs is given a chance. 
Restarting the TaskTracker moves it out of the blacklist.

> repeatedly blacklisted tasktrackers should get declared dead
> ------------------------------------------------------------
>
>                 Key: HADOOP-4305
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4305
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Christian Kunz
>            Assignee: Amareshwari Sriramadasu
>             Fix For: 0.20.0
>
>         Attachments: patch-4305-0.18.txt, patch-4305-1.txt, patch-4305-2.txt, patch-4305-3.txt, patch-4305-4.txt
>
>
> When running a batch of jobs it often happens that the same tasktrackers are blacklisted again and again. This can slow job execution considerably, in particular, when tasks fail because of timeout.
> It would make sense to no longer assign any tasks to such tasktrackers and to declare them dead.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4305) repeatedly blacklisted tasktrackers should get declared dead

Posted by "Runping Qi (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12644218#action_12644218 ] 

Runping Qi commented on HADOOP-4305:
------------------------------------


The number of slots per node is a guess-work at best and it just means in normal case the TT can run that many tasks of typical jobs concurrently.
However, tasks of different jobs may need different resources. Thus, when that many tasks needing large resources run consurrently, some task may fail.
However, the TT may work fine if fewer tasks run concurrently. 
So in genral, you should reduce the max allowed concurrent tasks when some tasks fail. 
If the failing continues, that number may eventually reach to zero. That is effectively equivalent to be blacklisted.

After certain period of time without failures, that number should be incremented gradually.
Ths way, TT will adapt to various load/resource situation automously.
 

> repeatedly blacklisted tasktrackers should get declared dead
> ------------------------------------------------------------
>
>                 Key: HADOOP-4305
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4305
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Christian Kunz
>            Assignee: Amareshwari Sriramadasu
>             Fix For: 0.20.0
>
>
> When running a batch of jobs it often happens that the same tasktrackers are blacklisted again and again. This can slow job execution considerably, in particular, when tasks fail because of timeout.
> It would make sense to no longer assign any tasks to such tasktrackers and to declare them dead.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4305) repeatedly blacklisted tasktrackers should get declared dead

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12647631#action_12647631 ] 

Hadoop QA commented on HADOOP-4305:
-----------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12393865/patch-4305-1.txt
  against trunk revision 713893.

    +1 @author.  The patch does not contain any @author tags.

    -1 tests included.  The patch doesn't appear to include any new or modified tests.
                        Please justify why no tests are needed for this patch.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs.  The patch does not introduce any new Findbugs warnings.

    +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

    -1 core tests.  The patch failed core unit tests.

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3592/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3592/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3592/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3592/console

This message is automatically generated.

> repeatedly blacklisted tasktrackers should get declared dead
> ------------------------------------------------------------
>
>                 Key: HADOOP-4305
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4305
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Christian Kunz
>            Assignee: Amareshwari Sriramadasu
>             Fix For: 0.20.0
>
>         Attachments: patch-4305-1.txt
>
>
> When running a batch of jobs it often happens that the same tasktrackers are blacklisted again and again. This can slow job execution considerably, in particular, when tasks fail because of timeout.
> It would make sense to no longer assign any tasks to such tasktrackers and to declare them dead.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4305) repeatedly blacklisted tasktrackers should get declared dead

Posted by "Amareshwari Sriramadasu (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12641786#action_12641786 ] 

Amareshwari Sriramadasu commented on HADOOP-4305:
-------------------------------------------------

bq. Can we count this only from successful jobs?
 +1. 

bq. If we adopt a proposal that if a TT fails consecutive tasks from N different jobs, then the tasktracker is shutdown. Any successful execution of any task from any job resets the counter associated with this tasktracker.
This involves a lot of book keeping at JobTracker, since counter should be maintained for all the trackers. And every task completion and failure has to update the counters.



> repeatedly blacklisted tasktrackers should get declared dead
> ------------------------------------------------------------
>
>                 Key: HADOOP-4305
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4305
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Christian Kunz
>            Assignee: Amareshwari Sriramadasu
>             Fix For: 0.20.0
>
>
> When running a batch of jobs it often happens that the same tasktrackers are blacklisted again and again. This can slow job execution considerably, in particular, when tasks fail because of timeout.
> It would make sense to no longer assign any tasks to such tasktrackers and to declare them dead.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4305) repeatedly blacklisted tasktrackers should get declared dead

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12641794#action_12641794 ] 

Devaraj Das commented on HADOOP-4305:
-------------------------------------

I think Dhruba's suggestion makes sense. +1 for that. Vinod's suggestion seems to require users to know about this knob which I think is not called for at this point.

> repeatedly blacklisted tasktrackers should get declared dead
> ------------------------------------------------------------
>
>                 Key: HADOOP-4305
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4305
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Christian Kunz
>            Assignee: Amareshwari Sriramadasu
>             Fix For: 0.20.0
>
>
> When running a batch of jobs it often happens that the same tasktrackers are blacklisted again and again. This can slow job execution considerably, in particular, when tasks fail because of timeout.
> It would make sense to no longer assign any tasks to such tasktrackers and to declare them dead.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4305) repeatedly blacklisted tasktrackers should get declared dead

Posted by "Amareshwari Sriramadasu (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Amareshwari Sriramadasu updated HADOOP-4305:
--------------------------------------------

    Attachment: patch-4305-1.txt

Here is a patch with proposed fix.
The patch does the following:
*  Adds the blacklisted trackers of the job to the potentially faulty list, in JobTracker.finalizeJob()
*  The tracker is moved to blacklisted trackers (across jobs) from potentially faulty list iff
   ** #blacklists  exceed mapred.max.tracker.blacklists (default value is 4),
   **  #blacklists is 50% above the average #blacklists, over the active and potentially faulty trackers
   **  50% the cluster is not blacklisted yet
* Restarting the tracker makes it an active tracker
* After a day, the tarcker is given a chance again to run tasks
* Adds #blacklisted_trackers to ClusterStatus
* Updates web UI to show the blacklisted trackers.


> repeatedly blacklisted tasktrackers should get declared dead
> ------------------------------------------------------------
>
>                 Key: HADOOP-4305
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4305
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Christian Kunz
>            Assignee: Amareshwari Sriramadasu
>             Fix For: 0.20.0
>
>         Attachments: patch-4305-1.txt
>
>
> When running a batch of jobs it often happens that the same tasktrackers are blacklisted again and again. This can slow job execution considerably, in particular, when tasks fail because of timeout.
> It would make sense to no longer assign any tasks to such tasktrackers and to declare them dead.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4305) repeatedly blacklisted tasktrackers should get declared dead

Posted by "Vinod K V (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12641714#action_12641714 ] 

Vinod K V commented on HADOOP-4305:
-----------------------------------

Can we count this only from successful jobs?
bq. Even that wouldn't be enough. Blacklisting of a TT by a successful job would only mean that this TT is not suitable for running this job. We can't generalize it to say that this TT is not fit for running any job. The later can be concluded only by monitoring TT health, which should be done independently of job failures.

The proposal here doesn't seem to be a right fix. If we are concerned about batch jobs(similar jobs), and of same jobs being repetitively submitted, we can addressing the issue by introducing the concept of a batch and by linking batch jobs by something like a 'batch-id'. By default all jobs would belong to the default batch. And then, we can consider this batch-id for blacklisting TTs. Thoughts?

> repeatedly blacklisted tasktrackers should get declared dead
> ------------------------------------------------------------
>
>                 Key: HADOOP-4305
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4305
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Christian Kunz
>            Assignee: Amareshwari Sriramadasu
>             Fix For: 0.20.0
>
>
> When running a batch of jobs it often happens that the same tasktrackers are blacklisted again and again. This can slow job execution considerably, in particular, when tasks fail because of timeout.
> It would make sense to no longer assign any tasks to such tasktrackers and to declare them dead.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4305) repeatedly blacklisted tasktrackers should get declared dead

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12644209#action_12644209 ] 

Devaraj Das commented on HADOOP-4305:
-------------------------------------

What Owen proposed can be done in addition to what Amareshwari proposed. Specifically, disabling TT based on the average number of failures helps since 
1) we prevent massive disablement
2) we penalize only those TTs that are performing badly relative to others in the cluster
Makes sense ?

> repeatedly blacklisted tasktrackers should get declared dead
> ------------------------------------------------------------
>
>                 Key: HADOOP-4305
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4305
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Christian Kunz
>            Assignee: Amareshwari Sriramadasu
>             Fix For: 0.20.0
>
>
> When running a batch of jobs it often happens that the same tasktrackers are blacklisted again and again. This can slow job execution considerably, in particular, when tasks fail because of timeout.
> It would make sense to no longer assign any tasks to such tasktrackers and to declare them dead.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4305) repeatedly blacklisted tasktrackers should get declared dead

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12652368#action_12652368 ] 

Devaraj Das commented on HADOOP-4305:
-------------------------------------

Some comments:
1. Format if condition brackets properly in incrementFaults method
2. You should be able to use the same datastructure for both potentiallyFaulty and blacklisted trackers.
3. Add a comment for mapred.cluster.average.blacklist.threshold that it is there solely for tuning purposes and once this feature has been tested in real clusters and an appropriate value for the threshold has been found, this config might be taken out.
4. Check whether you can remove initialContact flag and use only the restarted flag in the heartbeat method. This is a more serious change but might be worthwhile in simplifying the state machine.

> repeatedly blacklisted tasktrackers should get declared dead
> ------------------------------------------------------------
>
>                 Key: HADOOP-4305
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4305
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Christian Kunz
>            Assignee: Amareshwari Sriramadasu
>             Fix For: 0.20.0
>
>         Attachments: patch-4305-0.18.txt, patch-4305-1.txt, patch-4305-2.txt
>
>
> When running a batch of jobs it often happens that the same tasktrackers are blacklisted again and again. This can slow job execution considerably, in particular, when tasks fail because of timeout.
> It would make sense to no longer assign any tasks to such tasktrackers and to declare them dead.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4305) repeatedly blacklisted tasktrackers should get declared dead

Posted by "dhruba borthakur (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12641520#action_12641520 ] 

dhruba borthakur commented on HADOOP-4305:
------------------------------------------

I vote for Option 1. If a TaskTracker is found faulty, shut it down. This is similar to the behaviour of a Datanode.

If an admin restarts a task-tarcker, then it should probably join the cluster as a "healthy" task tracker. 

> repeatedly blacklisted tasktrackers should get declared dead
> ------------------------------------------------------------
>
>                 Key: HADOOP-4305
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4305
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Christian Kunz
>            Assignee: Amareshwari Sriramadasu
>             Fix For: 0.20.0
>
>
> When running a batch of jobs it often happens that the same tasktrackers are blacklisted again and again. This can slow job execution considerably, in particular, when tasks fail because of timeout.
> It would make sense to no longer assign any tasks to such tasktrackers and to declare them dead.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4305) repeatedly blacklisted tasktrackers should get declared dead

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12650817#action_12650817 ] 

Hadoop QA commented on HADOOP-4305:
-----------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12394532/patch-4305-2.txt
  against trunk revision 720632.

    +1 @author.  The patch does not contain any @author tags.

    -1 tests included.  The patch doesn't appear to include any new or modified tests.
                        Please justify why no tests are needed for this patch.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs.  The patch does not introduce any new Findbugs warnings.

    +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

    -1 core tests.  The patch failed core unit tests.

    -1 contrib tests.  The patch failed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3646/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3646/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3646/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3646/console

This message is automatically generated.

> repeatedly blacklisted tasktrackers should get declared dead
> ------------------------------------------------------------
>
>                 Key: HADOOP-4305
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4305
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Christian Kunz
>            Assignee: Amareshwari Sriramadasu
>             Fix For: 0.20.0
>
>         Attachments: patch-4305-0.18.txt, patch-4305-1.txt, patch-4305-2.txt
>
>
> When running a batch of jobs it often happens that the same tasktrackers are blacklisted again and again. This can slow job execution considerably, in particular, when tasks fail because of timeout.
> It would make sense to no longer assign any tasks to such tasktrackers and to declare them dead.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4305) repeatedly blacklisted tasktrackers should get declared dead

Posted by "Amareshwari Sriramadasu (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12643944#action_12643944 ] 

Amareshwari Sriramadasu commented on HADOOP-4305:
-------------------------------------------------

The problem if we do it at the TaskTracker is that it does not know if the application is buggy.  So, buggy code can bring down all the tasktrackers.

Another approach is:
* JT blacklists TTs on per job basis as is today.
* When a successful job blacklist a tracker, we add the tracker to a potentially-faulty list. For each tracker, the number of jobs that blacklisted it (#blacklists) will be maintained.
* The tracker is blacklisted across all jobs if #blacklists is X\% above the average #blacklists, over all the trackers.
For example, In cluster with 100 trackers, 
||Tracker|| ||#BlackLists||
|TT1| |10|
|TT2| |10|
|TT3| |10|
|TT4| |7|
|TT5| |5|
With X=25\%, TT1, TT2 and TT3 will be blacklisted across all the jobs.
* We can reconsider the tracker after time, T, or when it restarts. 
* No more than 50% of the trackers can get blacklisted on the cluster.

Thoughts?


> repeatedly blacklisted tasktrackers should get declared dead
> ------------------------------------------------------------
>
>                 Key: HADOOP-4305
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4305
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Christian Kunz
>            Assignee: Amareshwari Sriramadasu
>             Fix For: 0.20.0
>
>
> When running a batch of jobs it often happens that the same tasktrackers are blacklisted again and again. This can slow job execution considerably, in particular, when tasks fail because of timeout.
> It would make sense to no longer assign any tasks to such tasktrackers and to declare them dead.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4305) repeatedly blacklisted tasktrackers should get declared dead

Posted by "dhruba borthakur (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12641722#action_12641722 ] 

dhruba borthakur commented on HADOOP-4305:
------------------------------------------

If we adopt a proposal that if a TT fails *consecutive* tasks from N different jobs, then the tasktracker is shutdown. Any successful execution of any task from any job resets the counter associated with this tasktracker. Vinod's proposal seems to be a generalization of this approach. 

Will this simple algorithm work?



> repeatedly blacklisted tasktrackers should get declared dead
> ------------------------------------------------------------
>
>                 Key: HADOOP-4305
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4305
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Christian Kunz
>            Assignee: Amareshwari Sriramadasu
>             Fix For: 0.20.0
>
>
> When running a batch of jobs it often happens that the same tasktrackers are blacklisted again and again. This can slow job execution considerably, in particular, when tasks fail because of timeout.
> It would make sense to no longer assign any tasks to such tasktrackers and to declare them dead.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4305) repeatedly blacklisted tasktrackers should get declared dead

Posted by "Runping Qi (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12642807#action_12642807 ] 

Runping Qi commented on HADOOP-4305:
------------------------------------


I think TaskTracker is at a better position to decide whether it can accept more tasks or not.
If not, whether/when to shut down itself. If yes, what kind of tasks it can accept.

Many reasons may cause task failing on a node.
Most common one is resource limit.
A task tracker may be ok to run one task, but may fail to run more tasks simutaneously.
A task tracker may be ok to run 4 concurrent tasks of one job, but may fail to run 3 concurrent tasks of other jobs.
The task tracker should accumulate some stats and make decision based on the stats.

A simple heuristics may run like this: When a task fail, the task tracker records the concurrent number of tasks running on the tracker at that time
If multiple tasks fail at the same concurrence level, the TT should stop asking for new tasks until the concurrence level drop lower.
After running for a while without task failure, it can bump up the concurrence threahold level again.



> repeatedly blacklisted tasktrackers should get declared dead
> ------------------------------------------------------------
>
>                 Key: HADOOP-4305
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4305
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Christian Kunz
>            Assignee: Amareshwari Sriramadasu
>             Fix For: 0.20.0
>
>
> When running a batch of jobs it often happens that the same tasktrackers are blacklisted again and again. This can slow job execution considerably, in particular, when tasks fail because of timeout.
> It would make sense to no longer assign any tasks to such tasktrackers and to declare them dead.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4305) repeatedly blacklisted tasktrackers should get declared dead

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Devaraj Das updated HADOOP-4305:
--------------------------------

    Fix Version/s: 0.20.0
         Assignee: Amareshwari Sriramadasu

> repeatedly blacklisted tasktrackers should get declared dead
> ------------------------------------------------------------
>
>                 Key: HADOOP-4305
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4305
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Christian Kunz
>            Assignee: Amareshwari Sriramadasu
>             Fix For: 0.20.0
>
>
> When running a batch of jobs it often happens that the same tasktrackers are blacklisted again and again. This can slow job execution considerably, in particular, when tasks fail because of timeout.
> It would make sense to no longer assign any tasks to such tasktrackers and to declare them dead.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4305) repeatedly blacklisted tasktrackers should get declared dead

Posted by "Runping Qi (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12644213#action_12644213 ] 

Runping Qi commented on HADOOP-4305:
------------------------------------


I still think it is not enough to simply count the number of failed tasks.
We need to know under what conditions the task failed.
Consider two cases. One is that a task of a job  failed when it was the only task ran on a TT.
Another case is a task of a job failed when another 10 tasks ran concurrently.
Case one provides a strong evidence that this TT may not appropriate for the task for that job.
However, Case two is much weaker. The task might have succeeded if there were fewer concurrent tasks.


> repeatedly blacklisted tasktrackers should get declared dead
> ------------------------------------------------------------
>
>                 Key: HADOOP-4305
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4305
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Christian Kunz
>            Assignee: Amareshwari Sriramadasu
>             Fix For: 0.20.0
>
>
> When running a batch of jobs it often happens that the same tasktrackers are blacklisted again and again. This can slow job execution considerably, in particular, when tasks fail because of timeout.
> It would make sense to no longer assign any tasks to such tasktrackers and to declare them dead.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4305) repeatedly blacklisted tasktrackers should get declared dead

Posted by "Amareshwari Sriramadasu (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12641340#action_12641340 ] 

Amareshwari Sriramadasu commented on HADOOP-4305:
-------------------------------------------------

Another question : Does the restart of a task tarcker say it is healthy? Or should we make admin explicitly say that the task tracker is healthy?

bq. Irrespective of all this, Map/Reduce should have a utility similar to dfsadmin -refreshNodes , to add and delete trackers to cluster anytime.
This could be  a separate jira, if it is really required.

> repeatedly blacklisted tasktrackers should get declared dead
> ------------------------------------------------------------
>
>                 Key: HADOOP-4305
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4305
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Christian Kunz
>            Assignee: Amareshwari Sriramadasu
>             Fix For: 0.20.0
>
>
> When running a batch of jobs it often happens that the same tasktrackers are blacklisted again and again. This can slow job execution considerably, in particular, when tasks fail because of timeout.
> It would make sense to no longer assign any tasks to such tasktrackers and to declare them dead.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4305) repeatedly blacklisted tasktrackers should get declared dead

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12644049#action_12644049 ] 

Devaraj Das commented on HADOOP-4305:
-------------------------------------

Runping, yes, the TT can remember that. But the JT has the overall knowledge about the cluster (TTs and jobs) and is in a better position to decide whether to give a certain task to a TT or not, right ?

> repeatedly blacklisted tasktrackers should get declared dead
> ------------------------------------------------------------
>
>                 Key: HADOOP-4305
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4305
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Christian Kunz
>            Assignee: Amareshwari Sriramadasu
>             Fix For: 0.20.0
>
>
> When running a batch of jobs it often happens that the same tasktrackers are blacklisted again and again. This can slow job execution considerably, in particular, when tasks fail because of timeout.
> It would make sense to no longer assign any tasks to such tasktrackers and to declare them dead.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4305) repeatedly blacklisted tasktrackers should get declared dead

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12644217#action_12644217 ] 

Devaraj Das commented on HADOOP-4305:
-------------------------------------

Runping, I think taking the average blacklist count on a per tracker basis, and penalizing only those TTs way above the average should help even in this scenario. So for example, if a TT is really faulty, it's blacklist-count should be way above the average number of blacklist-count per tracker, and this would be penalized. The other case is where only certain tasks fail due to resource limitations and the TT gets blacklisted for none of its fault, but IMO in a practical setup, this problem would affect many other TTs as well, and hence the average blacklist-count would be a bit higher... Makes sense?

> repeatedly blacklisted tasktrackers should get declared dead
> ------------------------------------------------------------
>
>                 Key: HADOOP-4305
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4305
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Christian Kunz
>            Assignee: Amareshwari Sriramadasu
>             Fix For: 0.20.0
>
>
> When running a batch of jobs it often happens that the same tasktrackers are blacklisted again and again. This can slow job execution considerably, in particular, when tasks fail because of timeout.
> It would make sense to no longer assign any tasks to such tasktrackers and to declare them dead.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4305) repeatedly blacklisted tasktrackers should get declared dead

Posted by "dhruba borthakur (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12635623#action_12635623 ] 

dhruba borthakur commented on HADOOP-4305:
------------------------------------------

+1. I have seen that scenario occur a lot in our cluster. Some details in HADOOP-2676.

> repeatedly blacklisted tasktrackers should get declared dead
> ------------------------------------------------------------
>
>                 Key: HADOOP-4305
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4305
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Christian Kunz
>
> When running a batch of jobs it often happens that the same tasktrackers are blacklisted again and again. This can slow job execution considerably, in particular, when tasks fail because of timeout.
> It would make sense to no longer assign any tasks to such tasktrackers and to declare them dead.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4305) repeatedly blacklisted tasktrackers should get declared dead

Posted by "Amareshwari Sriramadasu (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Amareshwari Sriramadasu updated HADOOP-4305:
--------------------------------------------

    Attachment: patch-4305-4.txt

Added junit testcase to test the blacklisting strategy.

test-patch result:
{noformat}
     [exec] +1 overall.
     [exec]
     [exec]     +1 @author.  The patch does not contain any @author tags.
     [exec]
     [exec]     +1 tests included.  The patch appears to include 7 new or modified tests.
     [exec]
     [exec]     +1 javadoc.  The javadoc tool did not generate any warning messages.
     [exec]
     [exec]     +1 javac.  The applied patch does not increase the total number of javac compiler warnings.
     [exec]
     [exec]     +1 findbugs.  The patch does not introduce any new Findbugs warnings.
     [exec]
     [exec]     +1 Eclipse classpath. The patch retains Eclipse classpath integrity.
{noformat}

> repeatedly blacklisted tasktrackers should get declared dead
> ------------------------------------------------------------
>
>                 Key: HADOOP-4305
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4305
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Christian Kunz
>            Assignee: Amareshwari Sriramadasu
>             Fix For: 0.20.0
>
>         Attachments: patch-4305-0.18.txt, patch-4305-1.txt, patch-4305-2.txt, patch-4305-3.txt, patch-4305-4.txt
>
>
> When running a batch of jobs it often happens that the same tasktrackers are blacklisted again and again. This can slow job execution considerably, in particular, when tasks fail because of timeout.
> It would make sense to no longer assign any tasks to such tasktrackers and to declare them dead.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4305) repeatedly blacklisted tasktrackers should get declared dead

Posted by "Sharad Agarwal (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653628#action_12653628 ] 

Sharad Agarwal commented on HADOOP-4305:
----------------------------------------

+1 for heartbeat related changes in JT and TT.

> repeatedly blacklisted tasktrackers should get declared dead
> ------------------------------------------------------------
>
>                 Key: HADOOP-4305
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4305
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Christian Kunz
>            Assignee: Amareshwari Sriramadasu
>             Fix For: 0.20.0
>
>         Attachments: patch-4305-0.18.txt, patch-4305-1.txt, patch-4305-2.txt, patch-4305-3.txt
>
>
> When running a batch of jobs it often happens that the same tasktrackers are blacklisted again and again. This can slow job execution considerably, in particular, when tasks fail because of timeout.
> It would make sense to no longer assign any tasks to such tasktrackers and to declare them dead.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4305) repeatedly blacklisted tasktrackers should get declared dead

Posted by "Steve Loughran (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12644263#action_12644263 ] 

Steve Loughran commented on HADOOP-4305:
----------------------------------------

I'd be happiest if there was some way of reporting this to some policy component that made the right decision. Because the action you take on a managed-VM cluster is different from hadoop on physical. On physical, you blacklist and maybe trigger a reboot. Or you start running well-known health tasks to see which parts of the system appear healthy. On a VM cluster you just delete that node and create a new one -no need to faff around with the state of the VM if it is a task-only VM; if its also a datanode you have to decommission it first.

> repeatedly blacklisted tasktrackers should get declared dead
> ------------------------------------------------------------
>
>                 Key: HADOOP-4305
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4305
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Christian Kunz
>            Assignee: Amareshwari Sriramadasu
>             Fix For: 0.20.0
>
>
> When running a batch of jobs it often happens that the same tasktrackers are blacklisted again and again. This can slow job execution considerably, in particular, when tasks fail because of timeout.
> It would make sense to no longer assign any tasks to such tasktrackers and to declare them dead.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4305) repeatedly blacklisted tasktrackers should get declared dead

Posted by "Amareshwari Sriramadasu (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12644215#action_12644215 ] 

Amareshwari Sriramadasu commented on HADOOP-4305:
-------------------------------------------------

If the TaskTracker is configured to run 10 tasks at a time, and task fails because of too many tasks, the problem is with TaskTracker (its configuration). Right?

> repeatedly blacklisted tasktrackers should get declared dead
> ------------------------------------------------------------
>
>                 Key: HADOOP-4305
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4305
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Christian Kunz
>            Assignee: Amareshwari Sriramadasu
>             Fix For: 0.20.0
>
>
> When running a batch of jobs it often happens that the same tasktrackers are blacklisted again and again. This can slow job execution considerably, in particular, when tasks fail because of timeout.
> It would make sense to no longer assign any tasks to such tasktrackers and to declare them dead.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4305) repeatedly blacklisted tasktrackers should get declared dead

Posted by "Amareshwari Sriramadasu (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Amareshwari Sriramadasu updated HADOOP-4305:
--------------------------------------------

    Status: Open  (was: Patch Available)

> repeatedly blacklisted tasktrackers should get declared dead
> ------------------------------------------------------------
>
>                 Key: HADOOP-4305
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4305
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Christian Kunz
>            Assignee: Amareshwari Sriramadasu
>             Fix For: 0.20.0
>
>         Attachments: patch-4305-0.18.txt, patch-4305-1.txt, patch-4305-2.txt
>
>
> When running a batch of jobs it often happens that the same tasktrackers are blacklisted again and again. This can slow job execution considerably, in particular, when tasks fail because of timeout.
> It would make sense to no longer assign any tasks to such tasktrackers and to declare them dead.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4305) repeatedly blacklisted tasktrackers should get declared dead

Posted by "Amareshwari Sriramadasu (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Amareshwari Sriramadasu updated HADOOP-4305:
--------------------------------------------

    Release Note: 
Improves the blacklisting strategy, whereby, tasktrackers that are blacklisted are not given tasks to run from other jobs, subject to the following conditions (all must be met): 
1) The TaskTracker has been blacklisted by at least 4 jobs ( can be configured by mapred.max.tasktracker.blacklists) 
2) The TaskTracker has been blacklisted 50% more number of times than the average 
3) The cluster has less than 50% trackers blacklisted. 
Once in 24 hours, a TaskTracker blacklisted for all jobs is given a chance. 
Restarting the TaskTracker moves it out of the blacklist.

  was:Added configuration property "mapred.max.tasktracker.blacklists",  to specify the number of blacklists for a task tracker by various jobs after which the task tracker can be blacklisted across all jobs, defaults to 4.


> repeatedly blacklisted tasktrackers should get declared dead
> ------------------------------------------------------------
>
>                 Key: HADOOP-4305
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4305
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Christian Kunz
>            Assignee: Amareshwari Sriramadasu
>             Fix For: 0.20.0
>
>         Attachments: patch-4305-0.18.txt, patch-4305-1.txt, patch-4305-2.txt, patch-4305-3.txt, patch-4305-4.txt
>
>
> When running a batch of jobs it often happens that the same tasktrackers are blacklisted again and again. This can slow job execution considerably, in particular, when tasks fail because of timeout.
> It would make sense to no longer assign any tasks to such tasktrackers and to declare them dead.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4305) repeatedly blacklisted tasktrackers should get declared dead

Posted by "Amareshwari Sriramadasu (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12641330#action_12641330 ] 

Amareshwari Sriramadasu commented on HADOOP-4305:
-------------------------------------------------

I propose the following for declaring tasktrackers dead:
    * When to declare a task tracker dead? :
       If the number of times the task tracker got blacklisted is greater than or equal to *mapred.max.tasktracker.blacklists*, then the job tracker declares the task tracker as dead.
    * What to do with the dead task tracker? :
         ** Option 1:
           *** send DisallowedTaskTrackerException to the task tracker in the heartbeat. Then task tracker shuts down.  
           *** Kill the tasks running on dead tracker and re-execute them.  (Similar behavior as lost task tracker)
        ** Option 2:
           *** Honor the tasks running the dead tracker also. But do not schedule any more jobs or tasks on it. (Same behavior as black listed tracker, but it is across all the jobs).
    * How to bring back the dead tracker? :
         ** With Option 1, restarting the task tracker adds it back to the cluster.
         ** With Option 2, There should be an admin command which removes a tracker from deadList to LiveList.
   
I think we should go with *Option 1*, because it is actually making it die and come up. Thoughts?

Irrespective of all this,  Map/Reduce should have a utility similar to *dfsadmin -refreshNodes* , to add and delete trackers to cluster anytime.
    

> repeatedly blacklisted tasktrackers should get declared dead
> ------------------------------------------------------------
>
>                 Key: HADOOP-4305
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4305
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Christian Kunz
>            Assignee: Amareshwari Sriramadasu
>             Fix For: 0.20.0
>
>
> When running a batch of jobs it often happens that the same tasktrackers are blacklisted again and again. This can slow job execution considerably, in particular, when tasks fail because of timeout.
> It would make sense to no longer assign any tasks to such tasktrackers and to declare them dead.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4305) repeatedly blacklisted tasktrackers should get declared dead

Posted by "Christian Kunz (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12644324#action_12644324 ] 

Christian Kunz commented on HADOOP-4305:
----------------------------------------

I would be happy with Amareshwari's proposal together with Owen's suggestion. I believe this would help us a lot in our environment.
To only count blacklisted TaskTrackers for successful jobs seems necessary to avoid false positives because of application issues (although we just had a case where enough bad-behaving TaskTrackers generated enough individual task failures to fail jobs repeatedly). Solutions that cover 80% of issues with 20% of development effort are better than perfect solutions that are never implemented.


> repeatedly blacklisted tasktrackers should get declared dead
> ------------------------------------------------------------
>
>                 Key: HADOOP-4305
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4305
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Christian Kunz
>            Assignee: Amareshwari Sriramadasu
>             Fix For: 0.20.0
>
>
> When running a batch of jobs it often happens that the same tasktrackers are blacklisted again and again. This can slow job execution considerably, in particular, when tasks fail because of timeout.
> It would make sense to no longer assign any tasks to such tasktrackers and to declare them dead.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4305) repeatedly blacklisted tasktrackers should get declared dead

Posted by "Amareshwari Sriramadasu (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Amareshwari Sriramadasu updated HADOOP-4305:
--------------------------------------------

    Status: Patch Available  (was: Open)

> repeatedly blacklisted tasktrackers should get declared dead
> ------------------------------------------------------------
>
>                 Key: HADOOP-4305
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4305
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Christian Kunz
>            Assignee: Amareshwari Sriramadasu
>             Fix For: 0.20.0
>
>         Attachments: patch-4305-0.18.txt, patch-4305-1.txt, patch-4305-2.txt
>
>
> When running a batch of jobs it often happens that the same tasktrackers are blacklisted again and again. This can slow job execution considerably, in particular, when tasks fail because of timeout.
> It would make sense to no longer assign any tasks to such tasktrackers and to declare them dead.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4305) repeatedly blacklisted tasktrackers should get declared dead

Posted by "Amareshwari Sriramadasu (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Amareshwari Sriramadasu updated HADOOP-4305:
--------------------------------------------

    Attachment: patch-4305-3.txt

Attaching patch for incorporating review comments.

> repeatedly blacklisted tasktrackers should get declared dead
> ------------------------------------------------------------
>
>                 Key: HADOOP-4305
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4305
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Christian Kunz
>            Assignee: Amareshwari Sriramadasu
>             Fix For: 0.20.0
>
>         Attachments: patch-4305-0.18.txt, patch-4305-1.txt, patch-4305-2.txt, patch-4305-3.txt
>
>
> When running a batch of jobs it often happens that the same tasktrackers are blacklisted again and again. This can slow job execution considerably, in particular, when tasks fail because of timeout.
> It would make sense to no longer assign any tasks to such tasktrackers and to declare them dead.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4305) repeatedly blacklisted tasktrackers should get declared dead

Posted by "dhruba borthakur (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12644417#action_12644417 ] 

dhruba borthakur commented on HADOOP-4305:
------------------------------------------

I like Amareshwati's proposal because it is simple. Owen's extension seems to add a "age-ing" factor to the counter. And, Runping's proposal can be encompassed into Amareshwari's proposal too.....reflect the state of the TT (how many jobs was it running simultaneously) by incrementing the blacklist counter with an appropriate weight.

> repeatedly blacklisted tasktrackers should get declared dead
> ------------------------------------------------------------
>
>                 Key: HADOOP-4305
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4305
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Christian Kunz
>            Assignee: Amareshwari Sriramadasu
>             Fix For: 0.20.0
>
>
> When running a batch of jobs it often happens that the same tasktrackers are blacklisted again and again. This can slow job execution considerably, in particular, when tasks fail because of timeout.
> It would make sense to no longer assign any tasks to such tasktrackers and to declare them dead.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4305) repeatedly blacklisted tasktrackers should get declared dead

Posted by "Owen O'Malley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12644151#action_12644151 ] 

Owen O'Malley commented on HADOOP-4305:
---------------------------------------

One thing that concerns me, is that there needs to be a policy for degrading the information across time. How about a policy where we define the max number of black lists (from successful jobs) you can be on and still get new tasks. Furthermore, once each day the counters are decremented by 1. (To avoid a massive re-enablement, I'd probably use hostname.hashcode() % 24 as the hour to decrement the count for each host. So if you set the threshold to 4, it would give you:
  1. Any TT black listed by 4 (or more) jobs would not get new tasks.
  2. Each day, the JT would forgive one blacklist and likely re-enable the TT
  3. A re-enabled TT would only get one chance for that day.

Thoughts?

> repeatedly blacklisted tasktrackers should get declared dead
> ------------------------------------------------------------
>
>                 Key: HADOOP-4305
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4305
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Christian Kunz
>            Assignee: Amareshwari Sriramadasu
>             Fix For: 0.20.0
>
>
> When running a batch of jobs it often happens that the same tasktrackers are blacklisted again and again. This can slow job execution considerably, in particular, when tasks fail because of timeout.
> It would make sense to no longer assign any tasks to such tasktrackers and to declare them dead.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (HADOOP-4305) repeatedly blacklisted tasktrackers should get declared dead

Posted by "Amareshwari Sriramadasu (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12641330#action_12641330 ] 

amareshwari edited comment on HADOOP-4305 at 10/21/08 3:26 AM:
---------------------------------------------------------------------------

I propose the following for declaring tasktrackers dead:
    * When to declare a task tracker dead? :
       If the number of times the task tracker got blacklisted is greater than or equal to *mapred.max.tasktracker.blacklists*, then the job tracker declares the task tracker as dead.
    * What to do with the dead task tracker? :
         ** Option 1:
           *** send DisallowedTaskTrackerException to the task tracker in the heartbeat. Then task tracker shuts down.  
           *** Kill the tasks running on dead tracker and re-execute them.  (Similar behavior as lost task tracker)
        ** Option 2:
           *** Honor the tasks running the dead tracker also. But do not schedule any more jobs or tasks on it. (Same behavior as black listed tracker, but it is across all the jobs).
    * How to bring back the dead tracker? :
         ** With Option 1, restarting the task tracker adds it back to the cluster.
         ** With Option 2, There should be an admin command which removes the tracker from deadList.
   
I think we should go with *Option 1*, because it is actually making it die and come up. Thoughts?

Irrespective of all this,  Map/Reduce should have a utility similar to *dfsadmin -refreshNodes* , to add and delete trackers to cluster anytime.
    

      was (Author: amareshwari):
    I propose the following for declaring tasktrackers dead:
    * When to declare a task tracker dead? :
       If the number of times the task tracker got blacklisted is greater than or equal to *mapred.max.tasktracker.blacklists*, then the job tracker declares the task tracker as dead.
    * What to do with the dead task tracker? :
         ** Option 1:
           *** send DisallowedTaskTrackerException to the task tracker in the heartbeat. Then task tracker shuts down.  
           *** Kill the tasks running on dead tracker and re-execute them.  (Similar behavior as lost task tracker)
        ** Option 2:
           *** Honor the tasks running the dead tracker also. But do not schedule any more jobs or tasks on it. (Same behavior as black listed tracker, but it is across all the jobs).
    * How to bring back the dead tracker? :
         ** With Option 1, restarting the task tracker adds it back to the cluster.
         ** With Option 2, There should be an admin command which removes a tracker from deadList to LiveList.
   
I think we should go with *Option 1*, because it is actually making it die and come up. Thoughts?

Irrespective of all this,  Map/Reduce should have a utility similar to *dfsadmin -refreshNodes* , to add and delete trackers to cluster anytime.
    
  
> repeatedly blacklisted tasktrackers should get declared dead
> ------------------------------------------------------------
>
>                 Key: HADOOP-4305
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4305
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Christian Kunz
>            Assignee: Amareshwari Sriramadasu
>             Fix For: 0.20.0
>
>
> When running a batch of jobs it often happens that the same tasktrackers are blacklisted again and again. This can slow job execution considerably, in particular, when tasks fail because of timeout.
> It would make sense to no longer assign any tasks to such tasktrackers and to declare them dead.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4305) repeatedly blacklisted tasktrackers should get declared dead

Posted by "Amareshwari Sriramadasu (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Amareshwari Sriramadasu updated HADOOP-4305:
--------------------------------------------

    Attachment: patch-4305-0.18.txt

Christian, here is a patch for 0.18. Can you validate the patch on your cluster?

> repeatedly blacklisted tasktrackers should get declared dead
> ------------------------------------------------------------
>
>                 Key: HADOOP-4305
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4305
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Christian Kunz
>            Assignee: Amareshwari Sriramadasu
>             Fix For: 0.20.0
>
>         Attachments: patch-4305-0.18.txt, patch-4305-1.txt
>
>
> When running a batch of jobs it often happens that the same tasktrackers are blacklisted again and again. This can slow job execution considerably, in particular, when tasks fail because of timeout.
> It would make sense to no longer assign any tasks to such tasktrackers and to declare them dead.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4305) repeatedly blacklisted tasktrackers should get declared dead

Posted by "Amareshwari Sriramadasu (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Amareshwari Sriramadasu updated HADOOP-4305:
--------------------------------------------

    Status: Open  (was: Patch Available)

> repeatedly blacklisted tasktrackers should get declared dead
> ------------------------------------------------------------
>
>                 Key: HADOOP-4305
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4305
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Christian Kunz
>            Assignee: Amareshwari Sriramadasu
>             Fix For: 0.20.0
>
>         Attachments: patch-4305-0.18.txt, patch-4305-1.txt
>
>
> When running a batch of jobs it often happens that the same tasktrackers are blacklisted again and again. This can slow job execution considerably, in particular, when tasks fail because of timeout.
> It would make sense to no longer assign any tasks to such tasktrackers and to declare them dead.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4305) repeatedly blacklisted tasktrackers should get declared dead

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12649594#action_12649594 ] 

Devaraj Das commented on HADOOP-4305:
-------------------------------------

1. Change ConvertTrackerNameToHostName not to include "tracker_" and remove the extra List introduced in JobInProgress.
2. Faults in PotentiallyFaultyList should also get decremented if there are no faults on that tracker for 24 hours.
3. Make AVERAGE_BLACKLST_THRESHOLD configurable. but do not expose it outside (this will give us a way to tune until we reach the correct number).
4. Put more comments in code explaining the algorithm.
5. change the name of addFaultyTracker to incrementFaults
6. ReinitAction on lostTaskTracker should not erase its faults.


> repeatedly blacklisted tasktrackers should get declared dead
> ------------------------------------------------------------
>
>                 Key: HADOOP-4305
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4305
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Christian Kunz
>            Assignee: Amareshwari Sriramadasu
>             Fix For: 0.20.0
>
>         Attachments: patch-4305-0.18.txt, patch-4305-1.txt
>
>
> When running a batch of jobs it often happens that the same tasktrackers are blacklisted again and again. This can slow job execution considerably, in particular, when tasks fail because of timeout.
> It would make sense to no longer assign any tasks to such tasktrackers and to declare them dead.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4305) repeatedly blacklisted tasktrackers should get declared dead

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12644225#action_12644225 ] 

Devaraj Das commented on HADOOP-4305:
-------------------------------------

Runping, the case we need to consider is faulty tasks (as Koji had pointed out here - https://issues.apache.org/jira/browse/HADOOP-4305?focusedCommentId=12641556#action_12641556). A tasktracker should not stop asking for tasks if the failures were due to task faults (e.g. buggy user code). So that's why the proposal here is to increment the blacklist-count for a TT only for successful jobs. If a job fails, the blacklist-count is not incremented even if the TT got blacklisted for this job...


> repeatedly blacklisted tasktrackers should get declared dead
> ------------------------------------------------------------
>
>                 Key: HADOOP-4305
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4305
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Christian Kunz
>            Assignee: Amareshwari Sriramadasu
>             Fix For: 0.20.0
>
>
> When running a batch of jobs it often happens that the same tasktrackers are blacklisted again and again. This can slow job execution considerably, in particular, when tasks fail because of timeout.
> It would make sense to no longer assign any tasks to such tasktrackers and to declare them dead.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4305) repeatedly blacklisted tasktrackers should get declared dead

Posted by "Amareshwari Sriramadasu (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Amareshwari Sriramadasu updated HADOOP-4305:
--------------------------------------------

    Status: Patch Available  (was: Open)

> repeatedly blacklisted tasktrackers should get declared dead
> ------------------------------------------------------------
>
>                 Key: HADOOP-4305
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4305
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Christian Kunz
>            Assignee: Amareshwari Sriramadasu
>             Fix For: 0.20.0
>
>         Attachments: patch-4305-1.txt
>
>
> When running a batch of jobs it often happens that the same tasktrackers are blacklisted again and again. This can slow job execution considerably, in particular, when tasks fail because of timeout.
> It would make sense to no longer assign any tasks to such tasktrackers and to declare them dead.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4305) repeatedly blacklisted tasktrackers should get declared dead

Posted by "Koji Noguchi (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12641556#action_12641556 ] 

Koji Noguchi commented on HADOOP-4305:
--------------------------------------

I'm afraid what would happen when application has a bug and it gets submitted many times to the cluster.
Could this job blacklist the healthy nodes and eventually take the TaskTrackers down?

bq. If the number of times the task tracker got blacklisted is greater than or equal to mapred.max.tasktracker.blacklists, then the job tracker declares the task tracker as dead.

Can we count this *only* from successful jobs?


> repeatedly blacklisted tasktrackers should get declared dead
> ------------------------------------------------------------
>
>                 Key: HADOOP-4305
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4305
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Christian Kunz
>            Assignee: Amareshwari Sriramadasu
>             Fix For: 0.20.0
>
>
> When running a batch of jobs it often happens that the same tasktrackers are blacklisted again and again. This can slow job execution considerably, in particular, when tasks fail because of timeout.
> It would make sense to no longer assign any tasks to such tasktrackers and to declare them dead.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4305) repeatedly blacklisted tasktrackers should get declared dead

Posted by "Amareshwari Sriramadasu (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12645718#action_12645718 ] 

Amareshwari Sriramadasu commented on HADOOP-4305:
-------------------------------------------------

bq. Runping's proposal can be encompassed into Amareshwari's proposal too.....reflect the state of the TT (how many jobs was it running simultaneously) by incrementing the blacklist counter with an appropriate weight.

I think you meant how many tasks the tracker was running simultaneously at the time of failure. But, in steady state all the slots of the tracker will be occupied. then, the blacklist weight would be same for all the trackers.

bq. The tracker is blacklisted across all jobs if #blacklists is X% above the average #blacklists, over all the trackers.
Here, The average value may be very skewed, since very few trackers would be faulty.  (In my previous example, it should be X=2500%)
To avoid skewness, the tracker can be blacklisted across all jobs if
1. #blacklists is greater than *mapred.max.tasktracker.blacklists*  and
2. #blacklists is 50% above the average #blacklists, over all the trackers.



> repeatedly blacklisted tasktrackers should get declared dead
> ------------------------------------------------------------
>
>                 Key: HADOOP-4305
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4305
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Christian Kunz
>            Assignee: Amareshwari Sriramadasu
>             Fix For: 0.20.0
>
>
> When running a batch of jobs it often happens that the same tasktrackers are blacklisted again and again. This can slow job execution considerably, in particular, when tasks fail because of timeout.
> It would make sense to no longer assign any tasks to such tasktrackers and to declare them dead.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4305) repeatedly blacklisted tasktrackers should get declared dead

Posted by "dhruba borthakur (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12644053#action_12644053 ] 

dhruba borthakur commented on HADOOP-4305:
------------------------------------------

Amareshwari's  proposal looks good to me. We should make the TT come back to life after T seconds. Maybe we should issue a re-initialize command to the TT after T seconds and remove it from the blacklist.

> repeatedly blacklisted tasktrackers should get declared dead
> ------------------------------------------------------------
>
>                 Key: HADOOP-4305
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4305
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Christian Kunz
>            Assignee: Amareshwari Sriramadasu
>             Fix For: 0.20.0
>
>
> When running a batch of jobs it often happens that the same tasktrackers are blacklisted again and again. This can slow job execution considerably, in particular, when tasks fail because of timeout.
> It would make sense to no longer assign any tasks to such tasktrackers and to declare them dead.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4305) repeatedly blacklisted tasktrackers should get declared dead

Posted by "Amareshwari Sriramadasu (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12650853#action_12650853 ] 

Amareshwari Sriramadasu commented on HADOOP-4305:
-------------------------------------------------

The only test failure org.apache.hadoop.hdfsproxy.TestHdfsProxy.testHdfsProxyInterface is not related to the patch. All the core and contrig tests passed on my machine.

> repeatedly blacklisted tasktrackers should get declared dead
> ------------------------------------------------------------
>
>                 Key: HADOOP-4305
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4305
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Christian Kunz
>            Assignee: Amareshwari Sriramadasu
>             Fix For: 0.20.0
>
>         Attachments: patch-4305-0.18.txt, patch-4305-1.txt, patch-4305-2.txt
>
>
> When running a batch of jobs it often happens that the same tasktrackers are blacklisted again and again. This can slow job execution considerably, in particular, when tasks fail because of timeout.
> It would make sense to no longer assign any tasks to such tasktrackers and to declare them dead.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4305) repeatedly blacklisted tasktrackers should get declared dead

Posted by "Runping Qi (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12644034#action_12644034 ] 

Runping Qi commented on HADOOP-4305:
------------------------------------


certainly the TT can remember the jobs of failed tasks and decide whether to refuse tasks of those jobs only or refuse tasks of all jobs.
 

> repeatedly blacklisted tasktrackers should get declared dead
> ------------------------------------------------------------
>
>                 Key: HADOOP-4305
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4305
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Christian Kunz
>            Assignee: Amareshwari Sriramadasu
>             Fix For: 0.20.0
>
>
> When running a batch of jobs it often happens that the same tasktrackers are blacklisted again and again. This can slow job execution considerably, in particular, when tasks fail because of timeout.
> It would make sense to no longer assign any tasks to such tasktrackers and to declare them dead.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4305) repeatedly blacklisted tasktrackers should get declared dead

Posted by "Amareshwari Sriramadasu (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Amareshwari Sriramadasu updated HADOOP-4305:
--------------------------------------------

    Release Note: Added configuration property "mapred.max.tasktracker.blacklists",  to specify the number of blacklists for a task tracker by various jobs after which the task tracker can be blacklisted across all jobs, defaults to 4.
          Status: Patch Available  (was: Open)

test-patch result on trunk:
{noformat}
     [exec]
     [exec] -1 overall.
     [exec]
     [exec]     +1 @author.  The patch does not contain any @author tags.
     [exec]
     [exec]     -1 tests included.  The patch doesn't appear to include any new or modified tests.
     [exec]                         Please justify why no tests are needed for this patch.
     [exec]
     [exec]     +1 javadoc.  The javadoc tool did not generate any warning messages.
     [exec]
     [exec]     +1 javac.  The applied patch does not increase the total number of javac compiler warnings.
     [exec]
     [exec]     +1 findbugs.  The patch does not introduce any new Findbugs warnings.
     [exec]
     [exec]     +1 Eclipse classpath. The patch retains Eclipse classpath integrity.
     [exec]
{noformat}
All core and contrib tests passed on my machine.

It is very difficult to write junit test for this. Did manual tests.
Manual tests include:
#  Verified that a tracker should get blacklisted across all jobs iff
  ** #blacklists exceeds mapred.max.tracker.blacklists.
  ** #blacklists is 50% above the average #blacklists, over the active and potentially faulty trackers
  ** 50% of the cluster is not blacklisted yet
#  Verified Restarting the tracker makes it a healthy tracker.  Verified restart on a healthy tracker, blacklisted tracker and a lost tracker.
# Verified that Lost task tracker that bounced back should accept tasks again
# Verified that Once blacklisted tracker, if lost and comes back, it is not healthy.



> repeatedly blacklisted tasktrackers should get declared dead
> ------------------------------------------------------------
>
>                 Key: HADOOP-4305
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4305
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Christian Kunz
>            Assignee: Amareshwari Sriramadasu
>             Fix For: 0.20.0
>
>         Attachments: patch-4305-0.18.txt, patch-4305-1.txt, patch-4305-2.txt, patch-4305-3.txt
>
>
> When running a batch of jobs it often happens that the same tasktrackers are blacklisted again and again. This can slow job execution considerably, in particular, when tasks fail because of timeout.
> It would make sense to no longer assign any tasks to such tasktrackers and to declare them dead.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4305) repeatedly blacklisted tasktrackers should get declared dead

Posted by "Amareshwari Sriramadasu (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Amareshwari Sriramadasu updated HADOOP-4305:
--------------------------------------------

    Attachment: patch-4305-2.txt

Patch incorporating review comments.
Can somebody please review this?

> repeatedly blacklisted tasktrackers should get declared dead
> ------------------------------------------------------------
>
>                 Key: HADOOP-4305
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4305
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Christian Kunz
>            Assignee: Amareshwari Sriramadasu
>             Fix For: 0.20.0
>
>         Attachments: patch-4305-0.18.txt, patch-4305-1.txt, patch-4305-2.txt
>
>
> When running a batch of jobs it often happens that the same tasktrackers are blacklisted again and again. This can slow job execution considerably, in particular, when tasks fail because of timeout.
> It would make sense to no longer assign any tasks to such tasktrackers and to declare them dead.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4305) repeatedly blacklisted tasktrackers should get declared dead

Posted by "Hudson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12654084#action_12654084 ] 

Hudson commented on HADOOP-4305:
--------------------------------

Integrated in Hadoop-trunk #680 (See [http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/680/])
    

> repeatedly blacklisted tasktrackers should get declared dead
> ------------------------------------------------------------
>
>                 Key: HADOOP-4305
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4305
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Christian Kunz
>            Assignee: Amareshwari Sriramadasu
>             Fix For: 0.20.0
>
>         Attachments: patch-4305-0.18.txt, patch-4305-1.txt, patch-4305-2.txt, patch-4305-3.txt, patch-4305-4.txt
>
>
> When running a batch of jobs it often happens that the same tasktrackers are blacklisted again and again. This can slow job execution considerably, in particular, when tasks fail because of timeout.
> It would make sense to no longer assign any tasks to such tasktrackers and to declare them dead.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4305) repeatedly blacklisted tasktrackers should get declared dead

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Devaraj Das updated HADOOP-4305:
--------------------------------

      Resolution: Fixed
    Hadoop Flags: [Reviewed]
          Status: Resolved  (was: Patch Available)

I just committed this. Thanks Amareshwari!

> repeatedly blacklisted tasktrackers should get declared dead
> ------------------------------------------------------------
>
>                 Key: HADOOP-4305
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4305
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Christian Kunz
>            Assignee: Amareshwari Sriramadasu
>             Fix For: 0.20.0
>
>         Attachments: patch-4305-0.18.txt, patch-4305-1.txt, patch-4305-2.txt, patch-4305-3.txt, patch-4305-4.txt
>
>
> When running a batch of jobs it often happens that the same tasktrackers are blacklisted again and again. This can slow job execution considerably, in particular, when tasks fail because of timeout.
> It would make sense to no longer assign any tasks to such tasktrackers and to declare them dead.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.