You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by "Runping Qi (JIRA)" <ji...@apache.org> on 2007/10/02 15:32:51 UTC

[jira] Created: (HADOOP-1984) some reducer stuck at copy phase and progress extremely slowly

some reducer stuck at copy phase and progress extremely slowly
--------------------------------------------------------------

                 Key: HADOOP-1984
                 URL: https://issues.apache.org/jira/browse/HADOOP-1984
             Project: Hadoop
          Issue Type: Bug
          Components: mapred
    Affects Versions: 0.14.0
            Reporter: Runping Qi
            Priority: Critical



In many cases, some reducers got stuck at copy phase, progressing extremely slowly.
The entire cluster seems doing nothing. This causes a very bad long tails of otherwise well tuned map/red jobs.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1984) some reducer stuck at copy phase and progress extremely slowly

Posted by "Amar Kamat (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Amar Kamat updated HADOOP-1984:
-------------------------------

        Fix Version/s: 0.16.0
    Affects Version/s:     (was: 0.14.0)
                       0.16.0
               Status: Patch Available  (was: Open)

> some reducer stuck at copy phase and progress extremely slowly
> --------------------------------------------------------------
>
>                 Key: HADOOP-1984
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1984
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.0
>            Reporter: Runping Qi
>            Priority: Critical
>             Fix For: 0.16.0
>
>         Attachments: HADOOP-1984.patch
>
>
> In many cases, some reducers got stuck at copy phase, progressing extremely slowly.
> The entire cluster seems doing nothing. This causes a very bad long tails of otherwise well tuned map/red jobs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1984) some reducer stuck at copy phase and progress extremely slowly

Posted by "Owen O'Malley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12540863 ] 

Owen O'Malley commented on HADOOP-1984:
---------------------------------------

oops, -1 it doesn't add the current time.

> some reducer stuck at copy phase and progress extremely slowly
> --------------------------------------------------------------
>
>                 Key: HADOOP-1984
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1984
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.0
>            Reporter: Runping Qi
>            Assignee: Amar Kamat
>            Priority: Critical
>             Fix For: 0.16.0
>
>         Attachments: HADOOP-1984.patch
>
>
> In many cases, some reducers got stuck at copy phase, progressing extremely slowly.
> The entire cluster seems doing nothing. This causes a very bad long tails of otherwise well tuned map/red jobs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1984) some reducer stuck at copy phase and progress extremely slowly

Posted by "Amar Kamat (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12543057 ] 

Amar Kamat commented on HADOOP-1984:
------------------------------------

Consider the following strategy :
{code}
backoff (t) = backoff-init * backoff-base^{t-1}
   t = number of retries
   backoff-init = 4 sec
   backoff-base = 2
   max-t = 5
{code}
So now the backoff intervals are {{4, 8, 16, 32, 64, 128}},  {{sum = 252 = 4.2 min}} after which the notification is sent to the jobtracker from the reduce. So 6 retries are made before sending the notification.

> some reducer stuck at copy phase and progress extremely slowly
> --------------------------------------------------------------
>
>                 Key: HADOOP-1984
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1984
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.0
>            Reporter: Runping Qi
>            Assignee: Amar Kamat
>            Priority: Critical
>             Fix For: 0.16.0
>
>         Attachments: HADOOP-1984.patch
>
>
> In many cases, some reducers got stuck at copy phase, progressing extremely slowly.
> The entire cluster seems doing nothing. This causes a very bad long tails of otherwise well tuned map/red jobs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1984) some reducer stuck at copy phase and progress extremely slowly

Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Arun C Murthy updated HADOOP-1984:
----------------------------------

    Status: Open  (was: Patch Available)

Amar, some comments:

-  Please specify the 'unit' (seconds) for {{mapred.reduce.copy.backoff}} in hadoop-default.xml.

- I suggest we use integer arithmetic where possible:
{noformat}
+                long currentBackOff = 
+                        BACKOFF_INIT 
+                        * (long)Math.pow(BACKOFF_EXPONENTIAL_BASE, 
+                                         noFailedFetches.intValue() - 1);
{noformat}

is actually:

{noformat}
+                long currentBackOff = (1 << (noFailedFetches.intValue() + 1));
{noformat}

given that the base is hard-coded as 2. It keeps things more readable and easier to maintain.

I'm pretty sure we can do this to calculate {{maxFetchRetriesPerMap}} too...


> some reducer stuck at copy phase and progress extremely slowly
> --------------------------------------------------------------
>
>                 Key: HADOOP-1984
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1984
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.0
>            Reporter: Runping Qi
>            Assignee: Amar Kamat
>            Priority: Critical
>             Fix For: 0.16.0
>
>         Attachments: HADOOP-1984-simple.patch, HADOOP-1984-simple.patch, HADOOP-1984.patch
>
>
> In many cases, some reducers got stuck at copy phase, progressing extremely slowly.
> The entire cluster seems doing nothing. This causes a very bad long tails of otherwise well tuned map/red jobs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1984) some reducer stuck at copy phase and progress extremely slowly

Posted by "Amar Kamat (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12539081 ] 

Amar Kamat commented on HADOOP-1984:
------------------------------------

Ant test passed on my system. This error is due to the hase test  {{TestRegionServerExit}}. 
----
The change here is the backoff function used for retrying when the map output fetch fails. Currently we are using {{60 + random(0,300)}} sec as the backoff interval. By using exponential backoff the penalty for first few backoffs is not much but then for the later ones the penalty is huge. The initial backoff is 2 sec and the function is
{code}
backoff (n) = init_value * base^(n-1)
n = no of retries
base is set to 2
init_value is set to 2 sec
{code}
Any suggestions on the formulation of the backoff algorithm and the initial values ?

> some reducer stuck at copy phase and progress extremely slowly
> --------------------------------------------------------------
>
>                 Key: HADOOP-1984
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1984
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.0
>            Reporter: Runping Qi
>            Priority: Critical
>             Fix For: 0.16.0
>
>         Attachments: HADOOP-1984.patch
>
>
> In many cases, some reducers got stuck at copy phase, progressing extremely slowly.
> The entire cluster seems doing nothing. This causes a very bad long tails of otherwise well tuned map/red jobs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1984) some reducer stuck at copy phase and progress extremely slowly

Posted by "Tsz Wo (Nicholas), SZE (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12540872 ] 

Tsz Wo (Nicholas), SZE commented on HADOOP-1984:
------------------------------------------------

Do we need to check overflow?  It is relatively easy to get overflow error in exponential functions.

> some reducer stuck at copy phase and progress extremely slowly
> --------------------------------------------------------------
>
>                 Key: HADOOP-1984
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1984
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.0
>            Reporter: Runping Qi
>            Assignee: Amar Kamat
>            Priority: Critical
>             Fix For: 0.16.0
>
>         Attachments: HADOOP-1984.patch
>
>
> In many cases, some reducers got stuck at copy phase, progressing extremely slowly.
> The entire cluster seems doing nothing. This causes a very bad long tails of otherwise well tuned map/red jobs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1984) some reducer stuck at copy phase and progress extremely slowly

Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12541832 ] 

Arun C Murthy commented on HADOOP-1984:
---------------------------------------

Sorry, I should have been more clear: the proposal I had in mind takes into account the total number of maps for the job and the number of complete, yet fetch-pending maps... does that help?

> some reducer stuck at copy phase and progress extremely slowly
> --------------------------------------------------------------
>
>                 Key: HADOOP-1984
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1984
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.0
>            Reporter: Runping Qi
>            Assignee: Amar Kamat
>            Priority: Critical
>             Fix For: 0.16.0
>
>         Attachments: HADOOP-1984.patch
>
>
> In many cases, some reducers got stuck at copy phase, progressing extremely slowly.
> The entire cluster seems doing nothing. This causes a very bad long tails of otherwise well tuned map/red jobs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1984) some reducer stuck at copy phase and progress extremely slowly

Posted by "Owen O'Malley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12542794 ] 

Owen O'Malley commented on HADOOP-1984:
---------------------------------------

Runping, it doesn't get to 10 minutes until it has failed 5 times. And it can easily take 10 minutes to clear a backlog off of a task tracker that is getting slammed. I certainly have seen jobs that longer than that to work off the backlog. I still maintain that a simple exponential back off is the right approach, because there are a lot of things that could have caused the slow down. 

Devaraj, please don't change the failure notification policy in this same bug. If it needs to be changed, it should be a different issue. Just changing the default number of retries in this issue is ok, but I don't think we should change the policy for *that* in this issue. Furthermore, if we do change the policy, I'd argue for something much more direct and say that if a tracker is black listed for a job, the number of retries should be cut in half or something.

> some reducer stuck at copy phase and progress extremely slowly
> --------------------------------------------------------------
>
>                 Key: HADOOP-1984
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1984
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.0
>            Reporter: Runping Qi
>            Assignee: Amar Kamat
>            Priority: Critical
>             Fix For: 0.16.0
>
>         Attachments: HADOOP-1984.patch
>
>
> In many cases, some reducers got stuck at copy phase, progressing extremely slowly.
> The entire cluster seems doing nothing. This causes a very bad long tails of otherwise well tuned map/red jobs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1984) some reducer stuck at copy phase and progress extremely slowly

Posted by "Runping Qi (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12541817 ] 

Runping Qi commented on HADOOP-1984:
------------------------------------


Ten minutes waiting interval seems too much.
When the interval reach a few minutes, it does not 
need to be doubled for the next try.
If one has to wait for 10+ minutes for a map-output,
it may be better off to re-execute the mapper on other nodes
Re-executing a mapper usually takes less than 10 minutes.

 


> some reducer stuck at copy phase and progress extremely slowly
> --------------------------------------------------------------
>
>                 Key: HADOOP-1984
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1984
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.0
>            Reporter: Runping Qi
>            Assignee: Amar Kamat
>            Priority: Critical
>             Fix For: 0.16.0
>
>         Attachments: HADOOP-1984.patch
>
>
> In many cases, some reducers got stuck at copy phase, progressing extremely slowly.
> The entire cluster seems doing nothing. This causes a very bad long tails of otherwise well tuned map/red jobs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1984) some reducer stuck at copy phase and progress extremely slowly

Posted by "Runping Qi (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12531791 ] 

Runping Qi commented on HADOOP-1984:
------------------------------------

The status of the stucked  reducer:
reduce > copy (778 of 779 at 0.20 MB/s) > 

> some reducer stuck at copy phase and progress extremely slowly
> --------------------------------------------------------------
>
>                 Key: HADOOP-1984
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1984
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.14.0
>            Reporter: Runping Qi
>            Priority: Critical
>
> In many cases, some reducers got stuck at copy phase, progressing extremely slowly.
> The entire cluster seems doing nothing. This causes a very bad long tails of otherwise well tuned map/red jobs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1984) some reducer stuck at copy phase and progress extremely slowly

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12543934 ] 

Hadoop QA commented on HADOOP-1984:
-----------------------------------

+1 overall.  Here are the results of testing the latest attachment 
http://issues.apache.org/jira/secure/attachment/12369866/HADOOP-1984-simple.patch
against trunk revision r596495.

    @author +1.  The patch does not contain any @author tags.

    javadoc +1.  The javadoc tool did not generate any warning messages.

    javac +1.  The applied patch does not generate any new compiler warnings.

    findbugs +1.  The patch does not introduce any new Findbugs warnings.

    core tests +1.  The patch passed core unit tests.

    contrib tests +1.  The patch passed contrib unit tests.

Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1126/testReport/
Findbugs warnings: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1126/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1126/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1126/console

This message is automatically generated.

> some reducer stuck at copy phase and progress extremely slowly
> --------------------------------------------------------------
>
>                 Key: HADOOP-1984
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1984
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.0
>            Reporter: Runping Qi
>            Assignee: Amar Kamat
>            Priority: Critical
>             Fix For: 0.16.0
>
>         Attachments: HADOOP-1984-simple.patch, HADOOP-1984-simple.patch, HADOOP-1984.patch
>
>
> In many cases, some reducers got stuck at copy phase, progressing extremely slowly.
> The entire cluster seems doing nothing. This causes a very bad long tails of otherwise well tuned map/red jobs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (HADOOP-1984) some reducer stuck at copy phase and progress extremely slowly

Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Arun C Murthy reassigned HADOOP-1984:
-------------------------------------

    Assignee: Amar Kamat

> some reducer stuck at copy phase and progress extremely slowly
> --------------------------------------------------------------
>
>                 Key: HADOOP-1984
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1984
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.0
>            Reporter: Runping Qi
>            Assignee: Amar Kamat
>            Priority: Critical
>             Fix For: 0.16.0
>
>         Attachments: HADOOP-1984.patch
>
>
> In many cases, some reducers got stuck at copy phase, progressing extremely slowly.
> The entire cluster seems doing nothing. This causes a very bad long tails of otherwise well tuned map/red jobs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1984) some reducer stuck at copy phase and progress extremely slowly

Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Arun C Murthy updated HADOOP-1984:
----------------------------------

    Status: Open  (was: Patch Available)

Oh, Owen is right of course... Amar could you please fix it and submit a new patch? 

> some reducer stuck at copy phase and progress extremely slowly
> --------------------------------------------------------------
>
>                 Key: HADOOP-1984
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1984
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.0
>            Reporter: Runping Qi
>            Assignee: Amar Kamat
>            Priority: Critical
>             Fix For: 0.16.0
>
>         Attachments: HADOOP-1984.patch
>
>
> In many cases, some reducers got stuck at copy phase, progressing extremely slowly.
> The entire cluster seems doing nothing. This causes a very bad long tails of otherwise well tuned map/red jobs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1984) some reducer stuck at copy phase and progress extremely slowly

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12540331 ] 

Devaraj Das commented on HADOOP-1984:
-------------------------------------

+1

> some reducer stuck at copy phase and progress extremely slowly
> --------------------------------------------------------------
>
>                 Key: HADOOP-1984
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1984
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.0
>            Reporter: Runping Qi
>            Priority: Critical
>             Fix For: 0.16.0
>
>         Attachments: HADOOP-1984.patch
>
>
> In many cases, some reducers got stuck at copy phase, progressing extremely slowly.
> The entire cluster seems doing nothing. This causes a very bad long tails of otherwise well tuned map/red jobs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1984) some reducer stuck at copy phase and progress extremely slowly

Posted by "Amar Kamat (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Amar Kamat updated HADOOP-1984:
-------------------------------

    Attachment: HADOOP-1984-simple.patch

> some reducer stuck at copy phase and progress extremely slowly
> --------------------------------------------------------------
>
>                 Key: HADOOP-1984
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1984
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.0
>            Reporter: Runping Qi
>            Assignee: Amar Kamat
>            Priority: Critical
>             Fix For: 0.16.0
>
>         Attachments: HADOOP-1984-simple.patch, HADOOP-1984-simple.patch, HADOOP-1984-simple.patch, HADOOP-1984.patch
>
>
> In many cases, some reducers got stuck at copy phase, progressing extremely slowly.
> The entire cluster seems doing nothing. This causes a very bad long tails of otherwise well tuned map/red jobs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1984) some reducer stuck at copy phase and progress extremely slowly

Posted by "Amar Kamat (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Amar Kamat updated HADOOP-1984:
-------------------------------

    Attachment: HADOOP-1984.patch

> some reducer stuck at copy phase and progress extremely slowly
> --------------------------------------------------------------
>
>                 Key: HADOOP-1984
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1984
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.14.0
>            Reporter: Runping Qi
>            Priority: Critical
>         Attachments: HADOOP-1984.patch
>
>
> In many cases, some reducers got stuck at copy phase, progressing extremely slowly.
> The entire cluster seems doing nothing. This causes a very bad long tails of otherwise well tuned map/red jobs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1984) some reducer stuck at copy phase and progress extremely slowly

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12540868 ] 

Devaraj Das commented on HADOOP-1984:
-------------------------------------

Oops ! Even i missed that.

> some reducer stuck at copy phase and progress extremely slowly
> --------------------------------------------------------------
>
>                 Key: HADOOP-1984
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1984
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.0
>            Reporter: Runping Qi
>            Assignee: Amar Kamat
>            Priority: Critical
>             Fix For: 0.16.0
>
>         Attachments: HADOOP-1984.patch
>
>
> In many cases, some reducers got stuck at copy phase, progressing extremely slowly.
> The entire cluster seems doing nothing. This causes a very bad long tails of otherwise well tuned map/red jobs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1984) some reducer stuck at copy phase and progress extremely slowly

Posted by "Amar Kamat (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Amar Kamat updated HADOOP-1984:
-------------------------------

    Status: Patch Available  (was: Open)

> some reducer stuck at copy phase and progress extremely slowly
> --------------------------------------------------------------
>
>                 Key: HADOOP-1984
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1984
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.0
>            Reporter: Runping Qi
>            Assignee: Amar Kamat
>            Priority: Critical
>             Fix For: 0.16.0
>
>         Attachments: HADOOP-1984-simple.patch, HADOOP-1984-simple.patch, HADOOP-1984.patch
>
>
> In many cases, some reducers got stuck at copy phase, progressing extremely slowly.
> The entire cluster seems doing nothing. This causes a very bad long tails of otherwise well tuned map/red jobs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1984) some reducer stuck at copy phase and progress extremely slowly

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12545661 ] 

Hadoop QA commented on HADOOP-1984:
-----------------------------------

+1 overall.  Here are the results of testing the latest attachment 
http://issues.apache.org/jira/secure/attachment/12370217/HADOOP-1984-simple.patch
against trunk revision r598152.

    @author +1.  The patch does not contain any @author tags.

    javadoc +1.  The javadoc tool did not generate any warning messages.

    javac +1.  The applied patch does not generate any new compiler warnings.

    findbugs +1.  The patch does not introduce any new Findbugs warnings.

    core tests +1.  The patch passed core unit tests.

    contrib tests +1.  The patch passed contrib unit tests.

Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1164/testReport/
Findbugs warnings: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1164/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1164/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1164/console

This message is automatically generated.

> some reducer stuck at copy phase and progress extremely slowly
> --------------------------------------------------------------
>
>                 Key: HADOOP-1984
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1984
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.0
>            Reporter: Runping Qi
>            Assignee: Amar Kamat
>            Priority: Critical
>             Fix For: 0.16.0
>
>         Attachments: HADOOP-1984-simple.patch, HADOOP-1984-simple.patch, HADOOP-1984-simple.patch, HADOOP-1984.patch
>
>
> In many cases, some reducers got stuck at copy phase, progressing extremely slowly.
> The entire cluster seems doing nothing. This causes a very bad long tails of otherwise well tuned map/red jobs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1984) some reducer stuck at copy phase and progress extremely slowly

Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12541731 ] 

Arun C Murthy commented on HADOOP-1984:
---------------------------------------

Ok, after some more thought...

Initial back-off of 2s is too aggressive - it means that it only takes 5 failures in 1minute (2 + 4 + 8 + 16 + 32 = 62s) to send a fetch-failure notification to the JT (see HADOOP-1158) and thus doesn't sufficiently account for transient problems faced by the tasktracker's jetty (tt on which the map ran).

I propose we peg the starting back-off at 8s which means that it takes approximately 4mins for one notification to be sent to the JT i.e. (8 + 16 + 32 + 64 + 128 = 249s).

Also, we need to maintain a per-host fetch-failure count since the penaltyBox is also maintained on a per-host basis.

> some reducer stuck at copy phase and progress extremely slowly
> --------------------------------------------------------------
>
>                 Key: HADOOP-1984
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1984
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.0
>            Reporter: Runping Qi
>            Assignee: Amar Kamat
>            Priority: Critical
>             Fix For: 0.16.0
>
>         Attachments: HADOOP-1984.patch
>
>
> In many cases, some reducers got stuck at copy phase, progressing extremely slowly.
> The entire cluster seems doing nothing. This causes a very bad long tails of otherwise well tuned map/red jobs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1984) some reducer stuck at copy phase and progress extremely slowly

Posted by "Amar Kamat (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12543790 ] 

Amar Kamat commented on HADOOP-1984:
------------------------------------

Submitting a patch which implements the strategy discussed above. The way it works is as follows :
1. On the first failed attempt the copy is backed off by {{backoff_init}},  currently set to *4 sec*
2. For subsequent failures the map output copy is backed off by {{backoff_init}} * {{backoff_base}}^{{num_retries-1}}^,  {{backoff_base}} currently set to *2*.
3. Backoff continues until the total time spent on the copy is less than {{max_backoff}} which is user configurable using {{mapred.reduce.copy.backoff}}.
     i.e  {{backoff(1) + backoff(2) + .... backoff(max_retries) ~ max_backoff}},  default {{max_backoff}} is set to *300 (5 min)*
4. After a total of {{max_backoff}} time the job tracker is notified.
5. This cycle continues for every new map on the host since {{num_retries}} is for a particular map.


> some reducer stuck at copy phase and progress extremely slowly
> --------------------------------------------------------------
>
>                 Key: HADOOP-1984
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1984
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.0
>            Reporter: Runping Qi
>            Assignee: Amar Kamat
>            Priority: Critical
>             Fix For: 0.16.0
>
>         Attachments: HADOOP-1984-simple.patch, HADOOP-1984.patch
>
>
> In many cases, some reducers got stuck at copy phase, progressing extremely slowly.
> The entire cluster seems doing nothing. This causes a very bad long tails of otherwise well tuned map/red jobs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1984) some reducer stuck at copy phase and progress extremely slowly

Posted by "Owen O'Malley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12540861 ] 

Owen O'Malley commented on HADOOP-1984:
---------------------------------------

+1

> some reducer stuck at copy phase and progress extremely slowly
> --------------------------------------------------------------
>
>                 Key: HADOOP-1984
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1984
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.0
>            Reporter: Runping Qi
>            Assignee: Amar Kamat
>            Priority: Critical
>             Fix For: 0.16.0
>
>         Attachments: HADOOP-1984.patch
>
>
> In many cases, some reducers got stuck at copy phase, progressing extremely slowly.
> The entire cluster seems doing nothing. This causes a very bad long tails of otherwise well tuned map/red jobs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1984) some reducer stuck at copy phase and progress extremely slowly

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Devaraj Das updated HADOOP-1984:
--------------------------------

    Resolution: Fixed
        Status: Resolved  (was: Patch Available)

I just committed this. Thanks, Amar!

> some reducer stuck at copy phase and progress extremely slowly
> --------------------------------------------------------------
>
>                 Key: HADOOP-1984
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1984
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.0
>            Reporter: Runping Qi
>            Assignee: Amar Kamat
>            Priority: Critical
>             Fix For: 0.16.0
>
>         Attachments: HADOOP-1984-simple.patch, HADOOP-1984-simple.patch, HADOOP-1984-simple.patch, HADOOP-1984.patch
>
>
> In many cases, some reducers got stuck at copy phase, progressing extremely slowly.
> The entire cluster seems doing nothing. This causes a very bad long tails of otherwise well tuned map/red jobs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1984) some reducer stuck at copy phase and progress extremely slowly

Posted by "Runping Qi (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12541828 ] 

Runping Qi commented on HADOOP-1984:
------------------------------------


It will be really helpful if we know the overall job progress status, 
and health state of the node hosting the map output.

If there are still a lot of pending mappers (and thus, 
each reducer gets many map outputs copied in each pass in shuffling), 
it is OK to wait for a long time for some map outputs
However, if the concerned map outputs are the only ones that the reducer
is trying to get, then it makes more sense to re-execute the mappers 
than waiting for 10 minutes.

Also, if we know the node holding the concernec map outputs has some other tasks 
and those tasks are making healthy progress, it should be OK for waiting for a little long.
However, if the node does not have any tasks running on it, or the tasks on the node
do not make progress, it is better to start re-execute earlier than waiting for a long period.

In particular, in the case I ran into, the concerned node was KNOWN to be in bad shape,
and was BLACKLISTED, thus, it did not have any tasks running on it. In this case,
it is clear that it is much better to re-execute the mappers earlier than waiting for the map
outputs from that node. 

> some reducer stuck at copy phase and progress extremely slowly
> --------------------------------------------------------------
>
>                 Key: HADOOP-1984
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1984
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.0
>            Reporter: Runping Qi
>            Assignee: Amar Kamat
>            Priority: Critical
>             Fix For: 0.16.0
>
>         Attachments: HADOOP-1984.patch
>
>
> In many cases, some reducers got stuck at copy phase, progressing extremely slowly.
> The entire cluster seems doing nothing. This causes a very bad long tails of otherwise well tuned map/red jobs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1984) some reducer stuck at copy phase and progress extremely slowly

Posted by "Owen O'Malley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12541814 ] 

Owen O'Malley commented on HADOOP-1984:
---------------------------------------

how about using:

4 ** n = 4, 16, 64, 256, 1024
sum = 1364s = ~22.5 minutes

That will allow a task tracker that is being pounded by a large cluster to catch up before the reduce is killed.

Although those jumps are big enough that it probably makes sense to add enough randomness to make the intervals overlap, so how about?

4**n * rand(0.5, 2.0)

which would let each round of backoffs meet the round before and after it.

> some reducer stuck at copy phase and progress extremely slowly
> --------------------------------------------------------------
>
>                 Key: HADOOP-1984
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1984
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.0
>            Reporter: Runping Qi
>            Assignee: Amar Kamat
>            Priority: Critical
>             Fix For: 0.16.0
>
>         Attachments: HADOOP-1984.patch
>
>
> In many cases, some reducers got stuck at copy phase, progressing extremely slowly.
> The entire cluster seems doing nothing. This causes a very bad long tails of otherwise well tuned map/red jobs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1984) some reducer stuck at copy phase and progress extremely slowly

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12542552 ] 

Devaraj Das commented on HADOOP-1984:
-------------------------------------

After some corridor discussion between me, Arun and Amar, here are some thoughts:
1) The task completion event has an additional field that says how much time a given map took to complete.

2) The longer the map completion time, the more we delay the feedback about it (in case reducers fail to fetch) to the JobTracker. That is, the killing of maps is not based on the number of times the attempt to fetch failed, but instead dependent on the time the map will take to run if reexecuted.

3) For example, if a map takes 30 minutes to run, and the fetch for the corresponding output fails, a reduce postpones giving the feedback to the JobTracker until it has tried for 15 minutes or so (exponential backoff within this time interval). In other words, we increase the number of retries for long running maps just so that we might be successful in fetching probabilistically. The time a reduce spends retrying is directly proportional to the time the map took to complete.

> some reducer stuck at copy phase and progress extremely slowly
> --------------------------------------------------------------
>
>                 Key: HADOOP-1984
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1984
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.0
>            Reporter: Runping Qi
>            Assignee: Amar Kamat
>            Priority: Critical
>             Fix For: 0.16.0
>
>         Attachments: HADOOP-1984.patch
>
>
> In many cases, some reducers got stuck at copy phase, progressing extremely slowly.
> The entire cluster seems doing nothing. This causes a very bad long tails of otherwise well tuned map/red jobs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1984) some reducer stuck at copy phase and progress extremely slowly

Posted by "Amar Kamat (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Amar Kamat updated HADOOP-1984:
-------------------------------

    Attachment: HADOOP-1984-simple.patch

> some reducer stuck at copy phase and progress extremely slowly
> --------------------------------------------------------------
>
>                 Key: HADOOP-1984
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1984
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.0
>            Reporter: Runping Qi
>            Assignee: Amar Kamat
>            Priority: Critical
>             Fix For: 0.16.0
>
>         Attachments: HADOOP-1984-simple.patch, HADOOP-1984.patch
>
>
> In many cases, some reducers got stuck at copy phase, progressing extremely slowly.
> The entire cluster seems doing nothing. This causes a very bad long tails of otherwise well tuned map/red jobs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1984) some reducer stuck at copy phase and progress extremely slowly

Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12541820 ] 

Arun C Murthy commented on HADOOP-1984:
---------------------------------------

bq.That will allow a task tracker that is being pounded by a large cluster to catch up before the reduce is killed.
and
bq. Ten minutes waiting interval seems too much.

are conflicting requirements, basically they depend on the job-characteristics.

To get around this, I propose we factor in the number of maps in the job into the back-off calculations... thoughts?


> some reducer stuck at copy phase and progress extremely slowly
> --------------------------------------------------------------
>
>                 Key: HADOOP-1984
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1984
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.0
>            Reporter: Runping Qi
>            Assignee: Amar Kamat
>            Priority: Critical
>             Fix For: 0.16.0
>
>         Attachments: HADOOP-1984.patch
>
>
> In many cases, some reducers got stuck at copy phase, progressing extremely slowly.
> The entire cluster seems doing nothing. This causes a very bad long tails of otherwise well tuned map/red jobs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1984) some reducer stuck at copy phase and progress extremely slowly

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12538732 ] 

Hadoop QA commented on HADOOP-1984:
-----------------------------------

-1 overall.  Here are the results of testing the latest attachment 
http://issues.apache.org/jira/secure/attachment/12368662/HADOOP-1984.patch
against trunk revision r589879.

    @author +1.  The patch does not contain any @author tags.

    javadoc +1.  The javadoc tool did not generate any warning messages.

    javac +1.  The applied patch does not generate any new compiler warnings.

    findbugs +1.  The patch does not introduce any new Findbugs warnings.

    core tests +1.  The patch passed core unit tests.

    contrib tests -1.  The patch failed contrib unit tests.

Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1027/testReport/
Findbugs warnings: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1027/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1027/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1027/console

This message is automatically generated.

> some reducer stuck at copy phase and progress extremely slowly
> --------------------------------------------------------------
>
>                 Key: HADOOP-1984
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1984
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.0
>            Reporter: Runping Qi
>            Priority: Critical
>             Fix For: 0.16.0
>
>         Attachments: HADOOP-1984.patch
>
>
> In many cases, some reducers got stuck at copy phase, progressing extremely slowly.
> The entire cluster seems doing nothing. This causes a very bad long tails of otherwise well tuned map/red jobs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1984) some reducer stuck at copy phase and progress extremely slowly

Posted by "Hudson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12546679 ] 

Hudson commented on HADOOP-1984:
--------------------------------

Integrated in Hadoop-Nightly #317 (See [http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Nightly/317/])

> some reducer stuck at copy phase and progress extremely slowly
> --------------------------------------------------------------
>
>                 Key: HADOOP-1984
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1984
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.0
>            Reporter: Runping Qi
>            Assignee: Amar Kamat
>            Priority: Critical
>             Fix For: 0.16.0
>
>         Attachments: HADOOP-1984-simple.patch, HADOOP-1984-simple.patch, HADOOP-1984-simple.patch, HADOOP-1984.patch
>
>
> In many cases, some reducers got stuck at copy phase, progressing extremely slowly.
> The entire cluster seems doing nothing. This causes a very bad long tails of otherwise well tuned map/red jobs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1984) some reducer stuck at copy phase and progress extremely slowly

Posted by "Amar Kamat (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Amar Kamat updated HADOOP-1984:
-------------------------------

    Status: Patch Available  (was: Open)

Taking into consideration Arun's comment, hard coding the computations for base=2. The way it works now is as follows
1. the first notification is sent to the jobtracker after {{max-backoff}} time. The backoff values are ({{4, 8, 16, 32 ..... k}}) such that {{4 + 8 + 16 + 32 .... k ~ max-backoff}},  {{max-backoff}} can be set using {{mapred.reduce.copy.backoff}} [default is 300 sec].
2. subsequent notifications are also sent after {{max-backoff}} time but the backoff values now are ({{max-backoff/2, max-backoff/2}}) , i.e 2 retries
3. at the max the reducer waits for {{3 * max-backoff}} before the maps get re-executed. 
Comments ?


> some reducer stuck at copy phase and progress extremely slowly
> --------------------------------------------------------------
>
>                 Key: HADOOP-1984
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1984
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.0
>            Reporter: Runping Qi
>            Assignee: Amar Kamat
>            Priority: Critical
>             Fix For: 0.16.0
>
>         Attachments: HADOOP-1984-simple.patch, HADOOP-1984-simple.patch, HADOOP-1984-simple.patch, HADOOP-1984.patch
>
>
> In many cases, some reducers got stuck at copy phase, progressing extremely slowly.
> The entire cluster seems doing nothing. This causes a very bad long tails of otherwise well tuned map/red jobs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1984) some reducer stuck at copy phase and progress extremely slowly

Posted by "Amar Kamat (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Amar Kamat updated HADOOP-1984:
-------------------------------

    Attachment: HADOOP-1984-simple.patch

Taking into consideration Devaraj's comment, {{mapred.reduce.copy.backoff}} is now added to {{hadoop-default.xml}}. 

> some reducer stuck at copy phase and progress extremely slowly
> --------------------------------------------------------------
>
>                 Key: HADOOP-1984
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1984
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.0
>            Reporter: Runping Qi
>            Assignee: Amar Kamat
>            Priority: Critical
>             Fix For: 0.16.0
>
>         Attachments: HADOOP-1984-simple.patch, HADOOP-1984-simple.patch, HADOOP-1984.patch
>
>
> In many cases, some reducers got stuck at copy phase, progressing extremely slowly.
> The entire cluster seems doing nothing. This causes a very bad long tails of otherwise well tuned map/red jobs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.