You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by "Amareshwari Sriramadasu (JIRA)" <ji...@apache.org> on 2008/09/23 05:48:44 UTC

[jira] Created: (HADOOP-4246) Reduce task copy errors may not kill it eventually

Reduce task copy errors may not kill it eventually
--------------------------------------------------

                 Key: HADOOP-4246
                 URL: https://issues.apache.org/jira/browse/HADOOP-4246
             Project: Hadoop Core
          Issue Type: Bug
          Components: mapred
            Reporter: Amareshwari Sriramadasu
            Assignee: Amareshwari Sriramadasu
            Priority: Critical
             Fix For: 0.19.0


maxFetchRetriesPerMap in reduce task can be zero some times (when maxMapRunTime is less than 4 seconds or mapred.reduce.copy.backoff is less than 4). This will not count reduce task copy errors to kill it eventually.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4246) Reduce task copy errors may not kill it eventually

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12635720#action_12635720 ] 

Devaraj Das commented on HADOOP-4246:
-------------------------------------

{code}
+                if ((fetchFailedMaps.size() >= maxFailedUniqueFetches)
                     && !reducerHealthy 
                     && (!reducerProgressedEnough || reducerStalled)) { 
                   LOG.fatal("Shuffle failed with too many fetch failures " + 
{code}

The expression above should include (fetchFailedMaps.size() == numPendingFetches) to take care of cases where a reducer node becomes faulty towards the end of the shuffle.

> Reduce task copy errors may not kill it eventually
> --------------------------------------------------
>
>                 Key: HADOOP-4246
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4246
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.19.0
>            Reporter: Amareshwari Sriramadasu
>            Assignee: Amareshwari Sriramadasu
>            Priority: Blocker
>             Fix For: 0.19.0
>
>         Attachments: patch-4246.txt, patch-4246.txt, patch-4246.txt
>
>
> maxFetchRetriesPerMap in reduce task can be zero some times (when maxMapRunTime is less than 4 seconds or mapred.reduce.copy.backoff is less than 4). This will not count reduce task copy errors to kill it eventually.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4246) Reduce task copy errors may not kill it eventually

Posted by "Amareshwari Sriramadasu (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Amareshwari Sriramadasu updated HADOOP-4246:
--------------------------------------------

    Status: Patch Available  (was: Open)

test-patch result:
{noformat}
     [exec] -1 overall.
     [exec]
     [exec]     +1 @author.  The patch does not contain any @author tags.
     [exec]
     [exec]     -1 tests included.  The patch doesn't appear to include any new or modified tests.
     [exec]                         Please justify why no tests are needed for this patch.
     [exec]
     [exec]     +1 javadoc.  The javadoc tool did not generate any warning messages.
     [exec]
     [exec]     +1 javac.  The applied patch does not increase the total number of javac compiler warnings.
     [exec]
     [exec]     +1 findbugs.  The patch does not introduce any new Findbugs warnings.
     [exec]
     [exec]     +1 Eclipse classpath. The patch retains Eclipse classpath integrity.
     [exec]
     [exec]
{noformat}
All core and contrib tests passed on my machine.

It is difficult to write a unit test simulating reduce-copy errors. 
The patch is manually tested. And ran Sort benchmark with the patch.

> Reduce task copy errors may not kill it eventually
> --------------------------------------------------
>
>                 Key: HADOOP-4246
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4246
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.19.0
>            Reporter: Amareshwari Sriramadasu
>            Assignee: Amareshwari Sriramadasu
>            Priority: Blocker
>             Fix For: 0.19.0
>
>         Attachments: patch-4246.txt, patch-4246.txt, patch-4246.txt, patch-4246.txt
>
>
> maxFetchRetriesPerMap in reduce task can be zero some times (when maxMapRunTime is less than 4 seconds or mapred.reduce.copy.backoff is less than 4). This will not count reduce task copy errors to kill it eventually.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4246) Reduce task copy errors may not kill it eventually

Posted by "Amareshwari Sriramadasu (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Amareshwari Sriramadasu updated HADOOP-4246:
--------------------------------------------

    Attachment: patch-4246.txt

Here is a patch doing :
1. maxFetchRetriesPerMap is assigned value 1, if it is zero.
2. maxFailedUniqueFetches is assigned value of numMaps, if numMaps is less than 5.

Tested the patch by throwing an FSError from copyOutput, with numMaps = 3 and mapRunTime=2secs. 

> Reduce task copy errors may not kill it eventually
> --------------------------------------------------
>
>                 Key: HADOOP-4246
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4246
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.19.0
>            Reporter: Amareshwari Sriramadasu
>            Assignee: Amareshwari Sriramadasu
>            Priority: Blocker
>             Fix For: 0.19.0
>
>         Attachments: patch-4246.txt, patch-4246.txt
>
>
> maxFetchRetriesPerMap in reduce task can be zero some times (when maxMapRunTime is less than 4 seconds or mapred.reduce.copy.backoff is less than 4). This will not count reduce task copy errors to kill it eventually.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4246) Reduce task copy errors may not kill it eventually

Posted by "Hudson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12636823#action_12636823 ] 

Hudson commented on HADOOP-4246:
--------------------------------

Integrated in Hadoop-trunk #623 (See [http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/623/])
    . Ensure we have the correct lower bound on the number of retries for fetching map-outputs; also fixed the case where the reducer automatically kills on too many unique map-outputs could not be fetched for small jobs. Contributed by Amareshwari Sri Ramadasu.


> Reduce task copy errors may not kill it eventually
> --------------------------------------------------
>
>                 Key: HADOOP-4246
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4246
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.19.0
>            Reporter: Amareshwari Sriramadasu
>            Assignee: Amareshwari Sriramadasu
>            Priority: Blocker
>             Fix For: 0.19.0
>
>         Attachments: patch-4246.txt, patch-4246.txt, patch-4246.txt, patch-4246.txt
>
>
> maxFetchRetriesPerMap in reduce task can be zero some times (when maxMapRunTime is less than 4 seconds or mapred.reduce.copy.backoff is less than 4). This will not count reduce task copy errors to kill it eventually.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4246) Reduce task copy errors may not kill it eventually

Posted by "Amareshwari Sriramadasu (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Amareshwari Sriramadasu updated HADOOP-4246:
--------------------------------------------

    Attachment: patch-4246.txt

Patch incorporating Devaraj's and Arun's comments.

> Reduce task copy errors may not kill it eventually
> --------------------------------------------------
>
>                 Key: HADOOP-4246
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4246
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.19.0
>            Reporter: Amareshwari Sriramadasu
>            Assignee: Amareshwari Sriramadasu
>            Priority: Blocker
>             Fix For: 0.19.0
>
>         Attachments: patch-4246.txt, patch-4246.txt, patch-4246.txt, patch-4246.txt
>
>
> maxFetchRetriesPerMap in reduce task can be zero some times (when maxMapRunTime is less than 4 seconds or mapred.reduce.copy.backoff is less than 4). This will not count reduce task copy errors to kill it eventually.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4246) Reduce task copy errors may not kill it eventually

Posted by "Amareshwari Sriramadasu (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Amareshwari Sriramadasu updated HADOOP-4246:
--------------------------------------------

    Attachment: patch-4246.txt

The attached patch adds GENERIC_ERROR to copyOutputErrorType for Exceptions/Errors other than Connect and Read exceptions

> Reduce task copy errors may not kill it eventually
> --------------------------------------------------
>
>                 Key: HADOOP-4246
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4246
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.19.0
>            Reporter: Amareshwari Sriramadasu
>            Assignee: Amareshwari Sriramadasu
>            Priority: Blocker
>             Fix For: 0.19.0
>
>         Attachments: patch-4246.txt
>
>
> maxFetchRetriesPerMap in reduce task can be zero some times (when maxMapRunTime is less than 4 seconds or mapred.reduce.copy.backoff is less than 4). This will not count reduce task copy errors to kill it eventually.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4246) Reduce task copy errors may not kill it eventually

Posted by "Jothi Padmanabhan (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12635698#action_12635698 ] 

Jothi Padmanabhan commented on HADOOP-4246:
-------------------------------------------

+1, patch looks good

> Reduce task copy errors may not kill it eventually
> --------------------------------------------------
>
>                 Key: HADOOP-4246
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4246
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.19.0
>            Reporter: Amareshwari Sriramadasu
>            Assignee: Amareshwari Sriramadasu
>            Priority: Blocker
>             Fix For: 0.19.0
>
>         Attachments: patch-4246.txt, patch-4246.txt, patch-4246.txt
>
>
> maxFetchRetriesPerMap in reduce task can be zero some times (when maxMapRunTime is less than 4 seconds or mapred.reduce.copy.backoff is less than 4). This will not count reduce task copy errors to kill it eventually.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4246) Reduce task copy errors may not kill it eventually

Posted by "Jothi Padmanabhan (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12635684#action_12635684 ] 

Jothi Padmanabhan commented on HADOOP-4246:
-------------------------------------------

Having a lower bound of 1 on map-fetch-retries might not be efficient as it opens the possibility for map reexecution on transient errors -- three different reducers reporting error when the serving task tracker had a transient problem could lead to re-execution of the map. We probably should try at least twice.

> Reduce task copy errors may not kill it eventually
> --------------------------------------------------
>
>                 Key: HADOOP-4246
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4246
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.19.0
>            Reporter: Amareshwari Sriramadasu
>            Assignee: Amareshwari Sriramadasu
>            Priority: Blocker
>             Fix For: 0.19.0
>
>         Attachments: patch-4246.txt, patch-4246.txt
>
>
> maxFetchRetriesPerMap in reduce task can be zero some times (when maxMapRunTime is less than 4 seconds or mapred.reduce.copy.backoff is less than 4). This will not count reduce task copy errors to kill it eventually.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4246) Reduce task copy errors may not kill it eventually

Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Arun C Murthy updated HADOOP-4246:
----------------------------------

    Resolution: Fixed
        Status: Resolved  (was: Patch Available)

I just committed this. Thanks, Amareshwari!

> Reduce task copy errors may not kill it eventually
> --------------------------------------------------
>
>                 Key: HADOOP-4246
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4246
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.19.0
>            Reporter: Amareshwari Sriramadasu
>            Assignee: Amareshwari Sriramadasu
>            Priority: Blocker
>             Fix For: 0.19.0
>
>         Attachments: patch-4246.txt, patch-4246.txt, patch-4246.txt, patch-4246.txt
>
>
> maxFetchRetriesPerMap in reduce task can be zero some times (when maxMapRunTime is less than 4 seconds or mapred.reduce.copy.backoff is less than 4). This will not count reduce task copy errors to kill it eventually.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4246) Reduce task copy errors may not kill it eventually

Posted by "Amareshwari Sriramadasu (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Amareshwari Sriramadasu updated HADOOP-4246:
--------------------------------------------

    Status: Open  (was: Patch Available)

> Reduce task copy errors may not kill it eventually
> --------------------------------------------------
>
>                 Key: HADOOP-4246
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4246
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.19.0
>            Reporter: Amareshwari Sriramadasu
>            Assignee: Amareshwari Sriramadasu
>            Priority: Blocker
>             Fix For: 0.19.0
>
>         Attachments: patch-4246.txt, patch-4246.txt, patch-4246.txt
>
>
> maxFetchRetriesPerMap in reduce task can be zero some times (when maxMapRunTime is less than 4 seconds or mapred.reduce.copy.backoff is less than 4). This will not count reduce task copy errors to kill it eventually.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4246) Reduce task copy errors may not kill it eventually

Posted by "Amareshwari Sriramadasu (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Amareshwari Sriramadasu updated HADOOP-4246:
--------------------------------------------

             Priority: Blocker  (was: Critical)
    Affects Version/s: 0.19.0

Currently only READ and CONNECT errors in Reducer copyOutput are count against failed fetches, other errors like disk out of space are not taken into consideration. These errors could just hang the reducer.

> Reduce task copy errors may not kill it eventually
> --------------------------------------------------
>
>                 Key: HADOOP-4246
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4246
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.19.0
>            Reporter: Amareshwari Sriramadasu
>            Assignee: Amareshwari Sriramadasu
>            Priority: Blocker
>             Fix For: 0.19.0
>
>
> maxFetchRetriesPerMap in reduce task can be zero some times (when maxMapRunTime is less than 4 seconds or mapred.reduce.copy.backoff is less than 4). This will not count reduce task copy errors to kill it eventually.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4246) Reduce task copy errors may not kill it eventually

Posted by "Amareshwari Sriramadasu (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Amareshwari Sriramadasu updated HADOOP-4246:
--------------------------------------------

    Attachment: patch-4246.txt

Patch doing maxFetchRetriesPerMap as MIN_FETCH_RETRIES_PER_MAP (constant with value 2), if it is less than 2.

> Reduce task copy errors may not kill it eventually
> --------------------------------------------------
>
>                 Key: HADOOP-4246
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4246
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.19.0
>            Reporter: Amareshwari Sriramadasu
>            Assignee: Amareshwari Sriramadasu
>            Priority: Blocker
>             Fix For: 0.19.0
>
>         Attachments: patch-4246.txt, patch-4246.txt, patch-4246.txt
>
>
> maxFetchRetriesPerMap in reduce task can be zero some times (when maxMapRunTime is less than 4 seconds or mapred.reduce.copy.backoff is less than 4). This will not count reduce task copy errors to kill it eventually.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4246) Reduce task copy errors may not kill it eventually

Posted by "Amareshwari Sriramadasu (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Amareshwari Sriramadasu updated HADOOP-4246:
--------------------------------------------

    Status: Patch Available  (was: Open)

> Reduce task copy errors may not kill it eventually
> --------------------------------------------------
>
>                 Key: HADOOP-4246
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4246
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.19.0
>            Reporter: Amareshwari Sriramadasu
>            Assignee: Amareshwari Sriramadasu
>            Priority: Blocker
>             Fix For: 0.19.0
>
>         Attachments: patch-4246.txt, patch-4246.txt
>
>
> maxFetchRetriesPerMap in reduce task can be zero some times (when maxMapRunTime is less than 4 seconds or mapred.reduce.copy.backoff is less than 4). This will not count reduce task copy errors to kill it eventually.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4246) Reduce task copy errors may not kill it eventually

Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12635788#action_12635788 ] 

Arun C Murthy commented on HADOOP-4246:
---------------------------------------

The "if (a < b) a=b;" pieces should used Math.max instead...

> Reduce task copy errors may not kill it eventually
> --------------------------------------------------
>
>                 Key: HADOOP-4246
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4246
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.19.0
>            Reporter: Amareshwari Sriramadasu
>            Assignee: Amareshwari Sriramadasu
>            Priority: Blocker
>             Fix For: 0.19.0
>
>         Attachments: patch-4246.txt, patch-4246.txt, patch-4246.txt
>
>
> maxFetchRetriesPerMap in reduce task can be zero some times (when maxMapRunTime is less than 4 seconds or mapred.reduce.copy.backoff is less than 4). This will not count reduce task copy errors to kill it eventually.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4246) Reduce task copy errors may not kill it eventually

Posted by "Amareshwari Sriramadasu (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Amareshwari Sriramadasu updated HADOOP-4246:
--------------------------------------------

    Status: Patch Available  (was: Open)

> Reduce task copy errors may not kill it eventually
> --------------------------------------------------
>
>                 Key: HADOOP-4246
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4246
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.19.0
>            Reporter: Amareshwari Sriramadasu
>            Assignee: Amareshwari Sriramadasu
>            Priority: Blocker
>             Fix For: 0.19.0
>
>         Attachments: patch-4246.txt
>
>
> maxFetchRetriesPerMap in reduce task can be zero some times (when maxMapRunTime is less than 4 seconds or mapred.reduce.copy.backoff is less than 4). This will not count reduce task copy errors to kill it eventually.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4246) Reduce task copy errors may not kill it eventually

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12635923#action_12635923 ] 

Hadoop QA commented on HADOOP-4246:
-----------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12391193/patch-4246.txt
  against trunk revision 700589.

    +1 @author.  The patch does not contain any @author tags.

    -1 tests included.  The patch doesn't appear to include any new or modified tests.
                        Please justify why no tests are needed for this patch.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs.  The patch does not introduce any new Findbugs warnings.

    +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

    +1 core tests.  The patch passed core unit tests.

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3406/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3406/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3406/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3406/console

This message is automatically generated.

> Reduce task copy errors may not kill it eventually
> --------------------------------------------------
>
>                 Key: HADOOP-4246
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4246
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.19.0
>            Reporter: Amareshwari Sriramadasu
>            Assignee: Amareshwari Sriramadasu
>            Priority: Blocker
>             Fix For: 0.19.0
>
>         Attachments: patch-4246.txt, patch-4246.txt, patch-4246.txt
>
>
> maxFetchRetriesPerMap in reduce task can be zero some times (when maxMapRunTime is less than 4 seconds or mapred.reduce.copy.backoff is less than 4). This will not count reduce task copy errors to kill it eventually.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4246) Reduce task copy errors may not kill it eventually

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12634172#action_12634172 ] 

Hadoop QA commented on HADOOP-4246:
-----------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12390825/patch-4246.txt
  against trunk revision 698385.

    +1 @author.  The patch does not contain any @author tags.

    -1 tests included.  The patch doesn't appear to include any new or modified tests.
                        Please justify why no tests are needed for this patch.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs.  The patch does not introduce any new Findbugs warnings.

    -1 core tests.  The patch failed core unit tests.

    -1 contrib tests.  The patch failed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3361/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3361/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3361/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3361/console

This message is automatically generated.

> Reduce task copy errors may not kill it eventually
> --------------------------------------------------
>
>                 Key: HADOOP-4246
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4246
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.19.0
>            Reporter: Amareshwari Sriramadasu
>            Assignee: Amareshwari Sriramadasu
>            Priority: Blocker
>             Fix For: 0.19.0
>
>         Attachments: patch-4246.txt
>
>
> maxFetchRetriesPerMap in reduce task can be zero some times (when maxMapRunTime is less than 4 seconds or mapred.reduce.copy.backoff is less than 4). This will not count reduce task copy errors to kill it eventually.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4246) Reduce task copy errors may not kill it eventually

Posted by "Amareshwari Sriramadasu (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12636517#action_12636517 ] 

Amareshwari Sriramadasu commented on HADOOP-4246:
-------------------------------------------------

The test failure for TestDatanodeDeath.testDatanodeDeath is not related to the patch.

> Reduce task copy errors may not kill it eventually
> --------------------------------------------------
>
>                 Key: HADOOP-4246
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4246
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.19.0
>            Reporter: Amareshwari Sriramadasu
>            Assignee: Amareshwari Sriramadasu
>            Priority: Blocker
>             Fix For: 0.19.0
>
>         Attachments: patch-4246.txt, patch-4246.txt, patch-4246.txt, patch-4246.txt
>
>
> maxFetchRetriesPerMap in reduce task can be zero some times (when maxMapRunTime is less than 4 seconds or mapred.reduce.copy.backoff is less than 4). This will not count reduce task copy errors to kill it eventually.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4246) Reduce task copy errors may not kill it eventually

Posted by "Amareshwari Sriramadasu (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Amareshwari Sriramadasu updated HADOOP-4246:
--------------------------------------------

    Status: Open  (was: Patch Available)

GENERIC_ERROR doesn't make sense after reverting HADOOP-3327. 

I will upload a patch again for addressing maxFetchRetriesPerMap=zero

> Reduce task copy errors may not kill it eventually
> --------------------------------------------------
>
>                 Key: HADOOP-4246
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4246
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.19.0
>            Reporter: Amareshwari Sriramadasu
>            Assignee: Amareshwari Sriramadasu
>            Priority: Blocker
>             Fix For: 0.19.0
>
>         Attachments: patch-4246.txt
>
>
> maxFetchRetriesPerMap in reduce task can be zero some times (when maxMapRunTime is less than 4 seconds or mapred.reduce.copy.backoff is less than 4). This will not count reduce task copy errors to kill it eventually.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4246) Reduce task copy errors may not kill it eventually

Posted by "Jothi Padmanabhan (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12634369#action_12634369 ] 

Jothi Padmanabhan commented on HADOOP-4246:
-------------------------------------------

The patch looks good. A few minor comments

* Since MAX_FAILED_UNIQUE_FETCHES is no longer a constant, it should be named maxFailedUniqueFetches

* getClosestPowerOf2 will not return negative numbers. So, this piece of code 
{code}  
   if (this.maxFetchRetriesPerMap < 1) {
        this.maxFetchRetriesPerMap = 1;
      }
{code}
should be modifed to
{code}
if (this.maxFetcRetriesPerMap ==0) {
  this.maxFetchRetriesPerMap = 1;
}
{code}

for better clarity
* For the backoff value for a GENERIC_ERROR, should we just back off by a fixed amount and retry? The concern here is that if we are hitting a 'disk-out-of-space' exception, we are better off identifying it earlier than late. If the map_run_time is high, we might actually be spending a lot of time before the jobtracker gets notified. Thoughts?


> Reduce task copy errors may not kill it eventually
> --------------------------------------------------
>
>                 Key: HADOOP-4246
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4246
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.19.0
>            Reporter: Amareshwari Sriramadasu
>            Assignee: Amareshwari Sriramadasu
>            Priority: Blocker
>             Fix For: 0.19.0
>
>         Attachments: patch-4246.txt
>
>
> maxFetchRetriesPerMap in reduce task can be zero some times (when maxMapRunTime is less than 4 seconds or mapred.reduce.copy.backoff is less than 4). This will not count reduce task copy errors to kill it eventually.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4246) Reduce task copy errors may not kill it eventually

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12636264#action_12636264 ] 

Hadoop QA commented on HADOOP-4246:
-----------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12391263/patch-4246.txt
  against trunk revision 700923.

    +1 @author.  The patch does not contain any @author tags.

    -1 tests included.  The patch doesn't appear to include any new or modified tests.
                        Please justify why no tests are needed for this patch.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs.  The patch does not introduce any new Findbugs warnings.

    +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

    -1 core tests.  The patch failed core unit tests.

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3417/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3417/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3417/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3417/console

This message is automatically generated.

> Reduce task copy errors may not kill it eventually
> --------------------------------------------------
>
>                 Key: HADOOP-4246
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4246
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.19.0
>            Reporter: Amareshwari Sriramadasu
>            Assignee: Amareshwari Sriramadasu
>            Priority: Blocker
>             Fix For: 0.19.0
>
>         Attachments: patch-4246.txt, patch-4246.txt, patch-4246.txt, patch-4246.txt
>
>
> maxFetchRetriesPerMap in reduce task can be zero some times (when maxMapRunTime is less than 4 seconds or mapred.reduce.copy.backoff is less than 4). This will not count reduce task copy errors to kill it eventually.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.