You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-issues@hadoop.apache.org by "Robert Joseph Evans (JIRA)" <ji...@apache.org> on 2012/11/05 21:20:12 UTC

[jira] [Created] (MAPREDUCE-4772) Fetch failures can take way too long for a map to be restarted

Robert Joseph Evans created MAPREDUCE-4772:
----------------------------------------------

             Summary: Fetch failures can take way too long for a map to be restarted
                 Key: MAPREDUCE-4772
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4772
             Project: Hadoop Map/Reduce
          Issue Type: Bug
          Components: mrv2
    Affects Versions: 0.23.4
            Reporter: Robert Joseph Evans
            Assignee: Robert Joseph Evans
            Priority: Critical


In one particular case we saw a NM go down at just the right time, that most of the reducers got the output of the map tasks, but not all of them.

The ones that failed to get the output reported to the AM rather quickly that they could not fetch from the NM, but because the other reducers were still running the AM would not relaunch the map task because there weren't more than 50% of the running reducers that had reported fetch failures.  Then because of the exponential back-off for fetches on the reducers it took until 1 hour 45 min for the reduce tasks to hit another 10 fetch failures and report in again. At that point the other reducers had finished and the job relaunched the map task.  If the reducers had still been running at 1:45 I have no idea how long it would have taken for each of the tasks to get to 30 fetch failures.

We need to trigger the map based off of percentage of reducers shuffling, not percentage of reducers running, we also need to have a maximum limit of the back off, so that we don't ever have the reducer waiting for days to try and fetch map output.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAPREDUCE-4772) Fetch failures can take way too long for a map to be restarted

Posted by "Robert Joseph Evans (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAPREDUCE-4772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Joseph Evans updated MAPREDUCE-4772:
-------------------------------------------

    Priority: Critical  (was: Major)
    
> Fetch failures can take way too long for a map to be restarted
> --------------------------------------------------------------
>
>                 Key: MAPREDUCE-4772
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4772
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.4
>            Reporter: Robert Joseph Evans
>            Assignee: Robert Joseph Evans
>            Priority: Critical
>
> In one particular case we saw a NM go down at just the right time, that most of the reducers got the output of the map tasks, but not all of them.
> The ones that failed to get the output reported to the AM rather quickly that they could not fetch from the NM, but because the other reducers were still running the AM would not relaunch the map task because there weren't more than 50% of the running reducers that had reported fetch failures.  Then because of the exponential back-off for fetches on the reducers it took until 1 hour 45 min for the reduce tasks to hit another 10 fetch failures and report in again. At that point the other reducers had finished and the job relaunched the map task.  If the reducers had still been running at 1:45 I have no idea how long it would have taken for each of the tasks to get to 30 fetch failures.
> We need to trigger the map based off of percentage of reducers shuffling, not percentage of reducers running, we also need to have a maximum limit of the back off, so that we don't ever have the reducer waiting for days to try and fetch map output.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-4772) Fetch failures can take way too long for a map to be restarted

Posted by "Hudson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-4772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13493899#comment-13493899 ] 

Hudson commented on MAPREDUCE-4772:
-----------------------------------

Integrated in Hadoop-Yarn-trunk #31 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/31/])
    MAPREDUCE-4772. Fetch failures can take way too long for a map to be restarted (bobby) (Revision 1407118)

     Result = SUCCESS
bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1407118
Files : 
* /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/JobImpl.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestFetchFailure.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/MRJobConfig.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/Fetcher.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/ShuffleScheduler.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/resources/mapred-default.xml
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/test/java/org/apache/hadoop/mapreduce/task/reduce/TestFetcher.java

                
> Fetch failures can take way too long for a map to be restarted
> --------------------------------------------------------------
>
>                 Key: MAPREDUCE-4772
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4772
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.4
>            Reporter: Robert Joseph Evans
>            Assignee: Robert Joseph Evans
>            Priority: Critical
>             Fix For: 3.0.0, 2.0.3-alpha, 0.23.5
>
>         Attachments: MR-4772-0.23.txt, MR-4772-trunk.txt
>
>
> In one particular case we saw a NM go down at just the right time, that most of the reducers got the output of the map tasks, but not all of them.
> The ones that failed to get the output reported to the AM rather quickly that they could not fetch from the NM, but because the other reducers were still running the AM would not relaunch the map task because there weren't more than 50% of the running reducers that had reported fetch failures.  Then because of the exponential back-off for fetches on the reducers it took until 1 hour 45 min for the reduce tasks to hit another 10 fetch failures and report in again. At that point the other reducers had finished and the job relaunched the map task.  If the reducers had still been running at 1:45 I have no idea how long it would have taken for each of the tasks to get to 30 fetch failures.
> We need to trigger the map based off of percentage of reducers shuffling, not percentage of reducers running, we also need to have a maximum limit of the back off, so that we don't ever have the reducer waiting for days to try and fetch map output.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-4772) Fetch failures can take way too long for a map to be restarted

Posted by "Hudson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-4772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13493253#comment-13493253 ] 

Hudson commented on MAPREDUCE-4772:
-----------------------------------

Integrated in Hadoop-trunk-Commit #2979 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/2979/])
    MAPREDUCE-4772. Fetch failures can take way too long for a map to be restarted (bobby) (Revision 1407118)

     Result = SUCCESS
bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1407118
Files : 
* /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/JobImpl.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestFetchFailure.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/MRJobConfig.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/Fetcher.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/ShuffleScheduler.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/resources/mapred-default.xml
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/test/java/org/apache/hadoop/mapreduce/task/reduce/TestFetcher.java

                
> Fetch failures can take way too long for a map to be restarted
> --------------------------------------------------------------
>
>                 Key: MAPREDUCE-4772
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4772
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.4
>            Reporter: Robert Joseph Evans
>            Assignee: Robert Joseph Evans
>            Priority: Critical
>         Attachments: MR-4772-0.23.txt, MR-4772-trunk.txt
>
>
> In one particular case we saw a NM go down at just the right time, that most of the reducers got the output of the map tasks, but not all of them.
> The ones that failed to get the output reported to the AM rather quickly that they could not fetch from the NM, but because the other reducers were still running the AM would not relaunch the map task because there weren't more than 50% of the running reducers that had reported fetch failures.  Then because of the exponential back-off for fetches on the reducers it took until 1 hour 45 min for the reduce tasks to hit another 10 fetch failures and report in again. At that point the other reducers had finished and the job relaunched the map task.  If the reducers had still been running at 1:45 I have no idea how long it would have taken for each of the tasks to get to 30 fetch failures.
> We need to trigger the map based off of percentage of reducers shuffling, not percentage of reducers running, we also need to have a maximum limit of the back off, so that we don't ever have the reducer waiting for days to try and fetch map output.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-4772) Fetch failures can take way too long for a map to be restarted

Posted by "Jonathan Eagles (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-4772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13492919#comment-13492919 ] 

Jonathan Eagles commented on MAPREDUCE-4772:
--------------------------------------------

+1
                
> Fetch failures can take way too long for a map to be restarted
> --------------------------------------------------------------
>
>                 Key: MAPREDUCE-4772
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4772
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.4
>            Reporter: Robert Joseph Evans
>            Assignee: Robert Joseph Evans
>            Priority: Critical
>         Attachments: MR-4772-0.23.txt, MR-4772-trunk.txt
>
>
> In one particular case we saw a NM go down at just the right time, that most of the reducers got the output of the map tasks, but not all of them.
> The ones that failed to get the output reported to the AM rather quickly that they could not fetch from the NM, but because the other reducers were still running the AM would not relaunch the map task because there weren't more than 50% of the running reducers that had reported fetch failures.  Then because of the exponential back-off for fetches on the reducers it took until 1 hour 45 min for the reduce tasks to hit another 10 fetch failures and report in again. At that point the other reducers had finished and the job relaunched the map task.  If the reducers had still been running at 1:45 I have no idea how long it would have taken for each of the tasks to get to 30 fetch failures.
> We need to trigger the map based off of percentage of reducers shuffling, not percentage of reducers running, we also need to have a maximum limit of the back off, so that we don't ever have the reducer waiting for days to try and fetch map output.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-4772) Fetch failures can take way too long for a map to be restarted

Posted by "Robert Joseph Evans (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-4772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13491784#comment-13491784 ] 

Robert Joseph Evans commented on MAPREDUCE-4772:
------------------------------------------------

OK so I missed some of the code in shuffleScheduler.checkReducerHealth().  The stall check is in there, but the previous check for a single map attempt is completely useless at this point.  Dropping the severity accordingly.
                
> Fetch failures can take way too long for a map to be restarted
> --------------------------------------------------------------
>
>                 Key: MAPREDUCE-4772
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4772
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.4
>            Reporter: Robert Joseph Evans
>            Assignee: Robert Joseph Evans
>
> In one particular case we saw a NM go down at just the right time, that most of the reducers got the output of the map tasks, but not all of them.
> The ones that failed to get the output reported to the AM rather quickly that they could not fetch from the NM, but because the other reducers were still running the AM would not relaunch the map task because there weren't more than 50% of the running reducers that had reported fetch failures.  Then because of the exponential back-off for fetches on the reducers it took until 1 hour 45 min for the reduce tasks to hit another 10 fetch failures and report in again. At that point the other reducers had finished and the job relaunched the map task.  If the reducers had still been running at 1:45 I have no idea how long it would have taken for each of the tasks to get to 30 fetch failures.
> We need to trigger the map based off of percentage of reducers shuffling, not percentage of reducers running, we also need to have a maximum limit of the back off, so that we don't ever have the reducer waiting for days to try and fetch map output.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-4772) Fetch failures can take way too long for a map to be restarted

Posted by "Hudson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-4772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13493981#comment-13493981 ] 

Hudson commented on MAPREDUCE-4772:
-----------------------------------

Integrated in Hadoop-Hdfs-trunk #1221 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1221/])
    MAPREDUCE-4772. Fetch failures can take way too long for a map to be restarted (bobby) (Revision 1407118)

     Result = SUCCESS
bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1407118
Files : 
* /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/JobImpl.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestFetchFailure.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/MRJobConfig.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/Fetcher.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/ShuffleScheduler.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/resources/mapred-default.xml
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/test/java/org/apache/hadoop/mapreduce/task/reduce/TestFetcher.java

                
> Fetch failures can take way too long for a map to be restarted
> --------------------------------------------------------------
>
>                 Key: MAPREDUCE-4772
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4772
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.4
>            Reporter: Robert Joseph Evans
>            Assignee: Robert Joseph Evans
>            Priority: Critical
>             Fix For: 3.0.0, 2.0.3-alpha, 0.23.5
>
>         Attachments: MR-4772-0.23.txt, MR-4772-trunk.txt
>
>
> In one particular case we saw a NM go down at just the right time, that most of the reducers got the output of the map tasks, but not all of them.
> The ones that failed to get the output reported to the AM rather quickly that they could not fetch from the NM, but because the other reducers were still running the AM would not relaunch the map task because there weren't more than 50% of the running reducers that had reported fetch failures.  Then because of the exponential back-off for fetches on the reducers it took until 1 hour 45 min for the reduce tasks to hit another 10 fetch failures and report in again. At that point the other reducers had finished and the job relaunched the map task.  If the reducers had still been running at 1:45 I have no idea how long it would have taken for each of the tasks to get to 30 fetch failures.
> We need to trigger the map based off of percentage of reducers shuffling, not percentage of reducers running, we also need to have a maximum limit of the back off, so that we don't ever have the reducer waiting for days to try and fetch map output.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAPREDUCE-4772) Fetch failures can take way too long for a map to be restarted

Posted by "Robert Joseph Evans (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAPREDUCE-4772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Joseph Evans updated MAPREDUCE-4772:
-------------------------------------------

       Resolution: Fixed
    Fix Version/s: 0.23.5
                   2.0.3-alpha
                   3.0.0
           Status: Resolved  (was: Patch Available)

Thanks for the review Jon,

I checked the code into trunk, branch-2, and branch-0.23
                
> Fetch failures can take way too long for a map to be restarted
> --------------------------------------------------------------
>
>                 Key: MAPREDUCE-4772
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4772
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.4
>            Reporter: Robert Joseph Evans
>            Assignee: Robert Joseph Evans
>            Priority: Critical
>             Fix For: 3.0.0, 2.0.3-alpha, 0.23.5
>
>         Attachments: MR-4772-0.23.txt, MR-4772-trunk.txt
>
>
> In one particular case we saw a NM go down at just the right time, that most of the reducers got the output of the map tasks, but not all of them.
> The ones that failed to get the output reported to the AM rather quickly that they could not fetch from the NM, but because the other reducers were still running the AM would not relaunch the map task because there weren't more than 50% of the running reducers that had reported fetch failures.  Then because of the exponential back-off for fetches on the reducers it took until 1 hour 45 min for the reduce tasks to hit another 10 fetch failures and report in again. At that point the other reducers had finished and the job relaunched the map task.  If the reducers had still been running at 1:45 I have no idea how long it would have taken for each of the tasks to get to 30 fetch failures.
> We need to trigger the map based off of percentage of reducers shuffling, not percentage of reducers running, we also need to have a maximum limit of the back off, so that we don't ever have the reducer waiting for days to try and fetch map output.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-4772) Fetch failures can take way too long for a map to be restarted

Posted by "Robert Joseph Evans (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-4772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13492799#comment-13492799 ] 

Robert Joseph Evans commented on MAPREDUCE-4772:
------------------------------------------------

OK So all of the javac warnings relate to the fact that the event dispatcher is a generic but is not really fully typed.  I could put in @IgnoreWarning, but that just masks the known problem.  Sorry I don't really see any way without some serious changes to the event dispatcher API.  
                
> Fetch failures can take way too long for a map to be restarted
> --------------------------------------------------------------
>
>                 Key: MAPREDUCE-4772
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4772
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.4
>            Reporter: Robert Joseph Evans
>            Assignee: Robert Joseph Evans
>            Priority: Critical
>         Attachments: MR-4772-0.23.txt, MR-4772-trunk.txt
>
>
> In one particular case we saw a NM go down at just the right time, that most of the reducers got the output of the map tasks, but not all of them.
> The ones that failed to get the output reported to the AM rather quickly that they could not fetch from the NM, but because the other reducers were still running the AM would not relaunch the map task because there weren't more than 50% of the running reducers that had reported fetch failures.  Then because of the exponential back-off for fetches on the reducers it took until 1 hour 45 min for the reduce tasks to hit another 10 fetch failures and report in again. At that point the other reducers had finished and the job relaunched the map task.  If the reducers had still been running at 1:45 I have no idea how long it would have taken for each of the tasks to get to 30 fetch failures.
> We need to trigger the map based off of percentage of reducers shuffling, not percentage of reducers running, we also need to have a maximum limit of the back off, so that we don't ever have the reducer waiting for days to try and fetch map output.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-4772) Fetch failures can take way too long for a map to be restarted

Posted by "Robert Joseph Evans (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-4772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13492796#comment-13492796 ] 

Robert Joseph Evans commented on MAPREDUCE-4772:
------------------------------------------------

I am looking into the javac warnings.
                
> Fetch failures can take way too long for a map to be restarted
> --------------------------------------------------------------
>
>                 Key: MAPREDUCE-4772
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4772
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.4
>            Reporter: Robert Joseph Evans
>            Assignee: Robert Joseph Evans
>            Priority: Critical
>         Attachments: MR-4772-0.23.txt, MR-4772-trunk.txt
>
>
> In one particular case we saw a NM go down at just the right time, that most of the reducers got the output of the map tasks, but not all of them.
> The ones that failed to get the output reported to the AM rather quickly that they could not fetch from the NM, but because the other reducers were still running the AM would not relaunch the map task because there weren't more than 50% of the running reducers that had reported fetch failures.  Then because of the exponential back-off for fetches on the reducers it took until 1 hour 45 min for the reduce tasks to hit another 10 fetch failures and report in again. At that point the other reducers had finished and the job relaunched the map task.  If the reducers had still been running at 1:45 I have no idea how long it would have taken for each of the tasks to get to 30 fetch failures.
> We need to trigger the map based off of percentage of reducers shuffling, not percentage of reducers running, we also need to have a maximum limit of the back off, so that we don't ever have the reducer waiting for days to try and fetch map output.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-4772) Fetch failures can take way too long for a map to be restarted

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-4772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13492620#comment-13492620 ] 

Hadoop QA commented on MAPREDUCE-4772:
--------------------------------------

{color:red}-1 overall{color}.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12552513/MR-4772-trunk.txt
  against trunk revision .

    {color:green}+1 @author{color}.  The patch does not contain any @author tags.

    {color:green}+1 tests included{color}.  The patch appears to include 2 new or modified test files.

      {color:red}-1 javac{color}.  The applied patch generated 2031 javac compiler warnings (more than the trunk's current 2025 warnings).

    {color:green}+1 javadoc{color}.  The javadoc tool did not generate any warning messages.

    {color:green}+1 eclipse:eclipse{color}.  The patch built with eclipse:eclipse.

    {color:green}+1 findbugs{color}.  The patch does not introduce any new Findbugs (version 1.3.9) warnings.

    {color:green}+1 release audit{color}.  The applied patch does not increase the total number of release audit warnings.

    {color:green}+1 core tests{color}.  The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core.

    {color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2993//testReport/
Javac warnings: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2993//artifact/trunk/patchprocess/diffJavacWarnings.txt
Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2993//console

This message is automatically generated.
                
> Fetch failures can take way too long for a map to be restarted
> --------------------------------------------------------------
>
>                 Key: MAPREDUCE-4772
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4772
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.4
>            Reporter: Robert Joseph Evans
>            Assignee: Robert Joseph Evans
>            Priority: Critical
>         Attachments: MR-4772-0.23.txt, MR-4772-trunk.txt
>
>
> In one particular case we saw a NM go down at just the right time, that most of the reducers got the output of the map tasks, but not all of them.
> The ones that failed to get the output reported to the AM rather quickly that they could not fetch from the NM, but because the other reducers were still running the AM would not relaunch the map task because there weren't more than 50% of the running reducers that had reported fetch failures.  Then because of the exponential back-off for fetches on the reducers it took until 1 hour 45 min for the reduce tasks to hit another 10 fetch failures and report in again. At that point the other reducers had finished and the job relaunched the map task.  If the reducers had still been running at 1:45 I have no idea how long it would have taken for each of the tasks to get to 30 fetch failures.
> We need to trigger the map based off of percentage of reducers shuffling, not percentage of reducers running, we also need to have a maximum limit of the back off, so that we don't ever have the reducer waiting for days to try and fetch map output.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-4772) Fetch failures can take way too long for a map to be restarted

Posted by "Robert Joseph Evans (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-4772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13492460#comment-13492460 ] 

Robert Joseph Evans commented on MAPREDUCE-4772:
------------------------------------------------

Boy I feel dumb, I commented on the wrong JIRA and reduced its priority. The above two comments were for MAPREDUCE-4775.  Moving this one back to critical.
                
> Fetch failures can take way too long for a map to be restarted
> --------------------------------------------------------------
>
>                 Key: MAPREDUCE-4772
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4772
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.4
>            Reporter: Robert Joseph Evans
>            Assignee: Robert Joseph Evans
>
> In one particular case we saw a NM go down at just the right time, that most of the reducers got the output of the map tasks, but not all of them.
> The ones that failed to get the output reported to the AM rather quickly that they could not fetch from the NM, but because the other reducers were still running the AM would not relaunch the map task because there weren't more than 50% of the running reducers that had reported fetch failures.  Then because of the exponential back-off for fetches on the reducers it took until 1 hour 45 min for the reduce tasks to hit another 10 fetch failures and report in again. At that point the other reducers had finished and the job relaunched the map task.  If the reducers had still been running at 1:45 I have no idea how long it would have taken for each of the tasks to get to 30 fetch failures.
> We need to trigger the map based off of percentage of reducers shuffling, not percentage of reducers running, we also need to have a maximum limit of the back off, so that we don't ever have the reducer waiting for days to try and fetch map output.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAPREDUCE-4772) Fetch failures can take way too long for a map to be restarted

Posted by "Robert Joseph Evans (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAPREDUCE-4772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Joseph Evans updated MAPREDUCE-4772:
-------------------------------------------

    Priority: Major  (was: Critical)
    
> Fetch failures can take way too long for a map to be restarted
> --------------------------------------------------------------
>
>                 Key: MAPREDUCE-4772
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4772
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.4
>            Reporter: Robert Joseph Evans
>            Assignee: Robert Joseph Evans
>
> In one particular case we saw a NM go down at just the right time, that most of the reducers got the output of the map tasks, but not all of them.
> The ones that failed to get the output reported to the AM rather quickly that they could not fetch from the NM, but because the other reducers were still running the AM would not relaunch the map task because there weren't more than 50% of the running reducers that had reported fetch failures.  Then because of the exponential back-off for fetches on the reducers it took until 1 hour 45 min for the reduce tasks to hit another 10 fetch failures and report in again. At that point the other reducers had finished and the job relaunched the map task.  If the reducers had still been running at 1:45 I have no idea how long it would have taken for each of the tasks to get to 30 fetch failures.
> We need to trigger the map based off of percentage of reducers shuffling, not percentage of reducers running, we also need to have a maximum limit of the back off, so that we don't ever have the reducer waiting for days to try and fetch map output.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAPREDUCE-4772) Fetch failures can take way too long for a map to be restarted

Posted by "Robert Joseph Evans (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAPREDUCE-4772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Joseph Evans updated MAPREDUCE-4772:
-------------------------------------------

    Attachment: MR-4772-trunk.txt
                MR-4772-0.23.txt

This patch changes the AM to restart a map task if 50% of the shuffling reducers report errors for a given map task instead of 50% of the running reducers.

It also changes how often a reduce reports fetch failures.  If a ConnectionException happens then it will report the error immediately and will not wait. A ConnectionException indicates that there is no one listening on the remote port.  This is very different from a timeout where the port is overrun and no one is able to get through.

It also adds in a maximum delay between fetch retries.  In the original code every time a fetch failure happened the reducer would add 30% to the delay.  It would also only report every 10th failure.  This means that the first failure would be reported after about 6 min, the second after 90 min and the third after 20 hours.  This is really bad when there is only one reducer because the AM requires at least three reports for the map task to be restarted.

The default maximum delay is set to 1 min which would change the numbers to be 6 min, 15 min, and 25 min respectively.  25 min still seems very long to wait, but is much better then 20 hours.
                
> Fetch failures can take way too long for a map to be restarted
> --------------------------------------------------------------
>
>                 Key: MAPREDUCE-4772
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4772
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.4
>            Reporter: Robert Joseph Evans
>            Assignee: Robert Joseph Evans
>            Priority: Critical
>         Attachments: MR-4772-0.23.txt, MR-4772-trunk.txt
>
>
> In one particular case we saw a NM go down at just the right time, that most of the reducers got the output of the map tasks, but not all of them.
> The ones that failed to get the output reported to the AM rather quickly that they could not fetch from the NM, but because the other reducers were still running the AM would not relaunch the map task because there weren't more than 50% of the running reducers that had reported fetch failures.  Then because of the exponential back-off for fetches on the reducers it took until 1 hour 45 min for the reduce tasks to hit another 10 fetch failures and report in again. At that point the other reducers had finished and the job relaunched the map task.  If the reducers had still been running at 1:45 I have no idea how long it would have taken for each of the tasks to get to 30 fetch failures.
> We need to trigger the map based off of percentage of reducers shuffling, not percentage of reducers running, we also need to have a maximum limit of the back off, so that we don't ever have the reducer waiting for days to try and fetch map output.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-4772) Fetch failures can take way too long for a map to be restarted

Posted by "Robert Joseph Evans (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-4772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13491785#comment-13491785 ] 

Robert Joseph Evans commented on MAPREDUCE-4772:
------------------------------------------------

I am also confused why a reducer could be stalled for over an hour (MAPREDUCE-4772) and not be killed.  I will look into that here too.
                
> Fetch failures can take way too long for a map to be restarted
> --------------------------------------------------------------
>
>                 Key: MAPREDUCE-4772
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4772
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.4
>            Reporter: Robert Joseph Evans
>            Assignee: Robert Joseph Evans
>
> In one particular case we saw a NM go down at just the right time, that most of the reducers got the output of the map tasks, but not all of them.
> The ones that failed to get the output reported to the AM rather quickly that they could not fetch from the NM, but because the other reducers were still running the AM would not relaunch the map task because there weren't more than 50% of the running reducers that had reported fetch failures.  Then because of the exponential back-off for fetches on the reducers it took until 1 hour 45 min for the reduce tasks to hit another 10 fetch failures and report in again. At that point the other reducers had finished and the job relaunched the map task.  If the reducers had still been running at 1:45 I have no idea how long it would have taken for each of the tasks to get to 30 fetch failures.
> We need to trigger the map based off of percentage of reducers shuffling, not percentage of reducers running, we also need to have a maximum limit of the back off, so that we don't ever have the reducer waiting for days to try and fetch map output.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-4772) Fetch failures can take way too long for a map to be restarted

Posted by "Jonathan Eagles (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-4772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13492755#comment-13492755 ] 

Jonathan Eagles commented on MAPREDUCE-4772:
--------------------------------------------

Changes are looking good. I agree with the fact that there is no way to configure infinite fetch failure backoff anymore as part of this JIRA. Is there anything that can be changed to remove the javac warnings here? 
                
> Fetch failures can take way too long for a map to be restarted
> --------------------------------------------------------------
>
>                 Key: MAPREDUCE-4772
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4772
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.4
>            Reporter: Robert Joseph Evans
>            Assignee: Robert Joseph Evans
>            Priority: Critical
>         Attachments: MR-4772-0.23.txt, MR-4772-trunk.txt
>
>
> In one particular case we saw a NM go down at just the right time, that most of the reducers got the output of the map tasks, but not all of them.
> The ones that failed to get the output reported to the AM rather quickly that they could not fetch from the NM, but because the other reducers were still running the AM would not relaunch the map task because there weren't more than 50% of the running reducers that had reported fetch failures.  Then because of the exponential back-off for fetches on the reducers it took until 1 hour 45 min for the reduce tasks to hit another 10 fetch failures and report in again. At that point the other reducers had finished and the job relaunched the map task.  If the reducers had still been running at 1:45 I have no idea how long it would have taken for each of the tasks to get to 30 fetch failures.
> We need to trigger the map based off of percentage of reducers shuffling, not percentage of reducers running, we also need to have a maximum limit of the back off, so that we don't ever have the reducer waiting for days to try and fetch map output.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-4772) Fetch failures can take way too long for a map to be restarted

Posted by "Hudson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-4772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13493972#comment-13493972 ] 

Hudson commented on MAPREDUCE-4772:
-----------------------------------

Integrated in Hadoop-Hdfs-0.23-Build #430 (See [https://builds.apache.org/job/Hadoop-Hdfs-0.23-Build/430/])
    svn merge -c 1407118 FIXES: MAPREDUCE-4772. Fetch failures can take way too long for a map to be restarted (bobby) (Revision 1407128)

     Result = SUCCESS
bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1407128
Files : 
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/JobImpl.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestFetchFailure.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/MRJobConfig.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/Fetcher.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/ShuffleScheduler.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/resources/mapred-default.xml
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/test/java/org/apache/hadoop/mapreduce/task/reduce/TestFetcher.java

                
> Fetch failures can take way too long for a map to be restarted
> --------------------------------------------------------------
>
>                 Key: MAPREDUCE-4772
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4772
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.4
>            Reporter: Robert Joseph Evans
>            Assignee: Robert Joseph Evans
>            Priority: Critical
>             Fix For: 3.0.0, 2.0.3-alpha, 0.23.5
>
>         Attachments: MR-4772-0.23.txt, MR-4772-trunk.txt
>
>
> In one particular case we saw a NM go down at just the right time, that most of the reducers got the output of the map tasks, but not all of them.
> The ones that failed to get the output reported to the AM rather quickly that they could not fetch from the NM, but because the other reducers were still running the AM would not relaunch the map task because there weren't more than 50% of the running reducers that had reported fetch failures.  Then because of the exponential back-off for fetches on the reducers it took until 1 hour 45 min for the reduce tasks to hit another 10 fetch failures and report in again. At that point the other reducers had finished and the job relaunched the map task.  If the reducers had still been running at 1:45 I have no idea how long it would have taken for each of the tasks to get to 30 fetch failures.
> We need to trigger the map based off of percentage of reducers shuffling, not percentage of reducers running, we also need to have a maximum limit of the back off, so that we don't ever have the reducer waiting for days to try and fetch map output.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAPREDUCE-4772) Fetch failures can take way too long for a map to be restarted

Posted by "Robert Joseph Evans (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAPREDUCE-4772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Joseph Evans updated MAPREDUCE-4772:
-------------------------------------------

    Status: Patch Available  (was: Open)
    
> Fetch failures can take way too long for a map to be restarted
> --------------------------------------------------------------
>
>                 Key: MAPREDUCE-4772
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4772
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.4
>            Reporter: Robert Joseph Evans
>            Assignee: Robert Joseph Evans
>            Priority: Critical
>         Attachments: MR-4772-0.23.txt, MR-4772-trunk.txt
>
>
> In one particular case we saw a NM go down at just the right time, that most of the reducers got the output of the map tasks, but not all of them.
> The ones that failed to get the output reported to the AM rather quickly that they could not fetch from the NM, but because the other reducers were still running the AM would not relaunch the map task because there weren't more than 50% of the running reducers that had reported fetch failures.  Then because of the exponential back-off for fetches on the reducers it took until 1 hour 45 min for the reduce tasks to hit another 10 fetch failures and report in again. At that point the other reducers had finished and the job relaunched the map task.  If the reducers had still been running at 1:45 I have no idea how long it would have taken for each of the tasks to get to 30 fetch failures.
> We need to trigger the map based off of percentage of reducers shuffling, not percentage of reducers running, we also need to have a maximum limit of the back off, so that we don't ever have the reducer waiting for days to try and fetch map output.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-4772) Fetch failures can take way too long for a map to be restarted

Posted by "Hudson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-4772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13494006#comment-13494006 ] 

Hudson commented on MAPREDUCE-4772:
-----------------------------------

Integrated in Hadoop-Mapreduce-trunk #1251 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1251/])
    MAPREDUCE-4772. Fetch failures can take way too long for a map to be restarted (bobby) (Revision 1407118)

     Result = FAILURE
bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1407118
Files : 
* /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/JobImpl.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestFetchFailure.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/MRJobConfig.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/Fetcher.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/ShuffleScheduler.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/resources/mapred-default.xml
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/test/java/org/apache/hadoop/mapreduce/task/reduce/TestFetcher.java

                
> Fetch failures can take way too long for a map to be restarted
> --------------------------------------------------------------
>
>                 Key: MAPREDUCE-4772
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4772
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.4
>            Reporter: Robert Joseph Evans
>            Assignee: Robert Joseph Evans
>            Priority: Critical
>             Fix For: 3.0.0, 2.0.3-alpha, 0.23.5
>
>         Attachments: MR-4772-0.23.txt, MR-4772-trunk.txt
>
>
> In one particular case we saw a NM go down at just the right time, that most of the reducers got the output of the map tasks, but not all of them.
> The ones that failed to get the output reported to the AM rather quickly that they could not fetch from the NM, but because the other reducers were still running the AM would not relaunch the map task because there weren't more than 50% of the running reducers that had reported fetch failures.  Then because of the exponential back-off for fetches on the reducers it took until 1 hour 45 min for the reduce tasks to hit another 10 fetch failures and report in again. At that point the other reducers had finished and the job relaunched the map task.  If the reducers had still been running at 1:45 I have no idea how long it would have taken for each of the tasks to get to 30 fetch failures.
> We need to trigger the map based off of percentage of reducers shuffling, not percentage of reducers running, we also need to have a maximum limit of the back off, so that we don't ever have the reducer waiting for days to try and fetch map output.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-4772) Fetch failures can take way too long for a map to be restarted

Posted by "Robert Joseph Evans (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-4772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13492595#comment-13492595 ] 

Robert Joseph Evans commented on MAPREDUCE-4772:
------------------------------------------------

Oh the only differences between 0.23 and trunk is that 0.23 includes one extra include in JobImpl that was not needed by trunk.
                
> Fetch failures can take way too long for a map to be restarted
> --------------------------------------------------------------
>
>                 Key: MAPREDUCE-4772
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4772
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.4
>            Reporter: Robert Joseph Evans
>            Assignee: Robert Joseph Evans
>            Priority: Critical
>         Attachments: MR-4772-0.23.txt, MR-4772-trunk.txt
>
>
> In one particular case we saw a NM go down at just the right time, that most of the reducers got the output of the map tasks, but not all of them.
> The ones that failed to get the output reported to the AM rather quickly that they could not fetch from the NM, but because the other reducers were still running the AM would not relaunch the map task because there weren't more than 50% of the running reducers that had reported fetch failures.  Then because of the exponential back-off for fetches on the reducers it took until 1 hour 45 min for the reduce tasks to hit another 10 fetch failures and report in again. At that point the other reducers had finished and the job relaunched the map task.  If the reducers had still been running at 1:45 I have no idea how long it would have taken for each of the tasks to get to 30 fetch failures.
> We need to trigger the map based off of percentage of reducers shuffling, not percentage of reducers running, we also need to have a maximum limit of the back off, so that we don't ever have the reducer waiting for days to try and fetch map output.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira