You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-issues@hadoop.apache.org by "Jason Lowe (JIRA)" <ji...@apache.org> on 2012/11/22 02:56:58 UTC

[jira] [Created] (MAPREDUCE-4817) Hardcoded task ping timeout kills tasks localizing large amounts of data

Jason Lowe created MAPREDUCE-4817:
-------------------------------------

             Summary: Hardcoded task ping timeout kills tasks localizing large amounts of data
                 Key: MAPREDUCE-4817
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4817
             Project: Hadoop Map/Reduce
          Issue Type: Bug
          Components: applicationmaster, mr-am
    Affects Versions: 0.23.3, 2.0.3-alpha
            Reporter: Jason Lowe


When a task is launched and spends more than 5 minutes localizing files, the AM will kill the task due to ping timeout.  The AM's TaskHeartbeatHandler currently tracks tasks via a progress timeout and a ping timeout.  The progress timeout can be controlled via mapreduce.task.timeout and even disabled by setting the property to 0.  The ping timeout, however, is hardcoded to 5 minutes and cannot be configured.  Therefore if the task takes too long localizing, it never gets running in order to ping back to the AM and the AM kills it due to ping timeout.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAPREDUCE-4817) Hardcoded task ping timeout kills tasks localizing large amounts of data

Posted by "Thomas Graves (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-4817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Thomas Graves updated MAPREDUCE-4817:
-------------------------------------

    Status: Patch Available  (was: Open)
    
> Hardcoded task ping timeout kills tasks localizing large amounts of data
> ------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-4817
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4817
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mr-am
>    Affects Versions: 0.23.3, 2.0.3-alpha
>            Reporter: Jason Lowe
>            Assignee: Thomas Graves
>            Priority: Critical
>         Attachments: MAPREDUCE-4817.patch, MAPREDUCE-4817.patch
>
>
> When a task is launched and spends more than 5 minutes localizing files, the AM will kill the task due to ping timeout.  The AM's TaskHeartbeatHandler currently tracks tasks via a progress timeout and a ping timeout.  The progress timeout can be controlled via mapreduce.task.timeout and even disabled by setting the property to 0.  The ping timeout, however, is hardcoded to 5 minutes and cannot be configured.  Therefore if the task takes too long localizing, it never gets running in order to ping back to the AM and the AM kills it due to ping timeout.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAPREDUCE-4817) Hardcoded task ping timeout kills tasks localizing large amounts of data

Posted by "Jason Lowe (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-4817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jason Lowe updated MAPREDUCE-4817:
----------------------------------

    Priority: Critical  (was: Major)

One possible workaround is to abuse the mapreduce.task.timeout.check-interval-ms property with a large value, effectively disabling the timeout checking.  Not ideal since runaway or zombie tasks will no longer be detected as they were before.
                
> Hardcoded task ping timeout kills tasks localizing large amounts of data
> ------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-4817
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4817
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mr-am
>    Affects Versions: 0.23.3, 2.0.3-alpha
>            Reporter: Jason Lowe
>            Priority: Critical
>
> When a task is launched and spends more than 5 minutes localizing files, the AM will kill the task due to ping timeout.  The AM's TaskHeartbeatHandler currently tracks tasks via a progress timeout and a ping timeout.  The progress timeout can be controlled via mapreduce.task.timeout and even disabled by setting the property to 0.  The ping timeout, however, is hardcoded to 5 minutes and cannot be configured.  Therefore if the task takes too long localizing, it never gets running in order to ping back to the AM and the AM kills it due to ping timeout.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-4817) Hardcoded task ping timeout kills tasks localizing large amounts of data

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-4817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13506439#comment-13506439 ] 

Hudson commented on MAPREDUCE-4817:
-----------------------------------

Integrated in Hadoop-Hdfs-0.23-Build #450 (See [https://builds.apache.org/job/Hadoop-Hdfs-0.23-Build/450/])
    merge -r 1414872:1414873 from trunk. FIXES: MAPREDUCE-4817 (Revision 1414875)

     Result = SUCCESS
tgraves : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1414875
Files : 
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapred/TaskAttemptListenerImpl.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/TaskHeartbeatHandler.java

                
> Hardcoded task ping timeout kills tasks localizing large amounts of data
> ------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-4817
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4817
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mr-am
>    Affects Versions: 0.23.3, 2.0.3-alpha
>            Reporter: Jason Lowe
>            Assignee: Thomas Graves
>            Priority: Critical
>             Fix For: 3.0.0, 2.0.3-alpha, 0.23.6
>
>         Attachments: MAPREDUCE-4817.patch, MAPREDUCE-4817.patch
>
>
> When a task is launched and spends more than 5 minutes localizing files, the AM will kill the task due to ping timeout.  The AM's TaskHeartbeatHandler currently tracks tasks via a progress timeout and a ping timeout.  The progress timeout can be controlled via mapreduce.task.timeout and even disabled by setting the property to 0.  The ping timeout, however, is hardcoded to 5 minutes and cannot be configured.  Therefore if the task takes too long localizing, it never gets running in order to ping back to the AM and the AM kills it due to ping timeout.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAPREDUCE-4817) Hardcoded task ping timeout kills tasks localizing large amounts of data

Posted by "Thomas Graves (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-4817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Thomas Graves updated MAPREDUCE-4817:
-------------------------------------

    Attachment: MAPREDUCE-4817.patch

here is the patch that add the config for the ping timeout.  Attaching because it was finished already before other comments and in case we want to go that way.  
                
> Hardcoded task ping timeout kills tasks localizing large amounts of data
> ------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-4817
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4817
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mr-am
>    Affects Versions: 0.23.3, 2.0.3-alpha
>            Reporter: Jason Lowe
>            Assignee: Thomas Graves
>            Priority: Critical
>         Attachments: MAPREDUCE-4817.patch
>
>
> When a task is launched and spends more than 5 minutes localizing files, the AM will kill the task due to ping timeout.  The AM's TaskHeartbeatHandler currently tracks tasks via a progress timeout and a ping timeout.  The progress timeout can be controlled via mapreduce.task.timeout and even disabled by setting the property to 0.  The ping timeout, however, is hardcoded to 5 minutes and cannot be configured.  Therefore if the task takes too long localizing, it never gets running in order to ping back to the AM and the AM kills it due to ping timeout.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-4817) Hardcoded task ping timeout kills tasks localizing large amounts of data

Posted by "Thomas Graves (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-4817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13504777#comment-13504777 ] 

Thomas Graves commented on MAPREDUCE-4817:
------------------------------------------

When you say knock off the ping thread I assume you really mean just the ping timeout check since the task progress happens in the same thread?

So the ping serves multiple purposes.  Currently it notifies the AM that the task has "pinged" in and is still running.  This could be useful even with taskTimeout since the taskTimeout could be turned off (set to 0) and we would never know if that task got hung.  Second, the task uses it to check to see if the AM is still alive.  If it doesn't return true, the task is supposed to exit.  1.X also had the ping check, but it went to the taskTracker and the tasktracker validated that the parent Task of the ping checker thread was still there.

Now with 0.23 the nodemanager is watching the processes and talking back to the RM to let it know that the AM died and if it died it kills the other tasks, but if the entire nodemanager goes down then the task doesn't know the AM went away.  If the task isn't sending progress, and the task timeout is set to 0, and this is the last AM retry it could hang around forever.  

The odds of that seem pretty small and I guess if we aren't worried about the first happening, the second probably isn't that interesting either. But we could also just remove the ping timeout check in the TaskHeartBeatHandler.    What exactly are you proposing?
                
> Hardcoded task ping timeout kills tasks localizing large amounts of data
> ------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-4817
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4817
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mr-am
>    Affects Versions: 0.23.3, 2.0.3-alpha
>            Reporter: Jason Lowe
>            Assignee: Thomas Graves
>            Priority: Critical
>
> When a task is launched and spends more than 5 minutes localizing files, the AM will kill the task due to ping timeout.  The AM's TaskHeartbeatHandler currently tracks tasks via a progress timeout and a ping timeout.  The progress timeout can be controlled via mapreduce.task.timeout and even disabled by setting the property to 0.  The ping timeout, however, is hardcoded to 5 minutes and cannot be configured.  Therefore if the task takes too long localizing, it never gets running in order to ping back to the AM and the AM kills it due to ping timeout.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-4817) Hardcoded task ping timeout kills tasks localizing large amounts of data

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-4817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13505677#comment-13505677 ] 

Hadoop QA commented on MAPREDUCE-4817:
--------------------------------------

{color:red}-1 overall{color}.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12555186/MAPREDUCE-4817.patch
  against trunk revision .

    {color:green}+1 @author{color}.  The patch does not contain any @author tags.

    {color:red}-1 tests included{color}.  The patch doesn't appear to include any new or modified tests.
                        Please justify why no new tests are needed for this patch.
                        Also please list what manual steps were performed to verify this patch.

    {color:green}+1 javac{color}.  The applied patch does not increase the total number of javac compiler warnings.

    {color:green}+1 javadoc{color}.  The javadoc tool did not generate any warning messages.

    {color:green}+1 eclipse:eclipse{color}.  The patch built with eclipse:eclipse.

    {color:green}+1 findbugs{color}.  The patch does not introduce any new Findbugs (version 1.3.9) warnings.

    {color:green}+1 release audit{color}.  The applied patch does not increase the total number of release audit warnings.

    {color:green}+1 core tests{color}.  The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app.

    {color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3073//testReport/
Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3073//console

This message is automatically generated.
                
> Hardcoded task ping timeout kills tasks localizing large amounts of data
> ------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-4817
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4817
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mr-am
>    Affects Versions: 0.23.3, 2.0.3-alpha
>            Reporter: Jason Lowe
>            Assignee: Thomas Graves
>            Priority: Critical
>         Attachments: MAPREDUCE-4817.patch, MAPREDUCE-4817.patch
>
>
> When a task is launched and spends more than 5 minutes localizing files, the AM will kill the task due to ping timeout.  The AM's TaskHeartbeatHandler currently tracks tasks via a progress timeout and a ping timeout.  The progress timeout can be controlled via mapreduce.task.timeout and even disabled by setting the property to 0.  The ping timeout, however, is hardcoded to 5 minutes and cannot be configured.  Therefore if the task takes too long localizing, it never gets running in order to ping back to the AM and the AM kills it due to ping timeout.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-4817) Hardcoded task ping timeout kills tasks localizing large amounts of data

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-4817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13505817#comment-13505817 ] 

Hudson commented on MAPREDUCE-4817:
-----------------------------------

Integrated in Hadoop-trunk-Commit #3070 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/3070/])
    MAPREDUCE-4817. Hardcoded task ping timeout kills tasks localizing large amounts of data (tgraves) (Revision 1414873)

     Result = FAILURE
tgraves : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1414873
Files : 
* /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapred/TaskAttemptListenerImpl.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/TaskHeartbeatHandler.java

                
> Hardcoded task ping timeout kills tasks localizing large amounts of data
> ------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-4817
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4817
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mr-am
>    Affects Versions: 0.23.3, 2.0.3-alpha
>            Reporter: Jason Lowe
>            Assignee: Thomas Graves
>            Priority: Critical
>             Fix For: 3.0.0, 2.0.3-alpha, 0.23.6
>
>         Attachments: MAPREDUCE-4817.patch, MAPREDUCE-4817.patch
>
>
> When a task is launched and spends more than 5 minutes localizing files, the AM will kill the task due to ping timeout.  The AM's TaskHeartbeatHandler currently tracks tasks via a progress timeout and a ping timeout.  The progress timeout can be controlled via mapreduce.task.timeout and even disabled by setting the property to 0.  The ping timeout, however, is hardcoded to 5 minutes and cannot be configured.  Therefore if the task takes too long localizing, it never gets running in order to ping back to the AM and the AM kills it due to ping timeout.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-4817) Hardcoded task ping timeout kills tasks localizing large amounts of data

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-4817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13506452#comment-13506452 ] 

Hudson commented on MAPREDUCE-4817:
-----------------------------------

Integrated in Hadoop-Hdfs-trunk #1241 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1241/])
    MAPREDUCE-4817. Hardcoded task ping timeout kills tasks localizing large amounts of data (tgraves) (Revision 1414873)

     Result = FAILURE
tgraves : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1414873
Files : 
* /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapred/TaskAttemptListenerImpl.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/TaskHeartbeatHandler.java

                
> Hardcoded task ping timeout kills tasks localizing large amounts of data
> ------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-4817
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4817
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mr-am
>    Affects Versions: 0.23.3, 2.0.3-alpha
>            Reporter: Jason Lowe
>            Assignee: Thomas Graves
>            Priority: Critical
>             Fix For: 3.0.0, 2.0.3-alpha, 0.23.6
>
>         Attachments: MAPREDUCE-4817.patch, MAPREDUCE-4817.patch
>
>
> When a task is launched and spends more than 5 minutes localizing files, the AM will kill the task due to ping timeout.  The AM's TaskHeartbeatHandler currently tracks tasks via a progress timeout and a ping timeout.  The progress timeout can be controlled via mapreduce.task.timeout and even disabled by setting the property to 0.  The ping timeout, however, is hardcoded to 5 minutes and cannot be configured.  Therefore if the task takes too long localizing, it never gets running in order to ping back to the AM and the AM kills it due to ping timeout.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-4817) Hardcoded task ping timeout kills tasks localizing large amounts of data

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-4817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13506384#comment-13506384 ] 

Hudson commented on MAPREDUCE-4817:
-----------------------------------

Integrated in Hadoop-Yarn-trunk #51 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/51/])
    MAPREDUCE-4817. Hardcoded task ping timeout kills tasks localizing large amounts of data (tgraves) (Revision 1414873)

     Result = SUCCESS
tgraves : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1414873
Files : 
* /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapred/TaskAttemptListenerImpl.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/TaskHeartbeatHandler.java

                
> Hardcoded task ping timeout kills tasks localizing large amounts of data
> ------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-4817
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4817
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mr-am
>    Affects Versions: 0.23.3, 2.0.3-alpha
>            Reporter: Jason Lowe
>            Assignee: Thomas Graves
>            Priority: Critical
>             Fix For: 3.0.0, 2.0.3-alpha, 0.23.6
>
>         Attachments: MAPREDUCE-4817.patch, MAPREDUCE-4817.patch
>
>
> When a task is launched and spends more than 5 minutes localizing files, the AM will kill the task due to ping timeout.  The AM's TaskHeartbeatHandler currently tracks tasks via a progress timeout and a ping timeout.  The progress timeout can be controlled via mapreduce.task.timeout and even disabled by setting the property to 0.  The ping timeout, however, is hardcoded to 5 minutes and cannot be configured.  Therefore if the task takes too long localizing, it never gets running in order to ping back to the AM and the AM kills it due to ping timeout.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-4817) Hardcoded task ping timeout kills tasks localizing large amounts of data

Posted by "Thomas Graves (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-4817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13504080#comment-13504080 ] 

Thomas Graves commented on MAPREDUCE-4817:
------------------------------------------

This jira will make the ping timeout configurable.  MAPREDUCE-4818 will be for actual fix to the issue.
                
> Hardcoded task ping timeout kills tasks localizing large amounts of data
> ------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-4817
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4817
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mr-am
>    Affects Versions: 0.23.3, 2.0.3-alpha
>            Reporter: Jason Lowe
>            Assignee: Thomas Graves
>            Priority: Critical
>
> When a task is launched and spends more than 5 minutes localizing files, the AM will kill the task due to ping timeout.  The AM's TaskHeartbeatHandler currently tracks tasks via a progress timeout and a ping timeout.  The progress timeout can be controlled via mapreduce.task.timeout and even disabled by setting the property to 0.  The ping timeout, however, is hardcoded to 5 minutes and cannot be configured.  Therefore if the task takes too long localizing, it never gets running in order to ping back to the AM and the AM kills it due to ping timeout.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-4817) Hardcoded task ping timeout kills tasks localizing large amounts of data

Posted by "Robert Joseph Evans (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-4817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13505636#comment-13505636 ] 

Robert Joseph Evans commented on MAPREDUCE-4817:
------------------------------------------------

The patch is simple and straight forward I am +1 assuming that Jekins is OK with it.  I am not sure that we need to update the task.  The ping is used check if the task can reach the AM still.  If you want to remove it go ahead and file a JIRA but it may have further ramifications. 
                
> Hardcoded task ping timeout kills tasks localizing large amounts of data
> ------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-4817
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4817
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mr-am
>    Affects Versions: 0.23.3, 2.0.3-alpha
>            Reporter: Jason Lowe
>            Assignee: Thomas Graves
>            Priority: Critical
>         Attachments: MAPREDUCE-4817.patch, MAPREDUCE-4817.patch
>
>
> When a task is launched and spends more than 5 minutes localizing files, the AM will kill the task due to ping timeout.  The AM's TaskHeartbeatHandler currently tracks tasks via a progress timeout and a ping timeout.  The progress timeout can be controlled via mapreduce.task.timeout and even disabled by setting the property to 0.  The ping timeout, however, is hardcoded to 5 minutes and cannot be configured.  Therefore if the task takes too long localizing, it never gets running in order to ping back to the AM and the AM kills it due to ping timeout.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-4817) Hardcoded task ping timeout kills tasks localizing large amounts of data

Posted by "Vinod Kumar Vavilapalli (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-4817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13504333#comment-13504333 ] 

Vinod Kumar Vavilapalli commented on MAPREDUCE-4817:
----------------------------------------------------

I was going to +1 for the proposal, but wasn't comfortable, so dig into some history. Please see my comments at MAPREDUCE-4089. Like I proposed there, we should knock off the ping thread altogether?

But the problem persists, in that, for large enough local-resources, the taskTimeOut may eventually happen instead which we'll need to address. So, shall we just knock off the ping thread here and let MAPREDUCE-4818 put in the real fix?
                
> Hardcoded task ping timeout kills tasks localizing large amounts of data
> ------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-4817
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4817
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mr-am
>    Affects Versions: 0.23.3, 2.0.3-alpha
>            Reporter: Jason Lowe
>            Assignee: Thomas Graves
>            Priority: Critical
>
> When a task is launched and spends more than 5 minutes localizing files, the AM will kill the task due to ping timeout.  The AM's TaskHeartbeatHandler currently tracks tasks via a progress timeout and a ping timeout.  The progress timeout can be controlled via mapreduce.task.timeout and even disabled by setting the property to 0.  The ping timeout, however, is hardcoded to 5 minutes and cannot be configured.  Therefore if the task takes too long localizing, it never gets running in order to ping back to the AM and the AM kills it due to ping timeout.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAPREDUCE-4817) Hardcoded task ping timeout kills tasks localizing large amounts of data

Posted by "Thomas Graves (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-4817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Thomas Graves updated MAPREDUCE-4817:
-------------------------------------

    Attachment: MAPREDUCE-4817.patch

This patch removes the ping Timeout check from the AM task heart beat handler.  If we want to remove the other side from each Task we can do that in separate jira.
                
> Hardcoded task ping timeout kills tasks localizing large amounts of data
> ------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-4817
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4817
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mr-am
>    Affects Versions: 0.23.3, 2.0.3-alpha
>            Reporter: Jason Lowe
>            Assignee: Thomas Graves
>            Priority: Critical
>         Attachments: MAPREDUCE-4817.patch, MAPREDUCE-4817.patch
>
>
> When a task is launched and spends more than 5 minutes localizing files, the AM will kill the task due to ping timeout.  The AM's TaskHeartbeatHandler currently tracks tasks via a progress timeout and a ping timeout.  The progress timeout can be controlled via mapreduce.task.timeout and even disabled by setting the property to 0.  The ping timeout, however, is hardcoded to 5 minutes and cannot be configured.  Therefore if the task takes too long localizing, it never gets running in order to ping back to the AM and the AM kills it due to ping timeout.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Assigned] (MAPREDUCE-4817) Hardcoded task ping timeout kills tasks localizing large amounts of data

Posted by "Thomas Graves (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-4817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Thomas Graves reassigned MAPREDUCE-4817:
----------------------------------------

    Assignee: Thomas Graves
    
> Hardcoded task ping timeout kills tasks localizing large amounts of data
> ------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-4817
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4817
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mr-am
>    Affects Versions: 0.23.3, 2.0.3-alpha
>            Reporter: Jason Lowe
>            Assignee: Thomas Graves
>            Priority: Critical
>
> When a task is launched and spends more than 5 minutes localizing files, the AM will kill the task due to ping timeout.  The AM's TaskHeartbeatHandler currently tracks tasks via a progress timeout and a ping timeout.  The progress timeout can be controlled via mapreduce.task.timeout and even disabled by setting the property to 0.  The ping timeout, however, is hardcoded to 5 minutes and cannot be configured.  Therefore if the task takes too long localizing, it never gets running in order to ping back to the AM and the AM kills it due to ping timeout.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAPREDUCE-4817) Hardcoded task ping timeout kills tasks localizing large amounts of data

Posted by "Thomas Graves (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-4817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Thomas Graves updated MAPREDUCE-4817:
-------------------------------------

          Resolution: Fixed
       Fix Version/s: 0.23.6
                      2.0.3-alpha
                      3.0.0
    Target Version/s: 3.0.0, 2.0.3-alpha, 0.23.6
              Status: Resolved  (was: Patch Available)

Thanks Bobby, I've committed this.
                
> Hardcoded task ping timeout kills tasks localizing large amounts of data
> ------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-4817
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4817
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mr-am
>    Affects Versions: 0.23.3, 2.0.3-alpha
>            Reporter: Jason Lowe
>            Assignee: Thomas Graves
>            Priority: Critical
>             Fix For: 3.0.0, 2.0.3-alpha, 0.23.6
>
>         Attachments: MAPREDUCE-4817.patch, MAPREDUCE-4817.patch
>
>
> When a task is launched and spends more than 5 minutes localizing files, the AM will kill the task due to ping timeout.  The AM's TaskHeartbeatHandler currently tracks tasks via a progress timeout and a ping timeout.  The progress timeout can be controlled via mapreduce.task.timeout and even disabled by setting the property to 0.  The ping timeout, however, is hardcoded to 5 minutes and cannot be configured.  Therefore if the task takes too long localizing, it never gets running in order to ping back to the AM and the AM kills it due to ping timeout.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira