You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-issues@hadoop.apache.org by "Jason Lowe (Created) (JIRA)" <ji...@apache.org> on 2012/01/26 21:57:40 UTC

[jira] [Created] (MAPREDUCE-3738) NM can hang during shutdown if AppLogAggregatorImpl thread dies unexpectedly

NM can hang during shutdown if AppLogAggregatorImpl thread dies unexpectedly
----------------------------------------------------------------------------

                 Key: MAPREDUCE-3738
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3738
             Project: Hadoop Map/Reduce
          Issue Type: Bug
          Components: mrv2, nodemanager
    Affects Versions: 0.23.1, 0.24.0
            Reporter: Jason Lowe


If an AppLogAggregator thread dies unexpectedly (e.g.: uncaught exception like OutOfMemoryError in the case I saw) then this will lead to a hang during nodemanager shutdown.  The NM calls AppLogAggregatorImpl.join() during shutdown to make sure log aggregation has completed, and that method internally waits for an atomic boolean to be set by the log aggregation thread to indicate it has finished.  Since the thread was killed off earlier due to an uncaught exception, the boolean will never be set and the NM hangs during shutdown repeating something like this every second in the log file:

2012-01-25 22:20:56,366 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl: Waiting for aggregation to complete for application_1326848182580_2806

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-3738) NM can hang during shutdown if AppLogAggregatorImpl thread dies unexpectedly

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13215325#comment-13215325 ] 

Hudson commented on MAPREDUCE-3738:
-----------------------------------

Integrated in Hadoop-Hdfs-0.23-Commit #574 (See [https://builds.apache.org/job/Hadoop-Hdfs-0.23-Commit/574/])
    merge MAPREDUCE-3738 from trunk (Revision 1293061)

     Result = SUCCESS
sseth : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1293061
Files : 
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/AppLogAggregatorImpl.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/TestLogAggregationService.java

                
> NM can hang during shutdown if AppLogAggregatorImpl thread dies unexpectedly
> ----------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3738
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3738
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2, nodemanager
>    Affects Versions: 0.23.1, 0.24.0
>            Reporter: Jason Lowe
>            Assignee: Jason Lowe
>            Priority: Critical
>             Fix For: 0.23.2
>
>         Attachments: MAPREDUCE-3738.patch, livehistdump.txt
>
>
> If an AppLogAggregator thread dies unexpectedly (e.g.: uncaught exception like OutOfMemoryError in the case I saw) then this will lead to a hang during nodemanager shutdown.  The NM calls AppLogAggregatorImpl.join() during shutdown to make sure log aggregation has completed, and that method internally waits for an atomic boolean to be set by the log aggregation thread to indicate it has finished.  Since the thread was killed off earlier due to an uncaught exception, the boolean will never be set and the NM hangs during shutdown repeating something like this every second in the log file:
> 2012-01-25 22:20:56,366 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl: Waiting for aggregation to complete for application_1326848182580_2806

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAPREDUCE-3738) NM can hang during shutdown if AppLogAggregatorImpl thread dies unexpectedly

Posted by "Siddharth Seth (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAPREDUCE-3738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Siddharth Seth updated MAPREDUCE-3738:
--------------------------------------

          Resolution: Fixed
       Fix Version/s: 0.23.2
    Target Version/s: 0.24.0, 0.23.2  (was: 0.23.2, 0.24.0)
        Release Note: Committed to trunk and branch-0.23. Thanks Jason.
        Hadoop Flags: Reviewed
              Status: Resolved  (was: Patch Available)
    
> NM can hang during shutdown if AppLogAggregatorImpl thread dies unexpectedly
> ----------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3738
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3738
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2, nodemanager
>    Affects Versions: 0.23.1, 0.24.0
>            Reporter: Jason Lowe
>            Assignee: Jason Lowe
>            Priority: Critical
>             Fix For: 0.23.2
>
>         Attachments: MAPREDUCE-3738.patch, livehistdump.txt
>
>
> If an AppLogAggregator thread dies unexpectedly (e.g.: uncaught exception like OutOfMemoryError in the case I saw) then this will lead to a hang during nodemanager shutdown.  The NM calls AppLogAggregatorImpl.join() during shutdown to make sure log aggregation has completed, and that method internally waits for an atomic boolean to be set by the log aggregation thread to indicate it has finished.  Since the thread was killed off earlier due to an uncaught exception, the boolean will never be set and the NM hangs during shutdown repeating something like this every second in the log file:
> 2012-01-25 22:20:56,366 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl: Waiting for aggregation to complete for application_1326848182580_2806

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-3738) NM can hang during shutdown if AppLogAggregatorImpl thread dies unexpectedly

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13215581#comment-13215581 ] 

Hudson commented on MAPREDUCE-3738:
-----------------------------------

Integrated in Hadoop-Hdfs-0.23-Build #178 (See [https://builds.apache.org/job/Hadoop-Hdfs-0.23-Build/178/])
    merge MAPREDUCE-3738 from trunk (Revision 1293061)

     Result = SUCCESS
sseth : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1293061
Files : 
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/AppLogAggregatorImpl.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/TestLogAggregationService.java

                
> NM can hang during shutdown if AppLogAggregatorImpl thread dies unexpectedly
> ----------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3738
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3738
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2, nodemanager
>    Affects Versions: 0.23.1, 0.24.0
>            Reporter: Jason Lowe
>            Assignee: Jason Lowe
>            Priority: Critical
>             Fix For: 0.23.2
>
>         Attachments: MAPREDUCE-3738.patch, livehistdump.txt
>
>
> If an AppLogAggregator thread dies unexpectedly (e.g.: uncaught exception like OutOfMemoryError in the case I saw) then this will lead to a hang during nodemanager shutdown.  The NM calls AppLogAggregatorImpl.join() during shutdown to make sure log aggregation has completed, and that method internally waits for an atomic boolean to be set by the log aggregation thread to indicate it has finished.  Since the thread was killed off earlier due to an uncaught exception, the boolean will never be set and the NM hangs during shutdown repeating something like this every second in the log file:
> 2012-01-25 22:20:56,366 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl: Waiting for aggregation to complete for application_1326848182580_2806

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-3738) NM can hang during shutdown if AppLogAggregatorImpl thread dies unexpectedly

Posted by "Siddharth Seth (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13215310#comment-13215310 ] 

Siddharth Seth commented on MAPREDUCE-3738:
-------------------------------------------

+1. Looks good.
                
> NM can hang during shutdown if AppLogAggregatorImpl thread dies unexpectedly
> ----------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3738
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3738
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2, nodemanager
>    Affects Versions: 0.23.1, 0.24.0
>            Reporter: Jason Lowe
>            Assignee: Jason Lowe
>            Priority: Critical
>         Attachments: MAPREDUCE-3738.patch, livehistdump.txt
>
>
> If an AppLogAggregator thread dies unexpectedly (e.g.: uncaught exception like OutOfMemoryError in the case I saw) then this will lead to a hang during nodemanager shutdown.  The NM calls AppLogAggregatorImpl.join() during shutdown to make sure log aggregation has completed, and that method internally waits for an atomic boolean to be set by the log aggregation thread to indicate it has finished.  Since the thread was killed off earlier due to an uncaught exception, the boolean will never be set and the NM hangs during shutdown repeating something like this every second in the log file:
> 2012-01-25 22:20:56,366 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl: Waiting for aggregation to complete for application_1326848182580_2806

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-3738) NM can hang during shutdown if AppLogAggregatorImpl thread dies unexpectedly

Posted by "Vinod Kumar Vavilapalli (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13194205#comment-13194205 ] 

Vinod Kumar Vavilapalli commented on MAPREDUCE-3738:
----------------------------------------------------

Originally when I wrote this, I had the same suspicion about the join. But later, I made sure all exceptions were caught and that the boolean gets set in all possible cases. OOM/errors are one thing that didn't occur to me.

Can you debug as to why you ran into OOM ? We need to fix that definitely, irrespective of how we want to handle other errors.
                
> NM can hang during shutdown if AppLogAggregatorImpl thread dies unexpectedly
> ----------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3738
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3738
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2, nodemanager
>    Affects Versions: 0.23.1, 0.24.0
>            Reporter: Jason Lowe
>            Priority: Critical
>
> If an AppLogAggregator thread dies unexpectedly (e.g.: uncaught exception like OutOfMemoryError in the case I saw) then this will lead to a hang during nodemanager shutdown.  The NM calls AppLogAggregatorImpl.join() during shutdown to make sure log aggregation has completed, and that method internally waits for an atomic boolean to be set by the log aggregation thread to indicate it has finished.  Since the thread was killed off earlier due to an uncaught exception, the boolean will never be set and the NM hangs during shutdown repeating something like this every second in the log file:
> 2012-01-25 22:20:56,366 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl: Waiting for aggregation to complete for application_1326848182580_2806

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-3738) NM can hang during shutdown if AppLogAggregatorImpl thread dies unexpectedly

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13215335#comment-13215335 ] 

Hudson commented on MAPREDUCE-3738:
-----------------------------------

Integrated in Hadoop-Mapreduce-0.23-Commit #589 (See [https://builds.apache.org/job/Hadoop-Mapreduce-0.23-Commit/589/])
    merge MAPREDUCE-3738 from trunk (Revision 1293061)

     Result = ABORTED
sseth : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1293061
Files : 
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/AppLogAggregatorImpl.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/TestLogAggregationService.java

                
> NM can hang during shutdown if AppLogAggregatorImpl thread dies unexpectedly
> ----------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3738
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3738
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2, nodemanager
>    Affects Versions: 0.23.1, 0.24.0
>            Reporter: Jason Lowe
>            Assignee: Jason Lowe
>            Priority: Critical
>             Fix For: 0.23.2
>
>         Attachments: MAPREDUCE-3738.patch, livehistdump.txt
>
>
> If an AppLogAggregator thread dies unexpectedly (e.g.: uncaught exception like OutOfMemoryError in the case I saw) then this will lead to a hang during nodemanager shutdown.  The NM calls AppLogAggregatorImpl.join() during shutdown to make sure log aggregation has completed, and that method internally waits for an atomic boolean to be set by the log aggregation thread to indicate it has finished.  Since the thread was killed off earlier due to an uncaught exception, the boolean will never be set and the NM hangs during shutdown repeating something like this every second in the log file:
> 2012-01-25 22:20:56,366 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl: Waiting for aggregation to complete for application_1326848182580_2806

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-3738) NM can hang during shutdown if AppLogAggregatorImpl thread dies unexpectedly

Posted by "Vinod Kumar Vavilapalli (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13194207#comment-13194207 ] 

Vinod Kumar Vavilapalli commented on MAPREDUCE-3738:
----------------------------------------------------

Comment race;) Even the stack trace during OOM will help.
                
> NM can hang during shutdown if AppLogAggregatorImpl thread dies unexpectedly
> ----------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3738
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3738
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2, nodemanager
>    Affects Versions: 0.23.1, 0.24.0
>            Reporter: Jason Lowe
>            Priority: Critical
>
> If an AppLogAggregator thread dies unexpectedly (e.g.: uncaught exception like OutOfMemoryError in the case I saw) then this will lead to a hang during nodemanager shutdown.  The NM calls AppLogAggregatorImpl.join() during shutdown to make sure log aggregation has completed, and that method internally waits for an atomic boolean to be set by the log aggregation thread to indicate it has finished.  Since the thread was killed off earlier due to an uncaught exception, the boolean will never be set and the NM hangs during shutdown repeating something like this every second in the log file:
> 2012-01-25 22:20:56,366 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl: Waiting for aggregation to complete for application_1326848182580_2806

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-3738) NM can hang during shutdown if AppLogAggregatorImpl thread dies unexpectedly

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13215612#comment-13215612 ] 

Hudson commented on MAPREDUCE-3738:
-----------------------------------

Integrated in Hadoop-Mapreduce-0.23-Build #206 (See [https://builds.apache.org/job/Hadoop-Mapreduce-0.23-Build/206/])
    merge MAPREDUCE-3738 from trunk (Revision 1293061)

     Result = FAILURE
sseth : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1293061
Files : 
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/AppLogAggregatorImpl.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/TestLogAggregationService.java

                
> NM can hang during shutdown if AppLogAggregatorImpl thread dies unexpectedly
> ----------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3738
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3738
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2, nodemanager
>    Affects Versions: 0.23.1, 0.24.0
>            Reporter: Jason Lowe
>            Assignee: Jason Lowe
>            Priority: Critical
>             Fix For: 0.23.2
>
>         Attachments: MAPREDUCE-3738.patch, livehistdump.txt
>
>
> If an AppLogAggregator thread dies unexpectedly (e.g.: uncaught exception like OutOfMemoryError in the case I saw) then this will lead to a hang during nodemanager shutdown.  The NM calls AppLogAggregatorImpl.join() during shutdown to make sure log aggregation has completed, and that method internally waits for an atomic boolean to be set by the log aggregation thread to indicate it has finished.  Since the thread was killed off earlier due to an uncaught exception, the boolean will never be set and the NM hangs during shutdown repeating something like this every second in the log file:
> 2012-01-25 22:20:56,366 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl: Waiting for aggregation to complete for application_1326848182580_2806

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-3738) NM can hang during shutdown if AppLogAggregatorImpl thread dies unexpectedly

Posted by "Mahadev konar (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13194196#comment-13194196 ] 

Mahadev konar commented on MAPREDUCE-3738:
------------------------------------------

Jason,
 Is there a bug for OOM? What was the reason for that?
                
> NM can hang during shutdown if AppLogAggregatorImpl thread dies unexpectedly
> ----------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3738
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3738
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2, nodemanager
>    Affects Versions: 0.23.1, 0.24.0
>            Reporter: Jason Lowe
>            Priority: Critical
>
> If an AppLogAggregator thread dies unexpectedly (e.g.: uncaught exception like OutOfMemoryError in the case I saw) then this will lead to a hang during nodemanager shutdown.  The NM calls AppLogAggregatorImpl.join() during shutdown to make sure log aggregation has completed, and that method internally waits for an atomic boolean to be set by the log aggregation thread to indicate it has finished.  Since the thread was killed off earlier due to an uncaught exception, the boolean will never be set and the NM hangs during shutdown repeating something like this every second in the log file:
> 2012-01-25 22:20:56,366 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl: Waiting for aggregation to complete for application_1326848182580_2806

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAPREDUCE-3738) NM can hang during shutdown if AppLogAggregatorImpl thread dies unexpectedly

Posted by "Jason Lowe (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAPREDUCE-3738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jason Lowe updated MAPREDUCE-3738:
----------------------------------

    Target Version/s: 0.24.0, 0.23.2
              Status: Patch Available  (was: Open)
    
> NM can hang during shutdown if AppLogAggregatorImpl thread dies unexpectedly
> ----------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3738
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3738
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2, nodemanager
>    Affects Versions: 0.23.1, 0.24.0
>            Reporter: Jason Lowe
>            Assignee: Jason Lowe
>            Priority: Critical
>         Attachments: MAPREDUCE-3738.patch, livehistdump.txt
>
>
> If an AppLogAggregator thread dies unexpectedly (e.g.: uncaught exception like OutOfMemoryError in the case I saw) then this will lead to a hang during nodemanager shutdown.  The NM calls AppLogAggregatorImpl.join() during shutdown to make sure log aggregation has completed, and that method internally waits for an atomic boolean to be set by the log aggregation thread to indicate it has finished.  Since the thread was killed off earlier due to an uncaught exception, the boolean will never be set and the NM hangs during shutdown repeating something like this every second in the log file:
> 2012-01-25 22:20:56,366 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl: Waiting for aggregation to complete for application_1326848182580_2806

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAPREDUCE-3738) NM can hang during shutdown if AppLogAggregatorImpl thread dies unexpectedly

Posted by "Mahadev konar (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAPREDUCE-3738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mahadev konar updated MAPREDUCE-3738:
-------------------------------------

    Priority: Critical  (was: Major)
    
> NM can hang during shutdown if AppLogAggregatorImpl thread dies unexpectedly
> ----------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3738
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3738
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2, nodemanager
>    Affects Versions: 0.23.1, 0.24.0
>            Reporter: Jason Lowe
>            Priority: Critical
>
> If an AppLogAggregator thread dies unexpectedly (e.g.: uncaught exception like OutOfMemoryError in the case I saw) then this will lead to a hang during nodemanager shutdown.  The NM calls AppLogAggregatorImpl.join() during shutdown to make sure log aggregation has completed, and that method internally waits for an atomic boolean to be set by the log aggregation thread to indicate it has finished.  Since the thread was killed off earlier due to an uncaught exception, the boolean will never be set and the NM hangs during shutdown repeating something like this every second in the log file:
> 2012-01-25 22:20:56,366 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl: Waiting for aggregation to complete for application_1326848182580_2806

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Assigned] (MAPREDUCE-3738) NM can hang during shutdown if AppLogAggregatorImpl thread dies unexpectedly

Posted by "Jason Lowe (Assigned) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAPREDUCE-3738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jason Lowe reassigned MAPREDUCE-3738:
-------------------------------------

    Assignee: Jason Lowe
    
> NM can hang during shutdown if AppLogAggregatorImpl thread dies unexpectedly
> ----------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3738
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3738
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2, nodemanager
>    Affects Versions: 0.23.1, 0.24.0
>            Reporter: Jason Lowe
>            Assignee: Jason Lowe
>            Priority: Critical
>         Attachments: livehistdump.txt
>
>
> If an AppLogAggregator thread dies unexpectedly (e.g.: uncaught exception like OutOfMemoryError in the case I saw) then this will lead to a hang during nodemanager shutdown.  The NM calls AppLogAggregatorImpl.join() during shutdown to make sure log aggregation has completed, and that method internally waits for an atomic boolean to be set by the log aggregation thread to indicate it has finished.  Since the thread was killed off earlier due to an uncaught exception, the boolean will never be set and the NM hangs during shutdown repeating something like this every second in the log file:
> 2012-01-25 22:20:56,366 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl: Waiting for aggregation to complete for application_1326848182580_2806

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-3738) NM can hang during shutdown if AppLogAggregatorImpl thread dies unexpectedly

Posted by "Jason Lowe (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13194204#comment-13194204 ] 

Jason Lowe commented on MAPREDUCE-3738:
---------------------------------------

No bug for the OOM yet, unfortunately cluster was re-deployed before grabbing a full heap dump.  I do have the jmap -hist:live output from one of the nodemanagers but haven't had a chance to go through it yet to see if it helps pinpoint where the leak would be.
                
> NM can hang during shutdown if AppLogAggregatorImpl thread dies unexpectedly
> ----------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3738
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3738
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2, nodemanager
>    Affects Versions: 0.23.1, 0.24.0
>            Reporter: Jason Lowe
>            Priority: Critical
>
> If an AppLogAggregator thread dies unexpectedly (e.g.: uncaught exception like OutOfMemoryError in the case I saw) then this will lead to a hang during nodemanager shutdown.  The NM calls AppLogAggregatorImpl.join() during shutdown to make sure log aggregation has completed, and that method internally waits for an atomic boolean to be set by the log aggregation thread to indicate it has finished.  Since the thread was killed off earlier due to an uncaught exception, the boolean will never be set and the NM hangs during shutdown repeating something like this every second in the log file:
> 2012-01-25 22:20:56,366 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl: Waiting for aggregation to complete for application_1326848182580_2806

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-3738) NM can hang during shutdown if AppLogAggregatorImpl thread dies unexpectedly

Posted by "Hadoop QA (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13215077#comment-13215077 ] 

Hadoop QA commented on MAPREDUCE-3738:
--------------------------------------

+1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12515804/MAPREDUCE-3738.patch
  against trunk revision .

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 3 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 eclipse:eclipse.  The patch built with eclipse:eclipse.

    +1 findbugs.  The patch does not introduce any new Findbugs (version 1.3.9) warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    +1 core tests.  The patch passed unit tests in .

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1917//testReport/
Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1917//console

This message is automatically generated.
                
> NM can hang during shutdown if AppLogAggregatorImpl thread dies unexpectedly
> ----------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3738
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3738
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2, nodemanager
>    Affects Versions: 0.23.1, 0.24.0
>            Reporter: Jason Lowe
>            Assignee: Jason Lowe
>            Priority: Critical
>         Attachments: MAPREDUCE-3738.patch, livehistdump.txt
>
>
> If an AppLogAggregator thread dies unexpectedly (e.g.: uncaught exception like OutOfMemoryError in the case I saw) then this will lead to a hang during nodemanager shutdown.  The NM calls AppLogAggregatorImpl.join() during shutdown to make sure log aggregation has completed, and that method internally waits for an atomic boolean to be set by the log aggregation thread to indicate it has finished.  Since the thread was killed off earlier due to an uncaught exception, the boolean will never be set and the NM hangs during shutdown repeating something like this every second in the log file:
> 2012-01-25 22:20:56,366 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl: Waiting for aggregation to complete for application_1326848182580_2806

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-3738) NM can hang during shutdown if AppLogAggregatorImpl thread dies unexpectedly

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13215320#comment-13215320 ] 

Hudson commented on MAPREDUCE-3738:
-----------------------------------

Integrated in Hadoop-Common-0.23-Commit #587 (See [https://builds.apache.org/job/Hadoop-Common-0.23-Commit/587/])
    merge MAPREDUCE-3738 from trunk (Revision 1293061)

     Result = SUCCESS
sseth : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1293061
Files : 
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/AppLogAggregatorImpl.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/TestLogAggregationService.java

                
> NM can hang during shutdown if AppLogAggregatorImpl thread dies unexpectedly
> ----------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3738
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3738
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2, nodemanager
>    Affects Versions: 0.23.1, 0.24.0
>            Reporter: Jason Lowe
>            Assignee: Jason Lowe
>            Priority: Critical
>             Fix For: 0.23.2
>
>         Attachments: MAPREDUCE-3738.patch, livehistdump.txt
>
>
> If an AppLogAggregator thread dies unexpectedly (e.g.: uncaught exception like OutOfMemoryError in the case I saw) then this will lead to a hang during nodemanager shutdown.  The NM calls AppLogAggregatorImpl.join() during shutdown to make sure log aggregation has completed, and that method internally waits for an atomic boolean to be set by the log aggregation thread to indicate it has finished.  Since the thread was killed off earlier due to an uncaught exception, the boolean will never be set and the NM hangs during shutdown repeating something like this every second in the log file:
> 2012-01-25 22:20:56,366 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl: Waiting for aggregation to complete for application_1326848182580_2806

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAPREDUCE-3738) NM can hang during shutdown if AppLogAggregatorImpl thread dies unexpectedly

Posted by "Jason Lowe (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAPREDUCE-3738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jason Lowe updated MAPREDUCE-3738:
----------------------------------

    Attachment: MAPREDUCE-3738.patch

Patch to ensure we always set the finished boolean in the log aggregation thread.

On a side note we haven't seen a reoccurrence of the OOM condition on the nodemanager, so we haven't been able to track down what caused it.
                
> NM can hang during shutdown if AppLogAggregatorImpl thread dies unexpectedly
> ----------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3738
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3738
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2, nodemanager
>    Affects Versions: 0.23.1, 0.24.0
>            Reporter: Jason Lowe
>            Assignee: Jason Lowe
>            Priority: Critical
>         Attachments: MAPREDUCE-3738.patch, livehistdump.txt
>
>
> If an AppLogAggregator thread dies unexpectedly (e.g.: uncaught exception like OutOfMemoryError in the case I saw) then this will lead to a hang during nodemanager shutdown.  The NM calls AppLogAggregatorImpl.join() during shutdown to make sure log aggregation has completed, and that method internally waits for an atomic boolean to be set by the log aggregation thread to indicate it has finished.  Since the thread was killed off earlier due to an uncaught exception, the boolean will never be set and the NM hangs during shutdown repeating something like this every second in the log file:
> 2012-01-25 22:20:56,366 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl: Waiting for aggregation to complete for application_1326848182580_2806

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAPREDUCE-3738) NM can hang during shutdown if AppLogAggregatorImpl thread dies unexpectedly

Posted by "Jason Lowe (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAPREDUCE-3738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jason Lowe updated MAPREDUCE-3738:
----------------------------------

    Attachment: livehistdump.txt

Attaching hist:live dump from one of the nodemanagers that had hit the OOM error multiple times in the log aggregation threads before eventually trying to shut down.  Unfortunately I don't have a full map dump or stack dump from that process.
                
> NM can hang during shutdown if AppLogAggregatorImpl thread dies unexpectedly
> ----------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3738
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3738
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2, nodemanager
>    Affects Versions: 0.23.1, 0.24.0
>            Reporter: Jason Lowe
>            Priority: Critical
>         Attachments: livehistdump.txt
>
>
> If an AppLogAggregator thread dies unexpectedly (e.g.: uncaught exception like OutOfMemoryError in the case I saw) then this will lead to a hang during nodemanager shutdown.  The NM calls AppLogAggregatorImpl.join() during shutdown to make sure log aggregation has completed, and that method internally waits for an atomic boolean to be set by the log aggregation thread to indicate it has finished.  Since the thread was killed off earlier due to an uncaught exception, the boolean will never be set and the NM hangs during shutdown repeating something like this every second in the log file:
> 2012-01-25 22:20:56,366 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl: Waiting for aggregation to complete for application_1326848182580_2806

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira