You are viewing a plain text version of this content. The canonical link for it is here.

Posted to yarn-issues@hadoop.apache.org by "patrick white (JIRA)" <ji...@apache.org> on 2012/08/31 20:21:07 UTC

[jira] [Created] (YARN-68) NodeManager will refuse to shutdown indefinitely due to container log aggregation

patrick white created YARN-68:
---------------------------------

             Summary: NodeManager will refuse to shutdown indefinitely due to container log aggregation
                 Key: YARN-68
                 URL: https://issues.apache.org/jira/browse/YARN-68
             Project: Hadoop YARN
          Issue Type: Bug
          Components: nodemanager
    Affects Versions: 0.23.3
         Environment: QE
            Reporter: patrick white


The nodemanager is able to get into a state where containermanager.logaggregation.AppLogAggregatorImpl will apparently wait
indefinitely for log aggregation to complete for an application, even if that application has abnormally terminated and is no longer present. 

Observed behavior is that an attempt to stop the nodemanager daemon will return but have no effect, the nm log continually displays messages similar to this:

[Thread-1]2012-08-21 17:44:07,581 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl:
Waiting for aggregation to complete for application_1345221477405_2733

The only recovery we found to work was to 'kill -9' the nm process.

What exactly causes the NM to enter this state is unclear but we do see this behavior reliably when the NM has run a task which failed, for example when debugging oozie distcp actions and having a distcp map task fail, the NM that was running the container will now enter this state where a shutdown on said NM will never complete, 'never' in this case was waiting for 2 hours before killing the nodemanager process.


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-68) NodeManager will refuse to shutdown indefinitely due to container log aggregation

Posted by "Robert Joseph Evans (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/YARN-68?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13446370#comment-13446370 ] 

Robert Joseph Evans commented on YARN-68:
-----------------------------------------

The changes look good for the most part.  I would like to see the join method in AppLogAgregator removed.  It looks like it is no longer used, and the implementation has changed so it looks like it will no longer work anyways.
                
> NodeManager will refuse to shutdown indefinitely due to container log aggregation
> ---------------------------------------------------------------------------------
>
>                 Key: YARN-68
>                 URL: https://issues.apache.org/jira/browse/YARN-68
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 0.23.3
>         Environment: QE
>            Reporter: patrick white
>            Assignee: Daryn Sharp
>         Attachments: YARN-68.patch
>
>
> The nodemanager is able to get into a state where containermanager.logaggregation.AppLogAggregatorImpl will apparently wait
> indefinitely for log aggregation to complete for an application, even if that application has abnormally terminated and is no longer present. 
> Observed behavior is that an attempt to stop the nodemanager daemon will return but have no effect, the nm log continually displays messages similar to this:
> [Thread-1]2012-08-21 17:44:07,581 INFO
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl:
> Waiting for aggregation to complete for application_1345221477405_2733
> The only recovery we found to work was to 'kill -9' the nm process.
> What exactly causes the NM to enter this state is unclear but we do see this behavior reliably when the NM has run a task which failed, for example when debugging oozie distcp actions and having a distcp map task fail, the NM that was running the container will now enter this state where a shutdown on said NM will never complete, 'never' in this case was waiting for 2 hours before killing the nodemanager process.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (YARN-68) NodeManager will refuse to shutdown indefinitely due to container log aggregation

Posted by "Daryn Sharp (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/YARN-68?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Daryn Sharp updated YARN-68:
----------------------------

    Attachment: YARN-68-1.patch
    
> NodeManager will refuse to shutdown indefinitely due to container log aggregation
> ---------------------------------------------------------------------------------
>
>                 Key: YARN-68
>                 URL: https://issues.apache.org/jira/browse/YARN-68
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 0.23.3
>         Environment: QE
>            Reporter: patrick white
>            Assignee: Daryn Sharp
>         Attachments: YARN-68-1.patch, YARN-68.patch
>
>
> The nodemanager is able to get into a state where containermanager.logaggregation.AppLogAggregatorImpl will apparently wait
> indefinitely for log aggregation to complete for an application, even if that application has abnormally terminated and is no longer present. 
> Observed behavior is that an attempt to stop the nodemanager daemon will return but have no effect, the nm log continually displays messages similar to this:
> [Thread-1]2012-08-21 17:44:07,581 INFO
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl:
> Waiting for aggregation to complete for application_1345221477405_2733
> The only recovery we found to work was to 'kill -9' the nm process.
> What exactly causes the NM to enter this state is unclear but we do see this behavior reliably when the NM has run a task which failed, for example when debugging oozie distcp actions and having a distcp map task fail, the NM that was running the container will now enter this state where a shutdown on said NM will never complete, 'never' in this case was waiting for 2 hours before killing the nodemanager process.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-68) NodeManager will refuse to shutdown indefinitely due to container log aggregation

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/YARN-68?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13446379#comment-13446379 ] 

Hadoop QA commented on YARN-68:
-------------------------------

+1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12543326/YARN-68.patch
  against trunk revision .

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 1 new or modified test files.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 eclipse:eclipse.  The patch built with eclipse:eclipse.

    +1 findbugs.  The patch does not introduce any new Findbugs (version 1.3.9) warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    +1 core tests.  The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager.

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: https://builds.apache.org/job/PreCommit-YARN-Build/14//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/14//console

This message is automatically generated.
                
> NodeManager will refuse to shutdown indefinitely due to container log aggregation
> ---------------------------------------------------------------------------------
>
>                 Key: YARN-68
>                 URL: https://issues.apache.org/jira/browse/YARN-68
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 0.23.3
>         Environment: QE
>            Reporter: patrick white
>            Assignee: Daryn Sharp
>         Attachments: YARN-68-1.patch, YARN-68.patch
>
>
> The nodemanager is able to get into a state where containermanager.logaggregation.AppLogAggregatorImpl will apparently wait
> indefinitely for log aggregation to complete for an application, even if that application has abnormally terminated and is no longer present. 
> Observed behavior is that an attempt to stop the nodemanager daemon will return but have no effect, the nm log continually displays messages similar to this:
> [Thread-1]2012-08-21 17:44:07,581 INFO
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl:
> Waiting for aggregation to complete for application_1345221477405_2733
> The only recovery we found to work was to 'kill -9' the nm process.
> What exactly causes the NM to enter this state is unclear but we do see this behavior reliably when the NM has run a task which failed, for example when debugging oozie distcp actions and having a distcp map task fail, the NM that was running the container will now enter this state where a shutdown on said NM will never complete, 'never' in this case was waiting for 2 hours before killing the nodemanager process.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (YARN-68) NodeManager will refuse to shutdown indefinitely due to container log aggregation

Posted by "Daryn Sharp (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/YARN-68?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Daryn Sharp updated YARN-68:
----------------------------

    Attachment:     (was: HADOOP-8726-1.patch)
    
> NodeManager will refuse to shutdown indefinitely due to container log aggregation
> ---------------------------------------------------------------------------------
>
>                 Key: YARN-68
>                 URL: https://issues.apache.org/jira/browse/YARN-68
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 0.23.3
>         Environment: QE
>            Reporter: patrick white
>            Assignee: Daryn Sharp
>         Attachments: YARN-68-1.patch, YARN-68.patch
>
>
> The nodemanager is able to get into a state where containermanager.logaggregation.AppLogAggregatorImpl will apparently wait
> indefinitely for log aggregation to complete for an application, even if that application has abnormally terminated and is no longer present. 
> Observed behavior is that an attempt to stop the nodemanager daemon will return but have no effect, the nm log continually displays messages similar to this:
> [Thread-1]2012-08-21 17:44:07,581 INFO
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl:
> Waiting for aggregation to complete for application_1345221477405_2733
> The only recovery we found to work was to 'kill -9' the nm process.
> What exactly causes the NM to enter this state is unclear but we do see this behavior reliably when the NM has run a task which failed, for example when debugging oozie distcp actions and having a distcp map task fail, the NM that was running the container will now enter this state where a shutdown on said NM will never complete, 'never' in this case was waiting for 2 hours before killing the nodemanager process.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-68) NodeManager will refuse to shutdown indefinitely due to container log aggregation

Posted by "Hudson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/YARN-68?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13449601#comment-13449601 ] 

Hudson commented on YARN-68:
----------------------------

Integrated in Hadoop-Hdfs-0.23-Build #366 (See [https://builds.apache.org/job/Hadoop-Hdfs-0.23-Build/366/])
    svn merge -c 1381317 FIXES: YARN-68. NodeManager will refuse to shutdown indefinitely due to container log aggregation (daryn via bobby) (Revision 1381320)

     Result = UNSTABLE
bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1381320
Files : 
* /hadoop/common/branches/branch-0.23/hadoop-yarn-project/CHANGES.txt
* /hadoop/common/branches/branch-0.23/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/AppLogAggregator.java
* /hadoop/common/branches/branch-0.23/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/AppLogAggregatorImpl.java
* /hadoop/common/branches/branch-0.23/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/LogAggregationService.java
* /hadoop/common/branches/branch-0.23/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/TestLogAggregationService.java

                
> NodeManager will refuse to shutdown indefinitely due to container log aggregation
> ---------------------------------------------------------------------------------
>
>                 Key: YARN-68
>                 URL: https://issues.apache.org/jira/browse/YARN-68
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 0.23.3
>         Environment: QE
>            Reporter: patrick white
>            Assignee: Daryn Sharp
>             Fix For: 2.2.0-alpha, 0.23.3
>
>         Attachments: YARN-68-1.patch, YARN-68.patch
>
>
> The nodemanager is able to get into a state where containermanager.logaggregation.AppLogAggregatorImpl will apparently wait
> indefinitely for log aggregation to complete for an application, even if that application has abnormally terminated and is no longer present. 
> Observed behavior is that an attempt to stop the nodemanager daemon will return but have no effect, the nm log continually displays messages similar to this:
> [Thread-1]2012-08-21 17:44:07,581 INFO
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl:
> Waiting for aggregation to complete for application_1345221477405_2733
> The only recovery we found to work was to 'kill -9' the nm process.
> What exactly causes the NM to enter this state is unclear but we do see this behavior reliably when the NM has run a task which failed, for example when debugging oozie distcp actions and having a distcp map task fail, the NM that was running the container will now enter this state where a shutdown on said NM will never complete, 'never' in this case was waiting for 2 hours before killing the nodemanager process.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (YARN-68) NodeManager will refuse to shutdown indefinitely due to container log aggregation

Posted by "Daryn Sharp (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/YARN-68?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Daryn Sharp updated YARN-68:
----------------------------

    Attachment: HADOOP-8726-1.patch

Your wish is my command!
                
> NodeManager will refuse to shutdown indefinitely due to container log aggregation
> ---------------------------------------------------------------------------------
>
>                 Key: YARN-68
>                 URL: https://issues.apache.org/jira/browse/YARN-68
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 0.23.3
>         Environment: QE
>            Reporter: patrick white
>            Assignee: Daryn Sharp
>         Attachments: YARN-68-1.patch, YARN-68.patch
>
>
> The nodemanager is able to get into a state where containermanager.logaggregation.AppLogAggregatorImpl will apparently wait
> indefinitely for log aggregation to complete for an application, even if that application has abnormally terminated and is no longer present. 
> Observed behavior is that an attempt to stop the nodemanager daemon will return but have no effect, the nm log continually displays messages similar to this:
> [Thread-1]2012-08-21 17:44:07,581 INFO
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl:
> Waiting for aggregation to complete for application_1345221477405_2733
> The only recovery we found to work was to 'kill -9' the nm process.
> What exactly causes the NM to enter this state is unclear but we do see this behavior reliably when the NM has run a task which failed, for example when debugging oozie distcp actions and having a distcp map task fail, the NM that was running the container will now enter this state where a shutdown on said NM will never complete, 'never' in this case was waiting for 2 hours before killing the nodemanager process.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-68) NodeManager will refuse to shutdown indefinitely due to container log aggregation

Posted by "Robert Joseph Evans (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/YARN-68?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13449041#comment-13449041 ] 

Robert Joseph Evans commented on YARN-68:
-----------------------------------------

Looks good +1.  I'll check this in.
                
> NodeManager will refuse to shutdown indefinitely due to container log aggregation
> ---------------------------------------------------------------------------------
>
>                 Key: YARN-68
>                 URL: https://issues.apache.org/jira/browse/YARN-68
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 0.23.3
>         Environment: QE
>            Reporter: patrick white
>            Assignee: Daryn Sharp
>         Attachments: YARN-68-1.patch, YARN-68.patch
>
>
> The nodemanager is able to get into a state where containermanager.logaggregation.AppLogAggregatorImpl will apparently wait
> indefinitely for log aggregation to complete for an application, even if that application has abnormally terminated and is no longer present. 
> Observed behavior is that an attempt to stop the nodemanager daemon will return but have no effect, the nm log continually displays messages similar to this:
> [Thread-1]2012-08-21 17:44:07,581 INFO
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl:
> Waiting for aggregation to complete for application_1345221477405_2733
> The only recovery we found to work was to 'kill -9' the nm process.
> What exactly causes the NM to enter this state is unclear but we do see this behavior reliably when the NM has run a task which failed, for example when debugging oozie distcp actions and having a distcp map task fail, the NM that was running the container will now enter this state where a shutdown on said NM will never complete, 'never' in this case was waiting for 2 hours before killing the nodemanager process.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-68) NodeManager will refuse to shutdown indefinitely due to container log aggregation

Posted by "Hudson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/YARN-68?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13449047#comment-13449047 ] 

Hudson commented on YARN-68:
----------------------------

Integrated in Hadoop-Common-trunk-Commit #2686 (See [https://builds.apache.org/job/Hadoop-Common-trunk-Commit/2686/])
    YARN-68. NodeManager will refuse to shutdown indefinitely due to container log aggregation (daryn via bobby) (Revision 1381317)

     Result = SUCCESS
bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1381317
Files : 
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/AppLogAggregator.java
* /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/AppLogAggregatorImpl.java
* /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/LogAggregationService.java
* /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/TestLogAggregationService.java

                
> NodeManager will refuse to shutdown indefinitely due to container log aggregation
> ---------------------------------------------------------------------------------
>
>                 Key: YARN-68
>                 URL: https://issues.apache.org/jira/browse/YARN-68
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 0.23.3
>         Environment: QE
>            Reporter: patrick white
>            Assignee: Daryn Sharp
>         Attachments: YARN-68-1.patch, YARN-68.patch
>
>
> The nodemanager is able to get into a state where containermanager.logaggregation.AppLogAggregatorImpl will apparently wait
> indefinitely for log aggregation to complete for an application, even if that application has abnormally terminated and is no longer present. 
> Observed behavior is that an attempt to stop the nodemanager daemon will return but have no effect, the nm log continually displays messages similar to this:
> [Thread-1]2012-08-21 17:44:07,581 INFO
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl:
> Waiting for aggregation to complete for application_1345221477405_2733
> The only recovery we found to work was to 'kill -9' the nm process.
> What exactly causes the NM to enter this state is unclear but we do see this behavior reliably when the NM has run a task which failed, for example when debugging oozie distcp actions and having a distcp map task fail, the NM that was running the container will now enter this state where a shutdown on said NM will never complete, 'never' in this case was waiting for 2 hours before killing the nodemanager process.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Assigned] (YARN-68) NodeManager will refuse to shutdown indefinitely due to container log aggregation

Posted by "Daryn Sharp (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/YARN-68?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Daryn Sharp reassigned YARN-68:
-------------------------------

    Assignee: Daryn Sharp
    
> NodeManager will refuse to shutdown indefinitely due to container log aggregation
> ---------------------------------------------------------------------------------
>
>                 Key: YARN-68
>                 URL: https://issues.apache.org/jira/browse/YARN-68
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 0.23.3
>         Environment: QE
>            Reporter: patrick white
>            Assignee: Daryn Sharp
>
> The nodemanager is able to get into a state where containermanager.logaggregation.AppLogAggregatorImpl will apparently wait
> indefinitely for log aggregation to complete for an application, even if that application has abnormally terminated and is no longer present. 
> Observed behavior is that an attempt to stop the nodemanager daemon will return but have no effect, the nm log continually displays messages similar to this:
> [Thread-1]2012-08-21 17:44:07,581 INFO
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl:
> Waiting for aggregation to complete for application_1345221477405_2733
> The only recovery we found to work was to 'kill -9' the nm process.
> What exactly causes the NM to enter this state is unclear but we do see this behavior reliably when the NM has run a task which failed, for example when debugging oozie distcp actions and having a distcp map task fail, the NM that was running the container will now enter this state where a shutdown on said NM will never complete, 'never' in this case was waiting for 2 hours before killing the nodemanager process.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-68) NodeManager will refuse to shutdown indefinitely due to container log aggregation

Posted by "Hudson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/YARN-68?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13449094#comment-13449094 ] 

Hudson commented on YARN-68:
----------------------------

Integrated in Hadoop-Mapreduce-trunk-Commit #2710 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Commit/2710/])
    YARN-68. NodeManager will refuse to shutdown indefinitely due to container log aggregation (daryn via bobby) (Revision 1381317)

     Result = FAILURE
bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1381317
Files : 
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/AppLogAggregator.java
* /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/AppLogAggregatorImpl.java
* /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/LogAggregationService.java
* /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/TestLogAggregationService.java

                
> NodeManager will refuse to shutdown indefinitely due to container log aggregation
> ---------------------------------------------------------------------------------
>
>                 Key: YARN-68
>                 URL: https://issues.apache.org/jira/browse/YARN-68
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 0.23.3
>         Environment: QE
>            Reporter: patrick white
>            Assignee: Daryn Sharp
>             Fix For: 2.2.0-alpha, 0.23.3
>
>         Attachments: YARN-68-1.patch, YARN-68.patch
>
>
> The nodemanager is able to get into a state where containermanager.logaggregation.AppLogAggregatorImpl will apparently wait
> indefinitely for log aggregation to complete for an application, even if that application has abnormally terminated and is no longer present. 
> Observed behavior is that an attempt to stop the nodemanager daemon will return but have no effect, the nm log continually displays messages similar to this:
> [Thread-1]2012-08-21 17:44:07,581 INFO
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl:
> Waiting for aggregation to complete for application_1345221477405_2733
> The only recovery we found to work was to 'kill -9' the nm process.
> What exactly causes the NM to enter this state is unclear but we do see this behavior reliably when the NM has run a task which failed, for example when debugging oozie distcp actions and having a distcp map task fail, the NM that was running the container will now enter this state where a shutdown on said NM will never complete, 'never' in this case was waiting for 2 hours before killing the nodemanager process.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-68) NodeManager will refuse to shutdown indefinitely due to container log aggregation

Posted by "Daryn Sharp (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/YARN-68?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13446246#comment-13446246 ] 

Daryn Sharp commented on YARN-68:
---------------------------------

This also prevents the NM from internally restarting after the RM is bounced, or the NM goes out of sync for too long.  The stop sets a boolean to signal the (nonexistent or stuck) thread to finish, and then waits for the (nonexistent or stuck) thread to set another boolean that it's finished.  This will cause the NM to wait forever and be unresponsive to shutdowns or internal restarts.
                
> NodeManager will refuse to shutdown indefinitely due to container log aggregation
> ---------------------------------------------------------------------------------
>
>                 Key: YARN-68
>                 URL: https://issues.apache.org/jira/browse/YARN-68
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 0.23.3
>         Environment: QE
>            Reporter: patrick white
>            Assignee: Daryn Sharp
>
> The nodemanager is able to get into a state where containermanager.logaggregation.AppLogAggregatorImpl will apparently wait
> indefinitely for log aggregation to complete for an application, even if that application has abnormally terminated and is no longer present. 
> Observed behavior is that an attempt to stop the nodemanager daemon will return but have no effect, the nm log continually displays messages similar to this:
> [Thread-1]2012-08-21 17:44:07,581 INFO
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl:
> Waiting for aggregation to complete for application_1345221477405_2733
> The only recovery we found to work was to 'kill -9' the nm process.
> What exactly causes the NM to enter this state is unclear but we do see this behavior reliably when the NM has run a task which failed, for example when debugging oozie distcp actions and having a distcp map task fail, the NM that was running the container will now enter this state where a shutdown on said NM will never complete, 'never' in this case was waiting for 2 hours before killing the nodemanager process.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-68) NodeManager will refuse to shutdown indefinitely due to container log aggregation

Posted by "Hudson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/YARN-68?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13449654#comment-13449654 ] 

Hudson commented on YARN-68:
----------------------------

Integrated in Hadoop-Mapreduce-trunk #1188 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1188/])
    YARN-68. NodeManager will refuse to shutdown indefinitely due to container log aggregation (daryn via bobby) (Revision 1381317)

     Result = ABORTED
bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1381317
Files : 
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/AppLogAggregator.java
* /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/AppLogAggregatorImpl.java
* /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/LogAggregationService.java
* /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/TestLogAggregationService.java

                
> NodeManager will refuse to shutdown indefinitely due to container log aggregation
> ---------------------------------------------------------------------------------
>
>                 Key: YARN-68
>                 URL: https://issues.apache.org/jira/browse/YARN-68
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 0.23.3
>         Environment: QE
>            Reporter: patrick white
>            Assignee: Daryn Sharp
>             Fix For: 2.2.0-alpha, 0.23.3
>
>         Attachments: YARN-68-1.patch, YARN-68.patch
>
>
> The nodemanager is able to get into a state where containermanager.logaggregation.AppLogAggregatorImpl will apparently wait
> indefinitely for log aggregation to complete for an application, even if that application has abnormally terminated and is no longer present. 
> Observed behavior is that an attempt to stop the nodemanager daemon will return but have no effect, the nm log continually displays messages similar to this:
> [Thread-1]2012-08-21 17:44:07,581 INFO
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl:
> Waiting for aggregation to complete for application_1345221477405_2733
> The only recovery we found to work was to 'kill -9' the nm process.
> What exactly causes the NM to enter this state is unclear but we do see this behavior reliably when the NM has run a task which failed, for example when debugging oozie distcp actions and having a distcp map task fail, the NM that was running the container will now enter this state where a shutdown on said NM will never complete, 'never' in this case was waiting for 2 hours before killing the nodemanager process.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (YARN-68) NodeManager will refuse to shutdown indefinitely due to container log aggregation

Posted by "Daryn Sharp (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/YARN-68?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Daryn Sharp updated YARN-68:
----------------------------

    Attachment: YARN-68.patch

Try much harder to shutdown the aggregators.  Will stop all the threads in the thread pool instead of assuming every aggregator has an active thread.  Better exception handling and setting of state to make it harder to get into a bad state.  It's not perfect because jammed threads can still block shutdown/restart, but the improved logic it makes it much less likely.
                
> NodeManager will refuse to shutdown indefinitely due to container log aggregation
> ---------------------------------------------------------------------------------
>
>                 Key: YARN-68
>                 URL: https://issues.apache.org/jira/browse/YARN-68
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 0.23.3
>         Environment: QE
>            Reporter: patrick white
>            Assignee: Daryn Sharp
>         Attachments: YARN-68.patch
>
>
> The nodemanager is able to get into a state where containermanager.logaggregation.AppLogAggregatorImpl will apparently wait
> indefinitely for log aggregation to complete for an application, even if that application has abnormally terminated and is no longer present. 
> Observed behavior is that an attempt to stop the nodemanager daemon will return but have no effect, the nm log continually displays messages similar to this:
> [Thread-1]2012-08-21 17:44:07,581 INFO
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl:
> Waiting for aggregation to complete for application_1345221477405_2733
> The only recovery we found to work was to 'kill -9' the nm process.
> What exactly causes the NM to enter this state is unclear but we do see this behavior reliably when the NM has run a task which failed, for example when debugging oozie distcp actions and having a distcp map task fail, the NM that was running the container will now enter this state where a shutdown on said NM will never complete, 'never' in this case was waiting for 2 hours before killing the nodemanager process.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-68) NodeManager will refuse to shutdown indefinitely due to container log aggregation

Posted by "Hudson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/YARN-68?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13449617#comment-13449617 ] 

Hudson commented on YARN-68:
----------------------------

Integrated in Hadoop-Hdfs-trunk #1157 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1157/])
    YARN-68. NodeManager will refuse to shutdown indefinitely due to container log aggregation (daryn via bobby) (Revision 1381317)

     Result = FAILURE
bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1381317
Files : 
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/AppLogAggregator.java
* /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/AppLogAggregatorImpl.java
* /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/LogAggregationService.java
* /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/TestLogAggregationService.java

                
> NodeManager will refuse to shutdown indefinitely due to container log aggregation
> ---------------------------------------------------------------------------------
>
>                 Key: YARN-68
>                 URL: https://issues.apache.org/jira/browse/YARN-68
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 0.23.3
>         Environment: QE
>            Reporter: patrick white
>            Assignee: Daryn Sharp
>             Fix For: 2.2.0-alpha, 0.23.3
>
>         Attachments: YARN-68-1.patch, YARN-68.patch
>
>
> The nodemanager is able to get into a state where containermanager.logaggregation.AppLogAggregatorImpl will apparently wait
> indefinitely for log aggregation to complete for an application, even if that application has abnormally terminated and is no longer present. 
> Observed behavior is that an attempt to stop the nodemanager daemon will return but have no effect, the nm log continually displays messages similar to this:
> [Thread-1]2012-08-21 17:44:07,581 INFO
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl:
> Waiting for aggregation to complete for application_1345221477405_2733
> The only recovery we found to work was to 'kill -9' the nm process.
> What exactly causes the NM to enter this state is unclear but we do see this behavior reliably when the NM has run a task which failed, for example when debugging oozie distcp actions and having a distcp map task fail, the NM that was running the container will now enter this state where a shutdown on said NM will never complete, 'never' in this case was waiting for 2 hours before killing the nodemanager process.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Closed] (YARN-68) NodeManager will refuse to shutdown indefinitely due to container log aggregation

Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/YARN-68?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Arun C Murthy closed YARN-68.
-----------------------------

    
> NodeManager will refuse to shutdown indefinitely due to container log aggregation
> ---------------------------------------------------------------------------------
>
>                 Key: YARN-68
>                 URL: https://issues.apache.org/jira/browse/YARN-68
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 0.23.3
>         Environment: QE
>            Reporter: patrick white
>            Assignee: Daryn Sharp
>             Fix For: 2.0.2-alpha, 0.23.3
>
>         Attachments: YARN-68-1.patch, YARN-68.patch
>
>
> The nodemanager is able to get into a state where containermanager.logaggregation.AppLogAggregatorImpl will apparently wait
> indefinitely for log aggregation to complete for an application, even if that application has abnormally terminated and is no longer present. 
> Observed behavior is that an attempt to stop the nodemanager daemon will return but have no effect, the nm log continually displays messages similar to this:
> [Thread-1]2012-08-21 17:44:07,581 INFO
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl:
> Waiting for aggregation to complete for application_1345221477405_2733
> The only recovery we found to work was to 'kill -9' the nm process.
> What exactly causes the NM to enter this state is unclear but we do see this behavior reliably when the NM has run a task which failed, for example when debugging oozie distcp actions and having a distcp map task fail, the NM that was running the container will now enter this state where a shutdown on said NM will never complete, 'never' in this case was waiting for 2 hours before killing the nodemanager process.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-68) NodeManager will refuse to shutdown indefinitely due to container log aggregation

Posted by "Hudson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/YARN-68?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13449050#comment-13449050 ] 

Hudson commented on YARN-68:
----------------------------

Integrated in Hadoop-Hdfs-trunk-Commit #2749 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Commit/2749/])
    YARN-68. NodeManager will refuse to shutdown indefinitely due to container log aggregation (daryn via bobby) (Revision 1381317)

     Result = SUCCESS
bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1381317
Files : 
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/AppLogAggregator.java
* /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/AppLogAggregatorImpl.java
* /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/LogAggregationService.java
* /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/TestLogAggregationService.java

                
> NodeManager will refuse to shutdown indefinitely due to container log aggregation
> ---------------------------------------------------------------------------------
>
>                 Key: YARN-68
>                 URL: https://issues.apache.org/jira/browse/YARN-68
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 0.23.3
>         Environment: QE
>            Reporter: patrick white
>            Assignee: Daryn Sharp
>             Fix For: 2.2.0-alpha, 0.23.3
>
>         Attachments: YARN-68-1.patch, YARN-68.patch
>
>
> The nodemanager is able to get into a state where containermanager.logaggregation.AppLogAggregatorImpl will apparently wait
> indefinitely for log aggregation to complete for an application, even if that application has abnormally terminated and is no longer present. 
> Observed behavior is that an attempt to stop the nodemanager daemon will return but have no effect, the nm log continually displays messages similar to this:
> [Thread-1]2012-08-21 17:44:07,581 INFO
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl:
> Waiting for aggregation to complete for application_1345221477405_2733
> The only recovery we found to work was to 'kill -9' the nm process.
> What exactly causes the NM to enter this state is unclear but we do see this behavior reliably when the NM has run a task which failed, for example when debugging oozie distcp actions and having a distcp map task fail, the NM that was running the container will now enter this state where a shutdown on said NM will never complete, 'never' in this case was waiting for 2 hours before killing the nodemanager process.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira