You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-issues@hadoop.apache.org by "Daryn Sharp (JIRA)" <ji...@apache.org> on 2012/06/01 18:49:23 UTC

[jira] [Created] (MAPREDUCE-4302) NM goes down if error encountered during log aggregation

Daryn Sharp created MAPREDUCE-4302:
--------------------------------------

             Summary: NM goes down if error encountered during log aggregation
                 Key: MAPREDUCE-4302
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4302
             Project: Hadoop Map/Reduce
          Issue Type: Bug
          Components: nodemanager
    Affects Versions: 2.0.0-alpha, 0.23.0, trunk
            Reporter: Daryn Sharp
            Assignee: Daryn Sharp
            Priority: Critical


When a container launch request is sent to the NM, if _any_ exception occurs during the init of log aggregation then the NM goes down.  The problem can be induced by situations including, but certainly not limited to: transient rpc connection issues, missing tokens, expired tokens, permissions, full/quota exceeded dfs, etc.  The problem may occur with and without security enabled.

The ramification is an entire cluster can be rather easily brought down either maliciously, accidentally, or via a submission bug.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-4302) NM goes down if error encountered during log aggregation

Posted by "Hudson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-4302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13287920#comment-13287920 ] 

Hudson commented on MAPREDUCE-4302:
-----------------------------------

Integrated in Hadoop-Hdfs-0.23-Build #275 (See [https://builds.apache.org/job/Hadoop-Hdfs-0.23-Build/275/])
    svn merge -c 1345362. FIXES: MAPREDUCE-4302. NM goes down if error encountered during log aggregation (Daryn Sharp via bobby) (Revision 1345366)

     Result = SUCCESS
bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1345366
Files : 
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/ContainerManagerImpl.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/application/ApplicationEventType.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/application/ApplicationFinishEvent.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/application/ApplicationImpl.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/LogAggregationService.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/loghandler/NonAggregatingLogHandler.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/TestLogAggregationService.java

                
> NM goes down if error encountered during log aggregation
> --------------------------------------------------------
>
>                 Key: MAPREDUCE-4302
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4302
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 0.23.0, 2.0.0-alpha, trunk
>            Reporter: Daryn Sharp
>            Assignee: Daryn Sharp
>            Priority: Critical
>             Fix For: 0.23.3, 2.0.1-alpha, 3.0.0
>
>         Attachments: MAPREDUCE-4302.patch
>
>
> When a container launch request is sent to the NM, if _any_ exception occurs during the init of log aggregation then the NM goes down.  The problem can be induced by situations including, but certainly not limited to: transient rpc connection issues, missing tokens, expired tokens, permissions, full/quota exceeded dfs, etc.  The problem may occur with and without security enabled.
> The ramification is an entire cluster can be rather easily brought down either maliciously, accidentally, or via a submission bug.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-4302) NM goes down if error encountered during log aggregation

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-4302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13287546#comment-13287546 ] 

Hadoop QA commented on MAPREDUCE-4302:
--------------------------------------

+1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12530572/MAPREDUCE-4302.patch
  against trunk revision .

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 1 new or modified test files.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 eclipse:eclipse.  The patch built with eclipse:eclipse.

    +1 findbugs.  The patch does not introduce any new Findbugs (version 1.3.9) warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    +1 core tests.  The patch passed unit tests in hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager.

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2430//testReport/
Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2430//console

This message is automatically generated.
                
> NM goes down if error encountered during log aggregation
> --------------------------------------------------------
>
>                 Key: MAPREDUCE-4302
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4302
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 0.23.0, 2.0.0-alpha, trunk
>            Reporter: Daryn Sharp
>            Assignee: Daryn Sharp
>            Priority: Critical
>         Attachments: MAPREDUCE-4302.patch
>
>
> When a container launch request is sent to the NM, if _any_ exception occurs during the init of log aggregation then the NM goes down.  The problem can be induced by situations including, but certainly not limited to: transient rpc connection issues, missing tokens, expired tokens, permissions, full/quota exceeded dfs, etc.  The problem may occur with and without security enabled.
> The ramification is an entire cluster can be rather easily brought down either maliciously, accidentally, or via a submission bug.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-4302) NM goes down if error encountered during log aggregation

Posted by "Hudson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-4302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13287788#comment-13287788 ] 

Hudson commented on MAPREDUCE-4302:
-----------------------------------

Integrated in Hadoop-Common-trunk-Commit #2311 (See [https://builds.apache.org/job/Hadoop-Common-trunk-Commit/2311/])
    MAPREDUCE-4302. NM goes down if error encountered during log aggregation (Daryn Sharp via bobby) (Revision 1345362)

     Result = SUCCESS
bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1345362
Files : 
* /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/ContainerManagerImpl.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/application/ApplicationEventType.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/application/ApplicationFinishEvent.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/application/ApplicationImpl.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/LogAggregationService.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/loghandler/NonAggregatingLogHandler.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/TestLogAggregationService.java

                
> NM goes down if error encountered during log aggregation
> --------------------------------------------------------
>
>                 Key: MAPREDUCE-4302
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4302
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 0.23.0, 2.0.0-alpha, trunk
>            Reporter: Daryn Sharp
>            Assignee: Daryn Sharp
>            Priority: Critical
>             Fix For: 0.23.3, 2.0.1-alpha, 3.0.0
>
>         Attachments: MAPREDUCE-4302.patch
>
>
> When a container launch request is sent to the NM, if _any_ exception occurs during the init of log aggregation then the NM goes down.  The problem can be induced by situations including, but certainly not limited to: transient rpc connection issues, missing tokens, expired tokens, permissions, full/quota exceeded dfs, etc.  The problem may occur with and without security enabled.
> The ramification is an entire cluster can be rather easily brought down either maliciously, accidentally, or via a submission bug.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-4302) NM goes down if error encountered during log aggregation

Posted by "Hudson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-4302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13287777#comment-13287777 ] 

Hudson commented on MAPREDUCE-4302:
-----------------------------------

Integrated in Hadoop-Mapreduce-trunk-Commit #2329 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Commit/2329/])
    MAPREDUCE-4302. NM goes down if error encountered during log aggregation (Daryn Sharp via bobby) (Revision 1345362)

     Result = FAILURE
bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1345362
Files : 
* /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/ContainerManagerImpl.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/application/ApplicationEventType.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/application/ApplicationFinishEvent.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/application/ApplicationImpl.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/LogAggregationService.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/loghandler/NonAggregatingLogHandler.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/TestLogAggregationService.java

                
> NM goes down if error encountered during log aggregation
> --------------------------------------------------------
>
>                 Key: MAPREDUCE-4302
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4302
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 0.23.0, 2.0.0-alpha, trunk
>            Reporter: Daryn Sharp
>            Assignee: Daryn Sharp
>            Priority: Critical
>             Fix For: 0.23.3, 2.0.1-alpha, 3.0.0
>
>         Attachments: MAPREDUCE-4302.patch
>
>
> When a container launch request is sent to the NM, if _any_ exception occurs during the init of log aggregation then the NM goes down.  The problem can be induced by situations including, but certainly not limited to: transient rpc connection issues, missing tokens, expired tokens, permissions, full/quota exceeded dfs, etc.  The problem may occur with and without security enabled.
> The ramification is an entire cluster can be rather easily brought down either maliciously, accidentally, or via a submission bug.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAPREDUCE-4302) NM goes down if error encountered during log aggregation

Posted by "Daryn Sharp (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAPREDUCE-4302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Daryn Sharp updated MAPREDUCE-4302:
-----------------------------------

    Status: Patch Available  (was: Open)
    
> NM goes down if error encountered during log aggregation
> --------------------------------------------------------
>
>                 Key: MAPREDUCE-4302
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4302
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 2.0.0-alpha, 0.23.0, trunk
>            Reporter: Daryn Sharp
>            Assignee: Daryn Sharp
>            Priority: Critical
>         Attachments: MAPREDUCE-4302.patch
>
>
> When a container launch request is sent to the NM, if _any_ exception occurs during the init of log aggregation then the NM goes down.  The problem can be induced by situations including, but certainly not limited to: transient rpc connection issues, missing tokens, expired tokens, permissions, full/quota exceeded dfs, etc.  The problem may occur with and without security enabled.
> The ramification is an entire cluster can be rather easily brought down either maliciously, accidentally, or via a submission bug.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Closed] (MAPREDUCE-4302) NM goes down if error encountered during log aggregation

Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAPREDUCE-4302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Arun C Murthy closed MAPREDUCE-4302.
------------------------------------

    
> NM goes down if error encountered during log aggregation
> --------------------------------------------------------
>
>                 Key: MAPREDUCE-4302
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4302
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 0.23.0, 2.0.0-alpha, trunk
>            Reporter: Daryn Sharp
>            Assignee: Daryn Sharp
>            Priority: Critical
>             Fix For: 0.23.3, 2.0.2-alpha
>
>         Attachments: MAPREDUCE-4302.patch
>
>
> When a container launch request is sent to the NM, if _any_ exception occurs during the init of log aggregation then the NM goes down.  The problem can be induced by situations including, but certainly not limited to: transient rpc connection issues, missing tokens, expired tokens, permissions, full/quota exceeded dfs, etc.  The problem may occur with and without security enabled.
> The ramification is an entire cluster can be rather easily brought down either maliciously, accidentally, or via a submission bug.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-4302) NM goes down if error encountered during log aggregation

Posted by "Daryn Sharp (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-4302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13287558#comment-13287558 ] 

Daryn Sharp commented on MAPREDUCE-4302:
----------------------------------------

For a little background, the problem was detected due to a NN token issue.  The NMs all went down because log aggregation init failed to connect to the NN to create its log dirs.  The NMs were started up again, and they all went down again because the AMs were retrying the tasks.  The problem was also induced by restricting permissions on the log dir and stopping the NN.
                
> NM goes down if error encountered during log aggregation
> --------------------------------------------------------
>
>                 Key: MAPREDUCE-4302
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4302
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 0.23.0, 2.0.0-alpha, trunk
>            Reporter: Daryn Sharp
>            Assignee: Daryn Sharp
>            Priority: Critical
>         Attachments: MAPREDUCE-4302.patch
>
>
> When a container launch request is sent to the NM, if _any_ exception occurs during the init of log aggregation then the NM goes down.  The problem can be induced by situations including, but certainly not limited to: transient rpc connection issues, missing tokens, expired tokens, permissions, full/quota exceeded dfs, etc.  The problem may occur with and without security enabled.
> The ramification is an entire cluster can be rather easily brought down either maliciously, accidentally, or via a submission bug.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-4302) NM goes down if error encountered during log aggregation

Posted by "Hudson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-4302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13287799#comment-13287799 ] 

Hudson commented on MAPREDUCE-4302:
-----------------------------------

Integrated in Hadoop-Hdfs-trunk-Commit #2383 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Commit/2383/])
    MAPREDUCE-4302. NM goes down if error encountered during log aggregation (Daryn Sharp via bobby) (Revision 1345362)

     Result = SUCCESS
bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1345362
Files : 
* /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/ContainerManagerImpl.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/application/ApplicationEventType.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/application/ApplicationFinishEvent.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/application/ApplicationImpl.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/LogAggregationService.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/loghandler/NonAggregatingLogHandler.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/TestLogAggregationService.java

                
> NM goes down if error encountered during log aggregation
> --------------------------------------------------------
>
>                 Key: MAPREDUCE-4302
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4302
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 0.23.0, 2.0.0-alpha, trunk
>            Reporter: Daryn Sharp
>            Assignee: Daryn Sharp
>            Priority: Critical
>             Fix For: 0.23.3, 2.0.1-alpha, 3.0.0
>
>         Attachments: MAPREDUCE-4302.patch
>
>
> When a container launch request is sent to the NM, if _any_ exception occurs during the init of log aggregation then the NM goes down.  The problem can be induced by situations including, but certainly not limited to: transient rpc connection issues, missing tokens, expired tokens, permissions, full/quota exceeded dfs, etc.  The problem may occur with and without security enabled.
> The ramification is an entire cluster can be rather easily brought down either maliciously, accidentally, or via a submission bug.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAPREDUCE-4302) NM goes down if error encountered during log aggregation

Posted by "Daryn Sharp (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAPREDUCE-4302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Daryn Sharp updated MAPREDUCE-4302:
-----------------------------------

    Attachment: MAPREDUCE-4302.patch

Wrap log aggregation init with a try block.  The log init is executed just before localization starts instead of in parallel.  Log init sends back an app finish event on failure, or a log init success after which localization begins.

Log init is serialized because it only makes a few directories and spins off a thread.  If log init fails then the job itself is likely to fail, so a log init failure reports a diagnostic of the exception back to the AM.  Otherwise if/when the job fails there will be no logs to debug...
                
> NM goes down if error encountered during log aggregation
> --------------------------------------------------------
>
>                 Key: MAPREDUCE-4302
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4302
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 0.23.0, 2.0.0-alpha, trunk
>            Reporter: Daryn Sharp
>            Assignee: Daryn Sharp
>            Priority: Critical
>         Attachments: MAPREDUCE-4302.patch
>
>
> When a container launch request is sent to the NM, if _any_ exception occurs during the init of log aggregation then the NM goes down.  The problem can be induced by situations including, but certainly not limited to: transient rpc connection issues, missing tokens, expired tokens, permissions, full/quota exceeded dfs, etc.  The problem may occur with and without security enabled.
> The ramification is an entire cluster can be rather easily brought down either maliciously, accidentally, or via a submission bug.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-4302) NM goes down if error encountered during log aggregation

Posted by "Hudson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-4302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13287945#comment-13287945 ] 

Hudson commented on MAPREDUCE-4302:
-----------------------------------

Integrated in Hadoop-Mapreduce-trunk #1098 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1098/])
    MAPREDUCE-4302. NM goes down if error encountered during log aggregation (Daryn Sharp via bobby) (Revision 1345362)

     Result = FAILURE
bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1345362
Files : 
* /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/ContainerManagerImpl.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/application/ApplicationEventType.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/application/ApplicationFinishEvent.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/application/ApplicationImpl.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/LogAggregationService.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/loghandler/NonAggregatingLogHandler.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/TestLogAggregationService.java

                
> NM goes down if error encountered during log aggregation
> --------------------------------------------------------
>
>                 Key: MAPREDUCE-4302
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4302
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 0.23.0, 2.0.0-alpha, trunk
>            Reporter: Daryn Sharp
>            Assignee: Daryn Sharp
>            Priority: Critical
>             Fix For: 0.23.3, 2.0.1-alpha, 3.0.0
>
>         Attachments: MAPREDUCE-4302.patch
>
>
> When a container launch request is sent to the NM, if _any_ exception occurs during the init of log aggregation then the NM goes down.  The problem can be induced by situations including, but certainly not limited to: transient rpc connection issues, missing tokens, expired tokens, permissions, full/quota exceeded dfs, etc.  The problem may occur with and without security enabled.
> The ramification is an entire cluster can be rather easily brought down either maliciously, accidentally, or via a submission bug.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-4302) NM goes down if error encountered during log aggregation

Posted by "Daryn Sharp (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-4302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13287709#comment-13287709 ] 

Daryn Sharp commented on MAPREDUCE-4302:
----------------------------------------

Yes, I manually tested by doing things like replacing the log dir with a file to force the mkdir to fail.
                
> NM goes down if error encountered during log aggregation
> --------------------------------------------------------
>
>                 Key: MAPREDUCE-4302
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4302
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 0.23.0, 2.0.0-alpha, trunk
>            Reporter: Daryn Sharp
>            Assignee: Daryn Sharp
>            Priority: Critical
>         Attachments: MAPREDUCE-4302.patch
>
>
> When a container launch request is sent to the NM, if _any_ exception occurs during the init of log aggregation then the NM goes down.  The problem can be induced by situations including, but certainly not limited to: transient rpc connection issues, missing tokens, expired tokens, permissions, full/quota exceeded dfs, etc.  The problem may occur with and without security enabled.
> The ramification is an entire cluster can be rather easily brought down either maliciously, accidentally, or via a submission bug.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-4302) NM goes down if error encountered during log aggregation

Posted by "Hudson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-4302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13287891#comment-13287891 ] 

Hudson commented on MAPREDUCE-4302:
-----------------------------------

Integrated in Hadoop-Hdfs-trunk #1064 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1064/])
    MAPREDUCE-4302. NM goes down if error encountered during log aggregation (Daryn Sharp via bobby) (Revision 1345362)

     Result = FAILURE
bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1345362
Files : 
* /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/ContainerManagerImpl.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/application/ApplicationEventType.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/application/ApplicationFinishEvent.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/application/ApplicationImpl.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/LogAggregationService.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/loghandler/NonAggregatingLogHandler.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/TestLogAggregationService.java

                
> NM goes down if error encountered during log aggregation
> --------------------------------------------------------
>
>                 Key: MAPREDUCE-4302
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4302
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 0.23.0, 2.0.0-alpha, trunk
>            Reporter: Daryn Sharp
>            Assignee: Daryn Sharp
>            Priority: Critical
>             Fix For: 0.23.3, 2.0.1-alpha, 3.0.0
>
>         Attachments: MAPREDUCE-4302.patch
>
>
> When a container launch request is sent to the NM, if _any_ exception occurs during the init of log aggregation then the NM goes down.  The problem can be induced by situations including, but certainly not limited to: transient rpc connection issues, missing tokens, expired tokens, permissions, full/quota exceeded dfs, etc.  The problem may occur with and without security enabled.
> The ramification is an entire cluster can be rather easily brought down either maliciously, accidentally, or via a submission bug.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-4302) NM goes down if error encountered during log aggregation

Posted by "Robert Joseph Evans (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-4302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13287713#comment-13287713 ] 

Robert Joseph Evans commented on MAPREDUCE-4302:
------------------------------------------------

Sounds good +1
                
> NM goes down if error encountered during log aggregation
> --------------------------------------------------------
>
>                 Key: MAPREDUCE-4302
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4302
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 0.23.0, 2.0.0-alpha, trunk
>            Reporter: Daryn Sharp
>            Assignee: Daryn Sharp
>            Priority: Critical
>         Attachments: MAPREDUCE-4302.patch
>
>
> When a container launch request is sent to the NM, if _any_ exception occurs during the init of log aggregation then the NM goes down.  The problem can be induced by situations including, but certainly not limited to: transient rpc connection issues, missing tokens, expired tokens, permissions, full/quota exceeded dfs, etc.  The problem may occur with and without security enabled.
> The ramification is an entire cluster can be rather easily brought down either maliciously, accidentally, or via a submission bug.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-4302) NM goes down if error encountered during log aggregation

Posted by "Robert Joseph Evans (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-4302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13287707#comment-13287707 ] 

Robert Joseph Evans commented on MAPREDUCE-4302:
------------------------------------------------

The changes look mostly straight forward.  Also the unit tests look OK.  Have you done any manual testing?
                
> NM goes down if error encountered during log aggregation
> --------------------------------------------------------
>
>                 Key: MAPREDUCE-4302
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4302
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 0.23.0, 2.0.0-alpha, trunk
>            Reporter: Daryn Sharp
>            Assignee: Daryn Sharp
>            Priority: Critical
>         Attachments: MAPREDUCE-4302.patch
>
>
> When a container launch request is sent to the NM, if _any_ exception occurs during the init of log aggregation then the NM goes down.  The problem can be induced by situations including, but certainly not limited to: transient rpc connection issues, missing tokens, expired tokens, permissions, full/quota exceeded dfs, etc.  The problem may occur with and without security enabled.
> The ramification is an entire cluster can be rather easily brought down either maliciously, accidentally, or via a submission bug.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAPREDUCE-4302) NM goes down if error encountered during log aggregation

Posted by "Robert Joseph Evans (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAPREDUCE-4302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Joseph Evans updated MAPREDUCE-4302:
-------------------------------------------

       Resolution: Fixed
    Fix Version/s: 3.0.0
                   2.0.1-alpha
                   0.23.3
           Status: Resolved  (was: Patch Available)

Thanks Daryn,

I put this into trunk, branch-2, and branch-0.23.
                
> NM goes down if error encountered during log aggregation
> --------------------------------------------------------
>
>                 Key: MAPREDUCE-4302
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4302
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 0.23.0, 2.0.0-alpha, trunk
>            Reporter: Daryn Sharp
>            Assignee: Daryn Sharp
>            Priority: Critical
>             Fix For: 0.23.3, 2.0.1-alpha, 3.0.0
>
>         Attachments: MAPREDUCE-4302.patch
>
>
> When a container launch request is sent to the NM, if _any_ exception occurs during the init of log aggregation then the NM goes down.  The problem can be induced by situations including, but certainly not limited to: transient rpc connection issues, missing tokens, expired tokens, permissions, full/quota exceeded dfs, etc.  The problem may occur with and without security enabled.
> The ramification is an entire cluster can be rather easily brought down either maliciously, accidentally, or via a submission bug.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira