You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Jason Lowe (JIRA)" <ji...@apache.org> on 2015/04/10 19:02:12 UTC

[jira] [Commented] (YARN-3476) Nodemanager can fail to delete local logs if log aggregation fails

    [ https://issues.apache.org/jira/browse/YARN-3476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14489921#comment-14489921 ] 

Jason Lowe commented on YARN-3476:
----------------------------------

Snippet from an NM log:

{noformat}
2015-03-15 11:34:34,671 [LogAggregationService #25] ERROR logaggregation.AppLogAggregatorImpl: Couldn't upload logs for container_e03_1424994657328_776201_02_016386. Skipping this container.
2015-03-15 11:34:34,672 [DeletionService #3] INFO nodemanager.LinuxContainerExecutor: Deleting absolute path : null
2015-03-15 11:34:34,751 [LogAggregationService #25] WARN logaggregation.AppLogAggregatorImpl: Aggregation did not complete for application application_1424994657328_776201
2015-03-15 11:34:34,751 [LogAggregationService #25] ERROR yarn.YarnUncaughtExceptionHandler: Thread Thread[LogAggregationService #25,5,main] threw an Exception.
java.lang.IllegalStateException: Cannot close TFile in the middle of key-value insertion.
        at org.apache.hadoop.io.file.tfile.TFile$Writer.close(TFile.java:310)
        at org.apache.hadoop.yarn.logaggregation.AggregatedLogFormat$LogWriter.close(AggregatedLogFormat.java:454)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl.uploadLogsForContainers(AppLogAggregatorImpl.java:285)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl.doAppLogAggregation(AppLogAggregatorImpl.java:415)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl.run(AppLogAggregatorImpl.java:380)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService$2.run(LogAggregationService.java:387)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:722)
{noformat}

Because of the TFile error we fail to do post-aggregation cleanup such as deleting the application logs.  At that point we leak the logs on the local disk.

Note the "Deleting absolute path : null" log above is probably caused by this logic in AppLogAggregatorImpl:

{code}
        if (uploadedFilePathsInThisCycle.size() > 0) {
          uploadedLogsInThisCycle = true;
        }
        this.delService.delete(this.userUgi.getShortUserName(), null,
          uploadedFilePathsInThisCycle
            .toArray(new Path[uploadedFilePathsInThisCycle.size()]));
{code}

We check if there are no uploaded file paths, but then go ahead and always try to delete them even if there are none.

> Nodemanager can fail to delete local logs if log aggregation fails
> ------------------------------------------------------------------
>
>                 Key: YARN-3476
>                 URL: https://issues.apache.org/jira/browse/YARN-3476
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: log-aggregation, nodemanager
>    Affects Versions: 2.6.0
>            Reporter: Jason Lowe
>            Assignee: Rohith
>
> If log aggregation encounters an error trying to upload the file then the underlying TFile can throw an illegalstateexception which will bubble up through the top of the thread and prevent the application logs from being deleted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)