You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Eric Yang (JIRA)" <ji...@apache.org> on 2018/06/06 22:38:00 UTC

[jira] [Commented] (YARN-8403) Nodemanager logs failed to download file with INFO level

    [ https://issues.apache.org/jira/browse/YARN-8403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16504004#comment-16504004 ] 

Eric Yang commented on YARN-8403:
---------------------------------

We can not aggregate this error message to app logs due to race condition of the error occurs before AM container starts.

The first error could occur when node manager is disk full or yarn local directory is setup with wrong permission or bad disk. From system administrator point of view, it is safer to log this message with ERROR level because system administrator might want to check this to make sure cluster is not misconfigure. However, this log entry could also be noise because user error in referencing file name can also trigger this error message to show up.

The second message can be left in warn state without being aggregated because both messages are tracing back to the same root cause that node manager is unable to download file for a container.

> Nodemanager logs failed to download file with INFO level
> --------------------------------------------------------
>
>                 Key: YARN-8403
>                 URL: https://issues.apache.org/jira/browse/YARN-8403
>             Project: Hadoop YARN
>          Issue Type: Improvement
>          Components: yarn
>            Reporter: Eric Yang
>            Priority: Major
>
> Some of the container execution related stack traces are printing in INFO or WARN level. 
> {code}
> 2018-06-06 03:10:40,077 INFO  localizer.ResourceLocalizationService (ResourceLocalizationService.java:writeCredentials(1312)) - Writing credentials to the nmPrivate file /grid/0/hadoop/yarn/local/nmPrivate/container_e02_1528246317583_0048_01_000001.tokens
> 2018-06-06 03:10:40,087 INFO  localizer.ResourceLocalizationService (ResourceLocalizationService.java:run(975)) - Failed to download resource { { hdfs://ctr-e138-1518143905142-347847-01-000003.hwx.site:8020/user/hrt_qa/Streaming/InputDir, 1528254452720, FILE, null },pending,[(container_e02_1528246317583_0048_01_000001)],6074418082915225,DOWNLOADING}
> org.apache.hadoop.yarn.exceptions.YarnException: Download and unpack failed
>         at org.apache.hadoop.yarn.util.FSDownload.downloadAndUnpack(FSDownload.java:306)
>         at org.apache.hadoop.yarn.util.FSDownload.verifyAndCopy(FSDownload.java:283)
>         at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:409)
>         at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:66)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:748)
> Caused by: java.io.FileNotFoundException: /grid/0/hadoop/yarn/local/filecache/28_tmp/InputDir/input1.txt (Permission denied)
>         at java.io.FileOutputStream.open0(Native Method)
>         at java.io.FileOutputStream.open(FileOutputStream.java:270)
>         at java.io.FileOutputStream.<init>(FileOutputStream.java:213)
>         at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.<init>(RawLocalFileSystem.java:236)
>         at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.<init>(RawLocalFileSystem.java:219)
>         at org.apache.hadoop.fs.RawLocalFileSystem.createOutputStreamWithMode(RawLocalFileSystem.java:318)
>         at org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:307)
>         at org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:338)
>         at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.<init>(ChecksumFileSystem.java:401)
>         at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:464)
>         at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:443)
>         at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1169)
>         at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1149)
>         at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1038)
>         at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:408)
>         at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:399)
>         at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:381)
>         at org.apache.hadoop.yarn.util.FSDownload.downloadAndUnpack(FSDownload.java:298)
>         ... 9 more
> {code}
> {code}
> 2018-06-06 03:10:41,547 WARN  privileged.PrivilegedOperationExecutor (PrivilegedOperationExecutor.java:executePrivilegedOperation(182)) - IOException executing command:
> java.io.InterruptedIOException: java.lang.InterruptedException
>         at org.apache.hadoop.util.Shell.runCommand(Shell.java:1012)
>         at org.apache.hadoop.util.Shell.run(Shell.java:902)
>         at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1227)
>         at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:152)
>         at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.startLocalizer(LinuxContainerExecutor.java:402)
>         at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:1229)
> Caused by: java.lang.InterruptedException
>         at java.lang.Object.wait(Native Method)
>         at java.lang.Object.wait(Object.java:502)
>         at java.lang.UNIXProcess.waitFor(UNIXProcess.java:395)
>         at org.apache.hadoop.util.Shell.runCommand(Shell.java:1002)
>         ... 5 more
> 2018-06-06 03:10:41,548 WARN  nodemanager.LinuxContainerExecutor (LinuxContainerExecutor.java:startLocalizer(407)) - Exit code from container container_e02_1528246317583_0048_01_000001 startLocalizer is : -1
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationException: java.io.InterruptedIOException: java.lang.InterruptedException
>         at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:183)
>         at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.startLocalizer(LinuxContainerExecutor.java:402)
>         at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:1229)
> Caused by: java.io.InterruptedIOException: java.lang.InterruptedException
>         at org.apache.hadoop.util.Shell.runCommand(Shell.java:1012)
>         at org.apache.hadoop.util.Shell.run(Shell.java:902)
>         at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1227)
>         at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:152)
>         ... 2 more
> Caused by: java.lang.InterruptedException
>         at java.lang.Object.wait(Native Method)
>         at java.lang.Object.wait(Object.java:502)
>         at java.lang.UNIXProcess.waitFor(UNIXProcess.java:395)
>         at org.apache.hadoop.util.Shell.runCommand(Shell.java:1002)
>         ... 5 more
> 2018-06-06 03:10:41,548 INFO  localizer.ResourceLocalizationService (ResourceLocalizationService.java:run(1249)) - Localizer failed for container_e02_1528246317583_0048_01_000001
> java.io.IOException: Application application_1528246317583_0048 initialization failed (exitCode=-1) with output: null
>         at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.startLocalizer(LinuxContainerExecutor.java:411)
>         at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:1229)
> Caused by: org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationException: java.io.InterruptedIOException: java.lang.InterruptedException
>         at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:183)
>         at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.startLocalizer(LinuxContainerExecutor.java:402)
> ... 1 more
> Caused by: java.io.InterruptedIOException: java.lang.InterruptedException
>         at org.apache.hadoop.util.Shell.runCommand(Shell.java:1012)
>         at org.apache.hadoop.util.Shell.run(Shell.java:902)
>         at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1227)
>         at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:152)
>         ... 2 more
> Caused by: java.lang.InterruptedException
>         at java.lang.Object.wait(Native Method)
>         at java.lang.Object.wait(Object.java:502)
>         at java.lang.UNIXProcess.waitFor(UNIXProcess.java:395)
>         at org.apache.hadoop.util.Shell.runCommand(Shell.java:1002)
>         ... 5 more
> {code}
> These logs are only present in NM. ( It does not show up in AM log) 
> These stacktraces are in WARN or INFO level. Ideally, exception should be printed in ERROR log level. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org