You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Jason Lowe (JIRA)" <ji...@apache.org> on 2018/08/16 15:05:00 UTC

[jira] [Commented] (YARN-8672) TestContainerManager#testLocalingResourceWhileContainerRunning occasionally times out

    [ https://issues.apache.org/jira/browse/YARN-8672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16582661#comment-16582661 ] 

Jason Lowe commented on YARN-8672:
----------------------------------

Here's some sample output showing the localization failure which I believe leads to the test timeout:
{noformat}
2018-08-14 23:57:37,636 INFO  [NM ContainerManager dispatcher] containermanager.AuxServices (AuxServices.java:handle(350)) - Got event CONTAINER_INIT for appId application_0_0000
2018-08-14 23:57:37,636 INFO  [NM ContainerManager dispatcher] localizer.ResourceLocalizationService (ResourceLocalizationService.java:handle(789)) - Created localizer for container_0_0000_01_000000
2018-08-14 23:57:37,642 INFO  [LocalizerRunner for container_0_0000_01_000000] localizer.ResourceLocalizationService (ResourceLocalizationService.java:writeCredentials(1315)) - Writing credentials to the nmPrivate file /testptch/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/target/TestContainerManager-localDir/nmPrivate/container_0_0000_01_000000.tokens
2018-08-14 23:57:37,645 INFO  [LocalizerRunner for container_0_0000_01_000000] nodemanager.DefaultContainerExecutor (DefaultContainerExecutor.java:createUserCacheDirs(836)) - Initializing user nobody
2018-08-14 23:57:37,662 INFO  [LocalizerRunner for container_0_0000_01_000000] nodemanager.DefaultContainerExecutor (DefaultContainerExecutor.java:startLocalizer(166)) - Copying from /testptch/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/target/TestContainerManager-localDir/nmPrivate/container_0_0000_01_000000.tokens to /testptch/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/target/TestContainerManager-localDir/usercache/nobody/appcache/application_0_0000/container_0_0000_01_000000.tokens
2018-08-14 23:57:37,663 INFO  [LocalizerRunner for container_0_0000_01_000000] nodemanager.DefaultContainerExecutor (DefaultContainerExecutor.java:startLocalizer(174)) - Localizer CWD set to /testptch/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/target/TestContainerManager-localDir/usercache/nobody/appcache/application_0_0000 = file:/testptch/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/target/TestContainerManager-localDir/usercache/nobody/appcache/application_0_0000
2018-08-14 23:57:37,704 INFO  [NM ContainerManager dispatcher] container.ContainerImpl (ContainerImpl.java:handle(2109)) - Container container_0_0000_01_000000 transitioned from LOCALIZING to SCHEDULED
2018-08-14 23:57:37,705 INFO  [NM ContainerManager dispatcher] scheduler.ContainerScheduler (ContainerScheduler.java:startContainer(541)) - Starting container [container_0_0000_01_000000]
2018-08-14 23:57:37,733 INFO  [NM ContainerManager dispatcher] container.ContainerImpl (ContainerImpl.java:handle(2109)) - Container container_0_0000_01_000000 transitioned from SCHEDULED to RUNNING
2018-08-14 23:57:37,734 INFO  [NM ContainerManager dispatcher] monitor.ContainersMonitorImpl (ContainersMonitorImpl.java:onStartMonitoringContainer(1013)) - Starting resource-monitoring for container_0_0000_01_000000
2018-08-14 23:57:37,771 INFO  [ContainersLauncher #0] nodemanager.DefaultContainerExecutor (DefaultContainerExecutor.java:buildCommandExecutor(370)) - launchContainer: [bash, /testptch/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/target/TestContainerManager-localDir/usercache/nobody/appcache/application_0_0000/container_0_0000_01_000000/default_container_executor.sh]
2018-08-14 23:57:38,635 INFO  [main] containermanager.ContainerManagerImpl (ContainerManagerImpl.java:getContainerStatusInternal(1455)) - Getting container-status for container_0_0000_01_000000
2018-08-14 23:57:38,636 INFO  [main] containermanager.ContainerManagerImpl (ContainerManagerImpl.java:getContainerStatusInternal(1469)) - Returning ContainerStatus: [ContainerId: container_0_0000_01_000000, ExecutionType: GUARANTEED, State: RUNNING, Capability: <memory:1024, vCores:1>, Diagnostics: , ExitStatus: -1000, IP: null, Host: null, ContainerSubState: RUNNING]
2018-08-14 23:57:38,636 INFO  [main] containermanager.TestContainerManager (BaseContainerManagerTest.java:waitForContainerState(338)) - Waiting for container to get into one of states [RUNNING]. Current state is RUNNING
2018-08-14 23:57:38,636 INFO  [main] containermanager.TestContainerManager (BaseContainerManagerTest.java:waitForContainerState(343)) - Container state is RUNNING
2018-08-14 23:57:38,651 INFO  [NM ContainerManager dispatcher] localizer.ResourceLocalizationService (ResourceLocalizationService.java:handle(783)) - New REQUEST_RESOURCE_LOCALIZATION localize request for container_0_0000_01_000000, remove old private localizer.
2018-08-14 23:57:38,651 INFO  [NM ContainerManager dispatcher] localizer.ResourceLocalizationService (ResourceLocalizationService.java:handle(789)) - Created localizer for container_0_0000_01_000000
2018-08-14 23:57:38,656 INFO  [LocalizerRunner for container_0_0000_01_000000] localizer.ResourceLocalizationService (ResourceLocalizationService.java:run(1252)) - Localizer failed for container_0_0000_01_000000
ExitCodeException exitCode=1: chmod: cannot access '/testptch/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/target/TestContainerManager-localDir/nmPrivate/container_0_0000_01_000000.tokens': No such file or directory

	at org.apache.hadoop.util.Shell.runCommand(Shell.java:1008)
	at org.apache.hadoop.util.Shell.run(Shell.java:901)
	at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1213)
	at org.apache.hadoop.util.Shell.execCommand(Shell.java:1307)
	at org.apache.hadoop.util.Shell.execCommand(Shell.java:1289)
	at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:865)
	at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.<init>(RawLocalFileSystem.java:252)
	at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.<init>(RawLocalFileSystem.java:232)
	at org.apache.hadoop.fs.RawLocalFileSystem.createOutputStreamWithMode(RawLocalFileSystem.java:331)
	at org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:320)
	at org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:351)
	at org.apache.hadoop.fs.FileSystem.primitiveCreate(FileSystem.java:1279)
	at org.apache.hadoop.fs.DelegateToFileSystem.createInternal(DelegateToFileSystem.java:100)
	at org.apache.hadoop.fs.ChecksumFs$ChecksumFSOutputSummer.<init>(ChecksumFs.java:353)
	at org.apache.hadoop.fs.ChecksumFs.createInternal(ChecksumFs.java:400)
	at org.apache.hadoop.fs.AbstractFileSystem.create(AbstractFileSystem.java:605)
	at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:696)
	at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:692)
	at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
	at org.apache.hadoop.fs.FileContext.create(FileContext.java:698)
	at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.writeCredentials(ResourceLocalizationService.java:1314)
	at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:1229)
{noformat}

Note the failure in the above stack trace is for the chmod that occurs immediately after creating a local file for an output stream, implying something asynchronously came along and removed the file.  When it fails it doesn't always fail with the exact same stacktrace, but the common theme is trying to access the container tokens file at some point when it's missing.

> TestContainerManager#testLocalingResourceWhileContainerRunning occasionally times out
> -------------------------------------------------------------------------------------
>
>                 Key: YARN-8672
>                 URL: https://issues.apache.org/jira/browse/YARN-8672
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 3.2.0
>            Reporter: Jason Lowe
>            Priority: Major
>
> Precommit builds have been failing in TestContainerManager#testLocalingResourceWhileContainerRunning.  I have been able to reproduce the problem without any patch applied if I run the test enough times.  It looks like something is removing container tokens from the nmPrivate area just as a new localizer starts.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org