You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-dev@hadoop.apache.org by "YCozy (Jira)" <ji...@apache.org> on 2020/05/28 16:21:00 UTC

[jira] [Created] (YARN-10294) NodeManager shows a wrong reason when a YARN service fails to start

YCozy created YARN-10294:
----------------------------

             Summary: NodeManager shows a wrong reason when a YARN service fails to start
                 Key: YARN-10294
                 URL: https://issues.apache.org/jira/browse/YARN-10294
             Project: Hadoop YARN
          Issue Type: Bug
          Components: nodemanager
    Affects Versions: 3.3.0
            Reporter: YCozy


We have a YARN cluster and try to start a sleeper service. A NodeManager NM1 gets assigned and tries to start the service. We can see from its log:
{noformat}
2020-05-28 14:48:18,650 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.scheduler.ContainerScheduler: Starting container [container_6_0001_01_000001]
2020-05-28 14:48:18,710 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: Container container_6_0001_01_000001 transitioned from SCHEDULED to RUNNING{noformat}
Due to some misconfiguration, the container fails to start. We can also see from the container's serviceam.log:
{noformat}
2020-05-28 14:48:56,651 [Curator-Framework-0] ERROR imps.CuratorFrameworkImpl - Background retry gave up
org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = ConnectionLoss
  at org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:972)
  at org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:943)
  at org.apache.curator.framework.imps.CuratorFrameworkImpl.access$300(CuratorFrameworkImpl.java:66)
  at org.apache.curator.framework.imps.CuratorFrameworkImpl$4.call(CuratorFrameworkImpl.java:346)
  at java.util.concurrent.FutureTask.run(FutureTask.java:266)                    
  at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
  at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
  at java.lang.Thread.run(Thread.java:748)
2020-05-28 14:49:04,621 [pool-5-thread-1] ERROR service.ServiceScheduler - Failed to register app sleeper1 in registry
org.apache.hadoop.registry.client.exceptions.RegistryIOException: `/registry/users/root/services/yarn-service': Failure of mkdir()  on /registry/users/root/services/yarn-service: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode =        ConnectionLoss for /registry/users/root/services/yarn-service: KeeperErrorCode = ConnectionLoss for /registry/users/root/services/yarn-service
  at org.apache.hadoop.registry.client.impl.zk.CuratorService.operationFailure(CuratorService.java:440)
  at org.apache.hadoop.registry.client.impl.zk.CuratorService.zkMkPath(CuratorService.java:595)
  at org.apache.hadoop.registry.client.impl.zk.RegistryOperationsService.mknode(RegistryOperationsService.java:99)
  at org.apache.hadoop.yarn.service.registry.YarnRegistryViewForProviders.putService(YarnRegistryViewForProviders.java:194)
  at org.apache.hadoop.yarn.service.registry.YarnRegistryViewForProviders.registerSelf(YarnRegistryViewForProviders.java:210)
  at org.apache.hadoop.yarn.service.ServiceScheduler$2.run(ServiceScheduler.java:575)
  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)     
  at java.util.concurrent.FutureTask.run(FutureTask.java:266)                    
  at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
  at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
  at java.lang.Thread.run(Thread.java:748)                                       
Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /registry/users/root/services/yarn-service
  at org.apache.zookeeper.KeeperException.create(KeeperException.java:102)       
  at org.apache.zookeeper.KeeperException.create(KeeperException.java:54)        
  at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:1637)                  
  at org.apache.curator.framework.imps.CreateBuilderImpl$17.call(CreateBuilderImpl.java:1180)
  at org.apache.curator.framework.imps.CreateBuilderImpl$17.call(CreateBuilderImpl.java:1156)
  at org.apache.curator.connection.StandardConnectionHandlingPolicy.callWithRetry(StandardConnectionHandlingPolicy.java:64)
  at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:100)              
  at org.apache.curator.framework.imps.CreateBuilderImpl.pathInForeground(CreateBuilderImpl.java:1153)                                                                                                                                                                
  at org.apache.curator.framework.imps.CreateBuilderImpl.protectedPathInForeground(CreateBuilderImpl.java:607)
  at org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:597)
  at org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:575)
  at org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:51)
  at org.apache.hadoop.registry.client.impl.zk.CuratorService.zkMkPath(CuratorService.java:587)
  ... 11 more
2020-05-28 14:49:12,351 [Listener at 0.0.0.0/43147] INFO  service.AbstractService - Service sleeper1 failed in state STARTED
java.lang.NullPointerException                                                   
  at org.apache.hadoop.yarn.api.records.ApplicationId.fromString(ApplicationId.java:120)
  at org.apache.hadoop.yarn.service.ServiceScheduler.recoverComponents(ServiceScheduler.java:463)
  at org.apache.hadoop.yarn.service.ServiceScheduler.serviceStart(ServiceScheduler.java:404)
  at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)                                                                                                                                                                                        
  at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:122)
  at org.apache.hadoop.yarn.service.ServiceMaster.lambda$serviceStart$0(ServiceMaster.java:267)
  at java.security.AccessController.doPrivileged(Native Method)                  
  at javax.security.auth.Subject.doAs(Subject.java:422)                          
  at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1845)
  at org.apache.hadoop.yarn.service.ServiceMaster.serviceStart(ServiceMaster.java:265)
  at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)   
  at org.apache.hadoop.yarn.service.ServiceMaster.main(ServiceMaster.java:346)

...

2020-05-28 14:49:13,463 [Listener at 0.0.0.0/43147] ERROR service.ServiceMaster - Error starting service master
java.lang.NullPointerException                                                   
  at org.apache.hadoop.yarn.api.records.ApplicationId.fromString(ApplicationId.java:120)
  at org.apache.hadoop.yarn.service.ServiceScheduler.recoverComponents(ServiceScheduler.java:463)
  at org.apache.hadoop.yarn.service.ServiceScheduler.serviceStart(ServiceScheduler.java:404)
  at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)   
  at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:122)
  at org.apache.hadoop.yarn.service.ServiceMaster.lambda$serviceStart$0(ServiceMaster.java:267)
  at java.security.AccessController.doPrivileged(Native Method)                  
  at javax.security.auth.Subject.doAs(Subject.java:422)                          
  at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1845)
  at org.apache.hadoop.yarn.service.ServiceMaster.serviceStart(ServiceMaster.java:265)
  at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)   
  at org.apache.hadoop.yarn.service.ServiceMaster.main(ServiceMaster.java:346)   
2020-05-28 14:49:13,463 [IPC Server Responder] INFO  ipc.Server - Stopping IPC Server Responder
2020-05-28 14:49:13,465 [Listener at 0.0.0.0/43147] INFO  util.ExitUtil - Exiting with status 1: Error starting service master{noformat}
However, NM1 shows an incorrect reason for why the container fails to start:
{noformat}
2020-05-28 14:49:13,516 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exception from container-launch with container ID: container_6_0001_01_000001 and exit code: 1
ExitCodeException exitCode=1: bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)
/bin/bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)        
/bin/bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)        

  at org.apache.hadoop.util.Shell.runCommand(Shell.java:1008)                    
  at org.apache.hadoop.util.Shell.run(Shell.java:901)                            
  at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1213)  
  at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:312)
  at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.launchContainer(ContainerLaunch.java:567)
  at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:355)
  at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:103)
  at java.util.concurrent.FutureTask.run(FutureTask.java:266)                    
  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
  at java.lang.Thread.run(Thread.java:748)                                       
2020-05-28 14:49:13,516 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Exception from container-launch.
2020-05-28 14:49:13,516 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Container id: container_6_0001_01_000001
2020-05-28 14:49:13,516 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Exit code: 1
2020-05-28 14:49:13,516 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Exception message: bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)
2020-05-28 14:49:13,516 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: /bin/bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)
2020-05-28 14:49:13,517 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: /bin/bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)
2020-05-28 14:49:13,519 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: Container launch failed : Container exited with a non-zero exit code 1. 
2020-05-28 14:49:13,580 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: Container container_6_0001_01_000001 transitioned from RUNNING to EXITED_WITH_FAILURE{noformat}
The error shown by NM1 is actually a warning from the container's prelaunch.err file, which differs from the error as shown in the container's serviceam.log.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-dev-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-dev-help@hadoop.apache.org