You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-dev@hadoop.apache.org by "Zbigniew Kostrzewa (JIRA)" <ji...@apache.org> on 2018/11/27 08:33:00 UTC

[jira] [Created] (YARN-9064) Both Resource Managers stay in standby after connection to ZooKeeper was recovered

Zbigniew Kostrzewa created YARN-9064:
----------------------------------------

             Summary: Both Resource Managers stay in standby after connection to ZooKeeper was recovered
                 Key: YARN-9064
                 URL: https://issues.apache.org/jira/browse/YARN-9064
             Project: Hadoop YARN
          Issue Type: Bug
          Components: resourcemanager, yarn
    Affects Versions: 2.6.0
         Environment: * cluster of 31 nodes
* each node is a VM with 60GB of RAM and 8 vcpus
* each VM is running CentOS 7.2 with Hadoop 2.6.0
* Hadoop cluster is secured with Kerberos
* Hadoop cluster is configured with HA
            Reporter: Zbigniew Kostrzewa


I have a Hadoop 2.6.0 31 nodes' cluster. The cluster is secured with Kerberos and configured in HA. The first 3 nodes hold both slave and master services:
 * Node-1: NameNode, ResourceManager, JournalNode, ZKFC, MapRed Job History Server, DataNode, NodeManager, ZooKeeper and Kerberos
 * Node-2: NameNode, ResourceManager, JournalNode, ZKFC, DataNode, NodeManager, ZooKeeper and Kerberos
 * Node-3: JournalNode, DataNode, NodeManager and ZooKeeper
 * Node-4..Node-31: DataNode and NodeManager

At one moment there was a problem with the switch the nodes were connected to and all the services started loosing connectivity.
 1. At first Kerberos stopped granting any tickets
 2. This broke the cluster as Hadoop services could not authenticate to each other.
 3. At some point ZooKeeper cluster lost leader and started re-election.
 4. This resulted in multiple ZooKeeper-related errors and warnings in ResourceManager and ZKFC logs.
 5. After a while, when the issue with the switch was resolved most of services recovered automatically
 6. "Most" except YARN:
 a. both ResourceManager were stuck in standby mode
 b. all NodeManagers were shutdown
 7. I have managed to recover YARN, however it required manual restart of both ResourceManagers (and starting all NodeManagers)

I have all the logs from the incident but the most important seem to be those:
{noformat}
2018-11-16 03:21:16,420 INFO org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: Unregistering app attempt : appattempt_1539778834071_0622_000001
2018-11-16 03:21:16,424 INFO org.apache.hadoop.yarn.server.resourcemanager.security.AMRMTokenSecretManager: Application finished, removing password for appattempt_1539778834071_0622_000001
2018-11-16 03:21:16,424 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1539778834071_0622 State change from NEW to ACCEPTED on event = RECOVER
2018-11-16 03:21:16,424 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAppManager: Successfully recovered 622 out of 622 applications
2018-11-16 03:21:16,424 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: The number of failed attempts is 0. The max attempts is 1
2018-11-16 03:21:16,424 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Recovery ended
2018-11-16 03:21:16,425 INFO org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: Registering app attempt : appattempt_1539778834071_0622_000002
2018-11-16 03:21:16,426 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1539778834071_0622_000002 State change from NEW to SUBMITTED on event = START
2018-11-16 03:21:16,427 INFO org.apache.hadoop.yarn.server.resourcemanager.security.RMContainerTokenSecretManager: Rolling master-key for container-tokens
2018-11-16 03:21:16,427 INFO org.apache.hadoop.yarn.server.resourcemanager.security.NMTokenSecretManagerInRM: Rolling master-key for nm-tokens
2018-11-16 03:21:16,427 INFO org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager: Updating the current master key for generating delegation tokens
2018-11-16 03:21:16,427 INFO org.apache.hadoop.yarn.server.resourcemanager.security.RMDelegationTokenSecretManager: storing master key with keyID 32
2018-11-16 03:21:16,427 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Storing RMDTMasterKey.
2018-11-16 03:21:16,440 INFO org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager: Starting expired delegation token remover thread, tokenRemoverScanInterval=60 min(s)
2018-11-16 03:21:16,441 INFO org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager: Updating the current master key for generating delegation tokens
2018-11-16 03:21:16,444 INFO org.apache.hadoop.yarn.server.resourcemanager.security.RMDelegationTokenSecretManager: storing master key with keyID 33
2018-11-16 03:21:16,445 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Storing RMDTMasterKey.
2018-11-16 03:21:16,458 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo: Application application_1539778834071_0622 requests cleared
2018-11-16 03:21:16,459 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler: Added Application Attempt appattempt_1539778834071_0622_000002 to scheduler from user packer
2018-11-16 03:21:16,459 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1539778834071_0622_000002 State change from SUBMITTED to SCHEDULED on event = ATTEMPT_ADDED
2018-11-16 03:21:16,459 INFO org.apache.hadoop.ipc.CallQueueManager: Using callQueue: class java.util.concurrent.LinkedBlockingQueue queueCapacity: 5000
2018-11-16 03:21:16,460 INFO org.apache.hadoop.service.AbstractService: Service org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService failed in state STARTED; cause: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.io.IOException: Failed on local exception: java.net.SocketException: Unresolved address; Host Details : local host is: "node-2.mydomain.com"; destination host is: (unknown):0; 
org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.io.IOException: Failed on local exception: java.net.SocketException: Unresolved address; Host Details : local host is: "node-2.mydomain.com"; destination host is: (unknown):0; 
		at org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl.getServer(RpcServerFactoryPBImpl.java:139)
		at org.apache.hadoop.yarn.ipc.HadoopYarnProtoRPC.getServer(HadoopYarnProtoRPC.java:65)
		at org.apache.hadoop.yarn.ipc.YarnRPC.getServer(YarnRPC.java:54)
		at org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService.serviceStart(ResourceTrackerService.java:163)
		at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
		at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120)
		at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:611)
		at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
		at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:1091)
		at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1132)
		at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1128)
		at java.security.AccessController.doPrivileged(Native Method)
		at javax.security.auth.Subject.doAs(Subject.java:422)
		at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1917)
		at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1128)
		at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:306)
		at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:132)
		at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:812)
		at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:483)
		at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:546)
		at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
Caused by: java.io.IOException: Failed on local exception: java.net.SocketException: Unresolved address; Host Details : local host is: "node-2.mydomain.com"; destination host is: (unknown):0; 
		at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
		at org.apache.hadoop.ipc.Server.bind(Server.java:522)
		at org.apache.hadoop.ipc.Server$Listener.<init>(Server.java:728)
		at org.apache.hadoop.ipc.Server.<init>(Server.java:2449)
		at org.apache.hadoop.ipc.RPC$Server.<init>(RPC.java:1042)
		at org.apache.hadoop.ipc.ProtobufRpcEngine$Server.<init>(ProtobufRpcEngine.java:535)
		at org.apache.hadoop.ipc.ProtobufRpcEngine.getServer(ProtobufRpcEngine.java:510)
		at org.apache.hadoop.ipc.RPC$Builder.build(RPC.java:887)
		at org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl.createServer(RpcServerFactoryPBImpl.java:169)
		at org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl.getServer(RpcServerFactoryPBImpl.java:132)
		... 20 more
Caused by: java.net.SocketException: Unresolved address
		at sun.nio.ch.Net.translateToSocketException(Net.java:131)
		at sun.nio.ch.Net.translateException(Net.java:157)
		at sun.nio.ch.Net.translateException(Net.java:163)
		at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:76)
		at org.apache.hadoop.ipc.Server.bind(Server.java:505)
		... 28 more
Caused by: java.nio.channels.UnresolvedAddressException
		at sun.nio.ch.Net.checkAddress(Net.java:101)
		at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:218)
		at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
		... 29 more
2018-11-16 03:21:16,464 INFO org.apache.hadoop.service.AbstractService: Service RMActiveServices failed in state STARTED; cause: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.io.IOException: Failed on local exception: java.net.SocketException: Unresolved address; Host Details : local host is: "node-2.mydomain.com"; destination host is: (unknown):0; 
org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.io.IOException: Failed on local exception: java.net.SocketException: Unresolved address; Host Details : local host is: "node-2.mydomain.com"; destination host is: (unknown):0; 
		at org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl.getServer(RpcServerFactoryPBImpl.java:139)
		at org.apache.hadoop.yarn.ipc.HadoopYarnProtoRPC.getServer(HadoopYarnProtoRPC.java:65)
		at org.apache.hadoop.yarn.ipc.YarnRPC.getServer(YarnRPC.java:54)
		at org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService.serviceStart(ResourceTrackerService.java:163)
		at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
		at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120)
		at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:611)
		at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
		at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:1091)
		at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1132)
		at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1128)
		at java.security.AccessController.doPrivileged(Native Method)
		at javax.security.auth.Subject.doAs(Subject.java:422)
		at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1917)
		at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1128)
		at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:306)
		at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:132)
		at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:812)
		at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:483)
		at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:546)
		at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
Caused by: java.io.IOException: Failed on local exception: java.net.SocketException: Unresolved address; Host Details : local host is: "node-2.mydomain.com"; destination host is: (unknown):0; 
		at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
		at org.apache.hadoop.ipc.Server.bind(Server.java:522)
		at org.apache.hadoop.ipc.Server$Listener.<init>(Server.java:728)
		at org.apache.hadoop.ipc.Server.<init>(Server.java:2449)
		at org.apache.hadoop.ipc.RPC$Server.<init>(RPC.java:1042)
		at org.apache.hadoop.ipc.ProtobufRpcEngine$Server.<init>(ProtobufRpcEngine.java:535)
		at org.apache.hadoop.ipc.ProtobufRpcEngine.getServer(ProtobufRpcEngine.java:510)
		at org.apache.hadoop.ipc.RPC$Builder.build(RPC.java:887)
		at org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl.createServer(RpcServerFactoryPBImpl.java:169)
		at org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl.getServer(RpcServerFactoryPBImpl.java:132)
		... 20 more
Caused by: java.net.SocketException: Unresolved address
		at sun.nio.ch.Net.translateToSocketException(Net.java:131)
		at sun.nio.ch.Net.translateException(Net.java:157)
		at sun.nio.ch.Net.translateException(Net.java:163)
		at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:76)
		at org.apache.hadoop.ipc.Server.bind(Server.java:505)
		... 28 more
Caused by: java.nio.channels.UnresolvedAddressException
		at sun.nio.ch.Net.checkAddress(Net.java:101)
		at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:218)
		at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
		... 29 more
2018-11-16 03:21:16,470 ERROR org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Returning, interrupted : java.lang.InterruptedException
2018-11-16 03:21:16,471 INFO org.apache.hadoop.yarn.util.AbstractLivelinessMonitor: org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.ContainerAllocationExpirer thread interrupted
2018-11-16 03:21:16,471 INFO org.apache.hadoop.yarn.util.AbstractLivelinessMonitor: AMLivelinessMonitor thread interrupted
2018-11-16 03:21:16,471 INFO org.apache.hadoop.yarn.util.AbstractLivelinessMonitor: NMLivelinessMonitor thread interrupted
2018-11-16 03:21:16,472 INFO org.apache.hadoop.yarn.util.AbstractLivelinessMonitor: AMLivelinessMonitor thread interrupted
2018-11-16 03:21:16,472 ERROR org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager: ExpiredTokenRemover received java.lang.InterruptedException: sleep interrupted
2018-11-16 03:21:16,473 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping ResourceManager metrics system...
2018-11-16 03:21:16,475 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: ResourceManager metrics system stopped.
2018-11-16 03:21:16,475 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: ResourceManager metrics system shutdown complete.
2018-11-16 03:21:16,475 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: AsyncDispatcher is draining to stop, igonring any new events.
2018-11-16 03:21:16,477 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$VerifyActiveStatusThread thread interrupted! Exiting!
2018-11-16 03:21:16,487 INFO org.apache.zookeeper.ZooKeeper: Session: 0x3671a89731f0000 closed
2018-11-16 03:21:16,488 INFO org.apache.zookeeper.ClientCnxn: EventThread shut down
2018-11-16 03:21:16,489 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Registering class org.apache.hadoop.yarn.server.resourcemanager.RMFatalEventType for class org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher
2018-11-16 03:21:16,490 INFO org.apache.hadoop.yarn.server.resourcemanager.security.NMTokenSecretManagerInRM: NMTokenKeyRollingInterval: 86400000ms and NMTokenKeyActivationDelay: 900000ms
2018-11-16 03:21:16,490 INFO org.apache.hadoop.yarn.server.resourcemanager.security.RMContainerTokenSecretManager: ContainerTokenKeyRollingInterval: 86400000ms and ContainerTokenKeyActivationDelay: 900000ms
2018-11-16 03:21:16,490 INFO org.apache.hadoop.yarn.server.resourcemanager.security.AMRMTokenSecretManager: AMRMTokenKeyRollingInterval: 86400000ms and AMRMTokenKeyActivationDelay: 900000 ms
2018-11-16 03:21:16,490 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStoreFactory: Using RMStateStore implementation - class org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore
2018-11-16 03:21:16,490 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Registering class org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStoreEventType for class org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler
2018-11-16 03:21:16,491 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Registering class org.apache.hadoop.yarn.server.resourcemanager.NodesListManagerEventType for class org.apache.hadoop.yarn.server.resourcemanager.NodesListManager
2018-11-16 03:21:16,491 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Using Scheduler: org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler
2018-11-16 03:21:16,491 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Registering class org.apache.hadoop.yarn.server.resourcemanager.scheduler.event.SchedulerEventType for class org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher
2018-11-16 03:21:16,491 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Registering class org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppEventType for class org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher
2018-11-16 03:21:16,491 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Registering class org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptEventType for class org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher
2018-11-16 03:21:16,491 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Registering class org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeEventType for class org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$NodeEventDispatcher
2018-11-16 03:21:16,492 INFO org.apache.hadoop.metrics2.impl.MetricsConfig: loaded properties from hadoop-metrics2.properties
2018-11-16 03:21:16,493 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled snapshot period at 10 second(s).
2018-11-16 03:21:16,493 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: ResourceManager metrics system started
2018-11-16 03:21:16,494 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Registering class org.apache.hadoop.yarn.server.resourcemanager.RMAppManagerEventType for class org.apache.hadoop.yarn.server.resourcemanager.RMAppManager
2018-11-16 03:21:16,494 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Registering class org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncherEventType for class org.apache.hadoop.yarn.server.resourcemanager.amlauncher.ApplicationMasterLauncher
2018-11-16 03:21:16,494 WARN org.apache.hadoop.metrics2.util.MBeans: Failed to register MBean "Hadoop:service=ResourceManager,name=RMNMInfo": Instance already exists.
2018-11-16 03:21:16,494 INFO org.apache.hadoop.yarn.server.resourcemanager.RMNMInfo: Registered RMNMInfo MBean
2018-11-16 03:21:16,494 INFO org.apache.hadoop.yarn.server.resourcemanager.metrics.SystemMetricsPublisher: YARN system metrics publishing service is not enabled
2018-11-16 03:21:16,494 INFO org.apache.hadoop.util.HostsFileReader: Refreshing hosts (include/exclude) list
2018-11-16 03:21:16,496 WARN org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=packer   OPERATION=transitionToActive    TARGET=RMHAProtocolService      RESULT=FAILURE  DESCRIPTION=Exception transitioning to active   PERMISSIONS=
2018-11-16 03:21:16,497 WARN org.apache.hadoop.ha.ActiveStandbyElector: Exception handling the winning of election
org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active
		at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:134)
		at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:812)
		at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:483)
		at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:546)
		at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
Caused by: org.apache.hadoop.ha.ServiceFailedException: Error when transitioning to Active mode
		at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:311)
		at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:132)
		... 4 more
Caused by: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.io.IOException: Failed on local exception: java.net.SocketException: Unresolved address; Host Details : local host is: "node-2.mydomain.com"; destination host is: (unknown):0; 
		at org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl.getServer(RpcServerFactoryPBImpl.java:139)
		at org.apache.hadoop.yarn.ipc.HadoopYarnProtoRPC.getServer(HadoopYarnProtoRPC.java:65)
		at org.apache.hadoop.yarn.ipc.YarnRPC.getServer(YarnRPC.java:54)
		at org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService.serviceStart(ResourceTrackerService.java:163)
		at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
		at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120)
		at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:611)
		at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
		at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:1091)
		at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1132)
		at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1128)
		at java.security.AccessController.doPrivileged(Native Method)
		at javax.security.auth.Subject.doAs(Subject.java:422)
		at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1917)
		at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1128)
		at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:306)
		... 5 more
Caused by: java.io.IOException: Failed on local exception: java.net.SocketException: Unresolved address; Host Details : local host is: "node-2.mydomain.com"; destination host is: (unknown):0; 
		at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
		at org.apache.hadoop.ipc.Server.bind(Server.java:522)
		at org.apache.hadoop.ipc.Server$Listener.<init>(Server.java:728)
		at org.apache.hadoop.ipc.Server.<init>(Server.java:2449)
		at org.apache.hadoop.ipc.RPC$Server.<init>(RPC.java:1042)
		at org.apache.hadoop.ipc.ProtobufRpcEngine$Server.<init>(ProtobufRpcEngine.java:535)
		at org.apache.hadoop.ipc.ProtobufRpcEngine.getServer(ProtobufRpcEngine.java:510)
		at org.apache.hadoop.ipc.RPC$Builder.build(RPC.java:887)
		at org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl.createServer(RpcServerFactoryPBImpl.java:169)
		at org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl.getServer(RpcServerFactoryPBImpl.java:132)
		... 20 more
Caused by: java.net.SocketException: Unresolved address
		at sun.nio.ch.Net.translateToSocketException(Net.java:131)
		at sun.nio.ch.Net.translateException(Net.java:157)
		at sun.nio.ch.Net.translateException(Net.java:163)
		at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:76)
		at org.apache.hadoop.ipc.Server.bind(Server.java:505)
		... 28 more
Caused by: java.nio.channels.UnresolvedAddressException
		at sun.nio.ch.Net.checkAddress(Net.java:101)
		at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:218)
		at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
		... 29 more
2018-11-16 03:21:16,497 INFO org.apache.hadoop.ha.ActiveStandbyElector: Trying to re-establish ZK session
2018-11-16 03:21:16,511 INFO org.apache.zookeeper.ZooKeeper: Session: 0x36681eb8c720002 closed
2018-11-16 03:21:17,513 INFO org.apache.zookeeper.ZooKeeper: Initiating client connection, connectString=node-1.mydomain.com:2181,node-1.mydomain.com:2181,node-1.mydomain.com:2181 sessionTimeout=10000 watcher=org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef@655d597b
2018-11-16 03:21:17,513 ERROR org.apache.zookeeper.client.StaticHostProvider: Unable to connect to server: node-2.mydomain.com:2181
java.net.UnknownHostException: node-2.mydomain.com
		at java.net.InetAddress.getAllByName0(InetAddress.java:1280)
		at java.net.InetAddress.getAllByName(InetAddress.java:1192)
		at java.net.InetAddress.getAllByName(InetAddress.java:1126)
		at org.apache.zookeeper.client.StaticHostProvider.<init>(StaticHostProvider.java:60)
		at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:445)
		at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:380)
		at org.apache.hadoop.ha.ActiveStandbyElector.getNewZooKeeper(ActiveStandbyElector.java:630)
		at org.apache.hadoop.ha.ActiveStandbyElector.createConnection(ActiveStandbyElector.java:774)
		at org.apache.hadoop.ha.ActiveStandbyElector.reEstablishSession(ActiveStandbyElector.java:749)
		at org.apache.hadoop.ha.ActiveStandbyElector.joinElectionInternal(ActiveStandbyElector.java:660)
		at org.apache.hadoop.ha.ActiveStandbyElector.reJoinElection(ActiveStandbyElector.java:688)
		at org.apache.hadoop.ha.ActiveStandbyElector.reJoinElectionAfterFailureToBecomeActive(ActiveStandbyElector.java:530)
		at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:484)
		at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:546)
		at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
2018-11-16 03:21:17,559 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server node-3.mydomain.com/10.242.1.106:2181. Will not attempt to authenticate using SASL (unknown error)
2018-11-16 03:21:17,560 INFO org.apache.zookeeper.ClientCnxn: Socket connection established, initiating session, client: /10.242.1.105:46773, server: node-3.mydomain.com/10.242.1.106:2181
2018-11-16 03:21:17,573 INFO org.apache.zookeeper.ClientCnxn: Session establishment complete on server node-3.mydomain.com/10.242.1.106:2181, sessionid = 0x3671a89731f0003, negotiated timeout = 10000
2018-11-16 03:21:17,575 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session connected.
2018-11-16 03:21:17,575 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x36681eb8c720002
2018-11-16 03:21:17,575 INFO org.apache.zookeeper.ClientCnxn: EventThread shut down
2018-11-16 03:21:17,585 INFO org.apache.hadoop.conf.Configuration: found resource yarn-site.xml at file:/hadoop-2.6.0-cdh5.14.0/etc/hadoop/yarn-site.xml
2018-11-16 03:21:17,588 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=packer   OPERATION=refreshAdminAcls      TARGET=AdminService     RESULT=SUCCESS
2018-11-16 03:21:17,588 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Already in standby state
2018-11-16 03:21:17,588 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=packer   OPERATION=transitionToStandby   TARGET=RMHAProtocolService      RESULT=SUCCESS
2018-11-16 03:30:57,669 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler: Release request cache is cleaned up
2018-11-16 03:31:16,496 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler: Release request cache is cleaned up
2018-11-19 13:35:36,554 WARN org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:packer/node-2.mydomain.com@SA_REALM (auth:KERBEROS) cause:javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
2018-11-19 13:35:39,353 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for packer/node-2.mydomain.com@SA_REALM (auth:KERBEROS)
2018-11-19 13:35:39,357 INFO SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager: Authorization successful for packer/node-2.mydomain.com@SA_REALM (auth:KERBEROS) for protocol=interface org.apache.hadoop.ha.HAServiceProtocol
2018-11-19 13:35:45,785 WARN org.apache.hadoop.security.authentication.server.AuthenticationFilter: AuthenticationToken ignored: org.apache.hadoop.security.authentication.util.SignerException: Invalid signature
2018-11-21 08:29:19,995 WARN org.apache.hadoop.security.authentication.server.AuthenticationFilter: AuthenticationToken ignored: org.apache.hadoop.security.authentication.util.SignerException: Invalid signature
2018-11-21 08:29:20,001 WARN org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:packer/node-2.mydomain.com@SA_REALM (auth:KERBEROS) cause:javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
2018-11-21 08:29:23,662 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for packer/node-2.mydomain.com@SA_REALM (auth:KERBEROS)
2018-11-21 08:29:23,666 INFO SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager: Authorization successful for packer/node-2.mydomain.com@SA_REALM (auth:KERBEROS) for protocol=interface org.apache.hadoop.ha.HAServiceProtocol
2018-11-21 08:31:37,254 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for packer/node-2.mydomain.com@SA_REALM (auth:KERBEROS)
2018-11-21 08:31:37,258 INFO SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager: Authorization successful for packer/node-2.mydomain.com@SA_REALM (auth:KERBEROS) for protocol=interface org.apache.hadoop.ha.HAServiceProtocol
{noformat}
I have found a few tickets about some race conditions in YARN popping out when issues with connecting to ZooKeeper occur but either they should have been fix in 2.6.0 or the logs don't match.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-dev-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-dev-help@hadoop.apache.org