You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-dev@hadoop.apache.org by "Zack Marsh (JIRA)" <ji...@apache.org> on 2015/06/30 23:24:04 UTC

[jira] [Created] (YARN-3871) ResourceManager down after Blueprint install

Zack Marsh created YARN-3871:
--------------------------------

             Summary: ResourceManager down after Blueprint install 
                 Key: YARN-3871
                 URL: https://issues.apache.org/jira/browse/YARN-3871
             Project: Hadoop YARN
          Issue Type: Bug
    Affects Versions: 2.7.1
         Environment: ambari-2.1.0-1295, hdp-2.3.0.0-2497, sles11sp3

            Reporter: Zack Marsh
            Priority: Critical
         Attachments: yarn-yarn-resourcemanager-piripiri3.log, yarn-yarn-resourcemanager-piripiri3.out

On a 3-Master HDP 2.3 cluster installed with HDP-2.3.0.0-2482 and Ambari-2.1.0-1266, the YARN ResourceManager was down following the Blueprint install.

It's important to note that nothing failed during the Blueprint install. The ResourceManager shutdown because of an inability to connect to Zookeeper.

Excerpt from the ResourceManager log:
{code}
2015-06-26 03:35:47,188 INFO  zookeeper.ZooKeeper (Environment.java:logEnv(100)) - Client environment:java.library.path=:/usr/hdp/2.3.0.0-2482/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2482/hadoop/lib/native:/usr/hdp/2.3.0.0-2482/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.0.0-2482/hadoop/lib/native
2015-06-26 03:35:47,188 INFO  zookeeper.ZooKeeper (Environment.java:logEnv(100)) - Client environment:java.io.tmpdir=/tmp
2015-06-26 03:35:47,188 INFO  zookeeper.ZooKeeper (Environment.java:logEnv(100)) - Client environment:java.compiler=<NA>
2015-06-26 03:35:47,188 INFO  zookeeper.ZooKeeper (Environment.java:logEnv(100)) - Client environment:os.name=Linux
2015-06-26 03:35:47,188 INFO  zookeeper.ZooKeeper (Environment.java:logEnv(100)) - Client environment:os.arch=amd64
2015-06-26 03:35:47,188 INFO  zookeeper.ZooKeeper (Environment.java:logEnv(100)) - Client environment:os.version=3.0.101-0.50.TDC.1.R.0-default
2015-06-26 03:35:47,188 INFO  zookeeper.ZooKeeper (Environment.java:logEnv(100)) - Client environment:user.name=yarn
2015-06-26 03:35:47,188 INFO  zookeeper.ZooKeeper (Environment.java:logEnv(100)) - Client environment:user.home=/home/yarn
2015-06-26 03:35:47,188 INFO  zookeeper.ZooKeeper (Environment.java:logEnv(100)) - Client environment:user.dir=/usr/hdp/2.3.0.0-2482/hadoop-yarn
2015-06-26 03:35:47,190 INFO  zookeeper.ZooKeeper (ZooKeeper.java:<init>(438)) - Initiating client connection, connectString=piripiri2.labs.teradata.com:2181,piripiri1.labs.teradata.com:2181,piripiri3.labs.teradata.com:2181 sessionTimeout=10000 watcher=org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef@59d2103b
2015-06-26 03:35:47,209 INFO  zookeeper.ClientCnxn (ClientCnxn.java:logStartConnect(975)) - Opening socket connection to server piripiri2.labs.teradata.com/39.0.40.2:2181. Will not attempt to authenticate using SASL (unknown error)
2015-06-26 03:35:47,276 WARN  zookeeper.ClientCnxn (ClientCnxn.java:run(1102)) - Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:716)
        at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
        at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)
2015-06-26 03:35:47,380 INFO  zookeeper.ClientCnxn (ClientCnxn.java:logStartConnect(975)) - Opening socket connection to server piripiri3.labs.teradata.com/39.0.40.3:2181. Will not attempt to authenticate using SASL (unknown error)
2015-06-26 03:35:47,381 INFO  zookeeper.ClientCnxn (ClientCnxn.java:primeConnection(852)) - Socket connection established to piripiri3.labs.teradata.com/39.0.40.3:2181, initiating session
2015-06-26 03:35:47,452 INFO  zookeeper.ClientCnxn (ClientCnxn.java:run(1098)) - Unable to read additional data from server sessionid 0x0, likely server has closed socket, closing socket connection and attempting reconnect
2015-06-26 03:35:48,067 INFO  zookeeper.ClientCnxn (ClientCnxn.java:logStartConnect(975)) - Opening socket connection to server piripiri1.labs.teradata.com/39.0.40.1:2181. Will not attempt to authenticate using SASL (unknown error)
2015-06-26 03:35:48,378 WARN  zookeeper.ClientCnxn (ClientCnxn.java:run(1102)) - Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:716)
        at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
        at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)
2015-06-26 03:35:49,914 INFO  zookeeper.ClientCnxn (ClientCnxn.java:logStartConnect(975)) - Opening socket connection to server piripiri2.labs.teradata.com/39.0.40.2:2181. Will not attempt to authenticate using SASL (unknown error)
2015-06-26 03:35:49,915 WARN  zookeeper.ClientCnxn (ClientCnxn.java:run(1102)) - Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:716)
        at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
        at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)
2015-06-26 03:35:50,028 INFO  zookeeper.ClientCnxn (ClientCnxn.java:logStartConnect(975)) - Opening socket connection to server piripiri3.labs.teradata.com/39.0.40.3:2181. Will not attempt to authenticate using SASL (unknown error)
2015-06-26 03:35:50,028 INFO  zookeeper.ClientCnxn (ClientCnxn.java:primeConnection(852)) - Socket connection established to piripiri3.labs.teradata.com/39.0.40.3:2181, initiating session
2015-06-26 03:35:50,030 INFO  zookeeper.ClientCnxn (ClientCnxn.java:run(1098)) - Unable to read additional data from server sessionid 0x0, likely server has closed socket, closing socket connection and attempting reconnect
2015-06-26 03:35:50,133 INFO  zookeeper.ClientCnxn (ClientCnxn.java:logStartConnect(975)) - Opening socket connection to server piripiri1.labs.teradata.com/39.0.40.1:2181. Will not attempt to authenticate using SASL (unknown error)
2015-06-26 03:35:50,134 WARN  zookeeper.ClientCnxn (ClientCnxn.java:run(1102)) - Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:716)
        at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
        at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)
2015-06-26 03:35:52,064 INFO  zookeeper.ClientCnxn (ClientCnxn.java:logStartConnect(975)) - Opening socket connection to server piripiri2.labs.teradata.com/39.0.40.2:2181. Will not attempt to authenticate using SASL (unknown error)
2015-06-26 03:35:52,065 WARN  zookeeper.ClientCnxn (ClientCnxn.java:run(1102)) - Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:716)
        at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
        at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)
2015-06-26 03:35:52,901 INFO  zookeeper.ClientCnxn (ClientCnxn.java:logStartConnect(975)) - Opening socket connection to server piripiri3.labs.teradata.com/39.0.40.3:2181. Will not attempt to authenticate using SASL (unknown error)
2015-06-26 03:35:52,901 INFO  zookeeper.ClientCnxn (ClientCnxn.java:primeConnection(852)) - Socket connection established to piripiri3.labs.teradata.com/39.0.40.3:2181, initiating session
2015-06-26 03:35:52,902 INFO  zookeeper.ClientCnxn (ClientCnxn.java:run(1098)) - Unable to read additional data from server sessionid 0x0, likely server has closed socket, closing socket connection and attempting reconnect
2015-06-26 03:35:53,570 INFO  zookeeper.ClientCnxn (ClientCnxn.java:logStartConnect(975)) - Opening socket connection to server piripiri1.labs.teradata.com/39.0.40.1:2181. Will not attempt to authenticate using SASL (unknown error)
2015-06-26 03:35:53,571 WARN  zookeeper.ClientCnxn (ClientCnxn.java:run(1102)) - Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:716)
        at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
        at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)
2015-06-26 03:35:55,541 INFO  zookeeper.ClientCnxn (ClientCnxn.java:logStartConnect(975)) - Opening socket connection to server piripiri2.labs.teradata.com/39.0.40.2:2181. Will not attempt to authenticate using SASL (unknown error)
2015-06-26 03:35:55,542 WARN  zookeeper.ClientCnxn (ClientCnxn.java:run(1102)) - Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:716)
        at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
        at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)
2015-06-26 03:35:56,513 INFO  zookeeper.ClientCnxn (ClientCnxn.java:logStartConnect(975)) - Opening socket connection to server piripiri3.labs.teradata.com/39.0.40.3:2181. Will not attempt to authenticate using SASL (unknown error)
2015-06-26 03:35:56,514 INFO  zookeeper.ClientCnxn (ClientCnxn.java:primeConnection(852)) - Socket connection established to piripiri3.labs.teradata.com/39.0.40.3:2181, initiating session
2015-06-26 03:35:56,515 INFO  zookeeper.ClientCnxn (ClientCnxn.java:run(1098)) - Unable to read additional data from server sessionid 0x0, likely server has closed socket, closing socket connection and attempting reconnect
2015-06-26 03:35:56,821 INFO  zookeeper.ClientCnxn (ClientCnxn.java:logStartConnect(975)) - Opening socket connection to server piripiri1.labs.teradata.com/39.0.40.1:2181. Will not attempt to authenticate using SASL (unknown error)
2015-06-26 03:35:56,822 WARN  zookeeper.ClientCnxn (ClientCnxn.java:run(1102)) - Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:716)
        at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
        at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)
2015-06-26 03:35:57,205 ERROR ha.ActiveStandbyElector (ActiveStandbyElector.java:waitForZKConnectionEvent(1044)) - Connection timed out: couldn't connect to ZooKeeper in 10000 milliseconds
2015-06-26 03:35:57,396 INFO  zookeeper.ZooKeeper (ZooKeeper.java:close(684)) - Session: 0x0 closed
2015-06-26 03:35:57,397 INFO  zookeeper.ClientCnxn (ClientCnxn.java:run(512)) - EventThread shut down
2015-06-26 03:35:57,403 INFO  service.AbstractService (AbstractService.java:noteFailure(272)) - Service org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService failed in state INITED; cause: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
        at org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef.waitForZKConnectionEvent(ActiveStandbyElector.java:1047)
        at org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef.access$400(ActiveStandbyElector.java:1018)
        at org.apache.hadoop.ha.ActiveStandbyElector.getNewZooKeeper(ActiveStandbyElector.java:633)
        at org.apache.hadoop.ha.ActiveStandbyElector.createConnection(ActiveStandbyElector.java:767)
        at org.apache.hadoop.ha.ActiveStandbyElector.<init>(ActiveStandbyElector.java:227)
        at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.serviceInit(EmbeddedElectorService.java:92)
        at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
        at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
        at org.apache.hadoop.yarn.server.resourcemanager.AdminService.serviceInit(AdminService.java:149)
        at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
        at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
        at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:261)
        at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
        at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1226)
2015-06-26 03:35:57,404 INFO  service.AbstractService (AbstractService.java:noteFailure(272)) - Service org.apache.hadoop.yarn.server.resourcemanager.AdminService failed in state INITED; cause: org.apache.hadoop.service.ServiceStateException: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss
org.apache.hadoop.service.ServiceStateException: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss
        at org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
        at org.apache.hadoop.service.AbstractService.init(AbstractService.java:172)
        at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
        at org.apache.hadoop.yarn.server.resourcemanager.AdminService.serviceInit(AdminService.java:149)
        at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
        at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
        at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:261)
        at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
        at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1226)
Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
        at org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef.waitForZKConnectionEvent(ActiveStandbyElector.java:1047)
        at org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef.access$400(ActiveStandbyElector.java:1018)
        at org.apache.hadoop.ha.ActiveStandbyElector.getNewZooKeeper(ActiveStandbyElector.java:633)
        at org.apache.hadoop.ha.ActiveStandbyElector.createConnection(ActiveStandbyElector.java:767)
        at org.apache.hadoop.ha.ActiveStandbyElector.<init>(ActiveStandbyElector.java:227)
        at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.serviceInit(EmbeddedElectorService.java:92)
        at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
        ... 7 more
2015-06-26 03:35:57,404 INFO  service.AbstractService (AbstractService.java:noteFailure(272)) - Service ResourceManager failed in state INITED; cause: org.apache.hadoop.service.ServiceStateException: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss
org.apache.hadoop.service.ServiceStateException: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss
        at org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
        at org.apache.hadoop.service.AbstractService.init(AbstractService.java:172)
        at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
        at org.apache.hadoop.yarn.server.resourcemanager.AdminService.serviceInit(AdminService.java:149)
        at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
        at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
        at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:261)
        at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
        at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1226)
Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
        at org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef.waitForZKConnectionEvent(ActiveStandbyElector.java:1047)
        at org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef.access$400(ActiveStandbyElector.java:1018)
        at org.apache.hadoop.ha.ActiveStandbyElector.getNewZooKeeper(ActiveStandbyElector.java:633)
        at org.apache.hadoop.ha.ActiveStandbyElector.createConnection(ActiveStandbyElector.java:767)
        at org.apache.hadoop.ha.ActiveStandbyElector.<init>(ActiveStandbyElector.java:227)
        at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.serviceInit(EmbeddedElectorService.java:92)
        at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
        ... 7 more
2015-06-26 03:35:57,405 INFO  resourcemanager.ResourceManager (ResourceManager.java:transitionToStandby(1068)) - Transitioning to standby state
2015-06-26 03:35:57,405 INFO  resourcemanager.ResourceManager (ResourceManager.java:transitionToStandby(1075)) - Transitioned to standby state
2015-06-26 03:35:57,405 FATAL resourcemanager.ResourceManager (ResourceManager.java:main(1230)) - Error starting ResourceManager
org.apache.hadoop.service.ServiceStateException: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss
        at org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
        at org.apache.hadoop.service.AbstractService.init(AbstractService.java:172)
        at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
        at org.apache.hadoop.yarn.server.resourcemanager.AdminService.serviceInit(AdminService.java:149)
        at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
        at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
        at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:261)
        at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
        at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1226)
Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
        at org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef.waitForZKConnectionEvent(ActiveStandbyElector.java:1047)
        at org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef.access$400(ActiveStandbyElector.java:1018)
        at org.apache.hadoop.ha.ActiveStandbyElector.getNewZooKeeper(ActiveStandbyElector.java:633)
        at org.apache.hadoop.ha.ActiveStandbyElector.createConnection(ActiveStandbyElector.java:767)
        at org.apache.hadoop.ha.ActiveStandbyElector.<init>(ActiveStandbyElector.java:227)
        at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.serviceInit(EmbeddedElectorService.java:92)
        at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
        ... 7 more
2015-06-26 03:35:57,407 INFO  resourcemanager.ResourceManager (LogAdapter.java:info(45)) - SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down ResourceManager at piripiri3/39.0.40.3
************************************************************/
{code}

This issue was observed again on a 3-Master cluster installed with HDP-2.3.0.0-2497 and Ambari-2.1.0-1295.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)