You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Chandni Singh (JIRA)" <ji...@apache.org> on 2018/06/27 00:52:00 UTC
[jira] [Comment Edited] (YARN-8409)
ActiveStandbyElectorBasedElectorService is failing with NPE
[ https://issues.apache.org/jira/browse/YARN-8409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16524290#comment-16524290 ]
Chandni Singh edited comment on YARN-8409 at 6/27/18 12:51 AM:
---------------------------------------------------------------
This happens when RM is started immediately after killing zookeeper leader.
The {{zkClient}} reference in {{ActiveStandbyElector}} is null which causes NPE.
Below is the chain of calls:
# In {{ActiveStandbyElector}} constructor, at line 274: {{reEstablishSession()}} is invoked.
# {{reEstablishSession}} tries to create zookeeper connection at line 825.
# {{createConnection}} calls {{connectToZookeeper}} at line 850 to initialize {{zkClient}}
# However, {{connectToZookeeper}} throws IOException because of session timeout
# {{zkClient}} never gets initialized and is {{null}}.
{{ActiveStandbyElectorBasedElectorService}} currently doesn't care if elector is connected to zookeeper and executes {{elector.ensureParentZNode()}} which then throws NPE.
was (Author: csingh):
This happens when RM is started immediately after killing zookeeper leader. The {{zkClient}} is null.
> ActiveStandbyElectorBasedElectorService is failing with NPE
> -----------------------------------------------------------
>
> Key: YARN-8409
> URL: https://issues.apache.org/jira/browse/YARN-8409
> Project: Hadoop YARN
> Issue Type: Bug
> Affects Versions: 3.1.1
> Reporter: Yesha Vora
> Assignee: Chandni Singh
> Priority: Major
>
> In RM-HA env, kill ZK leader and then perform RM failover.
> Sometimes, active RM gets NPE and fail to come up successfully
> {code:java}
> 2018-06-08 10:31:03,007 INFO client.ZooKeeperSaslClient (ZooKeeperSaslClient.java:run(289)) - Client will use GSSAPI as SASL mechanism.
> 2018-06-08 10:31:03,008 INFO zookeeper.ClientCnxn (ClientCnxn.java:logStartConnect(1019)) - Opening socket connection to server xxx/xxx:2181. Will attempt to SASL-authenticate using Login Context section 'Client'
> 2018-06-08 10:31:03,009 WARN zookeeper.ClientCnxn (ClientCnxn.java:run(1146)) - Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect
> java.net.ConnectException: Connection refused
> at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
> at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
> at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
> at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1125)
> 2018-06-08 10:31:03,344 INFO service.AbstractService (AbstractService.java:noteFailure(267)) - Service org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService failed in state INITED
> java.lang.NullPointerException
> at org.apache.hadoop.ha.ActiveStandbyElector$3.run(ActiveStandbyElector.java:1033)
> at org.apache.hadoop.ha.ActiveStandbyElector$3.run(ActiveStandbyElector.java:1030)
> at org.apache.hadoop.ha.ActiveStandbyElector.zkDoWithRetries(ActiveStandbyElector.java:1095)
> at org.apache.hadoop.ha.ActiveStandbyElector.zkDoWithRetries(ActiveStandbyElector.java:1087)
> at org.apache.hadoop.ha.ActiveStandbyElector.createWithRetries(ActiveStandbyElector.java:1030)
> at org.apache.hadoop.ha.ActiveStandbyElector.ensureParentZNode(ActiveStandbyElector.java:347)
> at org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.serviceInit(ActiveStandbyElectorBasedElectorService.java:110)
> at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
> at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108)
> at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:336)
> at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
> at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1479)
> 2018-06-08 10:31:03,345 INFO ha.ActiveStandbyElector (ActiveStandbyElector.java:quitElection(409)) - Yielding from election{code}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org