You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by "Lin Yiqun (JIRA)" <ji...@apache.org> on 2015/12/25 09:12:49 UTC
[jira] [Commented] (HADOOP-12680) Loss of zookeeper quorum lead all
the namenode to be standby state
[ https://issues.apache.org/jira/browse/HADOOP-12680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15071414#comment-15071414 ]
Lin Yiqun commented on HADOOP-12680:
------------------------------------
I show the some of zkfc log in this case:
{code}
2015-12-24 17:33:43,873 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session expired. Entering neutral mode and rejoining...
2015-12-24 17:33:43,873 INFO org.apache.hadoop.ha.ActiveStandbyElector: Trying to re-establish ZK session
2015-12-24 17:33:43,875 INFO org.apache.zookeeper.ZooKeeper: Initiating client connection, connectString=10.13.8.24:2181,10.13.8.25:2181,10.13.8.26:2181,10.13.8.27:2181,10.13.7.33:2181 sessionTimeout=30000 watcher=org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef@56d70b02
2015-12-24 17:33:43,884 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server 10.13.8.25/10.13.8.25:2181. Will not attempt to authenticate using SASL (unknown error)
2015-12-24 17:33:43,884 WARN org.apache.zookeeper.ClientCnxn: Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)
2015-12-24 17:33:43,905 INFO org.apache.zookeeper.ClientCnxn: Unable to reconnect to ZooKeeper service, session 0x451703dcdf7d107 has expired, closing socket connection
2015-12-24 17:33:43,985 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server 10.13.7.33/10.13.7.33:2181. Will not attempt to authenticate using SASL (unknown error)
2015-12-24 17:33:43,985 INFO org.apache.zookeeper.ClientCnxn: Socket connection established to 10.13.7.33/10.13.7.33:2181, initiating session
2015-12-24 17:33:43,985 INFO org.apache.zookeeper.ClientCnxn: Unable to read additional data from server sessionid 0x0, likely server has closed socket, closing socket connection and attempting reconnect
2015-12-24 17:33:44,712 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server 10.13.8.24/10.13.8.24:2181. Will not attempt to authenticate using SASL (unknown error)
2015-12-24 17:33:44,712 WARN org.apache.zookeeper.ClientCnxn: Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)
2015-12-24 17:33:45,806 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server 10.13.8.26/10.13.8.26:2181. Will not attempt to authenticate using SASL (unknown error)
2015-12-24 17:33:45,807 INFO org.apache.zookeeper.ClientCnxn: Socket connection established to 10.13.8.26/10.13.8.26:2181, initiating session
2015-12-24 17:33:45,807 INFO org.apache.zookeeper.ClientCnxn: Unable to read additional data from server sessionid 0x0, likely server has closed socket, closing socket connection and attempting reconnect
2015-12-24 17:33:46,549 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server 10.13.8.27/10.13.8.27:2181. Will not attempt to authenticate using SASL (unknown error)
2015-12-24 17:33:46,550 INFO org.apache.zookeeper.ClientCnxn: Socket connection established to 10.13.8.27/10.13.8.27:2181, initiating session
2015-12-24 17:33:46,561 INFO org.apache.zookeeper.ClientCnxn: Session establishment complete on server 10.13.8.27/10.13.8.27:2181, sessionid = 0x451d35639b5002a, negotiated timeout = 30000
2015-12-24 17:33:46,563 INFO org.apache.zookeeper.ClientCnxn: EventThread shut down
2015-12-24 17:33:46,564 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session connected.
2015-12-24 17:33:46,573 INFO org.apache.hadoop.ha.ZKFailoverController: ZK Election indicated that NameNode at qihe2192/10.12.2.192:9000 should become standby
2015-12-24 17:33:46,575 INFO org.apache.hadoop.ha.ZKFailoverController: Successfully transitioned NameNode at qihe2192/10.12.2.192:9000 to standby state
2015-12-24 17:47:21,517 WARN org.apache.hadoop.ha.HealthMonitor: Transport-level exception trying to monitor health of NameNode at qihe2192/10.12.2.192:9000: java.io.IOException: Connection reset by peer Failed on local exception: java.io.IOException: Connection reset by peer; Host Details : local host is: "qihe2192/10.12.2.192"; destination host is: "qihe2192":9000;
{code}
{code}
2015-12-24 17:33:44,860 INFO org.apache.zookeeper.ClientCnxn: Unable to reconnect to ZooKeeper service, session 0x551703eef8b00c2 has expired, closing socket connection
2015-12-24 17:33:44,861 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session expired. Entering neutral mode and rejoining...
2015-12-24 17:33:44,861 INFO org.apache.hadoop.ha.ActiveStandbyElector: Trying to re-establish ZK session
2015-12-24 17:33:44,862 INFO org.apache.zookeeper.ZooKeeper: Initiating client connection, connectString=10.13.8.24:2181,10.13.8.25:2181,10.13.8.26:2181,10.13.8.27:2181,10.13.7.33:2181 sessionTimeout=30000 watcher=org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef@5eefe70b
2015-12-24 17:33:44,863 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server 10.13.8.27/10.13.8.27:2181. Will not attempt to authenticate using SASL (unknown error)
2015-12-24 17:33:44,863 INFO org.apache.zookeeper.ClientCnxn: Socket connection established to 10.13.8.27/10.13.8.27:2181, initiating session
2015-12-24 17:33:44,871 INFO org.apache.zookeeper.ClientCnxn: Session establishment complete on server 10.13.8.27/10.13.8.27:2181, sessionid = 0x451d35639b50012, negotiated timeout = 30000
2015-12-24 17:33:44,873 INFO org.apache.zookeeper.ClientCnxn: EventThread shut down
2015-12-24 17:33:44,874 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session connected.
2015-12-24 17:33:44,892 INFO org.apache.hadoop.ha.ZKFailoverController: ZK Election indicated that NameNode at qihe2182/10.12.2.182:9000 should become standby
2015-12-24 17:33:44,928 INFO org.apache.hadoop.ha.ZKFailoverController: Successfully transitioned NameNode at qihe2182/10.12.2.182:9000 to standby state
2015-12-24 17:47:20,883 WARN org.apache.hadoop.ha.HealthMonitor: Transport-level exception trying to monitor health of NameNode at qihe2182/10.12.2.182:9000: java.io.IOException: Connection reset by peer Failed on local exception: java.io.IOException: Connection reset by peer; Host Details : local host is: "qihe2182/10.12.2.182"; destination host is: "qihe2182":9000;
2015-12-24 17:47:20,883 INFO org.apache.hadoop.ha.HealthMonitor: Entering state SERVICE_NOT_RESPONDING
{code}
In {{2015-12-24 17:33}}, namenode are all transitioned to standby state.
> Loss of zookeeper quorum lead all the namenode to be standby state
> ------------------------------------------------------------------
>
> Key: HADOOP-12680
> URL: https://issues.apache.org/jira/browse/HADOOP-12680
> Project: Hadoop Common
> Issue Type: Bug
> Components: ha
> Affects Versions: 2.7.1
> Reporter: Lin Yiqun
>
> When I am upgrading my zookeeper cluster, and will change the ip address of zk nodes. And I found two namenodes of my hadoop cluster got loss of connection with zk. And when I revocer the zk cluster, the two namenodes are both transitioned to standby state and this makes cluster can't provide service. I found the reason may be is following:
> {code}
> /**
> * If the elector gets disconnected from Zookeeper and does not know about
> * the lock state, then it will notify the service via the enterNeutralMode
> * interface. The service may choose to ignore this or stop doing state
> * changing operations. Upon reconnection, the elector verifies the leader
> * status and calls back on the becomeActive and becomeStandby app
> * interfaces. <br/>
> * Zookeeper disconnects can happen due to network issues or loss of
> * Zookeeper quorum. Thus enterNeutralMode can be used to guard against
> * split-brain issues. In such situations it might be prudent to call
> * becomeStandby too. However, such state change operations might be
> * expensive and enterNeutralMode can help guard against doing that for
> * transient issues.
> */
> void enterNeutralMode();
> {code}
> May be we should create a thread to monitor the stat of namenodes and don't let them all to be standby state.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)