You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by "Lin Yiqun (JIRA)" <ji...@apache.org> on 2015/12/25 09:12:49 UTC

[jira] [Commented] (HADOOP-12680) Loss of zookeeper quorum lead all the namenode to be standby state

    [ https://issues.apache.org/jira/browse/HADOOP-12680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15071414#comment-15071414 ] 

Lin Yiqun commented on HADOOP-12680:
------------------------------------

I show the some of zkfc log in this case:
{code}
2015-12-24 17:33:43,873 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session expired. Entering neutral mode and rejoining...
2015-12-24 17:33:43,873 INFO org.apache.hadoop.ha.ActiveStandbyElector: Trying to re-establish ZK session
2015-12-24 17:33:43,875 INFO org.apache.zookeeper.ZooKeeper: Initiating client connection, connectString=10.13.8.24:2181,10.13.8.25:2181,10.13.8.26:2181,10.13.8.27:2181,10.13.7.33:2181 sessionTimeout=30000 watcher=org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef@56d70b02
2015-12-24 17:33:43,884 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server 10.13.8.25/10.13.8.25:2181. Will not attempt to authenticate using SASL (unknown error)
2015-12-24 17:33:43,884 WARN org.apache.zookeeper.ClientCnxn: Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
        at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
        at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)
2015-12-24 17:33:43,905 INFO org.apache.zookeeper.ClientCnxn: Unable to reconnect to ZooKeeper service, session 0x451703dcdf7d107 has expired, closing socket connection
2015-12-24 17:33:43,985 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server 10.13.7.33/10.13.7.33:2181. Will not attempt to authenticate using SASL (unknown error)
2015-12-24 17:33:43,985 INFO org.apache.zookeeper.ClientCnxn: Socket connection established to 10.13.7.33/10.13.7.33:2181, initiating session
2015-12-24 17:33:43,985 INFO org.apache.zookeeper.ClientCnxn: Unable to read additional data from server sessionid 0x0, likely server has closed socket, closing socket connection and attempting reconnect
2015-12-24 17:33:44,712 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server 10.13.8.24/10.13.8.24:2181. Will not attempt to authenticate using SASL (unknown error)
2015-12-24 17:33:44,712 WARN org.apache.zookeeper.ClientCnxn: Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
        at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
        at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)
2015-12-24 17:33:45,806 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server 10.13.8.26/10.13.8.26:2181. Will not attempt to authenticate using SASL (unknown error)
2015-12-24 17:33:45,807 INFO org.apache.zookeeper.ClientCnxn: Socket connection established to 10.13.8.26/10.13.8.26:2181, initiating session
2015-12-24 17:33:45,807 INFO org.apache.zookeeper.ClientCnxn: Unable to read additional data from server sessionid 0x0, likely server has closed socket, closing socket connection and attempting reconnect
2015-12-24 17:33:46,549 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server 10.13.8.27/10.13.8.27:2181. Will not attempt to authenticate using SASL (unknown error)
2015-12-24 17:33:46,550 INFO org.apache.zookeeper.ClientCnxn: Socket connection established to 10.13.8.27/10.13.8.27:2181, initiating session
2015-12-24 17:33:46,561 INFO org.apache.zookeeper.ClientCnxn: Session establishment complete on server 10.13.8.27/10.13.8.27:2181, sessionid = 0x451d35639b5002a, negotiated timeout = 30000
2015-12-24 17:33:46,563 INFO org.apache.zookeeper.ClientCnxn: EventThread shut down
2015-12-24 17:33:46,564 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session connected.
2015-12-24 17:33:46,573 INFO org.apache.hadoop.ha.ZKFailoverController: ZK Election indicated that NameNode at qihe2192/10.12.2.192:9000 should become standby
2015-12-24 17:33:46,575 INFO org.apache.hadoop.ha.ZKFailoverController: Successfully transitioned NameNode at qihe2192/10.12.2.192:9000 to standby state
2015-12-24 17:47:21,517 WARN org.apache.hadoop.ha.HealthMonitor: Transport-level exception trying to monitor health of NameNode at qihe2192/10.12.2.192:9000: java.io.IOException: Connection reset by peer Failed on local exception: java.io.IOException: Connection reset by peer; Host Details : local host is: "qihe2192/10.12.2.192"; destination host is: "qihe2192":9000;
{code}
{code}
2015-12-24 17:33:44,860 INFO org.apache.zookeeper.ClientCnxn: Unable to reconnect to ZooKeeper service, session 0x551703eef8b00c2 has expired, closing socket connection
2015-12-24 17:33:44,861 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session expired. Entering neutral mode and rejoining...
2015-12-24 17:33:44,861 INFO org.apache.hadoop.ha.ActiveStandbyElector: Trying to re-establish ZK session
2015-12-24 17:33:44,862 INFO org.apache.zookeeper.ZooKeeper: Initiating client connection, connectString=10.13.8.24:2181,10.13.8.25:2181,10.13.8.26:2181,10.13.8.27:2181,10.13.7.33:2181 sessionTimeout=30000 watcher=org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef@5eefe70b
2015-12-24 17:33:44,863 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server 10.13.8.27/10.13.8.27:2181. Will not attempt to authenticate using SASL (unknown error)
2015-12-24 17:33:44,863 INFO org.apache.zookeeper.ClientCnxn: Socket connection established to 10.13.8.27/10.13.8.27:2181, initiating session
2015-12-24 17:33:44,871 INFO org.apache.zookeeper.ClientCnxn: Session establishment complete on server 10.13.8.27/10.13.8.27:2181, sessionid = 0x451d35639b50012, negotiated timeout = 30000
2015-12-24 17:33:44,873 INFO org.apache.zookeeper.ClientCnxn: EventThread shut down
2015-12-24 17:33:44,874 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session connected.
2015-12-24 17:33:44,892 INFO org.apache.hadoop.ha.ZKFailoverController: ZK Election indicated that NameNode at qihe2182/10.12.2.182:9000 should become standby
2015-12-24 17:33:44,928 INFO org.apache.hadoop.ha.ZKFailoverController: Successfully transitioned NameNode at qihe2182/10.12.2.182:9000 to standby state
2015-12-24 17:47:20,883 WARN org.apache.hadoop.ha.HealthMonitor: Transport-level exception trying to monitor health of NameNode at qihe2182/10.12.2.182:9000: java.io.IOException: Connection reset by peer Failed on local exception: java.io.IOException: Connection reset by peer; Host Details : local host is: "qihe2182/10.12.2.182"; destination host is: "qihe2182":9000;
2015-12-24 17:47:20,883 INFO org.apache.hadoop.ha.HealthMonitor: Entering state SERVICE_NOT_RESPONDING
{code}
In {{2015-12-24 17:33}}, namenode are all transitioned to standby state.

> Loss of zookeeper quorum lead all the namenode to be standby state
> ------------------------------------------------------------------
>
>                 Key: HADOOP-12680
>                 URL: https://issues.apache.org/jira/browse/HADOOP-12680
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: ha
>    Affects Versions: 2.7.1
>            Reporter: Lin Yiqun
>
> When I am upgrading my zookeeper cluster, and will change the ip address of zk nodes. And I found two namenodes of my hadoop cluster got loss of connection with zk. And when I revocer the zk cluster, the two namenodes are both transitioned to standby state and this makes cluster can't provide service. I found the reason may be is following:
> {code}
> /**
>      * If the elector gets disconnected from Zookeeper and does not know about
>      * the lock state, then it will notify the service via the enterNeutralMode
>      * interface. The service may choose to ignore this or stop doing state
>      * changing operations. Upon reconnection, the elector verifies the leader
>      * status and calls back on the becomeActive and becomeStandby app
>      * interfaces. <br/>
>      * Zookeeper disconnects can happen due to network issues or loss of
>      * Zookeeper quorum. Thus enterNeutralMode can be used to guard against
>      * split-brain issues. In such situations it might be prudent to call
>      * becomeStandby too. However, such state change operations might be
>      * expensive and enterNeutralMode can help guard against doing that for
>      * transient issues.
>      */
>     void enterNeutralMode();
> {code}
> May be we should create a thread to monitor the stat of namenodes and don't let them all to be standby state.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)