You are viewing a plain text version of this content. The canonical link for it is here.
Posted to hdfs-dev@hadoop.apache.org by "LAXMAN KUMAR SAHOO (JIRA)" <ji...@apache.org> on 2014/11/27 08:44:12 UTC

[jira] [Created] (HDFS-7451) Namenode HA failover happens very frequently from active to standby

LAXMAN KUMAR SAHOO created HDFS-7451:
----------------------------------------

             Summary: Namenode HA failover happens very frequently from active to standby
                 Key: HDFS-7451
                 URL: https://issues.apache.org/jira/browse/HDFS-7451
             Project: Hadoop HDFS
          Issue Type: Bug
            Reporter: LAXMAN KUMAR SAHOO


We have two namenode having HA enabled. From last couple of days we are observing that the failover happens very frequently from active to standby mode. Below is the log details of the active namenode during failover happens. Is there any fix to get rid of this issue?

Namenode logs:

{code}
2014-11-25 22:24:02,020 WARN org.apache.hadoop.ipc.Server: IPC Server Responder, call org.apache.hadoop.hdfs.protocol.Clie
ntProtocol.getListing from 10.2.16.214:40751: output error
2014-11-25 22:24:02,020 INFO org.apache.hadoop.ipc.Server: IPC Server handler 23 on 8020 caught an exception
java.nio.channels.ClosedChannelException
        at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:265)
        at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:474)
        at org.apache.hadoop.ipc.Server.channelWrite(Server.java:2195)
        at org.apache.hadoop.ipc.Server.access$2000(Server.java:110)
        at org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:979)
        at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:1045)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1798)







2014-11-25 22:24:10,631 INFO org.apache.hadoop.hdfs.server.namenode.FileJournalManager: Finalizing edits file /sda/dfs/namenode/current/edits_inprogress_0000000001643676954 -> /sda/dfs/namenode/current/edits_0000000001643676954-0000000001643677390
2014-11-25 22:24:10,631 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Closing
java.lang.Exception
        at org.apache.hadoop.hdfs.qjournal.client.IPCLoggerChannel.close(IPCLoggerChannel.java:182)
        at org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.close(AsyncLoggerSet.java:102)
        at org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.close(QuorumJournalManager.java:446)
        at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalAndStream.close(JournalSet.java:107)
        at org.apache.hadoop.hdfs.server.namenode.JournalSet$4.apply(JournalSet.java:222)
        at org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:347)
        at org.apache.hadoop.hdfs.server.namenode.JournalSet.close(JournalSet.java:219)
        at org.apache.hadoop.hdfs.server.namenode.FSEditLog.close(FSEditLog.java:308)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.stopActiveServices(FSNamesystem.java:939)
        at org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.stopActiveServices(NameNode.java:1365)
        at org.apache.hadoop.hdfs.server.namenode.ha.ActiveState.exitState(ActiveState.java:70)
        at org.apache.hadoop.hdfs.server.namenode.ha.HAState.setStateInternal(HAState.java:61)
        at org.apache.hadoop.hdfs.server.namenode.ha.ActiveState.setState(ActiveState.java:52)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.transitionToStandby(NameNode.java:1278)
        at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.transitionToStandby(NameNodeRpcServer.java:1046)
        at org.apache.hadoop.ha.protocolPB.HAServiceProtocolServerSideTranslatorPB.transitionToStandby(HAServiceProtocolServerSideTranslatorPB.java:119)
        at org.apache.hadoop.ha.proto.HAServiceProtocolProtos$HAServiceProtocolService$2.callBlockingMethod(HAServiceProtocolProtos.java:3635)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:453)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1002)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1752)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1748)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)


2014-11-25 22:24:10,632 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Starting services required for standby state
2014-11-25 22:24:10,633 INFO org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Will roll logs on active node at dc1-had03-m002.dc01.revsci.net/10.2.16.92:8020 every 120 seconds.
2014-11-25 22:24:10,634 INFO org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer: Starting standby checkpoint thread...
Checkpointing active NN at dc1-had03-m002.dc01.revsci.net:50070
Serving checkpoints at dc1-had03-m001.dc01.revsci.net/10.2.16.91:50070
{code}

zkfc logs:
{code}
2014-11-25 22:24:12,192 INFO org.apache.zookeeper.ClientCnxn: Unable to read additional data from server sessionid 0x449b8
ce9a110255, likely server has closed socket, closing socket connection and attempting reconnect
2014-11-25 22:24:12,293 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session disconnected. Entering neutral mode...
2014-11-25 22:24:12,950 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server dc1-had03-zook06.dc01.re
vsci.net/10.2.16.205:2181. Will not attempt to authenticate using SASL (unknown error)
2014-11-25 22:24:12,951 INFO org.apache.zookeeper.ClientCnxn: Socket connection established to dc1-had03-zook06.dc01.revsc
i.net/10.2.16.205:2181, initiating session
2014-11-25 22:24:12,952 INFO org.apache.zookeeper.ClientCnxn: Unable to reconnect to ZooKeeper service, session 0x449b8ce9
a110255 has expired, closing socket connection
2014-11-25 22:24:12,952 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session expired. Entering neutral mode and rejoini
ng...
2014-11-25 22:24:12,952 INFO org.apache.hadoop.ha.ActiveStandbyElector: Trying to re-establish ZK session
2014-11-25 22:24:12,952 INFO org.apache.zookeeper.ZooKeeper: Initiating client connection, connectString=dc1-had03-zook01.
dc01.revsci.net:2181,dc1-had03-zook02.dc01.revsci.net:2181,dc1-had03-zook03.dc01.revsci.net:2181,dc1-had03-zook04.dc01.rev
sci.net:2181,dc1-had03-zook05.dc01.revsci.net:2181,dc1-had03-zook06.dc01.revsci.net:2181 sessionTimeout=5000 watcher=org.a
pache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef@7f042529
2014-11-25 22:24:12,954 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server dc1-had03-zook01.dc01.revsci.net/10.2.16.200:2181. Will not attempt to authenticate using SASL (unknown error)
2014-11-25 22:24:12,954 INFO org.apache.zookeeper.ClientCnxn: Socket connection established to dc1-had03-zook01.dc01.revsci.net/10.2.16.200:2181, initiating session
2014-11-25 22:24:13,172 INFO org.apache.zookeeper.ClientCnxn: Session establishment complete on server dc1-had03-zook01.dc01.revsci.net/10.2.16.200:2181, sessionid = 0x149a7f9b6d60263, negotiated timeout = 5000
2014-11-25 22:24:13,173 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session connected.
2014-11-25 22:24:13,173 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x449b8ce9a110255
2014-11-25 22:24:13,173 INFO org.apache.zookeeper.ClientCnxn: EventThread shut down
2014-11-25 22:24:13,389 INFO org.apache.hadoop.ha.ZKFailoverController: ZK Election indicated that NameNode at dc1-had03-m001.dc01.revsci.net/10.2.16.91:8020 should become standby
2014-11-25 22:24:13,391 INFO org.apache.hadoop.ha.ZKFailoverController: Successfully transitioned NameNode at dc1-had03-m001.dc01.revsci.net/10.2.16.91:8020 to standby state
{code}

-laxman



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)