You are viewing a plain text version of this content. The canonical link for it is here.
Posted to hdfs-dev@hadoop.apache.org by "Vinayakumar B (JIRA)" <ji...@apache.org> on 2014/11/27 10:53:12 UTC

[jira] [Reopened] (HDFS-7451) Namenode HA failover happens very frequently from active to standby

     [ https://issues.apache.org/jira/browse/HDFS-7451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Vinayakumar B reopened HDFS-7451:
---------------------------------

> Namenode HA failover happens very frequently from active to standby
> -------------------------------------------------------------------
>
>                 Key: HDFS-7451
>                 URL: https://issues.apache.org/jira/browse/HDFS-7451
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: LAXMAN KUMAR SAHOO
>            Assignee: LAXMAN KUMAR SAHOO
>
> We have two namenode having HA enabled. From last couple of days we are observing that the failover happens very frequently from active to standby mode. Below is the log details of the active namenode during failover happens. Is there any fix to get rid of this issue?
> Namenode logs:
> {code}
> 2014-11-25 22:24:02,020 WARN org.apache.hadoop.ipc.Server: IPC Server Responder, call org.apache.hadoop.hdfs.protocol.Clie
> ntProtocol.getListing from 10.2.16.214:40751: output error
> 2014-11-25 22:24:02,020 INFO org.apache.hadoop.ipc.Server: IPC Server handler 23 on 8020 caught an exception
> java.nio.channels.ClosedChannelException
>         at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:265)
>         at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:474)
>         at org.apache.hadoop.ipc.Server.channelWrite(Server.java:2195)
>         at org.apache.hadoop.ipc.Server.access$2000(Server.java:110)
>         at org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:979)
>         at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:1045)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1798)
> 2014-11-25 22:24:10,631 INFO org.apache.hadoop.hdfs.server.namenode.FileJournalManager: Finalizing edits file /sda/dfs/namenode/current/edits_inprogress_0000000001643676954 -> /sda/dfs/namenode/current/edits_0000000001643676954-0000000001643677390
> 2014-11-25 22:24:10,631 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Closing
> java.lang.Exception
>         at org.apache.hadoop.hdfs.qjournal.client.IPCLoggerChannel.close(IPCLoggerChannel.java:182)
>         at org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.close(AsyncLoggerSet.java:102)
>         at org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.close(QuorumJournalManager.java:446)
>         at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalAndStream.close(JournalSet.java:107)
>         at org.apache.hadoop.hdfs.server.namenode.JournalSet$4.apply(JournalSet.java:222)
>         at org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:347)
>         at org.apache.hadoop.hdfs.server.namenode.JournalSet.close(JournalSet.java:219)
>         at org.apache.hadoop.hdfs.server.namenode.FSEditLog.close(FSEditLog.java:308)
>         at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.stopActiveServices(FSNamesystem.java:939)
>         at org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.stopActiveServices(NameNode.java:1365)
>         at org.apache.hadoop.hdfs.server.namenode.ha.ActiveState.exitState(ActiveState.java:70)
>         at org.apache.hadoop.hdfs.server.namenode.ha.HAState.setStateInternal(HAState.java:61)
>         at org.apache.hadoop.hdfs.server.namenode.ha.ActiveState.setState(ActiveState.java:52)
>         at org.apache.hadoop.hdfs.server.namenode.NameNode.transitionToStandby(NameNode.java:1278)
>         at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.transitionToStandby(NameNodeRpcServer.java:1046)
>         at org.apache.hadoop.ha.protocolPB.HAServiceProtocolServerSideTranslatorPB.transitionToStandby(HAServiceProtocolServerSideTranslatorPB.java:119)
>         at org.apache.hadoop.ha.proto.HAServiceProtocolProtos$HAServiceProtocolService$2.callBlockingMethod(HAServiceProtocolProtos.java:3635)
>         at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:453)
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1002)
>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1752)
>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1748)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:415)
> 2014-11-25 22:24:10,632 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Starting services required for standby state
> 2014-11-25 22:24:10,633 INFO org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Will roll logs on active node at dc1-had03-m002.dc01.revsci.net/10.2.16.92:8020 every 120 seconds.
> 2014-11-25 22:24:10,634 INFO org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer: Starting standby checkpoint thread...
> Checkpointing active NN at dc1-had03-m002.dc01.revsci.net:50070
> Serving checkpoints at dc1-had03-m001.dc01.revsci.net/10.2.16.91:50070
> {code}
> zkfc logs:
> {code}
> 2014-11-25 22:24:12,192 INFO org.apache.zookeeper.ClientCnxn: Unable to read additional data from server sessionid 0x449b8
> ce9a110255, likely server has closed socket, closing socket connection and attempting reconnect
> 2014-11-25 22:24:12,293 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session disconnected. Entering neutral mode...
> 2014-11-25 22:24:12,950 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server dc1-had03-zook06.dc01.re
> vsci.net/10.2.16.205:2181. Will not attempt to authenticate using SASL (unknown error)
> 2014-11-25 22:24:12,951 INFO org.apache.zookeeper.ClientCnxn: Socket connection established to dc1-had03-zook06.dc01.revsc
> i.net/10.2.16.205:2181, initiating session
> 2014-11-25 22:24:12,952 INFO org.apache.zookeeper.ClientCnxn: Unable to reconnect to ZooKeeper service, session 0x449b8ce9
> a110255 has expired, closing socket connection
> 2014-11-25 22:24:12,952 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session expired. Entering neutral mode and rejoini
> ng...
> 2014-11-25 22:24:12,952 INFO org.apache.hadoop.ha.ActiveStandbyElector: Trying to re-establish ZK session
> 2014-11-25 22:24:12,952 INFO org.apache.zookeeper.ZooKeeper: Initiating client connection, connectString=dc1-had03-zook01.
> dc01.revsci.net:2181,dc1-had03-zook02.dc01.revsci.net:2181,dc1-had03-zook03.dc01.revsci.net:2181,dc1-had03-zook04.dc01.rev
> sci.net:2181,dc1-had03-zook05.dc01.revsci.net:2181,dc1-had03-zook06.dc01.revsci.net:2181 sessionTimeout=5000 watcher=org.a
> pache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef@7f042529
> 2014-11-25 22:24:12,954 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server dc1-had03-zook01.dc01.revsci.net/10.2.16.200:2181. Will not attempt to authenticate using SASL (unknown error)
> 2014-11-25 22:24:12,954 INFO org.apache.zookeeper.ClientCnxn: Socket connection established to dc1-had03-zook01.dc01.revsci.net/10.2.16.200:2181, initiating session
> 2014-11-25 22:24:13,172 INFO org.apache.zookeeper.ClientCnxn: Session establishment complete on server dc1-had03-zook01.dc01.revsci.net/10.2.16.200:2181, sessionid = 0x149a7f9b6d60263, negotiated timeout = 5000
> 2014-11-25 22:24:13,173 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session connected.
> 2014-11-25 22:24:13,173 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x449b8ce9a110255
> 2014-11-25 22:24:13,173 INFO org.apache.zookeeper.ClientCnxn: EventThread shut down
> 2014-11-25 22:24:13,389 INFO org.apache.hadoop.ha.ZKFailoverController: ZK Election indicated that NameNode at dc1-had03-m001.dc01.revsci.net/10.2.16.91:8020 should become standby
> 2014-11-25 22:24:13,391 INFO org.apache.hadoop.ha.ZKFailoverController: Successfully transitioned NameNode at dc1-had03-m001.dc01.revsci.net/10.2.16.91:8020 to standby state
> {code}
> -laxman



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)