You are viewing a plain text version of this content. The canonical link for it is here.
Posted to hdfs-dev@hadoop.apache.org by "Zesheng Wu (JIRA)" <ji...@apache.org> on 2014/08/06 07:37:12 UTC

[jira] [Created] (HDFS-6827) NameNode double standby

Zesheng Wu created HDFS-6827:
--------------------------------

             Summary: NameNode double standby
                 Key: HDFS-6827
                 URL: https://issues.apache.org/jira/browse/HDFS-6827
             Project: Hadoop HDFS
          Issue Type: Bug
          Components: ha
    Affects Versions: 2.4.1
            Reporter: Zesheng Wu
            Assignee: Zesheng Wu


In our production cluster, we encounter a scenario like this: ANN crashed due to write journal timeout, and was restarted by the watchdog automatically, but after restarting both of the NNs are standby.

Following is the logs of the scenario:
# NN1 is down due to write journal timeout:
{color:red}2014-08-03,23:02:02,219{color} INFO org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG
# ZKFC1 detected "connection reset by peer"
{color:red}2014-08-03,23:02:02,560{color} ERROR org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:xx@xx.HADOOP (auth:KERBEROS) cause:java.io.IOException: {color:red}Connection reset by peer{color}
# NN1 wat restarted successfully by the watchdog:
2014-08-03,23:02:07,884 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: Web-server up at: xx:13201
2014-08-03,23:02:07,884 INFO org.apache.hadoop.ipc.Server: IPC Server Responder: starting
{color:red}2014-08-03,23:02:07,884{color} INFO org.apache.hadoop.ipc.Server: IPC Server listener on 13200: starting
2014-08-03,23:02:08,742 INFO org.apache.hadoop.ipc.Server: RPC server clean thread started!
2014-08-03,23:02:08,743 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: Registered DFSClientInformation MBean
2014-08-03,23:02:08,744 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: NameNode up at: xx/xx:13200
2014-08-03,23:02:08,744 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Starting services required for standby state
# ZKFC1 retried the connection and considered NN1 was healthy
{color:red}2014-08-03,23:02:08,292{color} INFO org.apache.hadoop.ipc.Client: Retrying connect to server: xx/xx:13200. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=1, sleepTime=1 SECONDS)
# ZKFC1 still considered NN1 as a healthy Active NN, and didn't trigger the failover, as a result, both NNs were standby.

The root cause of this bug is that NN is restarted too quickly and ZKFC health monitor doesn't realize that.



--
This message was sent by Atlassian JIRA
(v6.2#6252)