You are viewing a plain text version of this content. The canonical link for it is here.
Posted to hdfs-dev@hadoop.apache.org by "He Xiaoqiao (JIRA)" <ji...@apache.org> on 2018/07/23 14:14:00 UTC
[jira] [Created] (HDFS-13760) improve ZKFC fencing action when network of ZKFC interrupt

He Xiaoqiao created HDFS-13760:
----------------------------------

             Summary: improve ZKFC fencing action when network of ZKFC interrupt
                 Key: HDFS-13760
                 URL: https://issues.apache.org/jira/browse/HDFS-13760
             Project: Hadoop HDFS
          Issue Type: Improvement
          Components: ha
            Reporter: He Xiaoqiao


when host of Active NameNode & ZKFC meet network fault for quite a time, HDFS will be not available since ZKFC located on Standby NameNode will never ssh fence success due to it could not ssh to Active NameNode. In such situation, for Client, it could not connect to Active NameNode, then failover to Standby but it could not provide READ/WRITE.
{code:xml}
2018-07-23 15:57:10,836 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: rz-data-hdp-nn14.rz.sankuai.com/10.16.70.34:8060. Already tried 40 time(s); maxRetries=45
2018-07-23 15:57:30,856 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: rz-data-hdp-nn14.rz.sankuai.com/10.16.70.34:8060. Already tried 41 time(s); maxRetries=45
2018-07-23 15:57:50,872 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: rz-data-hdp-nn14.rz.sankuai.com/10.16.70.34:8060. Already tried 42 time(s); maxRetries=45
2018-07-23 15:58:10,892 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: rz-data-hdp-nn14.rz.sankuai.com/10.16.70.34:8060. Already tried 43 time(s); maxRetries=45
2018-07-23 15:58:30,912 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: rz-data-hdp-nn14.rz.sankuai.com/10.16.70.34:8060. Already tried 44 time(s); maxRetries=45
2018-07-23 15:58:50,933 INFO org.apache.hadoop.ha.ZKFailoverController: get old active state exception: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout while waiting for channel to be 
ready for connect. ch : java.nio.channels.SocketChannel[connection-pending local=/ip:port remote=hostname]
2018-07-23 15:58:50,933 INFO org.apache.hadoop.ha.ActiveStandbyElector: old active is not healthy. need to create znode
2018-07-23 15:58:50,933 INFO org.apache.hadoop.ha.ActiveStandbyElector: Elector callbacks for NameNode at standbynn start create node, now time: 45179010079342817
2018-07-23 15:58:50,936 INFO org.apache.hadoop.ha.ActiveStandbyElector: CreateNode result: 0 code:OK for path: /hadoop-ha/ns/ActiveStandbyElectorLock connectionState: CONNECTED  for elector id=469098346 appData=0a07727a2d6e6e313312046e6e31331a1f727a2d646174612d6864702d6e6e31332e727a2e73616e6b7561692e636f6d20e83e28d33e cb=Elector callbacks for NameNode at standbynamenode
2018-07-23 15:58:50,936 INFO org.apache.hadoop.ha.ActiveStandbyElector: Checking for any old active which needs to be fenced...
2018-07-23 15:58:50,938 INFO org.apache.hadoop.ha.ActiveStandbyElector: Old node exists: 0a07727a2d6e6e313312046e6e31341a1f727a2d646174612d6864702d6e6e31342e727a2e73616e6b7561692e636f6d20e83e28d33e
2018-07-23 15:58:50,939 INFO org.apache.hadoop.ha.ZKFailoverController: Should fence: NameNode at activenamenode
2018-07-23 15:59:10,960 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: activenamenode. Already tried 0 time(s); maxRetries=1
2018-07-23 15:59:30,980 WARN org.apache.hadoop.ha.FailoverController: Unable to gracefully make NameNode at activenamenode standby (unable to connect)
org.apache.hadoop.net.ConnectTimeoutException: Call From standbynamenode to activenamenode failed on socket timeout exception: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending local=ip:port remote=activenamenode]; For more details see:  http://wiki.apache.org/hadoop/SocketTimeout
{code}

I propose that when Active NameNode meet network fault, ZKFC force this NameNode to become Standby, and another ZKFC could hold the ZNode for election and transition other NameNode to Active even when ssh fence fail.

There is no available patch now, and I am very welcome to hear some suggestion.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-dev-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-help@hadoop.apache.org