You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hadoop.apache.org by Alexandr Porunov <al...@gmail.com> on 2016/07/19 13:52:33 UTC

ZKFC do not work in Hadoop HA

Hello,

I have a problem with ZKFC.
I have configured High Availability for Hadoop with QJM.
The problem is that when I turn off the active master node (or kill the
namenode process) standby node does not want to change its status from
standby to active. So it continues to be the standby node.

I was watching the log file of ZKFC when I turned off the active node. It
started trying to connect to the active node (which already died) to change
its status from active to standby.
But the active node already died, so it is impossible to connect to the
dead active master node.
Then I turned on the active master node. After that my standby node
connected to the old active master node and changed the status of the
active node from active to standby and the status of standby node from
standby to active.

It is really strange. After the crash of the active node the ZKFC wants to
connect to the dead node. Before connection is established ZKFC doesn't
want to change the status of standby node to active.

Why is it happens?

Here my log from zkfc (I cut it because it repeats all the time. After this
part of logs it logger writes the same thing):

2016-07-19 14:43:21,943 INFO org.apache.hadoop.ha.ActiveStandbyElector:
Checking for any old active which needs to be fenced...
2016-07-19 14:43:21,957 INFO org.apache.hadoop.ha.ActiveStandbyElector: Old
node exists: 0a0a68612d636c757374657212036e6e311a12686164
6f6f704163746976654d617374657220d43e28d33e
2016-07-19 14:43:21,978 INFO org.apache.hadoop.ha.ZKFailoverController:
Should fence: NameNode at hadoopActiveMaster/192.168.0.80:8020
2016-07-19 14:43:22,995 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: hadoopActiveMaster/192.168.0.80:8020. Already tried 0 time(s);
retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=1,
sleepTime=1000 MILLISECONDS)
2016-07-19 14:43:23,001 WARN org.apache.hadoop.ha.FailoverController:
Unable to gracefully make NameNode at hadoopActiveMaster/192.168.0.80:8020
standby (unable to connect)
java.net.ConnectException: Call From hadoopStandby/192.168.0.81 to
hadoopActiveMaster:8020 failed on connection exception:
java.net.ConnectException: Connection refused; For more details see:
http://wiki.apache.org/hadoop/ConnectionRefused
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
Method)
        at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
        at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
        at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:792)
        at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:732)
        at org.apache.hadoop.ipc.Client.call(Client.java:1479)
        at org.apache.hadoop.ipc.Client.call(Client.java:1412)
        at
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
        at com.sun.proxy.$Proxy9.transitionToStandby(Unknown Source)
        at
org.apache.hadoop.ha.protocolPB.HAServiceProtocolClientSideTranslatorPB.transitionToStandby(HAServiceProtocolClientSideTranslatorPB.java:112)
        at
org.apache.hadoop.ha.FailoverController.tryGracefulFence(FailoverController.java:172)
        at
org.apache.hadoop.ha.ZKFailoverController.doFence(ZKFailoverController.java:514)
        at
org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailoverController.java:505)
        at
org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.java:61)
        at
org.apache.hadoop.ha.ZKFailoverController$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:892)
        at
org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:910)
        at
org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:809)
        at
org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:418)
        at
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599)
        at
org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
Caused by: java.net.ConnectException: Connection refused
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
        at
org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
        at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:531)
        at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:495)
        at
org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:614)
        at
org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:712)
        at
org.apache.hadoop.ipc.Client$Connection.access$2900(Client.java:375)
        at org.apache.hadoop.ipc.Client.getConnection(Client.java:1528)
        at org.apache.hadoop.ipc.Client.call(Client.java:1451)
        ... 14 more
2016-07-19 14:43:23,007 INFO org.apache.hadoop.ha.NodeFencer: ======
Beginning Service Fencing Process... ======
2016-07-19 14:43:23,007 INFO org.apache.hadoop.ha.NodeFencer: Trying method
1/1: org.apache.hadoop.ha.SshFenceByTcpPort(null)
2016-07-19 14:43:23,064 INFO org.apache.hadoop.ha.SshFenceByTcpPort:
Connecting to hadoopActiveMaster...
2016-07-19 14:43:23,066 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
Connecting to hadoopActiveMaster port 22
2016-07-19 14:43:23,073 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
Connection established
2016-07-19 14:43:23,088 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
Remote version string: SSH-2.0-OpenSSH_6.6.1
2016-07-19 14:43:23,089 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
Local version string: SSH-2.0-JSCH-0.1.42
2016-07-19 14:43:23,089 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
CheckCiphers:
aes256-ctr,aes192-ctr,aes128-ctr,aes256-cbc,aes192-cbc,aes128-cbc,3des-ctr,arcfour,arcfour128,arcfour256
2016-07-19 14:43:23,445 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
aes256-ctr is not available.
2016-07-19 14:43:23,445 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
aes192-ctr is not available.
2016-07-19 14:43:23,445 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
aes256-cbc is not available.
2016-07-19 14:43:23,445 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
aes192-cbc is not available.
2016-07-19 14:43:23,445 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
arcfour256 is not available.
2016-07-19 14:43:23,445 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
SSH_MSG_KEXINIT sent
2016-07-19 14:43:23,446 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
SSH_MSG_KEXINIT received
2016-07-19 14:43:23,446 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
kex: server->client aes128-ctr hmac-md5 none
2016-07-19 14:43:23,446 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
kex: client->server aes128-ctr hmac-md5 none
2016-07-19 14:43:23,478 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
SSH_MSG_KEXDH_INIT sent
2016-07-19 14:43:23,479 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
expecting SSH_MSG_KEXDH_REPLY
2016-07-19 14:43:23,493 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
ssh_rsa_verify: signature true
2016-07-19 14:43:23,495 WARN org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
Permanently added 'hadoopActiveMaster' (RSA) to the list of known hosts.
2016-07-19 14:43:23,495 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
SSH_MSG_NEWKEYS sent
2016-07-19 14:43:23,495 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
SSH_MSG_NEWKEYS received
2016-07-19 14:43:23,519 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
SSH_MSG_SERVICE_REQUEST sent
2016-07-19 14:43:23,519 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
SSH_MSG_SERVICE_ACCEPT received
2016-07-19 14:43:23,524 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
Authentications that can continue:
gssapi-with-mic,publickey,keyboard-interactive,password
2016-07-19 14:43:23,524 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
Next authentication method: gssapi-with-mic
2016-07-19 14:43:23,527 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
Authentications that can continue: publickey,keyboard-interactive,password
2016-07-19 14:43:23,527 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
Next authentication method: publickey
2016-07-19 14:43:23,617 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
Authentication succeeded (publickey).
2016-07-19 14:43:23,624 INFO org.apache.hadoop.ha.SshFenceByTcpPort:
Connected to hadoopActiveMaster
2016-07-19 14:43:23,624 INFO org.apache.hadoop.ha.SshFenceByTcpPort:
Looking for process running on port 8020
2016-07-19 14:43:23,705 WARN org.apache.hadoop.ha.SshFenceByTcpPort:
PATH=$PATH:/sbin:/usr/sbin fuser -v -k -n tcp 8020 via ssh: bash: fuser:
command not found
2016-07-19 14:43:23,706 INFO org.apache.hadoop.ha.SshFenceByTcpPort: rc: 127
2016-07-19 14:43:23,706 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
Disconnecting from hadoopActiveMaster port 22
2016-07-19 14:43:23,717 WARN org.apache.hadoop.ha.NodeFencer: Fencing
method org.apache.hadoop.ha.SshFenceByTcpPort(null) was unsuccessful.
2016-07-19 14:43:23,718 ERROR org.apache.hadoop.ha.NodeFencer: Unable to
fence service by any configured method.
2016-07-19 14:43:23,719 WARN org.apache.hadoop.ha.ActiveStandbyElector:
Exception handling the winning of election
java.lang.RuntimeException: Unable to fence NameNode at hadoopActiveMaster/
192.168.0.80:8020
        at
org.apache.hadoop.ha.ZKFailoverController.doFence(ZKFailoverController.java:533)
        at
org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailoverController.java:505)
        at
org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.java:61)
        at
org.apache.hadoop.ha.ZKFailoverController$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:892)
        at
org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:910)
        at
org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:809)
        at
org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:418)
        at
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599)
        at
org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
2016-07-19 14:43:23,719 INFO org.apache.hadoop.ha.ActiveStandbyElector:
Trying to re-establish ZK session
2016-07-19 14:43:23,725 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
Caught an exception, leaving main loop due to Socket closed
2016-07-19 14:43:23,746 INFO org.apache.zookeeper.ZooKeeper: Session:
0x35602bbb71e0002 closed
2016-07-19 14:43:24,750 INFO org.apache.zookeeper.ZooKeeper: Initiating
client connection,
connectString=hadoopActiveMaster:2181,hadoopStandby:2181,hadoopSlave1:2181
sessionTimeout=5000
watcher=org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef@6a02f3d6
2016-07-19 14:43:24,760 INFO org.apache.zookeeper.ClientCnxn: Opening
socket connection to server hadoopActiveMaster/192.168.0.80:2181. Will not
attempt to authenticate using SASL (unknown error)
2016-07-19 14:43:24,762 INFO org.apache.zookeeper.ClientCnxn: Socket
connection established to hadoopActiveMaster/192.168.0.80:2181, initiating
session
2016-07-19 14:43:24,773 INFO org.apache.zookeeper.ClientCnxn: Session
establishment complete on server hadoopActiveMaster/192.168.0.80:2181,
sessionid = 0x15602bba9e00003, negotiated timeout = 5000
2016-07-19 14:43:24,778 INFO org.apache.zookeeper.ClientCnxn: EventThread
shut down
2016-07-19 14:43:24,782 INFO org.apache.hadoop.ha.ActiveStandbyElector:
Session connected.


Please, help me to solve the problem with the configuration of Hadoop HA

Sincerely,
Alexandr

Re: ZKFC do not work in Hadoop HA

Posted by Rakesh Radhakrishnan <ra...@apache.org>.
Good to hear the problem is resolved and able to continue.

Regards,
Rakesh

On Tue, Jul 19, 2016 at 10:31 PM, Alexandr Porunov <
alexandr.porunov@gmail.com> wrote:

> Rakesh,
>
> Thank you very much. I missed it. I hadn't "fuser" command on my nodes.
> I've just installed it. ZKFC became work properly!
>
> Best regards,
> Alexandr
>
> On Tue, Jul 19, 2016 at 5:29 PM, Rakesh Radhakrishnan <ra...@apache.org>
> wrote:
>
>> Hi Alexandr,
>>
>> I could see the following warning message in your logs and is the reason
>> for unsuccessful fencing. Could you please check 'fuser' command
>> execution in your system.
>>
>> 2016-07-19 14:43:23,705 WARN org.apache.hadoop.ha.SshFenceByTcpPort:
>> PATH=$PATH:/sbin:/usr/sbin fuser -v -k -n tcp 8020 via ssh: bash: fuser:
>> command not found
>> 2016-07-19 14:43:23,706 INFO org.apache.hadoop.ha.SshFenceByTcpPort: rc:
>> 127
>> 2016-07-19 14:43:23,706 INFO
>> org.apache.hadoop.ha.SshFenceByTcpPort.jsch: Disconnecting from
>> hadoopActiveMaster port 22
>>
>> Also, I'd suggest to visit
>> https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html
>> page to understand more about the fencing logic. In this page you can
>> search for "*dfs.ha.fencing.methods*" configuration.
>>
>> Regards,
>> Rakesh
>>
>> On Tue, Jul 19, 2016 at 7:22 PM, Alexandr Porunov <
>> alexandr.porunov@gmail.com> wrote:
>>
>>> Hello,
>>>
>>> I have a problem with ZKFC.
>>> I have configured High Availability for Hadoop with QJM.
>>> The problem is that when I turn off the active master node (or kill the
>>> namenode process) standby node does not want to change its status from
>>> standby to active. So it continues to be the standby node.
>>>
>>> I was watching the log file of ZKFC when I turned off the active node.
>>> It started trying to connect to the active node (which already died) to
>>> change its status from active to standby.
>>> But the active node already died, so it is impossible to connect to the
>>> dead active master node.
>>> Then I turned on the active master node. After that my standby node
>>> connected to the old active master node and changed the status of the
>>> active node from active to standby and the status of standby node from
>>> standby to active.
>>>
>>> It is really strange. After the crash of the active node the ZKFC wants
>>> to connect to the dead node. Before connection is established ZKFC doesn't
>>> want to change the status of standby node to active.
>>>
>>> Why is it happens?
>>>
>>> Here my log from zkfc (I cut it because it repeats all the time. After
>>> this part of logs it logger writes the same thing):
>>>
>>> 2016-07-19 14:43:21,943 INFO org.apache.hadoop.ha.ActiveStandbyElector:
>>> Checking for any old active which needs to be fenced...
>>> 2016-07-19 14:43:21,957 INFO org.apache.hadoop.ha.ActiveStandbyElector:
>>> Old node exists: 0a0a68612d636c757374657212036e6e311a12686164
>>> 6f6f704163746976654d617374657220d43e28d33e
>>> 2016-07-19 14:43:21,978 INFO org.apache.hadoop.ha.ZKFailoverController:
>>> Should fence: NameNode at hadoopActiveMaster/192.168.0.80:8020
>>> 2016-07-19 14:43:22,995 INFO org.apache.hadoop.ipc.Client: Retrying
>>> connect to server: hadoopActiveMaster/192.168.0.80:8020. Already tried
>>> 0 time(s); retry policy is
>>> RetryUpToMaximumCountWithFixedSleep(maxRetries=1, sleepTime=1000
>>> MILLISECONDS)
>>> 2016-07-19 14:43:23,001 WARN org.apache.hadoop.ha.FailoverController:
>>> Unable to gracefully make NameNode at hadoopActiveMaster/
>>> 192.168.0.80:8020 standby (unable to connect)
>>> java.net.ConnectException: Call From hadoopStandby/192.168.0.81 to
>>> hadoopActiveMaster:8020 failed on connection exception:
>>> java.net.ConnectException: Connection refused; For more details see:
>>> http://wiki.apache.org/hadoop/ConnectionRefused
>>>         at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
>>> Method)
>>>         at
>>> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>>>         at
>>> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>>>         at
>>> java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>>>         at
>>> org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:792)
>>>         at
>>> org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:732)
>>>         at org.apache.hadoop.ipc.Client.call(Client.java:1479)
>>>         at org.apache.hadoop.ipc.Client.call(Client.java:1412)
>>>         at
>>> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
>>>         at com.sun.proxy.$Proxy9.transitionToStandby(Unknown Source)
>>>         at
>>> org.apache.hadoop.ha.protocolPB.HAServiceProtocolClientSideTranslatorPB.transitionToStandby(HAServiceProtocolClientSideTranslatorPB.java:112)
>>>         at
>>> org.apache.hadoop.ha.FailoverController.tryGracefulFence(FailoverController.java:172)
>>>         at
>>> org.apache.hadoop.ha.ZKFailoverController.doFence(ZKFailoverController.java:514)
>>>         at
>>> org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailoverController.java:505)
>>>         at
>>> org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.java:61)
>>>         at
>>> org.apache.hadoop.ha.ZKFailoverController$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:892)
>>>         at
>>> org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:910)
>>>         at
>>> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:809)
>>>         at
>>> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:418)
>>>         at
>>> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599)
>>>         at
>>> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
>>> Caused by: java.net.ConnectException: Connection refused
>>>         at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>>>         at
>>> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
>>>         at
>>> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
>>>         at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:531)
>>>         at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:495)
>>>         at
>>> org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:614)
>>>         at
>>> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:712)
>>>         at
>>> org.apache.hadoop.ipc.Client$Connection.access$2900(Client.java:375)
>>>         at org.apache.hadoop.ipc.Client.getConnection(Client.java:1528)
>>>         at org.apache.hadoop.ipc.Client.call(Client.java:1451)
>>>         ... 14 more
>>> 2016-07-19 14:43:23,007 INFO org.apache.hadoop.ha.NodeFencer: ======
>>> Beginning Service Fencing Process... ======
>>> 2016-07-19 14:43:23,007 INFO org.apache.hadoop.ha.NodeFencer: Trying
>>> method 1/1: org.apache.hadoop.ha.SshFenceByTcpPort(null)
>>> 2016-07-19 14:43:23,064 INFO org.apache.hadoop.ha.SshFenceByTcpPort:
>>> Connecting to hadoopActiveMaster...
>>> 2016-07-19 14:43:23,066 INFO
>>> org.apache.hadoop.ha.SshFenceByTcpPort.jsch: Connecting to
>>> hadoopActiveMaster port 22
>>> 2016-07-19 14:43:23,073 INFO
>>> org.apache.hadoop.ha.SshFenceByTcpPort.jsch: Connection established
>>> 2016-07-19 14:43:23,088 INFO
>>> org.apache.hadoop.ha.SshFenceByTcpPort.jsch: Remote version string:
>>> SSH-2.0-OpenSSH_6.6.1
>>> 2016-07-19 14:43:23,089 INFO
>>> org.apache.hadoop.ha.SshFenceByTcpPort.jsch: Local version string:
>>> SSH-2.0-JSCH-0.1.42
>>> 2016-07-19 14:43:23,089 INFO
>>> org.apache.hadoop.ha.SshFenceByTcpPort.jsch: CheckCiphers:
>>> aes256-ctr,aes192-ctr,aes128-ctr,aes256-cbc,aes192-cbc,aes128-cbc,3des-ctr,arcfour,arcfour128,arcfour256
>>> 2016-07-19 14:43:23,445 INFO
>>> org.apache.hadoop.ha.SshFenceByTcpPort.jsch: aes256-ctr is not available.
>>> 2016-07-19 14:43:23,445 INFO
>>> org.apache.hadoop.ha.SshFenceByTcpPort.jsch: aes192-ctr is not available.
>>> 2016-07-19 14:43:23,445 INFO
>>> org.apache.hadoop.ha.SshFenceByTcpPort.jsch: aes256-cbc is not available.
>>> 2016-07-19 14:43:23,445 INFO
>>> org.apache.hadoop.ha.SshFenceByTcpPort.jsch: aes192-cbc is not available.
>>> 2016-07-19 14:43:23,445 INFO
>>> org.apache.hadoop.ha.SshFenceByTcpPort.jsch: arcfour256 is not available.
>>> 2016-07-19 14:43:23,445 INFO
>>> org.apache.hadoop.ha.SshFenceByTcpPort.jsch: SSH_MSG_KEXINIT sent
>>> 2016-07-19 14:43:23,446 INFO
>>> org.apache.hadoop.ha.SshFenceByTcpPort.jsch: SSH_MSG_KEXINIT received
>>> 2016-07-19 14:43:23,446 INFO
>>> org.apache.hadoop.ha.SshFenceByTcpPort.jsch: kex: server->client aes128-ctr
>>> hmac-md5 none
>>> 2016-07-19 14:43:23,446 INFO
>>> org.apache.hadoop.ha.SshFenceByTcpPort.jsch: kex: client->server aes128-ctr
>>> hmac-md5 none
>>> 2016-07-19 14:43:23,478 INFO
>>> org.apache.hadoop.ha.SshFenceByTcpPort.jsch: SSH_MSG_KEXDH_INIT sent
>>> 2016-07-19 14:43:23,479 INFO
>>> org.apache.hadoop.ha.SshFenceByTcpPort.jsch: expecting SSH_MSG_KEXDH_REPLY
>>> 2016-07-19 14:43:23,493 INFO
>>> org.apache.hadoop.ha.SshFenceByTcpPort.jsch: ssh_rsa_verify: signature true
>>> 2016-07-19 14:43:23,495 WARN
>>> org.apache.hadoop.ha.SshFenceByTcpPort.jsch: Permanently added
>>> 'hadoopActiveMaster' (RSA) to the list of known hosts.
>>> 2016-07-19 14:43:23,495 INFO
>>> org.apache.hadoop.ha.SshFenceByTcpPort.jsch: SSH_MSG_NEWKEYS sent
>>> 2016-07-19 14:43:23,495 INFO
>>> org.apache.hadoop.ha.SshFenceByTcpPort.jsch: SSH_MSG_NEWKEYS received
>>> 2016-07-19 14:43:23,519 INFO
>>> org.apache.hadoop.ha.SshFenceByTcpPort.jsch: SSH_MSG_SERVICE_REQUEST sent
>>> 2016-07-19 14:43:23,519 INFO
>>> org.apache.hadoop.ha.SshFenceByTcpPort.jsch: SSH_MSG_SERVICE_ACCEPT received
>>> 2016-07-19 14:43:23,524 INFO
>>> org.apache.hadoop.ha.SshFenceByTcpPort.jsch: Authentications that can
>>> continue: gssapi-with-mic,publickey,keyboard-interactive,password
>>> 2016-07-19 14:43:23,524 INFO
>>> org.apache.hadoop.ha.SshFenceByTcpPort.jsch: Next authentication method:
>>> gssapi-with-mic
>>> 2016-07-19 14:43:23,527 INFO
>>> org.apache.hadoop.ha.SshFenceByTcpPort.jsch: Authentications that can
>>> continue: publickey,keyboard-interactive,password
>>> 2016-07-19 14:43:23,527 INFO
>>> org.apache.hadoop.ha.SshFenceByTcpPort.jsch: Next authentication method:
>>> publickey
>>> 2016-07-19 14:43:23,617 INFO
>>> org.apache.hadoop.ha.SshFenceByTcpPort.jsch: Authentication succeeded
>>> (publickey).
>>> 2016-07-19 14:43:23,624 INFO org.apache.hadoop.ha.SshFenceByTcpPort:
>>> Connected to hadoopActiveMaster
>>> 2016-07-19 14:43:23,624 INFO org.apache.hadoop.ha.SshFenceByTcpPort:
>>> Looking for process running on port 8020
>>> 2016-07-19 14:43:23,705 WARN org.apache.hadoop.ha.SshFenceByTcpPort:
>>> PATH=$PATH:/sbin:/usr/sbin fuser -v -k -n tcp 8020 via ssh: bash: fuser:
>>> command not found
>>> 2016-07-19 14:43:23,706 INFO org.apache.hadoop.ha.SshFenceByTcpPort: rc:
>>> 127
>>> 2016-07-19 14:43:23,706 INFO
>>> org.apache.hadoop.ha.SshFenceByTcpPort.jsch: Disconnecting from
>>> hadoopActiveMaster port 22
>>> 2016-07-19 14:43:23,717 WARN org.apache.hadoop.ha.NodeFencer: Fencing
>>> method org.apache.hadoop.ha.SshFenceByTcpPort(null) was unsuccessful.
>>> 2016-07-19 14:43:23,718 ERROR org.apache.hadoop.ha.NodeFencer: Unable to
>>> fence service by any configured method.
>>> 2016-07-19 14:43:23,719 WARN org.apache.hadoop.ha.ActiveStandbyElector:
>>> Exception handling the winning of election
>>> java.lang.RuntimeException: Unable to fence NameNode at
>>> hadoopActiveMaster/192.168.0.80:8020
>>>         at
>>> org.apache.hadoop.ha.ZKFailoverController.doFence(ZKFailoverController.java:533)
>>>         at
>>> org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailoverController.java:505)
>>>         at
>>> org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.java:61)
>>>         at
>>> org.apache.hadoop.ha.ZKFailoverController$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:892)
>>>         at
>>> org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:910)
>>>         at
>>> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:809)
>>>         at
>>> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:418)
>>>         at
>>> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599)
>>>         at
>>> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
>>> 2016-07-19 14:43:23,719 INFO org.apache.hadoop.ha.ActiveStandbyElector:
>>> Trying to re-establish ZK session
>>> 2016-07-19 14:43:23,725 INFO
>>> org.apache.hadoop.ha.SshFenceByTcpPort.jsch: Caught an exception, leaving
>>> main loop due to Socket closed
>>> 2016-07-19 14:43:23,746 INFO org.apache.zookeeper.ZooKeeper: Session:
>>> 0x35602bbb71e0002 closed
>>> 2016-07-19 14:43:24,750 INFO org.apache.zookeeper.ZooKeeper: Initiating
>>> client connection,
>>> connectString=hadoopActiveMaster:2181,hadoopStandby:2181,hadoopSlave1:2181
>>> sessionTimeout=5000
>>> watcher=org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef@6a02f3d6
>>> 2016-07-19 14:43:24,760 INFO org.apache.zookeeper.ClientCnxn: Opening
>>> socket connection to server hadoopActiveMaster/192.168.0.80:2181. Will
>>> not attempt to authenticate using SASL (unknown error)
>>> 2016-07-19 14:43:24,762 INFO org.apache.zookeeper.ClientCnxn: Socket
>>> connection established to hadoopActiveMaster/192.168.0.80:2181,
>>> initiating session
>>> 2016-07-19 14:43:24,773 INFO org.apache.zookeeper.ClientCnxn: Session
>>> establishment complete on server hadoopActiveMaster/192.168.0.80:2181,
>>> sessionid = 0x15602bba9e00003, negotiated timeout = 5000
>>> 2016-07-19 14:43:24,778 INFO org.apache.zookeeper.ClientCnxn:
>>> EventThread shut down
>>> 2016-07-19 14:43:24,782 INFO org.apache.hadoop.ha.ActiveStandbyElector:
>>> Session connected.
>>>
>>>
>>> Please, help me to solve the problem with the configuration of Hadoop HA
>>>
>>> Sincerely,
>>> Alexandr
>>>
>>
>>
>

Re: ZKFC do not work in Hadoop HA

Posted by Alexandr Porunov <al...@gmail.com>.
Rakesh,

Thank you very much. I missed it. I hadn't "fuser" command on my nodes.
I've just installed it. ZKFC became work properly!

Best regards,
Alexandr

On Tue, Jul 19, 2016 at 5:29 PM, Rakesh Radhakrishnan <ra...@apache.org>
wrote:

> Hi Alexandr,
>
> I could see the following warning message in your logs and is the reason
> for unsuccessful fencing. Could you please check 'fuser' command
> execution in your system.
>
> 2016-07-19 14:43:23,705 WARN org.apache.hadoop.ha.SshFenceByTcpPort:
> PATH=$PATH:/sbin:/usr/sbin fuser -v -k -n tcp 8020 via ssh: bash: fuser:
> command not found
> 2016-07-19 14:43:23,706 INFO org.apache.hadoop.ha.SshFenceByTcpPort: rc:
> 127
> 2016-07-19 14:43:23,706 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
> Disconnecting from hadoopActiveMaster port 22
>
> Also, I'd suggest to visit
> https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html
> page to understand more about the fencing logic. In this page you can
> search for "*dfs.ha.fencing.methods*" configuration.
>
> Regards,
> Rakesh
>
> On Tue, Jul 19, 2016 at 7:22 PM, Alexandr Porunov <
> alexandr.porunov@gmail.com> wrote:
>
>> Hello,
>>
>> I have a problem with ZKFC.
>> I have configured High Availability for Hadoop with QJM.
>> The problem is that when I turn off the active master node (or kill the
>> namenode process) standby node does not want to change its status from
>> standby to active. So it continues to be the standby node.
>>
>> I was watching the log file of ZKFC when I turned off the active node. It
>> started trying to connect to the active node (which already died) to change
>> its status from active to standby.
>> But the active node already died, so it is impossible to connect to the
>> dead active master node.
>> Then I turned on the active master node. After that my standby node
>> connected to the old active master node and changed the status of the
>> active node from active to standby and the status of standby node from
>> standby to active.
>>
>> It is really strange. After the crash of the active node the ZKFC wants
>> to connect to the dead node. Before connection is established ZKFC doesn't
>> want to change the status of standby node to active.
>>
>> Why is it happens?
>>
>> Here my log from zkfc (I cut it because it repeats all the time. After
>> this part of logs it logger writes the same thing):
>>
>> 2016-07-19 14:43:21,943 INFO org.apache.hadoop.ha.ActiveStandbyElector:
>> Checking for any old active which needs to be fenced...
>> 2016-07-19 14:43:21,957 INFO org.apache.hadoop.ha.ActiveStandbyElector:
>> Old node exists: 0a0a68612d636c757374657212036e6e311a12686164
>> 6f6f704163746976654d617374657220d43e28d33e
>> 2016-07-19 14:43:21,978 INFO org.apache.hadoop.ha.ZKFailoverController:
>> Should fence: NameNode at hadoopActiveMaster/192.168.0.80:8020
>> 2016-07-19 14:43:22,995 INFO org.apache.hadoop.ipc.Client: Retrying
>> connect to server: hadoopActiveMaster/192.168.0.80:8020. Already tried 0
>> time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=1,
>> sleepTime=1000 MILLISECONDS)
>> 2016-07-19 14:43:23,001 WARN org.apache.hadoop.ha.FailoverController:
>> Unable to gracefully make NameNode at hadoopActiveMaster/
>> 192.168.0.80:8020 standby (unable to connect)
>> java.net.ConnectException: Call From hadoopStandby/192.168.0.81 to
>> hadoopActiveMaster:8020 failed on connection exception:
>> java.net.ConnectException: Connection refused; For more details see:
>> http://wiki.apache.org/hadoop/ConnectionRefused
>>         at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
>> Method)
>>         at
>> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>>         at
>> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>>         at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>>         at
>> org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:792)
>>         at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:732)
>>         at org.apache.hadoop.ipc.Client.call(Client.java:1479)
>>         at org.apache.hadoop.ipc.Client.call(Client.java:1412)
>>         at
>> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
>>         at com.sun.proxy.$Proxy9.transitionToStandby(Unknown Source)
>>         at
>> org.apache.hadoop.ha.protocolPB.HAServiceProtocolClientSideTranslatorPB.transitionToStandby(HAServiceProtocolClientSideTranslatorPB.java:112)
>>         at
>> org.apache.hadoop.ha.FailoverController.tryGracefulFence(FailoverController.java:172)
>>         at
>> org.apache.hadoop.ha.ZKFailoverController.doFence(ZKFailoverController.java:514)
>>         at
>> org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailoverController.java:505)
>>         at
>> org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.java:61)
>>         at
>> org.apache.hadoop.ha.ZKFailoverController$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:892)
>>         at
>> org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:910)
>>         at
>> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:809)
>>         at
>> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:418)
>>         at
>> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599)
>>         at
>> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
>> Caused by: java.net.ConnectException: Connection refused
>>         at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>>         at
>> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
>>         at
>> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
>>         at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:531)
>>         at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:495)
>>         at
>> org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:614)
>>         at
>> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:712)
>>         at
>> org.apache.hadoop.ipc.Client$Connection.access$2900(Client.java:375)
>>         at org.apache.hadoop.ipc.Client.getConnection(Client.java:1528)
>>         at org.apache.hadoop.ipc.Client.call(Client.java:1451)
>>         ... 14 more
>> 2016-07-19 14:43:23,007 INFO org.apache.hadoop.ha.NodeFencer: ======
>> Beginning Service Fencing Process... ======
>> 2016-07-19 14:43:23,007 INFO org.apache.hadoop.ha.NodeFencer: Trying
>> method 1/1: org.apache.hadoop.ha.SshFenceByTcpPort(null)
>> 2016-07-19 14:43:23,064 INFO org.apache.hadoop.ha.SshFenceByTcpPort:
>> Connecting to hadoopActiveMaster...
>> 2016-07-19 14:43:23,066 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
>> Connecting to hadoopActiveMaster port 22
>> 2016-07-19 14:43:23,073 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
>> Connection established
>> 2016-07-19 14:43:23,088 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
>> Remote version string: SSH-2.0-OpenSSH_6.6.1
>> 2016-07-19 14:43:23,089 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
>> Local version string: SSH-2.0-JSCH-0.1.42
>> 2016-07-19 14:43:23,089 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
>> CheckCiphers:
>> aes256-ctr,aes192-ctr,aes128-ctr,aes256-cbc,aes192-cbc,aes128-cbc,3des-ctr,arcfour,arcfour128,arcfour256
>> 2016-07-19 14:43:23,445 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
>> aes256-ctr is not available.
>> 2016-07-19 14:43:23,445 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
>> aes192-ctr is not available.
>> 2016-07-19 14:43:23,445 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
>> aes256-cbc is not available.
>> 2016-07-19 14:43:23,445 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
>> aes192-cbc is not available.
>> 2016-07-19 14:43:23,445 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
>> arcfour256 is not available.
>> 2016-07-19 14:43:23,445 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
>> SSH_MSG_KEXINIT sent
>> 2016-07-19 14:43:23,446 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
>> SSH_MSG_KEXINIT received
>> 2016-07-19 14:43:23,446 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
>> kex: server->client aes128-ctr hmac-md5 none
>> 2016-07-19 14:43:23,446 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
>> kex: client->server aes128-ctr hmac-md5 none
>> 2016-07-19 14:43:23,478 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
>> SSH_MSG_KEXDH_INIT sent
>> 2016-07-19 14:43:23,479 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
>> expecting SSH_MSG_KEXDH_REPLY
>> 2016-07-19 14:43:23,493 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
>> ssh_rsa_verify: signature true
>> 2016-07-19 14:43:23,495 WARN org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
>> Permanently added 'hadoopActiveMaster' (RSA) to the list of known hosts.
>> 2016-07-19 14:43:23,495 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
>> SSH_MSG_NEWKEYS sent
>> 2016-07-19 14:43:23,495 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
>> SSH_MSG_NEWKEYS received
>> 2016-07-19 14:43:23,519 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
>> SSH_MSG_SERVICE_REQUEST sent
>> 2016-07-19 14:43:23,519 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
>> SSH_MSG_SERVICE_ACCEPT received
>> 2016-07-19 14:43:23,524 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
>> Authentications that can continue:
>> gssapi-with-mic,publickey,keyboard-interactive,password
>> 2016-07-19 14:43:23,524 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
>> Next authentication method: gssapi-with-mic
>> 2016-07-19 14:43:23,527 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
>> Authentications that can continue: publickey,keyboard-interactive,password
>> 2016-07-19 14:43:23,527 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
>> Next authentication method: publickey
>> 2016-07-19 14:43:23,617 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
>> Authentication succeeded (publickey).
>> 2016-07-19 14:43:23,624 INFO org.apache.hadoop.ha.SshFenceByTcpPort:
>> Connected to hadoopActiveMaster
>> 2016-07-19 14:43:23,624 INFO org.apache.hadoop.ha.SshFenceByTcpPort:
>> Looking for process running on port 8020
>> 2016-07-19 14:43:23,705 WARN org.apache.hadoop.ha.SshFenceByTcpPort:
>> PATH=$PATH:/sbin:/usr/sbin fuser -v -k -n tcp 8020 via ssh: bash: fuser:
>> command not found
>> 2016-07-19 14:43:23,706 INFO org.apache.hadoop.ha.SshFenceByTcpPort: rc:
>> 127
>> 2016-07-19 14:43:23,706 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
>> Disconnecting from hadoopActiveMaster port 22
>> 2016-07-19 14:43:23,717 WARN org.apache.hadoop.ha.NodeFencer: Fencing
>> method org.apache.hadoop.ha.SshFenceByTcpPort(null) was unsuccessful.
>> 2016-07-19 14:43:23,718 ERROR org.apache.hadoop.ha.NodeFencer: Unable to
>> fence service by any configured method.
>> 2016-07-19 14:43:23,719 WARN org.apache.hadoop.ha.ActiveStandbyElector:
>> Exception handling the winning of election
>> java.lang.RuntimeException: Unable to fence NameNode at
>> hadoopActiveMaster/192.168.0.80:8020
>>         at
>> org.apache.hadoop.ha.ZKFailoverController.doFence(ZKFailoverController.java:533)
>>         at
>> org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailoverController.java:505)
>>         at
>> org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.java:61)
>>         at
>> org.apache.hadoop.ha.ZKFailoverController$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:892)
>>         at
>> org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:910)
>>         at
>> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:809)
>>         at
>> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:418)
>>         at
>> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599)
>>         at
>> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
>> 2016-07-19 14:43:23,719 INFO org.apache.hadoop.ha.ActiveStandbyElector:
>> Trying to re-establish ZK session
>> 2016-07-19 14:43:23,725 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
>> Caught an exception, leaving main loop due to Socket closed
>> 2016-07-19 14:43:23,746 INFO org.apache.zookeeper.ZooKeeper: Session:
>> 0x35602bbb71e0002 closed
>> 2016-07-19 14:43:24,750 INFO org.apache.zookeeper.ZooKeeper: Initiating
>> client connection,
>> connectString=hadoopActiveMaster:2181,hadoopStandby:2181,hadoopSlave1:2181
>> sessionTimeout=5000
>> watcher=org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef@6a02f3d6
>> 2016-07-19 14:43:24,760 INFO org.apache.zookeeper.ClientCnxn: Opening
>> socket connection to server hadoopActiveMaster/192.168.0.80:2181. Will
>> not attempt to authenticate using SASL (unknown error)
>> 2016-07-19 14:43:24,762 INFO org.apache.zookeeper.ClientCnxn: Socket
>> connection established to hadoopActiveMaster/192.168.0.80:2181,
>> initiating session
>> 2016-07-19 14:43:24,773 INFO org.apache.zookeeper.ClientCnxn: Session
>> establishment complete on server hadoopActiveMaster/192.168.0.80:2181,
>> sessionid = 0x15602bba9e00003, negotiated timeout = 5000
>> 2016-07-19 14:43:24,778 INFO org.apache.zookeeper.ClientCnxn: EventThread
>> shut down
>> 2016-07-19 14:43:24,782 INFO org.apache.hadoop.ha.ActiveStandbyElector:
>> Session connected.
>>
>>
>> Please, help me to solve the problem with the configuration of Hadoop HA
>>
>> Sincerely,
>> Alexandr
>>
>
>

Re: ZKFC do not work in Hadoop HA

Posted by Rakesh Radhakrishnan <ra...@apache.org>.
Hi Alexandr,

I could see the following warning message in your logs and is the reason
for unsuccessful fencing. Could you please check 'fuser' command execution
in your system.

2016-07-19 14:43:23,705 WARN org.apache.hadoop.ha.SshFenceByTcpPort:
PATH=$PATH:/sbin:/usr/sbin fuser -v -k -n tcp 8020 via ssh: bash: fuser:
command not found
2016-07-19 14:43:23,706 INFO org.apache.hadoop.ha.SshFenceByTcpPort: rc: 127
2016-07-19 14:43:23,706 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
Disconnecting from hadoopActiveMaster port 22

Also, I'd suggest to visit
https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html
page to understand more about the fencing logic. In this page you can
search for "*dfs.ha.fencing.methods*" configuration.

Regards,
Rakesh

On Tue, Jul 19, 2016 at 7:22 PM, Alexandr Porunov <
alexandr.porunov@gmail.com> wrote:

> Hello,
>
> I have a problem with ZKFC.
> I have configured High Availability for Hadoop with QJM.
> The problem is that when I turn off the active master node (or kill the
> namenode process) standby node does not want to change its status from
> standby to active. So it continues to be the standby node.
>
> I was watching the log file of ZKFC when I turned off the active node. It
> started trying to connect to the active node (which already died) to change
> its status from active to standby.
> But the active node already died, so it is impossible to connect to the
> dead active master node.
> Then I turned on the active master node. After that my standby node
> connected to the old active master node and changed the status of the
> active node from active to standby and the status of standby node from
> standby to active.
>
> It is really strange. After the crash of the active node the ZKFC wants to
> connect to the dead node. Before connection is established ZKFC doesn't
> want to change the status of standby node to active.
>
> Why is it happens?
>
> Here my log from zkfc (I cut it because it repeats all the time. After
> this part of logs it logger writes the same thing):
>
> 2016-07-19 14:43:21,943 INFO org.apache.hadoop.ha.ActiveStandbyElector:
> Checking for any old active which needs to be fenced...
> 2016-07-19 14:43:21,957 INFO org.apache.hadoop.ha.ActiveStandbyElector:
> Old node exists: 0a0a68612d636c757374657212036e6e311a12686164
> 6f6f704163746976654d617374657220d43e28d33e
> 2016-07-19 14:43:21,978 INFO org.apache.hadoop.ha.ZKFailoverController:
> Should fence: NameNode at hadoopActiveMaster/192.168.0.80:8020
> 2016-07-19 14:43:22,995 INFO org.apache.hadoop.ipc.Client: Retrying
> connect to server: hadoopActiveMaster/192.168.0.80:8020. Already tried 0
> time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=1,
> sleepTime=1000 MILLISECONDS)
> 2016-07-19 14:43:23,001 WARN org.apache.hadoop.ha.FailoverController:
> Unable to gracefully make NameNode at hadoopActiveMaster/192.168.0.80:8020
> standby (unable to connect)
> java.net.ConnectException: Call From hadoopStandby/192.168.0.81 to
> hadoopActiveMaster:8020 failed on connection exception:
> java.net.ConnectException: Connection refused; For more details see:
> http://wiki.apache.org/hadoop/ConnectionRefused
>         at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> Method)
>         at
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>         at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>         at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>         at
> org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:792)
>         at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:732)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1479)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1412)
>         at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
>         at com.sun.proxy.$Proxy9.transitionToStandby(Unknown Source)
>         at
> org.apache.hadoop.ha.protocolPB.HAServiceProtocolClientSideTranslatorPB.transitionToStandby(HAServiceProtocolClientSideTranslatorPB.java:112)
>         at
> org.apache.hadoop.ha.FailoverController.tryGracefulFence(FailoverController.java:172)
>         at
> org.apache.hadoop.ha.ZKFailoverController.doFence(ZKFailoverController.java:514)
>         at
> org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailoverController.java:505)
>         at
> org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.java:61)
>         at
> org.apache.hadoop.ha.ZKFailoverController$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:892)
>         at
> org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:910)
>         at
> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:809)
>         at
> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:418)
>         at
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599)
>         at
> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
> Caused by: java.net.ConnectException: Connection refused
>         at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>         at
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
>         at
> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
>         at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:531)
>         at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:495)
>         at
> org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:614)
>         at
> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:712)
>         at
> org.apache.hadoop.ipc.Client$Connection.access$2900(Client.java:375)
>         at org.apache.hadoop.ipc.Client.getConnection(Client.java:1528)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1451)
>         ... 14 more
> 2016-07-19 14:43:23,007 INFO org.apache.hadoop.ha.NodeFencer: ======
> Beginning Service Fencing Process... ======
> 2016-07-19 14:43:23,007 INFO org.apache.hadoop.ha.NodeFencer: Trying
> method 1/1: org.apache.hadoop.ha.SshFenceByTcpPort(null)
> 2016-07-19 14:43:23,064 INFO org.apache.hadoop.ha.SshFenceByTcpPort:
> Connecting to hadoopActiveMaster...
> 2016-07-19 14:43:23,066 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
> Connecting to hadoopActiveMaster port 22
> 2016-07-19 14:43:23,073 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
> Connection established
> 2016-07-19 14:43:23,088 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
> Remote version string: SSH-2.0-OpenSSH_6.6.1
> 2016-07-19 14:43:23,089 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
> Local version string: SSH-2.0-JSCH-0.1.42
> 2016-07-19 14:43:23,089 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
> CheckCiphers:
> aes256-ctr,aes192-ctr,aes128-ctr,aes256-cbc,aes192-cbc,aes128-cbc,3des-ctr,arcfour,arcfour128,arcfour256
> 2016-07-19 14:43:23,445 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
> aes256-ctr is not available.
> 2016-07-19 14:43:23,445 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
> aes192-ctr is not available.
> 2016-07-19 14:43:23,445 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
> aes256-cbc is not available.
> 2016-07-19 14:43:23,445 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
> aes192-cbc is not available.
> 2016-07-19 14:43:23,445 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
> arcfour256 is not available.
> 2016-07-19 14:43:23,445 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
> SSH_MSG_KEXINIT sent
> 2016-07-19 14:43:23,446 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
> SSH_MSG_KEXINIT received
> 2016-07-19 14:43:23,446 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
> kex: server->client aes128-ctr hmac-md5 none
> 2016-07-19 14:43:23,446 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
> kex: client->server aes128-ctr hmac-md5 none
> 2016-07-19 14:43:23,478 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
> SSH_MSG_KEXDH_INIT sent
> 2016-07-19 14:43:23,479 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
> expecting SSH_MSG_KEXDH_REPLY
> 2016-07-19 14:43:23,493 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
> ssh_rsa_verify: signature true
> 2016-07-19 14:43:23,495 WARN org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
> Permanently added 'hadoopActiveMaster' (RSA) to the list of known hosts.
> 2016-07-19 14:43:23,495 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
> SSH_MSG_NEWKEYS sent
> 2016-07-19 14:43:23,495 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
> SSH_MSG_NEWKEYS received
> 2016-07-19 14:43:23,519 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
> SSH_MSG_SERVICE_REQUEST sent
> 2016-07-19 14:43:23,519 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
> SSH_MSG_SERVICE_ACCEPT received
> 2016-07-19 14:43:23,524 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
> Authentications that can continue:
> gssapi-with-mic,publickey,keyboard-interactive,password
> 2016-07-19 14:43:23,524 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
> Next authentication method: gssapi-with-mic
> 2016-07-19 14:43:23,527 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
> Authentications that can continue: publickey,keyboard-interactive,password
> 2016-07-19 14:43:23,527 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
> Next authentication method: publickey
> 2016-07-19 14:43:23,617 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
> Authentication succeeded (publickey).
> 2016-07-19 14:43:23,624 INFO org.apache.hadoop.ha.SshFenceByTcpPort:
> Connected to hadoopActiveMaster
> 2016-07-19 14:43:23,624 INFO org.apache.hadoop.ha.SshFenceByTcpPort:
> Looking for process running on port 8020
> 2016-07-19 14:43:23,705 WARN org.apache.hadoop.ha.SshFenceByTcpPort:
> PATH=$PATH:/sbin:/usr/sbin fuser -v -k -n tcp 8020 via ssh: bash: fuser:
> command not found
> 2016-07-19 14:43:23,706 INFO org.apache.hadoop.ha.SshFenceByTcpPort: rc:
> 127
> 2016-07-19 14:43:23,706 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
> Disconnecting from hadoopActiveMaster port 22
> 2016-07-19 14:43:23,717 WARN org.apache.hadoop.ha.NodeFencer: Fencing
> method org.apache.hadoop.ha.SshFenceByTcpPort(null) was unsuccessful.
> 2016-07-19 14:43:23,718 ERROR org.apache.hadoop.ha.NodeFencer: Unable to
> fence service by any configured method.
> 2016-07-19 14:43:23,719 WARN org.apache.hadoop.ha.ActiveStandbyElector:
> Exception handling the winning of election
> java.lang.RuntimeException: Unable to fence NameNode at hadoopActiveMaster/
> 192.168.0.80:8020
>         at
> org.apache.hadoop.ha.ZKFailoverController.doFence(ZKFailoverController.java:533)
>         at
> org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailoverController.java:505)
>         at
> org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.java:61)
>         at
> org.apache.hadoop.ha.ZKFailoverController$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:892)
>         at
> org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:910)
>         at
> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:809)
>         at
> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:418)
>         at
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599)
>         at
> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
> 2016-07-19 14:43:23,719 INFO org.apache.hadoop.ha.ActiveStandbyElector:
> Trying to re-establish ZK session
> 2016-07-19 14:43:23,725 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
> Caught an exception, leaving main loop due to Socket closed
> 2016-07-19 14:43:23,746 INFO org.apache.zookeeper.ZooKeeper: Session:
> 0x35602bbb71e0002 closed
> 2016-07-19 14:43:24,750 INFO org.apache.zookeeper.ZooKeeper: Initiating
> client connection,
> connectString=hadoopActiveMaster:2181,hadoopStandby:2181,hadoopSlave1:2181
> sessionTimeout=5000
> watcher=org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef@6a02f3d6
> 2016-07-19 14:43:24,760 INFO org.apache.zookeeper.ClientCnxn: Opening
> socket connection to server hadoopActiveMaster/192.168.0.80:2181. Will
> not attempt to authenticate using SASL (unknown error)
> 2016-07-19 14:43:24,762 INFO org.apache.zookeeper.ClientCnxn: Socket
> connection established to hadoopActiveMaster/192.168.0.80:2181,
> initiating session
> 2016-07-19 14:43:24,773 INFO org.apache.zookeeper.ClientCnxn: Session
> establishment complete on server hadoopActiveMaster/192.168.0.80:2181,
> sessionid = 0x15602bba9e00003, negotiated timeout = 5000
> 2016-07-19 14:43:24,778 INFO org.apache.zookeeper.ClientCnxn: EventThread
> shut down
> 2016-07-19 14:43:24,782 INFO org.apache.hadoop.ha.ActiveStandbyElector:
> Session connected.
>
>
> Please, help me to solve the problem with the configuration of Hadoop HA
>
> Sincerely,
> Alexandr
>