You are viewing a plain text version of this content. The canonical link for it is here.
Posted to hdfs-user@hadoop.apache.org by YouPeng Yang <yy...@gmail.com> on 2013/12/03 03:23:58 UTC

Can not auto-failover when unplug network interface

Hi
   Another auto-failover testing problem:

   My HA can auto-failover after I kill the active NN.When it comes to the
unplug  network interface to simulate the hardware fail,the auto-failover
seems  not to work after   wait for times -the zkfc logs as [1].

   I'm using the default sshfence.






[1] zkfc
logs----------------------------------------------------------------------------------------
2013-12-03 10:05:56,650 INFO org.apache.hadoop.ha.NodeFencer: ======
Beginning Service Fencing Process... ======
2013-12-03 10:05:56,650 INFO org.apache.hadoop.ha.NodeFencer: Trying method
1/1: org.apache.hadoop.ha.SshFenceByTcpPort(null)
2013-12-03 10:05:56,651 INFO org.apache.hadoop.ha.SshFenceByTcpPort:
Connecting to hadoop3...
2013-12-03 10:05:56,651 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
Connecting to hadoop3 port 22
2013-12-03 10:05:59,648 WARN org.apache.hadoop.ha.SshFenceByTcpPort: Unable
to connect to hadoop3 as user hadoop
com.jcraft.jsch.JSchException: java.net.NoRouteToHostException: No route to
host
    at com.jcraft.jsch.Util.createSocket(Util.java:386)
    at com.jcraft.jsch.Session.connect(Session.java:182)
    at
org.apache.hadoop.ha.SshFenceByTcpPort.tryFence(SshFenceByTcpPort.java:100)
    at org.apache.hadoop.ha.NodeFencer.fence(NodeFencer.java:97)
    at
org.apache.hadoop.ha.ZKFailoverController.doFence(ZKFailoverController.java:521)
    at
org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailoverController.java:494)
    at
org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.java:59)
    at
org.apache.hadoop.ha.ZKFailoverController$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:837)
    at
org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:900)
    at
org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:799)
    at
org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:415)
    at
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:596)
    at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495)
2013-12-03 10:05:59,649 WARN org.apache.hadoop.ha.NodeFencer: Fencing
method org.apache.hadoop.ha.SshFenceByTcpPort(null) was unsuccessful.
2013-12-03 10:05:59,649 ERROR org.apache.hadoop.ha.NodeFencer: Unable to
fence service by any configured method.
2013-12-03 10:05:59,650 WARN org.apache.hadoop.ha.ActiveStandbyElector:
Exception handling the winning of election
java.lang.RuntimeException: Unable to fence NameNode at hadoop3/
10.7.23.124:8020
    at
org.apache.hadoop.ha.ZKFailoverController.doFence(ZKFailoverController.java:522)
    at
org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailoverController.java:494)
    at
org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.java:59)
    at
org.apache.hadoop.ha.ZKFailoverController$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:837)
    at
org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:900)
    at
org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:799)
    at
org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:415)
    at
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:596)
    at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495)
2013-12-03 10:05:59,650 INFO org.apache.hadoop.ha.ActiveStandbyElector:
Trying to re-establish ZK session
2013-12-03 10:05:59,676 INFO org.apache.zookeeper.ZooKeeper: Session:
0x142931031810260 closed
2013-12-03 10:06:00,678 INFO org.apache.zookeeper.ZooKeeper: Initiating
client connection, connectString=hadoop1:2181,hadoop2:2181,hadoop3:2181
sessionTimeout=5000
watcher=org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef@5ce2acea
2013-12-03 10:06:00,681 INFO org.apache.zookeeper.ClientCnxn: Opening
socket connection to server hadoop1/10.7.23.122:2181. Will not attempt to
authenticate using SASL (Unable to locate a login configuration)
2013-12-03 10:06:00,681 INFO org.apache.zookeeper.ClientCnxn: Socket
connection established to hadoop1/10.7.23.122:2181, initiating session
2013-12-03 10:06:00,709 INFO org.apache.zookeeper.ClientCnxn: Session
establishment complete on server hadoop1/10.7.23.122:2181, sessionid =
0x142931031810261, negotiated timeout = 5000
2013-12-03 10:06:00,711 INFO org.apache.zookeeper.ClientCnxn: EventThread
shut down

Re: Can not auto-failover when unplug network interface

Posted by YouPeng Yang <yy...@gmail.com>.
Hi Yu

   I think when unplug the nic ,the ssh could not make through because it
can not connect to  failed  active NN.
Suppose that ,the sshfence will failed.
   Am I right?


2013/12/3 YouPeng Yang <yy...@gmail.com>

> Hi Yu
>
>   Thanks for your response.
>   I'm sure my ssh setup is good. Ssh from  act NN to stanby nn need no
> password.
>
>
>
>
>
>
>
> I attached my config
> ------core-site.xml-----------------
>
> <configuration>
>  <property>
>      <name>fs.defaultFS</name>
>      <value>hdfs://lklcluster</value>
>      <final>true</final>
>  </property>
>
>  <property>
>      <name>hadoop.tmp.dir</name>
>      <value>/home/hadoop/tmp2</value>
>  </property>
>
>
> </configuration>
>
>
> -------hdfs-site.xml----------
> ---
>
> <configuration>
>  <property>
>      <name>dfs.namenode.name.dir</name>
>     <value>/home/hadoop/namedir2</value>
>  </property>
>
>  <property>
>      <name>dfs.datanode.data.dir</name>
>      <value>/home/hadoop/datadir2</value>
>  </property>
>
>  <property>
>    <name>dfs.nameservices</name>
>    <value>lklcluster</value>
> </property>
>
> <property>
>     <name>dfs.ha.namenodes.lklcluster</name>
>     <value>nn1,nn2</value>
> </property>
> <property>
>   <name>dfs.namenode.rpc-address.lklcluster.nn1</name>
>   <value>hadoop2:8020</value>
> </property>
> <property>
>     <name>dfs.namenode.rpc-address.lklcluster.nn2</name>
>     <value>hadoop3:8020</value>
> </property>
>
> <property>
>   <name>dfs.namenode.http-address.lklcluster.nn1</name>
>     <value>hadoop2:50070</value>
> </property>
>
> <property>
>     <name>dfs.namenode.http-address.lklcluster.nn2</name>
>     <value>hadoop3:50070</value>
> </property>
>
> <property>
>   <name>dfs.namenode.shared.edits.dir</name>
>
> <value>qjournal://hadoop1:8485;hadoop2:8485;hadoop3:8485/lklcluster</value>
> </property>
> <property>
>   <name>dfs.client.failover.proxy.provider.lklcluster</name>
>
> <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
> </property>
> <property>
>   <name>dfs.ha.fencing.methods</name>
>   <value>sshfence</value>
> </property>
>
> <property>
>   <name>dfs.ha.fencing.ssh.private-key-files</name>
>    <value>/home/hadoop/.ssh/id_rsa</value>
> </property>
>
> <property>
>     <name>dfs.ha.fencing.ssh.connect-timeout</name>
>      <value>5000</value>
> </property>
>
> <property>
>   <name>dfs.journalnode.edits.dir</name>
>    <value>/home/hadoop/journal/data</value>
> </property>
>
> <property>
>    <name>dfs.ha.automatic-failover.enabled</name>
>       <value>true</value>
> </property>
>
> <property>
>      <name>ha.zookeeper.quorum</name>
>      <value>hadoop1:2181,hadoop2:2181,hadoop3:2181</value>
> </property>
>
> </configuration>
>
>
>
> 2013/12/3 Azuryy Yu <az...@gmail.com>
>
>> This is still because your fence method configuraed improperly.
>> plseae paste your fence configuration. and double check you can ssh on
>> active NN to standby NN without password.
>>
>>
>> On Tue, Dec 3, 2013 at 10:23 AM, YouPeng Yang <yy...@gmail.com>wrote:
>>
>>> Hi
>>>    Another auto-failover testing problem:
>>>
>>>    My HA can auto-failover after I kill the active NN.When it comes to
>>> the unplug  network interface to simulate the hardware fail,the
>>> auto-failover seems  not to work after   wait for times -the zkfc logs as
>>> [1].
>>>
>>>    I'm using the default sshfence.
>>>
>>>
>>>
>>>
>>>
>>>
>>> [1] zkfc
>>> logs----------------------------------------------------------------------------------------
>>> 2013-12-03 10:05:56,650 INFO org.apache.hadoop.ha.NodeFencer: ======
>>> Beginning Service Fencing Process... ======
>>> 2013-12-03 10:05:56,650 INFO org.apache.hadoop.ha.NodeFencer: Trying
>>> method 1/1: org.apache.hadoop.ha.SshFenceByTcpPort(null)
>>> 2013-12-03 10:05:56,651 INFO org.apache.hadoop.ha.SshFenceByTcpPort:
>>> Connecting to hadoop3...
>>> 2013-12-03 10:05:56,651 INFO
>>> org.apache.hadoop.ha.SshFenceByTcpPort.jsch: Connecting to hadoop3 port 22
>>> 2013-12-03 10:05:59,648 WARN org.apache.hadoop.ha.SshFenceByTcpPort:
>>> Unable to connect to hadoop3 as user hadoop
>>> com.jcraft.jsch.JSchException: java.net.NoRouteToHostException: No route
>>> to host
>>>     at com.jcraft.jsch.Util.createSocket(Util.java:386)
>>>     at com.jcraft.jsch.Session.connect(Session.java:182)
>>>     at
>>> org.apache.hadoop.ha.SshFenceByTcpPort.tryFence(SshFenceByTcpPort.java:100)
>>>     at org.apache.hadoop.ha.NodeFencer.fence(NodeFencer.java:97)
>>>     at
>>> org.apache.hadoop.ha.ZKFailoverController.doFence(ZKFailoverController.java:521)
>>>     at
>>> org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailoverController.java:494)
>>>     at
>>> org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.java:59)
>>>     at
>>> org.apache.hadoop.ha.ZKFailoverController$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:837)
>>>     at
>>> org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:900)
>>>     at
>>> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:799)
>>>     at
>>> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:415)
>>>     at
>>> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:596)
>>>     at
>>> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495)
>>> 2013-12-03 10:05:59,649 WARN org.apache.hadoop.ha.NodeFencer: Fencing
>>> method org.apache.hadoop.ha.SshFenceByTcpPort(null) was unsuccessful.
>>> 2013-12-03 10:05:59,649 ERROR org.apache.hadoop.ha.NodeFencer: Unable to
>>> fence service by any configured method.
>>> 2013-12-03 10:05:59,650 WARN org.apache.hadoop.ha.ActiveStandbyElector:
>>> Exception handling the winning of election
>>> java.lang.RuntimeException: Unable to fence NameNode at hadoop3/
>>> 10.7.23.124:8020
>>>     at
>>> org.apache.hadoop.ha.ZKFailoverController.doFence(ZKFailoverController.java:522)
>>>     at
>>> org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailoverController.java:494)
>>>     at
>>> org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.java:59)
>>>     at
>>> org.apache.hadoop.ha.ZKFailoverController$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:837)
>>>     at
>>> org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:900)
>>>     at
>>> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:799)
>>>     at
>>> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:415)
>>>     at
>>> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:596)
>>>     at
>>> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495)
>>> 2013-12-03 10:05:59,650 INFO org.apache.hadoop.ha.ActiveStandbyElector:
>>> Trying to re-establish ZK session
>>> 2013-12-03 10:05:59,676 INFO org.apache.zookeeper.ZooKeeper: Session:
>>> 0x142931031810260 closed
>>> 2013-12-03 10:06:00,678 INFO org.apache.zookeeper.ZooKeeper: Initiating
>>> client connection, connectString=hadoop1:2181,hadoop2:2181,hadoop3:2181
>>> sessionTimeout=5000
>>> watcher=org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef@5ce2acea
>>> 2013-12-03 10:06:00,681 INFO org.apache.zookeeper.ClientCnxn: Opening
>>> socket connection to server hadoop1/10.7.23.122:2181. Will not attempt
>>> to authenticate using SASL (Unable to locate a login configuration)
>>> 2013-12-03 10:06:00,681 INFO org.apache.zookeeper.ClientCnxn: Socket
>>> connection established to hadoop1/10.7.23.122:2181, initiating session
>>> 2013-12-03 10:06:00,709 INFO org.apache.zookeeper.ClientCnxn: Session
>>> establishment complete on server hadoop1/10.7.23.122:2181, sessionid =
>>> 0x142931031810261, negotiated timeout = 5000
>>> 2013-12-03 10:06:00,711 INFO org.apache.zookeeper.ClientCnxn:
>>> EventThread shut down
>>>
>>
>>
>

Re: Can not auto-failover when unplug network interface

Posted by YouPeng Yang <yy...@gmail.com>.
Hi Yu

   I think when unplug the nic ,the ssh could not make through because it
can not connect to  failed  active NN.
Suppose that ,the sshfence will failed.
   Am I right?


2013/12/3 YouPeng Yang <yy...@gmail.com>

> Hi Yu
>
>   Thanks for your response.
>   I'm sure my ssh setup is good. Ssh from  act NN to stanby nn need no
> password.
>
>
>
>
>
>
>
> I attached my config
> ------core-site.xml-----------------
>
> <configuration>
>  <property>
>      <name>fs.defaultFS</name>
>      <value>hdfs://lklcluster</value>
>      <final>true</final>
>  </property>
>
>  <property>
>      <name>hadoop.tmp.dir</name>
>      <value>/home/hadoop/tmp2</value>
>  </property>
>
>
> </configuration>
>
>
> -------hdfs-site.xml----------
> ---
>
> <configuration>
>  <property>
>      <name>dfs.namenode.name.dir</name>
>     <value>/home/hadoop/namedir2</value>
>  </property>
>
>  <property>
>      <name>dfs.datanode.data.dir</name>
>      <value>/home/hadoop/datadir2</value>
>  </property>
>
>  <property>
>    <name>dfs.nameservices</name>
>    <value>lklcluster</value>
> </property>
>
> <property>
>     <name>dfs.ha.namenodes.lklcluster</name>
>     <value>nn1,nn2</value>
> </property>
> <property>
>   <name>dfs.namenode.rpc-address.lklcluster.nn1</name>
>   <value>hadoop2:8020</value>
> </property>
> <property>
>     <name>dfs.namenode.rpc-address.lklcluster.nn2</name>
>     <value>hadoop3:8020</value>
> </property>
>
> <property>
>   <name>dfs.namenode.http-address.lklcluster.nn1</name>
>     <value>hadoop2:50070</value>
> </property>
>
> <property>
>     <name>dfs.namenode.http-address.lklcluster.nn2</name>
>     <value>hadoop3:50070</value>
> </property>
>
> <property>
>   <name>dfs.namenode.shared.edits.dir</name>
>
> <value>qjournal://hadoop1:8485;hadoop2:8485;hadoop3:8485/lklcluster</value>
> </property>
> <property>
>   <name>dfs.client.failover.proxy.provider.lklcluster</name>
>
> <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
> </property>
> <property>
>   <name>dfs.ha.fencing.methods</name>
>   <value>sshfence</value>
> </property>
>
> <property>
>   <name>dfs.ha.fencing.ssh.private-key-files</name>
>    <value>/home/hadoop/.ssh/id_rsa</value>
> </property>
>
> <property>
>     <name>dfs.ha.fencing.ssh.connect-timeout</name>
>      <value>5000</value>
> </property>
>
> <property>
>   <name>dfs.journalnode.edits.dir</name>
>    <value>/home/hadoop/journal/data</value>
> </property>
>
> <property>
>    <name>dfs.ha.automatic-failover.enabled</name>
>       <value>true</value>
> </property>
>
> <property>
>      <name>ha.zookeeper.quorum</name>
>      <value>hadoop1:2181,hadoop2:2181,hadoop3:2181</value>
> </property>
>
> </configuration>
>
>
>
> 2013/12/3 Azuryy Yu <az...@gmail.com>
>
>> This is still because your fence method configuraed improperly.
>> plseae paste your fence configuration. and double check you can ssh on
>> active NN to standby NN without password.
>>
>>
>> On Tue, Dec 3, 2013 at 10:23 AM, YouPeng Yang <yy...@gmail.com>wrote:
>>
>>> Hi
>>>    Another auto-failover testing problem:
>>>
>>>    My HA can auto-failover after I kill the active NN.When it comes to
>>> the unplug  network interface to simulate the hardware fail,the
>>> auto-failover seems  not to work after   wait for times -the zkfc logs as
>>> [1].
>>>
>>>    I'm using the default sshfence.
>>>
>>>
>>>
>>>
>>>
>>>
>>> [1] zkfc
>>> logs----------------------------------------------------------------------------------------
>>> 2013-12-03 10:05:56,650 INFO org.apache.hadoop.ha.NodeFencer: ======
>>> Beginning Service Fencing Process... ======
>>> 2013-12-03 10:05:56,650 INFO org.apache.hadoop.ha.NodeFencer: Trying
>>> method 1/1: org.apache.hadoop.ha.SshFenceByTcpPort(null)
>>> 2013-12-03 10:05:56,651 INFO org.apache.hadoop.ha.SshFenceByTcpPort:
>>> Connecting to hadoop3...
>>> 2013-12-03 10:05:56,651 INFO
>>> org.apache.hadoop.ha.SshFenceByTcpPort.jsch: Connecting to hadoop3 port 22
>>> 2013-12-03 10:05:59,648 WARN org.apache.hadoop.ha.SshFenceByTcpPort:
>>> Unable to connect to hadoop3 as user hadoop
>>> com.jcraft.jsch.JSchException: java.net.NoRouteToHostException: No route
>>> to host
>>>     at com.jcraft.jsch.Util.createSocket(Util.java:386)
>>>     at com.jcraft.jsch.Session.connect(Session.java:182)
>>>     at
>>> org.apache.hadoop.ha.SshFenceByTcpPort.tryFence(SshFenceByTcpPort.java:100)
>>>     at org.apache.hadoop.ha.NodeFencer.fence(NodeFencer.java:97)
>>>     at
>>> org.apache.hadoop.ha.ZKFailoverController.doFence(ZKFailoverController.java:521)
>>>     at
>>> org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailoverController.java:494)
>>>     at
>>> org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.java:59)
>>>     at
>>> org.apache.hadoop.ha.ZKFailoverController$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:837)
>>>     at
>>> org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:900)
>>>     at
>>> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:799)
>>>     at
>>> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:415)
>>>     at
>>> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:596)
>>>     at
>>> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495)
>>> 2013-12-03 10:05:59,649 WARN org.apache.hadoop.ha.NodeFencer: Fencing
>>> method org.apache.hadoop.ha.SshFenceByTcpPort(null) was unsuccessful.
>>> 2013-12-03 10:05:59,649 ERROR org.apache.hadoop.ha.NodeFencer: Unable to
>>> fence service by any configured method.
>>> 2013-12-03 10:05:59,650 WARN org.apache.hadoop.ha.ActiveStandbyElector:
>>> Exception handling the winning of election
>>> java.lang.RuntimeException: Unable to fence NameNode at hadoop3/
>>> 10.7.23.124:8020
>>>     at
>>> org.apache.hadoop.ha.ZKFailoverController.doFence(ZKFailoverController.java:522)
>>>     at
>>> org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailoverController.java:494)
>>>     at
>>> org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.java:59)
>>>     at
>>> org.apache.hadoop.ha.ZKFailoverController$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:837)
>>>     at
>>> org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:900)
>>>     at
>>> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:799)
>>>     at
>>> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:415)
>>>     at
>>> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:596)
>>>     at
>>> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495)
>>> 2013-12-03 10:05:59,650 INFO org.apache.hadoop.ha.ActiveStandbyElector:
>>> Trying to re-establish ZK session
>>> 2013-12-03 10:05:59,676 INFO org.apache.zookeeper.ZooKeeper: Session:
>>> 0x142931031810260 closed
>>> 2013-12-03 10:06:00,678 INFO org.apache.zookeeper.ZooKeeper: Initiating
>>> client connection, connectString=hadoop1:2181,hadoop2:2181,hadoop3:2181
>>> sessionTimeout=5000
>>> watcher=org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef@5ce2acea
>>> 2013-12-03 10:06:00,681 INFO org.apache.zookeeper.ClientCnxn: Opening
>>> socket connection to server hadoop1/10.7.23.122:2181. Will not attempt
>>> to authenticate using SASL (Unable to locate a login configuration)
>>> 2013-12-03 10:06:00,681 INFO org.apache.zookeeper.ClientCnxn: Socket
>>> connection established to hadoop1/10.7.23.122:2181, initiating session
>>> 2013-12-03 10:06:00,709 INFO org.apache.zookeeper.ClientCnxn: Session
>>> establishment complete on server hadoop1/10.7.23.122:2181, sessionid =
>>> 0x142931031810261, negotiated timeout = 5000
>>> 2013-12-03 10:06:00,711 INFO org.apache.zookeeper.ClientCnxn:
>>> EventThread shut down
>>>
>>
>>
>

Re: Can not auto-failover when unplug network interface

Posted by YouPeng Yang <yy...@gmail.com>.
Hi Yu

   I think when unplug the nic ,the ssh could not make through because it
can not connect to  failed  active NN.
Suppose that ,the sshfence will failed.
   Am I right?


2013/12/3 YouPeng Yang <yy...@gmail.com>

> Hi Yu
>
>   Thanks for your response.
>   I'm sure my ssh setup is good. Ssh from  act NN to stanby nn need no
> password.
>
>
>
>
>
>
>
> I attached my config
> ------core-site.xml-----------------
>
> <configuration>
>  <property>
>      <name>fs.defaultFS</name>
>      <value>hdfs://lklcluster</value>
>      <final>true</final>
>  </property>
>
>  <property>
>      <name>hadoop.tmp.dir</name>
>      <value>/home/hadoop/tmp2</value>
>  </property>
>
>
> </configuration>
>
>
> -------hdfs-site.xml----------
> ---
>
> <configuration>
>  <property>
>      <name>dfs.namenode.name.dir</name>
>     <value>/home/hadoop/namedir2</value>
>  </property>
>
>  <property>
>      <name>dfs.datanode.data.dir</name>
>      <value>/home/hadoop/datadir2</value>
>  </property>
>
>  <property>
>    <name>dfs.nameservices</name>
>    <value>lklcluster</value>
> </property>
>
> <property>
>     <name>dfs.ha.namenodes.lklcluster</name>
>     <value>nn1,nn2</value>
> </property>
> <property>
>   <name>dfs.namenode.rpc-address.lklcluster.nn1</name>
>   <value>hadoop2:8020</value>
> </property>
> <property>
>     <name>dfs.namenode.rpc-address.lklcluster.nn2</name>
>     <value>hadoop3:8020</value>
> </property>
>
> <property>
>   <name>dfs.namenode.http-address.lklcluster.nn1</name>
>     <value>hadoop2:50070</value>
> </property>
>
> <property>
>     <name>dfs.namenode.http-address.lklcluster.nn2</name>
>     <value>hadoop3:50070</value>
> </property>
>
> <property>
>   <name>dfs.namenode.shared.edits.dir</name>
>
> <value>qjournal://hadoop1:8485;hadoop2:8485;hadoop3:8485/lklcluster</value>
> </property>
> <property>
>   <name>dfs.client.failover.proxy.provider.lklcluster</name>
>
> <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
> </property>
> <property>
>   <name>dfs.ha.fencing.methods</name>
>   <value>sshfence</value>
> </property>
>
> <property>
>   <name>dfs.ha.fencing.ssh.private-key-files</name>
>    <value>/home/hadoop/.ssh/id_rsa</value>
> </property>
>
> <property>
>     <name>dfs.ha.fencing.ssh.connect-timeout</name>
>      <value>5000</value>
> </property>
>
> <property>
>   <name>dfs.journalnode.edits.dir</name>
>    <value>/home/hadoop/journal/data</value>
> </property>
>
> <property>
>    <name>dfs.ha.automatic-failover.enabled</name>
>       <value>true</value>
> </property>
>
> <property>
>      <name>ha.zookeeper.quorum</name>
>      <value>hadoop1:2181,hadoop2:2181,hadoop3:2181</value>
> </property>
>
> </configuration>
>
>
>
> 2013/12/3 Azuryy Yu <az...@gmail.com>
>
>> This is still because your fence method configuraed improperly.
>> plseae paste your fence configuration. and double check you can ssh on
>> active NN to standby NN without password.
>>
>>
>> On Tue, Dec 3, 2013 at 10:23 AM, YouPeng Yang <yy...@gmail.com>wrote:
>>
>>> Hi
>>>    Another auto-failover testing problem:
>>>
>>>    My HA can auto-failover after I kill the active NN.When it comes to
>>> the unplug  network interface to simulate the hardware fail,the
>>> auto-failover seems  not to work after   wait for times -the zkfc logs as
>>> [1].
>>>
>>>    I'm using the default sshfence.
>>>
>>>
>>>
>>>
>>>
>>>
>>> [1] zkfc
>>> logs----------------------------------------------------------------------------------------
>>> 2013-12-03 10:05:56,650 INFO org.apache.hadoop.ha.NodeFencer: ======
>>> Beginning Service Fencing Process... ======
>>> 2013-12-03 10:05:56,650 INFO org.apache.hadoop.ha.NodeFencer: Trying
>>> method 1/1: org.apache.hadoop.ha.SshFenceByTcpPort(null)
>>> 2013-12-03 10:05:56,651 INFO org.apache.hadoop.ha.SshFenceByTcpPort:
>>> Connecting to hadoop3...
>>> 2013-12-03 10:05:56,651 INFO
>>> org.apache.hadoop.ha.SshFenceByTcpPort.jsch: Connecting to hadoop3 port 22
>>> 2013-12-03 10:05:59,648 WARN org.apache.hadoop.ha.SshFenceByTcpPort:
>>> Unable to connect to hadoop3 as user hadoop
>>> com.jcraft.jsch.JSchException: java.net.NoRouteToHostException: No route
>>> to host
>>>     at com.jcraft.jsch.Util.createSocket(Util.java:386)
>>>     at com.jcraft.jsch.Session.connect(Session.java:182)
>>>     at
>>> org.apache.hadoop.ha.SshFenceByTcpPort.tryFence(SshFenceByTcpPort.java:100)
>>>     at org.apache.hadoop.ha.NodeFencer.fence(NodeFencer.java:97)
>>>     at
>>> org.apache.hadoop.ha.ZKFailoverController.doFence(ZKFailoverController.java:521)
>>>     at
>>> org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailoverController.java:494)
>>>     at
>>> org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.java:59)
>>>     at
>>> org.apache.hadoop.ha.ZKFailoverController$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:837)
>>>     at
>>> org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:900)
>>>     at
>>> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:799)
>>>     at
>>> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:415)
>>>     at
>>> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:596)
>>>     at
>>> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495)
>>> 2013-12-03 10:05:59,649 WARN org.apache.hadoop.ha.NodeFencer: Fencing
>>> method org.apache.hadoop.ha.SshFenceByTcpPort(null) was unsuccessful.
>>> 2013-12-03 10:05:59,649 ERROR org.apache.hadoop.ha.NodeFencer: Unable to
>>> fence service by any configured method.
>>> 2013-12-03 10:05:59,650 WARN org.apache.hadoop.ha.ActiveStandbyElector:
>>> Exception handling the winning of election
>>> java.lang.RuntimeException: Unable to fence NameNode at hadoop3/
>>> 10.7.23.124:8020
>>>     at
>>> org.apache.hadoop.ha.ZKFailoverController.doFence(ZKFailoverController.java:522)
>>>     at
>>> org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailoverController.java:494)
>>>     at
>>> org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.java:59)
>>>     at
>>> org.apache.hadoop.ha.ZKFailoverController$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:837)
>>>     at
>>> org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:900)
>>>     at
>>> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:799)
>>>     at
>>> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:415)
>>>     at
>>> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:596)
>>>     at
>>> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495)
>>> 2013-12-03 10:05:59,650 INFO org.apache.hadoop.ha.ActiveStandbyElector:
>>> Trying to re-establish ZK session
>>> 2013-12-03 10:05:59,676 INFO org.apache.zookeeper.ZooKeeper: Session:
>>> 0x142931031810260 closed
>>> 2013-12-03 10:06:00,678 INFO org.apache.zookeeper.ZooKeeper: Initiating
>>> client connection, connectString=hadoop1:2181,hadoop2:2181,hadoop3:2181
>>> sessionTimeout=5000
>>> watcher=org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef@5ce2acea
>>> 2013-12-03 10:06:00,681 INFO org.apache.zookeeper.ClientCnxn: Opening
>>> socket connection to server hadoop1/10.7.23.122:2181. Will not attempt
>>> to authenticate using SASL (Unable to locate a login configuration)
>>> 2013-12-03 10:06:00,681 INFO org.apache.zookeeper.ClientCnxn: Socket
>>> connection established to hadoop1/10.7.23.122:2181, initiating session
>>> 2013-12-03 10:06:00,709 INFO org.apache.zookeeper.ClientCnxn: Session
>>> establishment complete on server hadoop1/10.7.23.122:2181, sessionid =
>>> 0x142931031810261, negotiated timeout = 5000
>>> 2013-12-03 10:06:00,711 INFO org.apache.zookeeper.ClientCnxn:
>>> EventThread shut down
>>>
>>
>>
>

Re: Can not auto-failover when unplug network interface

Posted by YouPeng Yang <yy...@gmail.com>.
Hi Yu

   I think when unplug the nic ,the ssh could not make through because it
can not connect to  failed  active NN.
Suppose that ,the sshfence will failed.
   Am I right?


2013/12/3 YouPeng Yang <yy...@gmail.com>

> Hi Yu
>
>   Thanks for your response.
>   I'm sure my ssh setup is good. Ssh from  act NN to stanby nn need no
> password.
>
>
>
>
>
>
>
> I attached my config
> ------core-site.xml-----------------
>
> <configuration>
>  <property>
>      <name>fs.defaultFS</name>
>      <value>hdfs://lklcluster</value>
>      <final>true</final>
>  </property>
>
>  <property>
>      <name>hadoop.tmp.dir</name>
>      <value>/home/hadoop/tmp2</value>
>  </property>
>
>
> </configuration>
>
>
> -------hdfs-site.xml----------
> ---
>
> <configuration>
>  <property>
>      <name>dfs.namenode.name.dir</name>
>     <value>/home/hadoop/namedir2</value>
>  </property>
>
>  <property>
>      <name>dfs.datanode.data.dir</name>
>      <value>/home/hadoop/datadir2</value>
>  </property>
>
>  <property>
>    <name>dfs.nameservices</name>
>    <value>lklcluster</value>
> </property>
>
> <property>
>     <name>dfs.ha.namenodes.lklcluster</name>
>     <value>nn1,nn2</value>
> </property>
> <property>
>   <name>dfs.namenode.rpc-address.lklcluster.nn1</name>
>   <value>hadoop2:8020</value>
> </property>
> <property>
>     <name>dfs.namenode.rpc-address.lklcluster.nn2</name>
>     <value>hadoop3:8020</value>
> </property>
>
> <property>
>   <name>dfs.namenode.http-address.lklcluster.nn1</name>
>     <value>hadoop2:50070</value>
> </property>
>
> <property>
>     <name>dfs.namenode.http-address.lklcluster.nn2</name>
>     <value>hadoop3:50070</value>
> </property>
>
> <property>
>   <name>dfs.namenode.shared.edits.dir</name>
>
> <value>qjournal://hadoop1:8485;hadoop2:8485;hadoop3:8485/lklcluster</value>
> </property>
> <property>
>   <name>dfs.client.failover.proxy.provider.lklcluster</name>
>
> <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
> </property>
> <property>
>   <name>dfs.ha.fencing.methods</name>
>   <value>sshfence</value>
> </property>
>
> <property>
>   <name>dfs.ha.fencing.ssh.private-key-files</name>
>    <value>/home/hadoop/.ssh/id_rsa</value>
> </property>
>
> <property>
>     <name>dfs.ha.fencing.ssh.connect-timeout</name>
>      <value>5000</value>
> </property>
>
> <property>
>   <name>dfs.journalnode.edits.dir</name>
>    <value>/home/hadoop/journal/data</value>
> </property>
>
> <property>
>    <name>dfs.ha.automatic-failover.enabled</name>
>       <value>true</value>
> </property>
>
> <property>
>      <name>ha.zookeeper.quorum</name>
>      <value>hadoop1:2181,hadoop2:2181,hadoop3:2181</value>
> </property>
>
> </configuration>
>
>
>
> 2013/12/3 Azuryy Yu <az...@gmail.com>
>
>> This is still because your fence method configuraed improperly.
>> plseae paste your fence configuration. and double check you can ssh on
>> active NN to standby NN without password.
>>
>>
>> On Tue, Dec 3, 2013 at 10:23 AM, YouPeng Yang <yy...@gmail.com>wrote:
>>
>>> Hi
>>>    Another auto-failover testing problem:
>>>
>>>    My HA can auto-failover after I kill the active NN.When it comes to
>>> the unplug  network interface to simulate the hardware fail,the
>>> auto-failover seems  not to work after   wait for times -the zkfc logs as
>>> [1].
>>>
>>>    I'm using the default sshfence.
>>>
>>>
>>>
>>>
>>>
>>>
>>> [1] zkfc
>>> logs----------------------------------------------------------------------------------------
>>> 2013-12-03 10:05:56,650 INFO org.apache.hadoop.ha.NodeFencer: ======
>>> Beginning Service Fencing Process... ======
>>> 2013-12-03 10:05:56,650 INFO org.apache.hadoop.ha.NodeFencer: Trying
>>> method 1/1: org.apache.hadoop.ha.SshFenceByTcpPort(null)
>>> 2013-12-03 10:05:56,651 INFO org.apache.hadoop.ha.SshFenceByTcpPort:
>>> Connecting to hadoop3...
>>> 2013-12-03 10:05:56,651 INFO
>>> org.apache.hadoop.ha.SshFenceByTcpPort.jsch: Connecting to hadoop3 port 22
>>> 2013-12-03 10:05:59,648 WARN org.apache.hadoop.ha.SshFenceByTcpPort:
>>> Unable to connect to hadoop3 as user hadoop
>>> com.jcraft.jsch.JSchException: java.net.NoRouteToHostException: No route
>>> to host
>>>     at com.jcraft.jsch.Util.createSocket(Util.java:386)
>>>     at com.jcraft.jsch.Session.connect(Session.java:182)
>>>     at
>>> org.apache.hadoop.ha.SshFenceByTcpPort.tryFence(SshFenceByTcpPort.java:100)
>>>     at org.apache.hadoop.ha.NodeFencer.fence(NodeFencer.java:97)
>>>     at
>>> org.apache.hadoop.ha.ZKFailoverController.doFence(ZKFailoverController.java:521)
>>>     at
>>> org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailoverController.java:494)
>>>     at
>>> org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.java:59)
>>>     at
>>> org.apache.hadoop.ha.ZKFailoverController$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:837)
>>>     at
>>> org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:900)
>>>     at
>>> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:799)
>>>     at
>>> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:415)
>>>     at
>>> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:596)
>>>     at
>>> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495)
>>> 2013-12-03 10:05:59,649 WARN org.apache.hadoop.ha.NodeFencer: Fencing
>>> method org.apache.hadoop.ha.SshFenceByTcpPort(null) was unsuccessful.
>>> 2013-12-03 10:05:59,649 ERROR org.apache.hadoop.ha.NodeFencer: Unable to
>>> fence service by any configured method.
>>> 2013-12-03 10:05:59,650 WARN org.apache.hadoop.ha.ActiveStandbyElector:
>>> Exception handling the winning of election
>>> java.lang.RuntimeException: Unable to fence NameNode at hadoop3/
>>> 10.7.23.124:8020
>>>     at
>>> org.apache.hadoop.ha.ZKFailoverController.doFence(ZKFailoverController.java:522)
>>>     at
>>> org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailoverController.java:494)
>>>     at
>>> org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.java:59)
>>>     at
>>> org.apache.hadoop.ha.ZKFailoverController$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:837)
>>>     at
>>> org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:900)
>>>     at
>>> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:799)
>>>     at
>>> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:415)
>>>     at
>>> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:596)
>>>     at
>>> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495)
>>> 2013-12-03 10:05:59,650 INFO org.apache.hadoop.ha.ActiveStandbyElector:
>>> Trying to re-establish ZK session
>>> 2013-12-03 10:05:59,676 INFO org.apache.zookeeper.ZooKeeper: Session:
>>> 0x142931031810260 closed
>>> 2013-12-03 10:06:00,678 INFO org.apache.zookeeper.ZooKeeper: Initiating
>>> client connection, connectString=hadoop1:2181,hadoop2:2181,hadoop3:2181
>>> sessionTimeout=5000
>>> watcher=org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef@5ce2acea
>>> 2013-12-03 10:06:00,681 INFO org.apache.zookeeper.ClientCnxn: Opening
>>> socket connection to server hadoop1/10.7.23.122:2181. Will not attempt
>>> to authenticate using SASL (Unable to locate a login configuration)
>>> 2013-12-03 10:06:00,681 INFO org.apache.zookeeper.ClientCnxn: Socket
>>> connection established to hadoop1/10.7.23.122:2181, initiating session
>>> 2013-12-03 10:06:00,709 INFO org.apache.zookeeper.ClientCnxn: Session
>>> establishment complete on server hadoop1/10.7.23.122:2181, sessionid =
>>> 0x142931031810261, negotiated timeout = 5000
>>> 2013-12-03 10:06:00,711 INFO org.apache.zookeeper.ClientCnxn:
>>> EventThread shut down
>>>
>>
>>
>

Re: Can not auto-failover when unplug network interface

Posted by YouPeng Yang <yy...@gmail.com>.
Hi Yu

  Thanks for your response.
  I'm sure my ssh setup is good. Ssh from  act NN to stanby nn need no
password.







I attached my config
------core-site.xml-----------------

<configuration>
 <property>
     <name>fs.defaultFS</name>
     <value>hdfs://lklcluster</value>
     <final>true</final>
 </property>

 <property>
     <name>hadoop.tmp.dir</name>
     <value>/home/hadoop/tmp2</value>
 </property>


</configuration>


-------hdfs-site.xml----------
---

<configuration>
 <property>
     <name>dfs.namenode.name.dir</name>
    <value>/home/hadoop/namedir2</value>
 </property>

 <property>
     <name>dfs.datanode.data.dir</name>
     <value>/home/hadoop/datadir2</value>
 </property>

 <property>
   <name>dfs.nameservices</name>
   <value>lklcluster</value>
</property>

<property>
    <name>dfs.ha.namenodes.lklcluster</name>
    <value>nn1,nn2</value>
</property>
<property>
  <name>dfs.namenode.rpc-address.lklcluster.nn1</name>
  <value>hadoop2:8020</value>
</property>
<property>
    <name>dfs.namenode.rpc-address.lklcluster.nn2</name>
    <value>hadoop3:8020</value>
</property>

<property>
  <name>dfs.namenode.http-address.lklcluster.nn1</name>
    <value>hadoop2:50070</value>
</property>

<property>
    <name>dfs.namenode.http-address.lklcluster.nn2</name>
    <value>hadoop3:50070</value>
</property>

<property>
  <name>dfs.namenode.shared.edits.dir</name>

<value>qjournal://hadoop1:8485;hadoop2:8485;hadoop3:8485/lklcluster</value>
</property>
<property>
  <name>dfs.client.failover.proxy.provider.lklcluster</name>

<value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
</property>
<property>
  <name>dfs.ha.fencing.methods</name>
  <value>sshfence</value>
</property>

<property>
  <name>dfs.ha.fencing.ssh.private-key-files</name>
   <value>/home/hadoop/.ssh/id_rsa</value>
</property>

<property>
    <name>dfs.ha.fencing.ssh.connect-timeout</name>
     <value>5000</value>
</property>

<property>
  <name>dfs.journalnode.edits.dir</name>
   <value>/home/hadoop/journal/data</value>
</property>

<property>
   <name>dfs.ha.automatic-failover.enabled</name>
      <value>true</value>
</property>

<property>
     <name>ha.zookeeper.quorum</name>
     <value>hadoop1:2181,hadoop2:2181,hadoop3:2181</value>
</property>

</configuration>



2013/12/3 Azuryy Yu <az...@gmail.com>

> This is still because your fence method configuraed improperly.
> plseae paste your fence configuration. and double check you can ssh on
> active NN to standby NN without password.
>
>
> On Tue, Dec 3, 2013 at 10:23 AM, YouPeng Yang <yy...@gmail.com>wrote:
>
>> Hi
>>    Another auto-failover testing problem:
>>
>>    My HA can auto-failover after I kill the active NN.When it comes to
>> the unplug  network interface to simulate the hardware fail,the
>> auto-failover seems  not to work after   wait for times -the zkfc logs as
>> [1].
>>
>>    I'm using the default sshfence.
>>
>>
>>
>>
>>
>>
>> [1] zkfc
>> logs----------------------------------------------------------------------------------------
>> 2013-12-03 10:05:56,650 INFO org.apache.hadoop.ha.NodeFencer: ======
>> Beginning Service Fencing Process... ======
>> 2013-12-03 10:05:56,650 INFO org.apache.hadoop.ha.NodeFencer: Trying
>> method 1/1: org.apache.hadoop.ha.SshFenceByTcpPort(null)
>> 2013-12-03 10:05:56,651 INFO org.apache.hadoop.ha.SshFenceByTcpPort:
>> Connecting to hadoop3...
>> 2013-12-03 10:05:56,651 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
>> Connecting to hadoop3 port 22
>> 2013-12-03 10:05:59,648 WARN org.apache.hadoop.ha.SshFenceByTcpPort:
>> Unable to connect to hadoop3 as user hadoop
>> com.jcraft.jsch.JSchException: java.net.NoRouteToHostException: No route
>> to host
>>     at com.jcraft.jsch.Util.createSocket(Util.java:386)
>>     at com.jcraft.jsch.Session.connect(Session.java:182)
>>     at
>> org.apache.hadoop.ha.SshFenceByTcpPort.tryFence(SshFenceByTcpPort.java:100)
>>     at org.apache.hadoop.ha.NodeFencer.fence(NodeFencer.java:97)
>>     at
>> org.apache.hadoop.ha.ZKFailoverController.doFence(ZKFailoverController.java:521)
>>     at
>> org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailoverController.java:494)
>>     at
>> org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.java:59)
>>     at
>> org.apache.hadoop.ha.ZKFailoverController$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:837)
>>     at
>> org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:900)
>>     at
>> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:799)
>>     at
>> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:415)
>>     at
>> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:596)
>>     at
>> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495)
>> 2013-12-03 10:05:59,649 WARN org.apache.hadoop.ha.NodeFencer: Fencing
>> method org.apache.hadoop.ha.SshFenceByTcpPort(null) was unsuccessful.
>> 2013-12-03 10:05:59,649 ERROR org.apache.hadoop.ha.NodeFencer: Unable to
>> fence service by any configured method.
>> 2013-12-03 10:05:59,650 WARN org.apache.hadoop.ha.ActiveStandbyElector:
>> Exception handling the winning of election
>> java.lang.RuntimeException: Unable to fence NameNode at hadoop3/
>> 10.7.23.124:8020
>>     at
>> org.apache.hadoop.ha.ZKFailoverController.doFence(ZKFailoverController.java:522)
>>     at
>> org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailoverController.java:494)
>>     at
>> org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.java:59)
>>     at
>> org.apache.hadoop.ha.ZKFailoverController$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:837)
>>     at
>> org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:900)
>>     at
>> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:799)
>>     at
>> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:415)
>>     at
>> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:596)
>>     at
>> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495)
>> 2013-12-03 10:05:59,650 INFO org.apache.hadoop.ha.ActiveStandbyElector:
>> Trying to re-establish ZK session
>> 2013-12-03 10:05:59,676 INFO org.apache.zookeeper.ZooKeeper: Session:
>> 0x142931031810260 closed
>> 2013-12-03 10:06:00,678 INFO org.apache.zookeeper.ZooKeeper: Initiating
>> client connection, connectString=hadoop1:2181,hadoop2:2181,hadoop3:2181
>> sessionTimeout=5000
>> watcher=org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef@5ce2acea
>> 2013-12-03 10:06:00,681 INFO org.apache.zookeeper.ClientCnxn: Opening
>> socket connection to server hadoop1/10.7.23.122:2181. Will not attempt
>> to authenticate using SASL (Unable to locate a login configuration)
>> 2013-12-03 10:06:00,681 INFO org.apache.zookeeper.ClientCnxn: Socket
>> connection established to hadoop1/10.7.23.122:2181, initiating session
>> 2013-12-03 10:06:00,709 INFO org.apache.zookeeper.ClientCnxn: Session
>> establishment complete on server hadoop1/10.7.23.122:2181, sessionid =
>> 0x142931031810261, negotiated timeout = 5000
>> 2013-12-03 10:06:00,711 INFO org.apache.zookeeper.ClientCnxn: EventThread
>> shut down
>>
>
>

Re: Can not auto-failover when unplug network interface

Posted by YouPeng Yang <yy...@gmail.com>.
Hi Yu

  Thanks for your response.
  I'm sure my ssh setup is good. Ssh from  act NN to stanby nn need no
password.







I attached my config
------core-site.xml-----------------

<configuration>
 <property>
     <name>fs.defaultFS</name>
     <value>hdfs://lklcluster</value>
     <final>true</final>
 </property>

 <property>
     <name>hadoop.tmp.dir</name>
     <value>/home/hadoop/tmp2</value>
 </property>


</configuration>


-------hdfs-site.xml----------
---

<configuration>
 <property>
     <name>dfs.namenode.name.dir</name>
    <value>/home/hadoop/namedir2</value>
 </property>

 <property>
     <name>dfs.datanode.data.dir</name>
     <value>/home/hadoop/datadir2</value>
 </property>

 <property>
   <name>dfs.nameservices</name>
   <value>lklcluster</value>
</property>

<property>
    <name>dfs.ha.namenodes.lklcluster</name>
    <value>nn1,nn2</value>
</property>
<property>
  <name>dfs.namenode.rpc-address.lklcluster.nn1</name>
  <value>hadoop2:8020</value>
</property>
<property>
    <name>dfs.namenode.rpc-address.lklcluster.nn2</name>
    <value>hadoop3:8020</value>
</property>

<property>
  <name>dfs.namenode.http-address.lklcluster.nn1</name>
    <value>hadoop2:50070</value>
</property>

<property>
    <name>dfs.namenode.http-address.lklcluster.nn2</name>
    <value>hadoop3:50070</value>
</property>

<property>
  <name>dfs.namenode.shared.edits.dir</name>

<value>qjournal://hadoop1:8485;hadoop2:8485;hadoop3:8485/lklcluster</value>
</property>
<property>
  <name>dfs.client.failover.proxy.provider.lklcluster</name>

<value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
</property>
<property>
  <name>dfs.ha.fencing.methods</name>
  <value>sshfence</value>
</property>

<property>
  <name>dfs.ha.fencing.ssh.private-key-files</name>
   <value>/home/hadoop/.ssh/id_rsa</value>
</property>

<property>
    <name>dfs.ha.fencing.ssh.connect-timeout</name>
     <value>5000</value>
</property>

<property>
  <name>dfs.journalnode.edits.dir</name>
   <value>/home/hadoop/journal/data</value>
</property>

<property>
   <name>dfs.ha.automatic-failover.enabled</name>
      <value>true</value>
</property>

<property>
     <name>ha.zookeeper.quorum</name>
     <value>hadoop1:2181,hadoop2:2181,hadoop3:2181</value>
</property>

</configuration>



2013/12/3 Azuryy Yu <az...@gmail.com>

> This is still because your fence method configuraed improperly.
> plseae paste your fence configuration. and double check you can ssh on
> active NN to standby NN without password.
>
>
> On Tue, Dec 3, 2013 at 10:23 AM, YouPeng Yang <yy...@gmail.com>wrote:
>
>> Hi
>>    Another auto-failover testing problem:
>>
>>    My HA can auto-failover after I kill the active NN.When it comes to
>> the unplug  network interface to simulate the hardware fail,the
>> auto-failover seems  not to work after   wait for times -the zkfc logs as
>> [1].
>>
>>    I'm using the default sshfence.
>>
>>
>>
>>
>>
>>
>> [1] zkfc
>> logs----------------------------------------------------------------------------------------
>> 2013-12-03 10:05:56,650 INFO org.apache.hadoop.ha.NodeFencer: ======
>> Beginning Service Fencing Process... ======
>> 2013-12-03 10:05:56,650 INFO org.apache.hadoop.ha.NodeFencer: Trying
>> method 1/1: org.apache.hadoop.ha.SshFenceByTcpPort(null)
>> 2013-12-03 10:05:56,651 INFO org.apache.hadoop.ha.SshFenceByTcpPort:
>> Connecting to hadoop3...
>> 2013-12-03 10:05:56,651 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
>> Connecting to hadoop3 port 22
>> 2013-12-03 10:05:59,648 WARN org.apache.hadoop.ha.SshFenceByTcpPort:
>> Unable to connect to hadoop3 as user hadoop
>> com.jcraft.jsch.JSchException: java.net.NoRouteToHostException: No route
>> to host
>>     at com.jcraft.jsch.Util.createSocket(Util.java:386)
>>     at com.jcraft.jsch.Session.connect(Session.java:182)
>>     at
>> org.apache.hadoop.ha.SshFenceByTcpPort.tryFence(SshFenceByTcpPort.java:100)
>>     at org.apache.hadoop.ha.NodeFencer.fence(NodeFencer.java:97)
>>     at
>> org.apache.hadoop.ha.ZKFailoverController.doFence(ZKFailoverController.java:521)
>>     at
>> org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailoverController.java:494)
>>     at
>> org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.java:59)
>>     at
>> org.apache.hadoop.ha.ZKFailoverController$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:837)
>>     at
>> org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:900)
>>     at
>> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:799)
>>     at
>> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:415)
>>     at
>> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:596)
>>     at
>> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495)
>> 2013-12-03 10:05:59,649 WARN org.apache.hadoop.ha.NodeFencer: Fencing
>> method org.apache.hadoop.ha.SshFenceByTcpPort(null) was unsuccessful.
>> 2013-12-03 10:05:59,649 ERROR org.apache.hadoop.ha.NodeFencer: Unable to
>> fence service by any configured method.
>> 2013-12-03 10:05:59,650 WARN org.apache.hadoop.ha.ActiveStandbyElector:
>> Exception handling the winning of election
>> java.lang.RuntimeException: Unable to fence NameNode at hadoop3/
>> 10.7.23.124:8020
>>     at
>> org.apache.hadoop.ha.ZKFailoverController.doFence(ZKFailoverController.java:522)
>>     at
>> org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailoverController.java:494)
>>     at
>> org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.java:59)
>>     at
>> org.apache.hadoop.ha.ZKFailoverController$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:837)
>>     at
>> org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:900)
>>     at
>> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:799)
>>     at
>> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:415)
>>     at
>> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:596)
>>     at
>> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495)
>> 2013-12-03 10:05:59,650 INFO org.apache.hadoop.ha.ActiveStandbyElector:
>> Trying to re-establish ZK session
>> 2013-12-03 10:05:59,676 INFO org.apache.zookeeper.ZooKeeper: Session:
>> 0x142931031810260 closed
>> 2013-12-03 10:06:00,678 INFO org.apache.zookeeper.ZooKeeper: Initiating
>> client connection, connectString=hadoop1:2181,hadoop2:2181,hadoop3:2181
>> sessionTimeout=5000
>> watcher=org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef@5ce2acea
>> 2013-12-03 10:06:00,681 INFO org.apache.zookeeper.ClientCnxn: Opening
>> socket connection to server hadoop1/10.7.23.122:2181. Will not attempt
>> to authenticate using SASL (Unable to locate a login configuration)
>> 2013-12-03 10:06:00,681 INFO org.apache.zookeeper.ClientCnxn: Socket
>> connection established to hadoop1/10.7.23.122:2181, initiating session
>> 2013-12-03 10:06:00,709 INFO org.apache.zookeeper.ClientCnxn: Session
>> establishment complete on server hadoop1/10.7.23.122:2181, sessionid =
>> 0x142931031810261, negotiated timeout = 5000
>> 2013-12-03 10:06:00,711 INFO org.apache.zookeeper.ClientCnxn: EventThread
>> shut down
>>
>
>

Re: Can not auto-failover when unplug network interface

Posted by YouPeng Yang <yy...@gmail.com>.
Hi Yu

  Thanks for your response.
  I'm sure my ssh setup is good. Ssh from  act NN to stanby nn need no
password.







I attached my config
------core-site.xml-----------------

<configuration>
 <property>
     <name>fs.defaultFS</name>
     <value>hdfs://lklcluster</value>
     <final>true</final>
 </property>

 <property>
     <name>hadoop.tmp.dir</name>
     <value>/home/hadoop/tmp2</value>
 </property>


</configuration>


-------hdfs-site.xml----------
---

<configuration>
 <property>
     <name>dfs.namenode.name.dir</name>
    <value>/home/hadoop/namedir2</value>
 </property>

 <property>
     <name>dfs.datanode.data.dir</name>
     <value>/home/hadoop/datadir2</value>
 </property>

 <property>
   <name>dfs.nameservices</name>
   <value>lklcluster</value>
</property>

<property>
    <name>dfs.ha.namenodes.lklcluster</name>
    <value>nn1,nn2</value>
</property>
<property>
  <name>dfs.namenode.rpc-address.lklcluster.nn1</name>
  <value>hadoop2:8020</value>
</property>
<property>
    <name>dfs.namenode.rpc-address.lklcluster.nn2</name>
    <value>hadoop3:8020</value>
</property>

<property>
  <name>dfs.namenode.http-address.lklcluster.nn1</name>
    <value>hadoop2:50070</value>
</property>

<property>
    <name>dfs.namenode.http-address.lklcluster.nn2</name>
    <value>hadoop3:50070</value>
</property>

<property>
  <name>dfs.namenode.shared.edits.dir</name>

<value>qjournal://hadoop1:8485;hadoop2:8485;hadoop3:8485/lklcluster</value>
</property>
<property>
  <name>dfs.client.failover.proxy.provider.lklcluster</name>

<value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
</property>
<property>
  <name>dfs.ha.fencing.methods</name>
  <value>sshfence</value>
</property>

<property>
  <name>dfs.ha.fencing.ssh.private-key-files</name>
   <value>/home/hadoop/.ssh/id_rsa</value>
</property>

<property>
    <name>dfs.ha.fencing.ssh.connect-timeout</name>
     <value>5000</value>
</property>

<property>
  <name>dfs.journalnode.edits.dir</name>
   <value>/home/hadoop/journal/data</value>
</property>

<property>
   <name>dfs.ha.automatic-failover.enabled</name>
      <value>true</value>
</property>

<property>
     <name>ha.zookeeper.quorum</name>
     <value>hadoop1:2181,hadoop2:2181,hadoop3:2181</value>
</property>

</configuration>



2013/12/3 Azuryy Yu <az...@gmail.com>

> This is still because your fence method configuraed improperly.
> plseae paste your fence configuration. and double check you can ssh on
> active NN to standby NN without password.
>
>
> On Tue, Dec 3, 2013 at 10:23 AM, YouPeng Yang <yy...@gmail.com>wrote:
>
>> Hi
>>    Another auto-failover testing problem:
>>
>>    My HA can auto-failover after I kill the active NN.When it comes to
>> the unplug  network interface to simulate the hardware fail,the
>> auto-failover seems  not to work after   wait for times -the zkfc logs as
>> [1].
>>
>>    I'm using the default sshfence.
>>
>>
>>
>>
>>
>>
>> [1] zkfc
>> logs----------------------------------------------------------------------------------------
>> 2013-12-03 10:05:56,650 INFO org.apache.hadoop.ha.NodeFencer: ======
>> Beginning Service Fencing Process... ======
>> 2013-12-03 10:05:56,650 INFO org.apache.hadoop.ha.NodeFencer: Trying
>> method 1/1: org.apache.hadoop.ha.SshFenceByTcpPort(null)
>> 2013-12-03 10:05:56,651 INFO org.apache.hadoop.ha.SshFenceByTcpPort:
>> Connecting to hadoop3...
>> 2013-12-03 10:05:56,651 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
>> Connecting to hadoop3 port 22
>> 2013-12-03 10:05:59,648 WARN org.apache.hadoop.ha.SshFenceByTcpPort:
>> Unable to connect to hadoop3 as user hadoop
>> com.jcraft.jsch.JSchException: java.net.NoRouteToHostException: No route
>> to host
>>     at com.jcraft.jsch.Util.createSocket(Util.java:386)
>>     at com.jcraft.jsch.Session.connect(Session.java:182)
>>     at
>> org.apache.hadoop.ha.SshFenceByTcpPort.tryFence(SshFenceByTcpPort.java:100)
>>     at org.apache.hadoop.ha.NodeFencer.fence(NodeFencer.java:97)
>>     at
>> org.apache.hadoop.ha.ZKFailoverController.doFence(ZKFailoverController.java:521)
>>     at
>> org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailoverController.java:494)
>>     at
>> org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.java:59)
>>     at
>> org.apache.hadoop.ha.ZKFailoverController$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:837)
>>     at
>> org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:900)
>>     at
>> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:799)
>>     at
>> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:415)
>>     at
>> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:596)
>>     at
>> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495)
>> 2013-12-03 10:05:59,649 WARN org.apache.hadoop.ha.NodeFencer: Fencing
>> method org.apache.hadoop.ha.SshFenceByTcpPort(null) was unsuccessful.
>> 2013-12-03 10:05:59,649 ERROR org.apache.hadoop.ha.NodeFencer: Unable to
>> fence service by any configured method.
>> 2013-12-03 10:05:59,650 WARN org.apache.hadoop.ha.ActiveStandbyElector:
>> Exception handling the winning of election
>> java.lang.RuntimeException: Unable to fence NameNode at hadoop3/
>> 10.7.23.124:8020
>>     at
>> org.apache.hadoop.ha.ZKFailoverController.doFence(ZKFailoverController.java:522)
>>     at
>> org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailoverController.java:494)
>>     at
>> org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.java:59)
>>     at
>> org.apache.hadoop.ha.ZKFailoverController$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:837)
>>     at
>> org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:900)
>>     at
>> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:799)
>>     at
>> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:415)
>>     at
>> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:596)
>>     at
>> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495)
>> 2013-12-03 10:05:59,650 INFO org.apache.hadoop.ha.ActiveStandbyElector:
>> Trying to re-establish ZK session
>> 2013-12-03 10:05:59,676 INFO org.apache.zookeeper.ZooKeeper: Session:
>> 0x142931031810260 closed
>> 2013-12-03 10:06:00,678 INFO org.apache.zookeeper.ZooKeeper: Initiating
>> client connection, connectString=hadoop1:2181,hadoop2:2181,hadoop3:2181
>> sessionTimeout=5000
>> watcher=org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef@5ce2acea
>> 2013-12-03 10:06:00,681 INFO org.apache.zookeeper.ClientCnxn: Opening
>> socket connection to server hadoop1/10.7.23.122:2181. Will not attempt
>> to authenticate using SASL (Unable to locate a login configuration)
>> 2013-12-03 10:06:00,681 INFO org.apache.zookeeper.ClientCnxn: Socket
>> connection established to hadoop1/10.7.23.122:2181, initiating session
>> 2013-12-03 10:06:00,709 INFO org.apache.zookeeper.ClientCnxn: Session
>> establishment complete on server hadoop1/10.7.23.122:2181, sessionid =
>> 0x142931031810261, negotiated timeout = 5000
>> 2013-12-03 10:06:00,711 INFO org.apache.zookeeper.ClientCnxn: EventThread
>> shut down
>>
>
>

Re: Can not auto-failover when unplug network interface

Posted by YouPeng Yang <yy...@gmail.com>.
Hi Yu

  Thanks for your response.
  I'm sure my ssh setup is good. Ssh from  act NN to stanby nn need no
password.







I attached my config
------core-site.xml-----------------

<configuration>
 <property>
     <name>fs.defaultFS</name>
     <value>hdfs://lklcluster</value>
     <final>true</final>
 </property>

 <property>
     <name>hadoop.tmp.dir</name>
     <value>/home/hadoop/tmp2</value>
 </property>


</configuration>


-------hdfs-site.xml----------
---

<configuration>
 <property>
     <name>dfs.namenode.name.dir</name>
    <value>/home/hadoop/namedir2</value>
 </property>

 <property>
     <name>dfs.datanode.data.dir</name>
     <value>/home/hadoop/datadir2</value>
 </property>

 <property>
   <name>dfs.nameservices</name>
   <value>lklcluster</value>
</property>

<property>
    <name>dfs.ha.namenodes.lklcluster</name>
    <value>nn1,nn2</value>
</property>
<property>
  <name>dfs.namenode.rpc-address.lklcluster.nn1</name>
  <value>hadoop2:8020</value>
</property>
<property>
    <name>dfs.namenode.rpc-address.lklcluster.nn2</name>
    <value>hadoop3:8020</value>
</property>

<property>
  <name>dfs.namenode.http-address.lklcluster.nn1</name>
    <value>hadoop2:50070</value>
</property>

<property>
    <name>dfs.namenode.http-address.lklcluster.nn2</name>
    <value>hadoop3:50070</value>
</property>

<property>
  <name>dfs.namenode.shared.edits.dir</name>

<value>qjournal://hadoop1:8485;hadoop2:8485;hadoop3:8485/lklcluster</value>
</property>
<property>
  <name>dfs.client.failover.proxy.provider.lklcluster</name>

<value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
</property>
<property>
  <name>dfs.ha.fencing.methods</name>
  <value>sshfence</value>
</property>

<property>
  <name>dfs.ha.fencing.ssh.private-key-files</name>
   <value>/home/hadoop/.ssh/id_rsa</value>
</property>

<property>
    <name>dfs.ha.fencing.ssh.connect-timeout</name>
     <value>5000</value>
</property>

<property>
  <name>dfs.journalnode.edits.dir</name>
   <value>/home/hadoop/journal/data</value>
</property>

<property>
   <name>dfs.ha.automatic-failover.enabled</name>
      <value>true</value>
</property>

<property>
     <name>ha.zookeeper.quorum</name>
     <value>hadoop1:2181,hadoop2:2181,hadoop3:2181</value>
</property>

</configuration>



2013/12/3 Azuryy Yu <az...@gmail.com>

> This is still because your fence method configuraed improperly.
> plseae paste your fence configuration. and double check you can ssh on
> active NN to standby NN without password.
>
>
> On Tue, Dec 3, 2013 at 10:23 AM, YouPeng Yang <yy...@gmail.com>wrote:
>
>> Hi
>>    Another auto-failover testing problem:
>>
>>    My HA can auto-failover after I kill the active NN.When it comes to
>> the unplug  network interface to simulate the hardware fail,the
>> auto-failover seems  not to work after   wait for times -the zkfc logs as
>> [1].
>>
>>    I'm using the default sshfence.
>>
>>
>>
>>
>>
>>
>> [1] zkfc
>> logs----------------------------------------------------------------------------------------
>> 2013-12-03 10:05:56,650 INFO org.apache.hadoop.ha.NodeFencer: ======
>> Beginning Service Fencing Process... ======
>> 2013-12-03 10:05:56,650 INFO org.apache.hadoop.ha.NodeFencer: Trying
>> method 1/1: org.apache.hadoop.ha.SshFenceByTcpPort(null)
>> 2013-12-03 10:05:56,651 INFO org.apache.hadoop.ha.SshFenceByTcpPort:
>> Connecting to hadoop3...
>> 2013-12-03 10:05:56,651 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
>> Connecting to hadoop3 port 22
>> 2013-12-03 10:05:59,648 WARN org.apache.hadoop.ha.SshFenceByTcpPort:
>> Unable to connect to hadoop3 as user hadoop
>> com.jcraft.jsch.JSchException: java.net.NoRouteToHostException: No route
>> to host
>>     at com.jcraft.jsch.Util.createSocket(Util.java:386)
>>     at com.jcraft.jsch.Session.connect(Session.java:182)
>>     at
>> org.apache.hadoop.ha.SshFenceByTcpPort.tryFence(SshFenceByTcpPort.java:100)
>>     at org.apache.hadoop.ha.NodeFencer.fence(NodeFencer.java:97)
>>     at
>> org.apache.hadoop.ha.ZKFailoverController.doFence(ZKFailoverController.java:521)
>>     at
>> org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailoverController.java:494)
>>     at
>> org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.java:59)
>>     at
>> org.apache.hadoop.ha.ZKFailoverController$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:837)
>>     at
>> org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:900)
>>     at
>> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:799)
>>     at
>> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:415)
>>     at
>> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:596)
>>     at
>> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495)
>> 2013-12-03 10:05:59,649 WARN org.apache.hadoop.ha.NodeFencer: Fencing
>> method org.apache.hadoop.ha.SshFenceByTcpPort(null) was unsuccessful.
>> 2013-12-03 10:05:59,649 ERROR org.apache.hadoop.ha.NodeFencer: Unable to
>> fence service by any configured method.
>> 2013-12-03 10:05:59,650 WARN org.apache.hadoop.ha.ActiveStandbyElector:
>> Exception handling the winning of election
>> java.lang.RuntimeException: Unable to fence NameNode at hadoop3/
>> 10.7.23.124:8020
>>     at
>> org.apache.hadoop.ha.ZKFailoverController.doFence(ZKFailoverController.java:522)
>>     at
>> org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailoverController.java:494)
>>     at
>> org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.java:59)
>>     at
>> org.apache.hadoop.ha.ZKFailoverController$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:837)
>>     at
>> org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:900)
>>     at
>> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:799)
>>     at
>> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:415)
>>     at
>> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:596)
>>     at
>> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495)
>> 2013-12-03 10:05:59,650 INFO org.apache.hadoop.ha.ActiveStandbyElector:
>> Trying to re-establish ZK session
>> 2013-12-03 10:05:59,676 INFO org.apache.zookeeper.ZooKeeper: Session:
>> 0x142931031810260 closed
>> 2013-12-03 10:06:00,678 INFO org.apache.zookeeper.ZooKeeper: Initiating
>> client connection, connectString=hadoop1:2181,hadoop2:2181,hadoop3:2181
>> sessionTimeout=5000
>> watcher=org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef@5ce2acea
>> 2013-12-03 10:06:00,681 INFO org.apache.zookeeper.ClientCnxn: Opening
>> socket connection to server hadoop1/10.7.23.122:2181. Will not attempt
>> to authenticate using SASL (Unable to locate a login configuration)
>> 2013-12-03 10:06:00,681 INFO org.apache.zookeeper.ClientCnxn: Socket
>> connection established to hadoop1/10.7.23.122:2181, initiating session
>> 2013-12-03 10:06:00,709 INFO org.apache.zookeeper.ClientCnxn: Session
>> establishment complete on server hadoop1/10.7.23.122:2181, sessionid =
>> 0x142931031810261, negotiated timeout = 5000
>> 2013-12-03 10:06:00,711 INFO org.apache.zookeeper.ClientCnxn: EventThread
>> shut down
>>
>
>

Re: Can not auto-failover when unplug network interface

Posted by Azuryy Yu <az...@gmail.com>.
This is still because your fence method configuraed improperly.
plseae paste your fence configuration. and double check you can ssh on
active NN to standby NN without password.


On Tue, Dec 3, 2013 at 10:23 AM, YouPeng Yang <yy...@gmail.com>wrote:

> Hi
>    Another auto-failover testing problem:
>
>    My HA can auto-failover after I kill the active NN.When it comes to the
> unplug  network interface to simulate the hardware fail,the auto-failover
> seems  not to work after   wait for times -the zkfc logs as [1].
>
>    I'm using the default sshfence.
>
>
>
>
>
>
> [1] zkfc
> logs----------------------------------------------------------------------------------------
> 2013-12-03 10:05:56,650 INFO org.apache.hadoop.ha.NodeFencer: ======
> Beginning Service Fencing Process... ======
> 2013-12-03 10:05:56,650 INFO org.apache.hadoop.ha.NodeFencer: Trying
> method 1/1: org.apache.hadoop.ha.SshFenceByTcpPort(null)
> 2013-12-03 10:05:56,651 INFO org.apache.hadoop.ha.SshFenceByTcpPort:
> Connecting to hadoop3...
> 2013-12-03 10:05:56,651 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
> Connecting to hadoop3 port 22
> 2013-12-03 10:05:59,648 WARN org.apache.hadoop.ha.SshFenceByTcpPort:
> Unable to connect to hadoop3 as user hadoop
> com.jcraft.jsch.JSchException: java.net.NoRouteToHostException: No route
> to host
>     at com.jcraft.jsch.Util.createSocket(Util.java:386)
>     at com.jcraft.jsch.Session.connect(Session.java:182)
>     at
> org.apache.hadoop.ha.SshFenceByTcpPort.tryFence(SshFenceByTcpPort.java:100)
>     at org.apache.hadoop.ha.NodeFencer.fence(NodeFencer.java:97)
>     at
> org.apache.hadoop.ha.ZKFailoverController.doFence(ZKFailoverController.java:521)
>     at
> org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailoverController.java:494)
>     at
> org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.java:59)
>     at
> org.apache.hadoop.ha.ZKFailoverController$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:837)
>     at
> org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:900)
>     at
> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:799)
>     at
> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:415)
>     at
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:596)
>     at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495)
> 2013-12-03 10:05:59,649 WARN org.apache.hadoop.ha.NodeFencer: Fencing
> method org.apache.hadoop.ha.SshFenceByTcpPort(null) was unsuccessful.
> 2013-12-03 10:05:59,649 ERROR org.apache.hadoop.ha.NodeFencer: Unable to
> fence service by any configured method.
> 2013-12-03 10:05:59,650 WARN org.apache.hadoop.ha.ActiveStandbyElector:
> Exception handling the winning of election
> java.lang.RuntimeException: Unable to fence NameNode at hadoop3/
> 10.7.23.124:8020
>     at
> org.apache.hadoop.ha.ZKFailoverController.doFence(ZKFailoverController.java:522)
>     at
> org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailoverController.java:494)
>     at
> org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.java:59)
>     at
> org.apache.hadoop.ha.ZKFailoverController$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:837)
>     at
> org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:900)
>     at
> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:799)
>     at
> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:415)
>     at
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:596)
>     at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495)
> 2013-12-03 10:05:59,650 INFO org.apache.hadoop.ha.ActiveStandbyElector:
> Trying to re-establish ZK session
> 2013-12-03 10:05:59,676 INFO org.apache.zookeeper.ZooKeeper: Session:
> 0x142931031810260 closed
> 2013-12-03 10:06:00,678 INFO org.apache.zookeeper.ZooKeeper: Initiating
> client connection, connectString=hadoop1:2181,hadoop2:2181,hadoop3:2181
> sessionTimeout=5000
> watcher=org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef@5ce2acea
> 2013-12-03 10:06:00,681 INFO org.apache.zookeeper.ClientCnxn: Opening
> socket connection to server hadoop1/10.7.23.122:2181. Will not attempt to
> authenticate using SASL (Unable to locate a login configuration)
> 2013-12-03 10:06:00,681 INFO org.apache.zookeeper.ClientCnxn: Socket
> connection established to hadoop1/10.7.23.122:2181, initiating session
> 2013-12-03 10:06:00,709 INFO org.apache.zookeeper.ClientCnxn: Session
> establishment complete on server hadoop1/10.7.23.122:2181, sessionid =
> 0x142931031810261, negotiated timeout = 5000
> 2013-12-03 10:06:00,711 INFO org.apache.zookeeper.ClientCnxn: EventThread
> shut down
>

Re: Can not auto-failover when unplug network interface

Posted by Azuryy Yu <az...@gmail.com>.
This is still because your fence method configuraed improperly.
plseae paste your fence configuration. and double check you can ssh on
active NN to standby NN without password.


On Tue, Dec 3, 2013 at 10:23 AM, YouPeng Yang <yy...@gmail.com>wrote:

> Hi
>    Another auto-failover testing problem:
>
>    My HA can auto-failover after I kill the active NN.When it comes to the
> unplug  network interface to simulate the hardware fail,the auto-failover
> seems  not to work after   wait for times -the zkfc logs as [1].
>
>    I'm using the default sshfence.
>
>
>
>
>
>
> [1] zkfc
> logs----------------------------------------------------------------------------------------
> 2013-12-03 10:05:56,650 INFO org.apache.hadoop.ha.NodeFencer: ======
> Beginning Service Fencing Process... ======
> 2013-12-03 10:05:56,650 INFO org.apache.hadoop.ha.NodeFencer: Trying
> method 1/1: org.apache.hadoop.ha.SshFenceByTcpPort(null)
> 2013-12-03 10:05:56,651 INFO org.apache.hadoop.ha.SshFenceByTcpPort:
> Connecting to hadoop3...
> 2013-12-03 10:05:56,651 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
> Connecting to hadoop3 port 22
> 2013-12-03 10:05:59,648 WARN org.apache.hadoop.ha.SshFenceByTcpPort:
> Unable to connect to hadoop3 as user hadoop
> com.jcraft.jsch.JSchException: java.net.NoRouteToHostException: No route
> to host
>     at com.jcraft.jsch.Util.createSocket(Util.java:386)
>     at com.jcraft.jsch.Session.connect(Session.java:182)
>     at
> org.apache.hadoop.ha.SshFenceByTcpPort.tryFence(SshFenceByTcpPort.java:100)
>     at org.apache.hadoop.ha.NodeFencer.fence(NodeFencer.java:97)
>     at
> org.apache.hadoop.ha.ZKFailoverController.doFence(ZKFailoverController.java:521)
>     at
> org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailoverController.java:494)
>     at
> org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.java:59)
>     at
> org.apache.hadoop.ha.ZKFailoverController$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:837)
>     at
> org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:900)
>     at
> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:799)
>     at
> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:415)
>     at
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:596)
>     at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495)
> 2013-12-03 10:05:59,649 WARN org.apache.hadoop.ha.NodeFencer: Fencing
> method org.apache.hadoop.ha.SshFenceByTcpPort(null) was unsuccessful.
> 2013-12-03 10:05:59,649 ERROR org.apache.hadoop.ha.NodeFencer: Unable to
> fence service by any configured method.
> 2013-12-03 10:05:59,650 WARN org.apache.hadoop.ha.ActiveStandbyElector:
> Exception handling the winning of election
> java.lang.RuntimeException: Unable to fence NameNode at hadoop3/
> 10.7.23.124:8020
>     at
> org.apache.hadoop.ha.ZKFailoverController.doFence(ZKFailoverController.java:522)
>     at
> org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailoverController.java:494)
>     at
> org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.java:59)
>     at
> org.apache.hadoop.ha.ZKFailoverController$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:837)
>     at
> org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:900)
>     at
> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:799)
>     at
> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:415)
>     at
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:596)
>     at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495)
> 2013-12-03 10:05:59,650 INFO org.apache.hadoop.ha.ActiveStandbyElector:
> Trying to re-establish ZK session
> 2013-12-03 10:05:59,676 INFO org.apache.zookeeper.ZooKeeper: Session:
> 0x142931031810260 closed
> 2013-12-03 10:06:00,678 INFO org.apache.zookeeper.ZooKeeper: Initiating
> client connection, connectString=hadoop1:2181,hadoop2:2181,hadoop3:2181
> sessionTimeout=5000
> watcher=org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef@5ce2acea
> 2013-12-03 10:06:00,681 INFO org.apache.zookeeper.ClientCnxn: Opening
> socket connection to server hadoop1/10.7.23.122:2181. Will not attempt to
> authenticate using SASL (Unable to locate a login configuration)
> 2013-12-03 10:06:00,681 INFO org.apache.zookeeper.ClientCnxn: Socket
> connection established to hadoop1/10.7.23.122:2181, initiating session
> 2013-12-03 10:06:00,709 INFO org.apache.zookeeper.ClientCnxn: Session
> establishment complete on server hadoop1/10.7.23.122:2181, sessionid =
> 0x142931031810261, negotiated timeout = 5000
> 2013-12-03 10:06:00,711 INFO org.apache.zookeeper.ClientCnxn: EventThread
> shut down
>

Re: Can not auto-failover when unplug network interface

Posted by Azuryy Yu <az...@gmail.com>.
This is still because your fence method configuraed improperly.
plseae paste your fence configuration. and double check you can ssh on
active NN to standby NN without password.


On Tue, Dec 3, 2013 at 10:23 AM, YouPeng Yang <yy...@gmail.com>wrote:

> Hi
>    Another auto-failover testing problem:
>
>    My HA can auto-failover after I kill the active NN.When it comes to the
> unplug  network interface to simulate the hardware fail,the auto-failover
> seems  not to work after   wait for times -the zkfc logs as [1].
>
>    I'm using the default sshfence.
>
>
>
>
>
>
> [1] zkfc
> logs----------------------------------------------------------------------------------------
> 2013-12-03 10:05:56,650 INFO org.apache.hadoop.ha.NodeFencer: ======
> Beginning Service Fencing Process... ======
> 2013-12-03 10:05:56,650 INFO org.apache.hadoop.ha.NodeFencer: Trying
> method 1/1: org.apache.hadoop.ha.SshFenceByTcpPort(null)
> 2013-12-03 10:05:56,651 INFO org.apache.hadoop.ha.SshFenceByTcpPort:
> Connecting to hadoop3...
> 2013-12-03 10:05:56,651 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
> Connecting to hadoop3 port 22
> 2013-12-03 10:05:59,648 WARN org.apache.hadoop.ha.SshFenceByTcpPort:
> Unable to connect to hadoop3 as user hadoop
> com.jcraft.jsch.JSchException: java.net.NoRouteToHostException: No route
> to host
>     at com.jcraft.jsch.Util.createSocket(Util.java:386)
>     at com.jcraft.jsch.Session.connect(Session.java:182)
>     at
> org.apache.hadoop.ha.SshFenceByTcpPort.tryFence(SshFenceByTcpPort.java:100)
>     at org.apache.hadoop.ha.NodeFencer.fence(NodeFencer.java:97)
>     at
> org.apache.hadoop.ha.ZKFailoverController.doFence(ZKFailoverController.java:521)
>     at
> org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailoverController.java:494)
>     at
> org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.java:59)
>     at
> org.apache.hadoop.ha.ZKFailoverController$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:837)
>     at
> org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:900)
>     at
> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:799)
>     at
> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:415)
>     at
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:596)
>     at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495)
> 2013-12-03 10:05:59,649 WARN org.apache.hadoop.ha.NodeFencer: Fencing
> method org.apache.hadoop.ha.SshFenceByTcpPort(null) was unsuccessful.
> 2013-12-03 10:05:59,649 ERROR org.apache.hadoop.ha.NodeFencer: Unable to
> fence service by any configured method.
> 2013-12-03 10:05:59,650 WARN org.apache.hadoop.ha.ActiveStandbyElector:
> Exception handling the winning of election
> java.lang.RuntimeException: Unable to fence NameNode at hadoop3/
> 10.7.23.124:8020
>     at
> org.apache.hadoop.ha.ZKFailoverController.doFence(ZKFailoverController.java:522)
>     at
> org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailoverController.java:494)
>     at
> org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.java:59)
>     at
> org.apache.hadoop.ha.ZKFailoverController$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:837)
>     at
> org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:900)
>     at
> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:799)
>     at
> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:415)
>     at
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:596)
>     at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495)
> 2013-12-03 10:05:59,650 INFO org.apache.hadoop.ha.ActiveStandbyElector:
> Trying to re-establish ZK session
> 2013-12-03 10:05:59,676 INFO org.apache.zookeeper.ZooKeeper: Session:
> 0x142931031810260 closed
> 2013-12-03 10:06:00,678 INFO org.apache.zookeeper.ZooKeeper: Initiating
> client connection, connectString=hadoop1:2181,hadoop2:2181,hadoop3:2181
> sessionTimeout=5000
> watcher=org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef@5ce2acea
> 2013-12-03 10:06:00,681 INFO org.apache.zookeeper.ClientCnxn: Opening
> socket connection to server hadoop1/10.7.23.122:2181. Will not attempt to
> authenticate using SASL (Unable to locate a login configuration)
> 2013-12-03 10:06:00,681 INFO org.apache.zookeeper.ClientCnxn: Socket
> connection established to hadoop1/10.7.23.122:2181, initiating session
> 2013-12-03 10:06:00,709 INFO org.apache.zookeeper.ClientCnxn: Session
> establishment complete on server hadoop1/10.7.23.122:2181, sessionid =
> 0x142931031810261, negotiated timeout = 5000
> 2013-12-03 10:06:00,711 INFO org.apache.zookeeper.ClientCnxn: EventThread
> shut down
>

Re: Can not auto-failover when unplug network interface

Posted by Azuryy Yu <az...@gmail.com>.
This is still because your fence method configuraed improperly.
plseae paste your fence configuration. and double check you can ssh on
active NN to standby NN without password.


On Tue, Dec 3, 2013 at 10:23 AM, YouPeng Yang <yy...@gmail.com>wrote:

> Hi
>    Another auto-failover testing problem:
>
>    My HA can auto-failover after I kill the active NN.When it comes to the
> unplug  network interface to simulate the hardware fail,the auto-failover
> seems  not to work after   wait for times -the zkfc logs as [1].
>
>    I'm using the default sshfence.
>
>
>
>
>
>
> [1] zkfc
> logs----------------------------------------------------------------------------------------
> 2013-12-03 10:05:56,650 INFO org.apache.hadoop.ha.NodeFencer: ======
> Beginning Service Fencing Process... ======
> 2013-12-03 10:05:56,650 INFO org.apache.hadoop.ha.NodeFencer: Trying
> method 1/1: org.apache.hadoop.ha.SshFenceByTcpPort(null)
> 2013-12-03 10:05:56,651 INFO org.apache.hadoop.ha.SshFenceByTcpPort:
> Connecting to hadoop3...
> 2013-12-03 10:05:56,651 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
> Connecting to hadoop3 port 22
> 2013-12-03 10:05:59,648 WARN org.apache.hadoop.ha.SshFenceByTcpPort:
> Unable to connect to hadoop3 as user hadoop
> com.jcraft.jsch.JSchException: java.net.NoRouteToHostException: No route
> to host
>     at com.jcraft.jsch.Util.createSocket(Util.java:386)
>     at com.jcraft.jsch.Session.connect(Session.java:182)
>     at
> org.apache.hadoop.ha.SshFenceByTcpPort.tryFence(SshFenceByTcpPort.java:100)
>     at org.apache.hadoop.ha.NodeFencer.fence(NodeFencer.java:97)
>     at
> org.apache.hadoop.ha.ZKFailoverController.doFence(ZKFailoverController.java:521)
>     at
> org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailoverController.java:494)
>     at
> org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.java:59)
>     at
> org.apache.hadoop.ha.ZKFailoverController$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:837)
>     at
> org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:900)
>     at
> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:799)
>     at
> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:415)
>     at
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:596)
>     at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495)
> 2013-12-03 10:05:59,649 WARN org.apache.hadoop.ha.NodeFencer: Fencing
> method org.apache.hadoop.ha.SshFenceByTcpPort(null) was unsuccessful.
> 2013-12-03 10:05:59,649 ERROR org.apache.hadoop.ha.NodeFencer: Unable to
> fence service by any configured method.
> 2013-12-03 10:05:59,650 WARN org.apache.hadoop.ha.ActiveStandbyElector:
> Exception handling the winning of election
> java.lang.RuntimeException: Unable to fence NameNode at hadoop3/
> 10.7.23.124:8020
>     at
> org.apache.hadoop.ha.ZKFailoverController.doFence(ZKFailoverController.java:522)
>     at
> org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailoverController.java:494)
>     at
> org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.java:59)
>     at
> org.apache.hadoop.ha.ZKFailoverController$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:837)
>     at
> org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:900)
>     at
> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:799)
>     at
> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:415)
>     at
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:596)
>     at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495)
> 2013-12-03 10:05:59,650 INFO org.apache.hadoop.ha.ActiveStandbyElector:
> Trying to re-establish ZK session
> 2013-12-03 10:05:59,676 INFO org.apache.zookeeper.ZooKeeper: Session:
> 0x142931031810260 closed
> 2013-12-03 10:06:00,678 INFO org.apache.zookeeper.ZooKeeper: Initiating
> client connection, connectString=hadoop1:2181,hadoop2:2181,hadoop3:2181
> sessionTimeout=5000
> watcher=org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef@5ce2acea
> 2013-12-03 10:06:00,681 INFO org.apache.zookeeper.ClientCnxn: Opening
> socket connection to server hadoop1/10.7.23.122:2181. Will not attempt to
> authenticate using SASL (Unable to locate a login configuration)
> 2013-12-03 10:06:00,681 INFO org.apache.zookeeper.ClientCnxn: Socket
> connection established to hadoop1/10.7.23.122:2181, initiating session
> 2013-12-03 10:06:00,709 INFO org.apache.zookeeper.ClientCnxn: Session
> establishment complete on server hadoop1/10.7.23.122:2181, sessionid =
> 0x142931031810261, negotiated timeout = 5000
> 2013-12-03 10:06:00,711 INFO org.apache.zookeeper.ClientCnxn: EventThread
> shut down
>