You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Bharath Vissapragada <bh...@cloudera.com> on 2014/12/01 12:57:30 UTC

Re: After hadoop QJM failover，hbase can not write

Did you override "dfs.client.retry.policy.enabled" to "true" in the
regionserver configs?

On Mon, Dec 1, 2014 at 9:13 AM, 聪聪 <17...@qq.com> wrote:

> hi，there:
> I encount a problem，it let me upset.
>
>
> I use version of hadoop is hadoop-2.3.0-cdh5.1.0，namenode HA use  the
> Quorum Journal Manager (QJM) feature ，dfs.ha.fencing.methods option is
> following：
> <property>
>         <name>dfs.ha.fencing.methods</name>
>         <value>sshfence
>                shell(q_hadoop_fence.sh $target_host $target_port)
>         </value>
> </property>
>
>
>
> or
>
>
> <property>
>         <name>dfs.ha.fencing.methods</name>
>         <value>sshfence
>                shell(/bin/true)
>         </value>
> </property>
>
>
>
> I use iptables to  simulate  machine of active namenode  crash。After
> automatic failover completed，hdfs  can the normal write，for example
> ./bin/hdfs dfs -put a.txt /tmp,but  hbase  still  can not write.
> After a very long time，hbase can write,but I can not statistic How long
> did it take.
> I want to ask：
> 1、Why hdfs Complete failover，hbase can not write？
> 2、After hdfs Complete failover，how long hbase can write？
> 3、Whether a particular parameters influence this time？
> ‍
>
> Looking forward for your responses!‍
> attach regionserver log,until the following content appears to be able to
> write：‍
> 2014-12-01 11:35:16,965 INFO  [MemStoreFlusher.6] regionserver.HRegion:
> Finished memstore flush of ~7.9 K/8096, currentsize=0/0 for region
> t,,1417403859247.645d0fbe63663fabfb73025d3eb99524. in 46ms, sequenceid=48,
> compaction requested=false
> 2014-12-01 11:35:17,755 WARN  [RpcServer.reader=1,port=60020]
> ipc.RpcServer: RpcServer.listener,port=60020: count of bytes read: 0
> java.io.IOException: Connection reset by peer
>         at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
>         at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
>         at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
>         at sun.nio.ch.IOUtil.read(IOUtil.java:197)
>         at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
>         at
> org.apache.hadoop.hbase.ipc.RpcServer.channelRead(RpcServer.java:2248)
>         at
> org.apache.hadoop.hbase.ipc.RpcServer$Connection.readAndProcess(RpcServer.java:1427)
>         at
> org.apache.hadoop.hbase.ipc.RpcServer$Listener.doRead(RpcServer.java:802)
>         at
> org.apache.hadoop.hbase.ipc.RpcServer$Listener$Reader.doRunLoop(RpcServer.java:593)
>         at
> org.apache.hadoop.hbase.ipc.RpcServer$Listener$Reader.run(RpcServer.java:568)
>         at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         at java.lang.Thread.run(Thread.java:744)‍
>
>
>
> part of  the datanode log is following：
>
>
> 2014-11-28 16:51:56,420 INFO org.apache.hadoop.ipc.Client: Retrying
> connect to server: l-hbase1.dba.dev.cn0/10.86.36.217:8020. Already tried
> 8 time(s); retry policy is
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000
> MILLISECONDS)
> 2014-11-28 16:52:12,421 INFO org.apache.hadoop.ipc.Client: Retrying
> connect to server: l-hbase1.dba.dev.cn0/10.86.36.217:8020. Already tried
> 9 time(s); retry policy is
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000
> MILLISECONDS)
> 2014-11-28 16:52:27,422 WARN
> org.apache.hadoop.hdfs.server.datanode.DataNode: IOException in offerService
> java.net.ConnectException: Call From
> l-hbase3.dba.dev.cn0.qunar.com/10.86.36.219 to l-hbase1.dba.dev.cn0:8020
> failed on connection exception: java.net.ConnectException:
>  Connection timed out; For more details see:
> http://wiki.apache.org/hadoop/ConnectionRefused
>         at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> Method)
>         at
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>         at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>         at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>         at
> org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:783)
>         at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:730)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1413)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1362)
>         at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
>         at com.sun.proxy.$Proxy9.sendHeartbeat(Unknown Source)
>         at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
>         at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:606)
>         at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186)
>         at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>         at com.sun.proxy.$Proxy9.sendHeartbeat(Unknown Source)
>         at
> org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.sendHeartbeat(DatanodeProtocolClientSideTranslatorPB.java:178)
>         at
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:566)
>         at
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:664)
>         at
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:834)
>         at java.lang.Thread.run(Thread.java:744)
> Caused by: java.net.ConnectException: Connection timed out
>         at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>         at
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:735)
>         at
> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
>         at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:529)
>         at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:493)
>         at
> org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:604)
>         at
> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:699)
>         at
> org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:367)
>         at org.apache.hadoop.ipc.Client.getConnection(Client.java:1461)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1380)
>         ... 14 more
> 2014-11-28 16:52:43,424 INFO org.apache.hadoop.ipc.Client: Retrying
> connect to server: l-hbase1.dba.dev.cn0/10.86.36.217:8020. Already tried
> 0 time(s); retry policy is
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000
> MILLISECONDS)
> 2014-11-28 16:52:59,424 INFO org.apache.hadoop.ipc.Client: Retrying
> connect to server: l-hbase1.dba.dev.cn0/10.86.36.217:8020. Already tried
> 1 time(s); retry policy is
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000
> MILLISECONDS)
> 2014-11-28 16:53:15,425 INFO org.apache.hadoop.ipc.Client: Retrying
> connect to server: l-hbase1.dba.dev.cn0/10.86.36.217:8020. Already tried
> 2 time(s); retry policy is
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000
> MILLISECONDS)‍




-- 
Bharath Vissapragada
<http://www.cloudera.com>

回复： After hadoop QJM failover，hbase can not write

Posted by 聪聪 <17...@qq.com>.

I found the cause of the problem,because I use iptables to  simulate  machine of active namenode   crash.
I try to manually shutdown the active namenode machine. After hadoop QJM failover，hbase  soon be able to write.
I do not know the essential reason .
Thank you again for your reply！‍
‍




------------------ 原始邮件 ------------------
发件人: "Bharath Vissapragada";<bh...@cloudera.com>;
发送时间: 2014年12月8日(星期一) 晚上6:35
收件人: "hbase-user"<us...@hbase.apache.org>; 

主题: Re: After hadoop QJM failover，hbase can not write



Sorry if my previous comment was unclear. dfs.client.retry.policy.enabled
should be set to "false" (which is the default config). Overriding it to
"true" will make the ha-client pick a wrong Retry policy. I just wanted to
make sure you didn't override with a wrong setting. Regarding your question
on speeding up the failover, a quick look at the codebase suggests the
following configs might be relevant

dfs.client.failover.max.attempts
dfs.client.failover.sleep.base.millis
dfs.client.failover.sleep.max.millis
dfs.client.retry.max.attempts

However I suggest to ask this question on hdfs lists, as you might get more
relevant answer for your question since hbase is oblivious to hdfs failover.



On Mon, Dec 1, 2014 at 8:51 PM, 聪聪 <17...@qq.com> wrote:

> Thanks for you！
> According to your suggestion，I configure
> "dfs.client.retry.policy.enabled" to "true"  in core-site.xml，and restart
> making effect.I find some changes in hbase master log. In mater log,retry
> information appear.But it still takes a long time to be able to write.I
> want ask how long hbase can write?What is retry policy?Whether can
> configure which parameters？‍
>
>
> attach hbase master log：
> 2014-12-01 22:47:30,487 INFO  [master:l-hbase2:60000-SendThread(
> l-hbase2.dba.dev.cn0.qunar.com:2181)] zookeeper.ClientCnxn: Session
> establishment complete on server
> l-hbase2.dba.dev.cn0.qunar.com/10.86.36.218:2181, sessionid =
> 0x14a0640d2100007, negotiated timeout = 40000
> 2014-12-01 22:48:38,729 INFO  [master:l-hbase2:60000.oldLogCleaner]
> ipc.Client: Retrying connect to server: l-hbase1.dba.dev.cn0/
> 10.86.36.217:8020. Already tried 0 time(s); retry policy is
> RetryPolicy[MultipleLinearRandomRetry[6x10000ms, 10x60000ms],
> TryOnceThenFail]
> 2014-12-01 22:49:07,748 INFO  [master:l-hbase2:60000.oldLogCleaner]
> ipc.Client: Retrying connect to server: l-hbase1.dba.dev.cn0/
> 10.86.36.217:8020. Already tried 1 time(s); retry policy is
> RetryPolicy[MultipleLinearRandomRetry[6x10000ms, 10x60000ms],
> TryOnceThenFail]
> 2014-12-01 22:49:22,534 DEBUG [l-hbase2.dba.dev.cn0.qunar.com,60000,1417445031747-BalancerChore]
> balancer.BaseLoadBalancer: Not running balancer because only 1 active
> regionserver(s)
> 2014-12-01 22:49:34,080 INFO  [master:l-hbase2:60000.oldLogCleaner]
> ipc.Client: Retrying connect to server: l-hbase1.dba.dev.cn0/
> 10.86.36.217:8020. Already tried 2 time(s); retry policy is
> RetryPolicy[MultipleLinearRandomRetry[6x10000ms, 10x60000ms],
> TryOnceThenFail]
> 2014-12-01 22:49:54,752 INFO  [master:l-hbase2:60000.oldLogCleaner]
> ipc.Client: Retrying connect to server: l-hbase1.dba.dev.cn0/
> 10.86.36.217:8020. Already tried 3 time(s); retry policy is
> RetryPolicy[MultipleLinearRandomRetry[6x10000ms, 10x60000ms],
> TryOnceThenFail]
> 2014-12-01 22:50:19,014 INFO  [master:l-hbase2:60000.oldLogCleaner]
> ipc.Client: Retrying connect to server: l-hbase1.dba.dev.cn0/
> 10.86.36.217:8020. Already tried 4 time(s); retry policy is
> RetryPolicy[MultipleLinearRandomRetry[6x10000ms, 10x60000ms],
> TryOnceThenFail]
> 2014-12-01 22:50:44,438 INFO  [master:l-hbase2:60000.oldLogCleaner]
> ipc.Client: Retrying connect to server: l-hbase1.dba.dev.cn0/
> 10.86.36.217:8020. Already tried 5 time(s); retry policy is
> RetryPolicy[MultipleLinearRandomRetry[6x10000ms, 10x60000ms],
> TryOnceThenFail]
> 2014-12-01 22:51:05,546 INFO  [master:l-hbase2:60000.oldLogCleaner]
> ipc.Client: Retrying connect to server: l-hbase1.dba.dev.cn0/
> 10.86.36.217:8020. Already tried 6 time(s); retry policy is
> RetryPolicy[MultipleLinearRandomRetry[6x10000ms, 10x60000ms],
> TryOnceThenFail]
> 2014-12-01 22:51:58,980 INFO  [master:l-hbase2:60000.oldLogCleaner]
> ipc.Client: Retrying connect to server: l-hbase1.dba.dev.cn0/
> 10.86.36.217:8020. Already tried 7 time(s); retry policy is
> RetryPolicy[MultipleLinearRandomRetry[6x10000ms, 10x60000ms],
> TryOnceThenFail]
> 2014-12-01 22:53:33,330 INFO  [master:l-hbase2:60000.oldLogCleaner]
> ipc.Client: Retrying connect to server: l-hbase1.dba.dev.cn0/
> 10.86.36.217:8020. Already tried 8 time(s); retry policy is
> RetryPolicy[MultipleLinearRandomRetry[6x10000ms, 10x60000ms],
> TryOnceThenFail]
> 2014-12-01 22:54:22,533 DEBUG [l-hbase2.dba.dev.cn0.qunar.com,60000,1417445031747-BalancerChore]
> balancer.BaseLoadBalancer: Not running balancer because only 1 active
> regionserver(s)
> 2014-12-01 22:54:30,953 INFO  [master:l-hbase2:60000.oldLogCleaner]
> ipc.Client: Retrying connect to server: l-hbase1.dba.dev.cn0/
> 10.86.36.217:8020. Already tried 9 time(s); retry policy is
> RetryPolicy[MultipleLinearRandomRetry[6x10000ms, 10x60000ms],
> TryOnceThenFail]
> 2014-12-01 22:55:43,189 INFO  [master:l-hbase2:60000.oldLogCleaner]
> ipc.Client: Retrying connect to server: l-hbase1.dba.dev.cn0/
> 10.86.36.217:8020. Already tried 10 time(s); retry policy is
> RetryPolicy[MultipleLinearRandomRetry[6x10000ms, 10x60000ms],
> TryOnceThenFail]
> 2014-12-01 22:56:49,457 INFO  [master:l-hbase2:60000.oldLogCleaner]
> ipc.Client: Retrying connect to server: l-hbase1.dba.dev.cn0/
> 10.86.36.217:8020. Already tried 11 time(s); retry policy is
> RetryPolicy[MultipleLinearRandomRetry[6x10000ms, 10x60000ms],
> TryOnceThenFail]
> 2014-12-01 22:58:29,088 INFO  [master:l-hbase2:60000.oldLogCleaner]
> ipc.Client: Retrying connect to server: l-hbase1.dba.dev.cn0/
> 10.86.36.217:8020. Already tried 12 time(s); retry policy is
> RetryPolicy[MultipleLinearRandomRetry[6x10000ms, 10x60000ms],
> TryOnceThenFail]
> 2014-12-01 22:59:22,532 DEBUG [l-hbase2.dba.dev.cn0.qunar.com,60000,1417445031747-BalancerChore]
> balancer.BaseLoadBalancer: Not running balancer because only 1 active
> regionserver(s)
> 2014-12-01 22:59:25,346 INFO  [master:l-hbase2:60000.oldLogCleaner]
> ipc.Client: Retrying connect to server: l-hbase1.dba.dev.cn0/
> 10.86.36.217:8020. Already tried 13 time(s); retry policy is
> RetryPolicy[MultipleLinearRandomRetry[6x10000ms, 10x60000ms],
> TryOnceThenFail]
> 2014-12-01 23:00:55,023 INFO  [master:l-hbase2:60000.oldLogCleaner]
> ipc.Client: Retrying connect to server: l-hbase1.dba.dev.cn0/
> 10.86.36.217:8020. Already tried 14 time(s); retry policy is
> RetryPolicy[MultipleLinearRandomRetry[6x10000ms, 10x60000ms],
> TryOnceThenFail]
> 2014-12-01 23:01:59,966 INFO  [master:l-hbase2:60000.oldLogCleaner]
> ipc.Client: Retrying connect to server: l-hbase1.dba.dev.cn0/
> 10.86.36.217:8020. Already tried 15 time(s); retry policy is
> RetryPolicy[MultipleLinearRandomRetry[6x10000ms, 10x60000ms],
> TryOnceThenFail]
> 2014-12-01 23:02:46,067 INFO  [master:l-hbase2:60000.oldLogCleaner]
> ipc.Client: Retrying connect to server: l-hbase1.dba.dev.cn0/
> 10.86.36.217:8020. Already tried 16 time(s); retry policy is
> RetryPolicy[MultipleLinearRandomRetry[6x10000ms, 10x60000ms],
> TryOnceThenFail]
> 2014-12-01 23:03:01,073 INFO  [master:l-hbase2:60000.oldLogCleaner]
> retry.RetryInvocationHandler: Exception while invoking getListing of class
> ClientNamenodeProtocolTranslatorPB over l-hbase1.dba.dev.cn0/
> 10.86.36.217:8020. Trying to fail over immediately.
> java.net.ConnectException: Call From
> l-hbase2.dba.dev.cn0.qunar.com/10.86.36.218 to l-hbase1.dba.dev.cn0:8020
> failed on connection exception: java.net.ConnectException: Connection timed
> out; For more details see:
> http://wiki.apache.org/hadoop/ConnectionRefused
>         at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> Method)
>         at
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>         at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>         at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>         at
> org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:783)
>         at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:730)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1415)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1364)
>         at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
>         at com.sun.proxy.$Proxy17.getListing(Unknown Source)
>         at
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getListing(ClientNamenodeProtocolTranslatorPB.java:546)
>         at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
>         at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:606)
>         at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
>         at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>         at com.sun.proxy.$Proxy18.getListing(Unknown Source)
>         at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
>         at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:606)
>         at
> org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:294)
>         at com.sun.proxy.$Proxy20.getListing(Unknown Source)
>         at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:1906)
>         at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:1889)
>         at
> org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:654)
>         at
> org.apache.hadoop.hdfs.DistributedFileSystem.access$600(DistributedFileSystem.java:104)
>         at
> org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:716)
>         at
> org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:712)
>         at
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>         at
> org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:712)
>         at
> org.apache.hadoop.hbase.util.FSUtils.listStatus(FSUtils.java:1555)
>         at
> org.apache.hadoop.hbase.util.FSUtils.listStatus(FSUtils.java:1575)
>         at
> org.apache.hadoop.hbase.master.cleaner.CleanerChore.chore(CleanerChore.java:123)
>         at org.apache.hadoop.hbase.Chore.run(Chore.java:87)
>         at java.lang.Thread.run(Thread.java:744)
> Caused by: java.net.ConnectException: Connection timed out
>         at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>         at
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:735)
>         at
> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
>         at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:529)
>         at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:493)
>         at
> org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:606)
>         at
> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:700)
>         at
> org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:367)
>         at org.apache.hadoop.ipc.Client.getConnection(Client.java:1463)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1382)
>         ... 28 more
> 2014-12-01 23:03:01,082 DEBUG [master:l-hbase2:60000.oldLogCleaner]
> master.ReplicationLogCleaner: Didn't find this log in ZK, deleting:
> l-hbase3.dba.dev.cn0.qunar.com%2C60020%2C1417428218321.1417442628891.meta
>
> ‍
>
>
>
>
>
> ------------------ 原始邮件 ------------------
> 发件人: "Bharath Vissapragada";<bh...@cloudera.com>;
> 发送时间: 2014年12月1日(星期一) 晚上8:27
> 收件人: "hbase-user"<us...@hbase.apache.org>;
>
> 主题: Re: After hadoop QJM failover，hbase can not write
>
>
>
> Did you override "dfs.client.retry.policy.enabled" to "true" in the
> regionserver configs?
>
> On Mon, Dec 1, 2014 at 9:13 AM, 聪聪 <17...@qq.com> wrote:
>
> > hi，there:
> > I encount a problem，it let me upset.
> >
> >
> > I use version of hadoop is hadoop-2.3.0-cdh5.1.0，namenode HA use  the
> > Quorum Journal Manager (QJM) feature ，dfs.ha.fencing.methods option is
> > following：
> > <property>
> >         <name>dfs.ha.fencing.methods</name>
> >         <value>sshfence
> >                shell(q_hadoop_fence.sh $target_host $target_port)
> >         </value>
> > </property>
> >
> >
> >
> > or
> >
> >
> > <property>
> >         <name>dfs.ha.fencing.methods</name>
> >         <value>sshfence
> >                shell(/bin/true)
> >         </value>
> > </property>
> >
> >
> >
> > I use iptables to  simulate  machine of active namenode  crash。After
> > automatic failover completed，hdfs  can the normal write，for example
> > ./bin/hdfs dfs -put a.txt /tmp,but  hbase  still  can not write.
> > After a very long time，hbase can write,but I can not statistic How long
> > did it take.
> > I want to ask：
> > 1、Why hdfs Complete failover，hbase can not write？
> > 2、After hdfs Complete failover，how long hbase can write？
> > 3、Whether a particular parameters influence this time？
> > ‍
> >
> > Looking forward for your responses!‍
> > attach regionserver log,until the following content appears to be able to
> > write：‍
> > 2014-12-01 11:35:16,965 INFO  [MemStoreFlusher.6] regionserver.HRegion:
> > Finished memstore flush of ~7.9 K/8096, currentsize=0/0 for region
> > t,,1417403859247.645d0fbe63663fabfb73025d3eb99524. in 46ms,
> sequenceid=48,
> > compaction requested=false
> > 2014-12-01 11:35:17,755 WARN  [RpcServer.reader=1,port=60020]
> > ipc.RpcServer: RpcServer.listener,port=60020: count of bytes read: 0
> > java.io.IOException: Connection reset by peer
> >         at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
> >         at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
> >         at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
> >         at sun.nio.ch.IOUtil.read(IOUtil.java:197)
> >         at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
> >         at
> > org.apache.hadoop.hbase.ipc.RpcServer.channelRead(RpcServer.java:2248)
> >         at
> >
> org.apache.hadoop.hbase.ipc.RpcServer$Connection.readAndProcess(RpcServer.java:1427)
> >         at
> > org.apache.hadoop.hbase.ipc.RpcServer$Listener.doRead(RpcServer.java:802)
> >         at
> >
> org.apache.hadoop.hbase.ipc.RpcServer$Listener$Reader.doRunLoop(RpcServer.java:593)
> >         at
> >
> org.apache.hadoop.hbase.ipc.RpcServer$Listener$Reader.run(RpcServer.java:568)
> >         at
> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> >         at
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> >         at java.lang.Thread.run(Thread.java:744)‍
> >
> >
> >
> > part of  the datanode log is following：
> >
> >
> > 2014-11-28 16:51:56,420 INFO org.apache.hadoop.ipc.Client: Retrying
> > connect to server: l-hbase1.dba.dev.cn0/10.86.36.217:8020. Already tried
> > 8 time(s); retry policy is
> > RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000
> > MILLISECONDS)
> > 2014-11-28 16:52:12,421 INFO org.apache.hadoop.ipc.Client: Retrying
> > connect to server: l-hbase1.dba.dev.cn0/10.86.36.217:8020. Already tried
> > 9 time(s); retry policy is
> > RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000
> > MILLISECONDS)
> > 2014-11-28 16:52:27,422 WARN
> > org.apache.hadoop.hdfs.server.datanode.DataNode: IOException in
> offerService
> > java.net.ConnectException: Call From
> > l-hbase3.dba.dev.cn0.qunar.com/10.86.36.219 to l-hbase1.dba.dev.cn0:8020
> > failed on connection exception: java.net.ConnectException:
> >  Connection timed out; For more details see:
> > http://wiki.apache.org/hadoop/ConnectionRefused
> >         at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> > Method)
> >         at
> >
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
> >         at
> >
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> >         at
> java.lang.reflect.Constructor.newInstance(Constructor.java:526)
> >         at
> > org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:783)
> >         at
> org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:730)
> >         at org.apache.hadoop.ipc.Client.call(Client.java:1413)
> >         at org.apache.hadoop.ipc.Client.call(Client.java:1362)
> >         at
> >
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
> >         at com.sun.proxy.$Proxy9.sendHeartbeat(Unknown Source)
> >         at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
> >         at
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> >         at java.lang.reflect.Method.invoke(Method.java:606)
> >         at
> >
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186)
> >         at
> >
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> >         at com.sun.proxy.$Proxy9.sendHeartbeat(Unknown Source)
> >         at
> >
> org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.sendHeartbeat(DatanodeProtocolClientSideTranslatorPB.java:178)
> >         at
> >
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:566)
> >         at
> >
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:664)
> >         at
> >
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:834)
> >         at java.lang.Thread.run(Thread.java:744)
> > Caused by: java.net.ConnectException: Connection timed out
> >         at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
> >         at
> > sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:735)
> >         at
> >
> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
> >         at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:529)
> >         at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:493)
> >         at
> > org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:604)
> >         at
> > org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:699)
> >         at
> > org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:367)
> >         at org.apache.hadoop.ipc.Client.getConnection(Client.java:1461)
> >         at org.apache.hadoop.ipc.Client.call(Client.java:1380)
> >         ... 14 more
> > 2014-11-28 16:52:43,424 INFO org.apache.hadoop.ipc.Client: Retrying
> > connect to server: l-hbase1.dba.dev.cn0/10.86.36.217:8020. Already tried
> > 0 time(s); retry policy is
> > RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000
> > MILLISECONDS)
> > 2014-11-28 16:52:59,424 INFO org.apache.hadoop.ipc.Client: Retrying
> > connect to server: l-hbase1.dba.dev.cn0/10.86.36.217:8020. Already tried
> > 1 time(s); retry policy is
> > RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000
> > MILLISECONDS)
> > 2014-11-28 16:53:15,425 INFO org.apache.hadoop.ipc.Client: Retrying
> > connect to server: l-hbase1.dba.dev.cn0/10.86.36.217:8020. Already tried
> > 2 time(s); retry policy is
> > RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000
> > MILLISECONDS)‍
>
>
>
>
> --
> Bharath Vissapragada
> <http://www.cloudera.com>




-- 
Bharath Vissapragada
<http://www.cloudera.com>

Re: After hadoop QJM failover，hbase can not write

Posted by Bharath Vissapragada <bh...@cloudera.com>.

Sorry if my previous comment was unclear. dfs.client.retry.policy.enabled
should be set to "false" (which is the default config). Overriding it to
"true" will make the ha-client pick a wrong Retry policy. I just wanted to
make sure you didn't override with a wrong setting. Regarding your question
on speeding up the failover, a quick look at the codebase suggests the
following configs might be relevant

dfs.client.failover.max.attempts
dfs.client.failover.sleep.base.millis
dfs.client.failover.sleep.max.millis
dfs.client.retry.max.attempts

However I suggest to ask this question on hdfs lists, as you might get more
relevant answer for your question since hbase is oblivious to hdfs failover.



On Mon, Dec 1, 2014 at 8:51 PM, 聪聪 <17...@qq.com> wrote:

> Thanks for you！
> According to your suggestion，I configure
> "dfs.client.retry.policy.enabled" to "true"  in core-site.xml，and restart
> making effect.I find some changes in hbase master log. In mater log,retry
> information appear.But it still takes a long time to be able to write.I
> want ask how long hbase can write?What is retry policy?Whether can
> configure which parameters？‍
>
>
> attach hbase master log：
> 2014-12-01 22:47:30,487 INFO  [master:l-hbase2:60000-SendThread(
> l-hbase2.dba.dev.cn0.qunar.com:2181)] zookeeper.ClientCnxn: Session
> establishment complete on server
> l-hbase2.dba.dev.cn0.qunar.com/10.86.36.218:2181, sessionid =
> 0x14a0640d2100007, negotiated timeout = 40000
> 2014-12-01 22:48:38,729 INFO  [master:l-hbase2:60000.oldLogCleaner]
> ipc.Client: Retrying connect to server: l-hbase1.dba.dev.cn0/
> 10.86.36.217:8020. Already tried 0 time(s); retry policy is
> RetryPolicy[MultipleLinearRandomRetry[6x10000ms, 10x60000ms],
> TryOnceThenFail]
> 2014-12-01 22:49:07,748 INFO  [master:l-hbase2:60000.oldLogCleaner]
> ipc.Client: Retrying connect to server: l-hbase1.dba.dev.cn0/
> 10.86.36.217:8020. Already tried 1 time(s); retry policy is
> RetryPolicy[MultipleLinearRandomRetry[6x10000ms, 10x60000ms],
> TryOnceThenFail]
> 2014-12-01 22:49:22,534 DEBUG [l-hbase2.dba.dev.cn0.qunar.com,60000,1417445031747-BalancerChore]
> balancer.BaseLoadBalancer: Not running balancer because only 1 active
> regionserver(s)
> 2014-12-01 22:49:34,080 INFO  [master:l-hbase2:60000.oldLogCleaner]
> ipc.Client: Retrying connect to server: l-hbase1.dba.dev.cn0/
> 10.86.36.217:8020. Already tried 2 time(s); retry policy is
> RetryPolicy[MultipleLinearRandomRetry[6x10000ms, 10x60000ms],
> TryOnceThenFail]
> 2014-12-01 22:49:54,752 INFO  [master:l-hbase2:60000.oldLogCleaner]
> ipc.Client: Retrying connect to server: l-hbase1.dba.dev.cn0/
> 10.86.36.217:8020. Already tried 3 time(s); retry policy is
> RetryPolicy[MultipleLinearRandomRetry[6x10000ms, 10x60000ms],
> TryOnceThenFail]
> 2014-12-01 22:50:19,014 INFO  [master:l-hbase2:60000.oldLogCleaner]
> ipc.Client: Retrying connect to server: l-hbase1.dba.dev.cn0/
> 10.86.36.217:8020. Already tried 4 time(s); retry policy is
> RetryPolicy[MultipleLinearRandomRetry[6x10000ms, 10x60000ms],
> TryOnceThenFail]
> 2014-12-01 22:50:44,438 INFO  [master:l-hbase2:60000.oldLogCleaner]
> ipc.Client: Retrying connect to server: l-hbase1.dba.dev.cn0/
> 10.86.36.217:8020. Already tried 5 time(s); retry policy is
> RetryPolicy[MultipleLinearRandomRetry[6x10000ms, 10x60000ms],
> TryOnceThenFail]
> 2014-12-01 22:51:05,546 INFO  [master:l-hbase2:60000.oldLogCleaner]
> ipc.Client: Retrying connect to server: l-hbase1.dba.dev.cn0/
> 10.86.36.217:8020. Already tried 6 time(s); retry policy is
> RetryPolicy[MultipleLinearRandomRetry[6x10000ms, 10x60000ms],
> TryOnceThenFail]
> 2014-12-01 22:51:58,980 INFO  [master:l-hbase2:60000.oldLogCleaner]
> ipc.Client: Retrying connect to server: l-hbase1.dba.dev.cn0/
> 10.86.36.217:8020. Already tried 7 time(s); retry policy is
> RetryPolicy[MultipleLinearRandomRetry[6x10000ms, 10x60000ms],
> TryOnceThenFail]
> 2014-12-01 22:53:33,330 INFO  [master:l-hbase2:60000.oldLogCleaner]
> ipc.Client: Retrying connect to server: l-hbase1.dba.dev.cn0/
> 10.86.36.217:8020. Already tried 8 time(s); retry policy is
> RetryPolicy[MultipleLinearRandomRetry[6x10000ms, 10x60000ms],
> TryOnceThenFail]
> 2014-12-01 22:54:22,533 DEBUG [l-hbase2.dba.dev.cn0.qunar.com,60000,1417445031747-BalancerChore]
> balancer.BaseLoadBalancer: Not running balancer because only 1 active
> regionserver(s)
> 2014-12-01 22:54:30,953 INFO  [master:l-hbase2:60000.oldLogCleaner]
> ipc.Client: Retrying connect to server: l-hbase1.dba.dev.cn0/
> 10.86.36.217:8020. Already tried 9 time(s); retry policy is
> RetryPolicy[MultipleLinearRandomRetry[6x10000ms, 10x60000ms],
> TryOnceThenFail]
> 2014-12-01 22:55:43,189 INFO  [master:l-hbase2:60000.oldLogCleaner]
> ipc.Client: Retrying connect to server: l-hbase1.dba.dev.cn0/
> 10.86.36.217:8020. Already tried 10 time(s); retry policy is
> RetryPolicy[MultipleLinearRandomRetry[6x10000ms, 10x60000ms],
> TryOnceThenFail]
> 2014-12-01 22:56:49,457 INFO  [master:l-hbase2:60000.oldLogCleaner]
> ipc.Client: Retrying connect to server: l-hbase1.dba.dev.cn0/
> 10.86.36.217:8020. Already tried 11 time(s); retry policy is
> RetryPolicy[MultipleLinearRandomRetry[6x10000ms, 10x60000ms],
> TryOnceThenFail]
> 2014-12-01 22:58:29,088 INFO  [master:l-hbase2:60000.oldLogCleaner]
> ipc.Client: Retrying connect to server: l-hbase1.dba.dev.cn0/
> 10.86.36.217:8020. Already tried 12 time(s); retry policy is
> RetryPolicy[MultipleLinearRandomRetry[6x10000ms, 10x60000ms],
> TryOnceThenFail]
> 2014-12-01 22:59:22,532 DEBUG [l-hbase2.dba.dev.cn0.qunar.com,60000,1417445031747-BalancerChore]
> balancer.BaseLoadBalancer: Not running balancer because only 1 active
> regionserver(s)
> 2014-12-01 22:59:25,346 INFO  [master:l-hbase2:60000.oldLogCleaner]
> ipc.Client: Retrying connect to server: l-hbase1.dba.dev.cn0/
> 10.86.36.217:8020. Already tried 13 time(s); retry policy is
> RetryPolicy[MultipleLinearRandomRetry[6x10000ms, 10x60000ms],
> TryOnceThenFail]
> 2014-12-01 23:00:55,023 INFO  [master:l-hbase2:60000.oldLogCleaner]
> ipc.Client: Retrying connect to server: l-hbase1.dba.dev.cn0/
> 10.86.36.217:8020. Already tried 14 time(s); retry policy is
> RetryPolicy[MultipleLinearRandomRetry[6x10000ms, 10x60000ms],
> TryOnceThenFail]
> 2014-12-01 23:01:59,966 INFO  [master:l-hbase2:60000.oldLogCleaner]
> ipc.Client: Retrying connect to server: l-hbase1.dba.dev.cn0/
> 10.86.36.217:8020. Already tried 15 time(s); retry policy is
> RetryPolicy[MultipleLinearRandomRetry[6x10000ms, 10x60000ms],
> TryOnceThenFail]
> 2014-12-01 23:02:46,067 INFO  [master:l-hbase2:60000.oldLogCleaner]
> ipc.Client: Retrying connect to server: l-hbase1.dba.dev.cn0/
> 10.86.36.217:8020. Already tried 16 time(s); retry policy is
> RetryPolicy[MultipleLinearRandomRetry[6x10000ms, 10x60000ms],
> TryOnceThenFail]
> 2014-12-01 23:03:01,073 INFO  [master:l-hbase2:60000.oldLogCleaner]
> retry.RetryInvocationHandler: Exception while invoking getListing of class
> ClientNamenodeProtocolTranslatorPB over l-hbase1.dba.dev.cn0/
> 10.86.36.217:8020. Trying to fail over immediately.
> java.net.ConnectException: Call From
> l-hbase2.dba.dev.cn0.qunar.com/10.86.36.218 to l-hbase1.dba.dev.cn0:8020
> failed on connection exception: java.net.ConnectException: Connection timed
> out; For more details see:
> http://wiki.apache.org/hadoop/ConnectionRefused
>         at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> Method)
>         at
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>         at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>         at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>         at
> org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:783)
>         at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:730)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1415)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1364)
>         at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
>         at com.sun.proxy.$Proxy17.getListing(Unknown Source)
>         at
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getListing(ClientNamenodeProtocolTranslatorPB.java:546)
>         at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
>         at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:606)
>         at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
>         at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>         at com.sun.proxy.$Proxy18.getListing(Unknown Source)
>         at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
>         at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:606)
>         at
> org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:294)
>         at com.sun.proxy.$Proxy20.getListing(Unknown Source)
>         at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:1906)
>         at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:1889)
>         at
> org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:654)
>         at
> org.apache.hadoop.hdfs.DistributedFileSystem.access$600(DistributedFileSystem.java:104)
>         at
> org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:716)
>         at
> org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:712)
>         at
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>         at
> org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:712)
>         at
> org.apache.hadoop.hbase.util.FSUtils.listStatus(FSUtils.java:1555)
>         at
> org.apache.hadoop.hbase.util.FSUtils.listStatus(FSUtils.java:1575)
>         at
> org.apache.hadoop.hbase.master.cleaner.CleanerChore.chore(CleanerChore.java:123)
>         at org.apache.hadoop.hbase.Chore.run(Chore.java:87)
>         at java.lang.Thread.run(Thread.java:744)
> Caused by: java.net.ConnectException: Connection timed out
>         at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>         at
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:735)
>         at
> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
>         at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:529)
>         at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:493)
>         at
> org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:606)
>         at
> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:700)
>         at
> org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:367)
>         at org.apache.hadoop.ipc.Client.getConnection(Client.java:1463)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1382)
>         ... 28 more
> 2014-12-01 23:03:01,082 DEBUG [master:l-hbase2:60000.oldLogCleaner]
> master.ReplicationLogCleaner: Didn't find this log in ZK, deleting:
> l-hbase3.dba.dev.cn0.qunar.com%2C60020%2C1417428218321.1417442628891.meta
>
> ‍
>
>
>
>
>
> ------------------ 原始邮件 ------------------
> 发件人: "Bharath Vissapragada";<bh...@cloudera.com>;
> 发送时间: 2014年12月1日(星期一) 晚上8:27
> 收件人: "hbase-user"<us...@hbase.apache.org>;
>
> 主题: Re: After hadoop QJM failover，hbase can not write
>
>
>
> Did you override "dfs.client.retry.policy.enabled" to "true" in the
> regionserver configs?
>
> On Mon, Dec 1, 2014 at 9:13 AM, 聪聪 <17...@qq.com> wrote:
>
> > hi，there:
> > I encount a problem，it let me upset.
> >
> >
> > I use version of hadoop is hadoop-2.3.0-cdh5.1.0，namenode HA use  the
> > Quorum Journal Manager (QJM) feature ，dfs.ha.fencing.methods option is
> > following：
> > <property>
> >         <name>dfs.ha.fencing.methods</name>
> >         <value>sshfence
> >                shell(q_hadoop_fence.sh $target_host $target_port)
> >         </value>
> > </property>
> >
> >
> >
> > or
> >
> >
> > <property>
> >         <name>dfs.ha.fencing.methods</name>
> >         <value>sshfence
> >                shell(/bin/true)
> >         </value>
> > </property>
> >
> >
> >
> > I use iptables to  simulate  machine of active namenode  crash。After
> > automatic failover completed，hdfs  can the normal write，for example
> > ./bin/hdfs dfs -put a.txt /tmp,but  hbase  still  can not write.
> > After a very long time，hbase can write,but I can not statistic How long
> > did it take.
> > I want to ask：
> > 1、Why hdfs Complete failover，hbase can not write？
> > 2、After hdfs Complete failover，how long hbase can write？
> > 3、Whether a particular parameters influence this time？
> > ‍
> >
> > Looking forward for your responses!‍
> > attach regionserver log,until the following content appears to be able to
> > write：‍
> > 2014-12-01 11:35:16,965 INFO  [MemStoreFlusher.6] regionserver.HRegion:
> > Finished memstore flush of ~7.9 K/8096, currentsize=0/0 for region
> > t,,1417403859247.645d0fbe63663fabfb73025d3eb99524. in 46ms,
> sequenceid=48,
> > compaction requested=false
> > 2014-12-01 11:35:17,755 WARN  [RpcServer.reader=1,port=60020]
> > ipc.RpcServer: RpcServer.listener,port=60020: count of bytes read: 0
> > java.io.IOException: Connection reset by peer
> >         at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
> >         at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
> >         at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
> >         at sun.nio.ch.IOUtil.read(IOUtil.java:197)
> >         at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
> >         at
> > org.apache.hadoop.hbase.ipc.RpcServer.channelRead(RpcServer.java:2248)
> >         at
> >
> org.apache.hadoop.hbase.ipc.RpcServer$Connection.readAndProcess(RpcServer.java:1427)
> >         at
> > org.apache.hadoop.hbase.ipc.RpcServer$Listener.doRead(RpcServer.java:802)
> >         at
> >
> org.apache.hadoop.hbase.ipc.RpcServer$Listener$Reader.doRunLoop(RpcServer.java:593)
> >         at
> >
> org.apache.hadoop.hbase.ipc.RpcServer$Listener$Reader.run(RpcServer.java:568)
> >         at
> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> >         at
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> >         at java.lang.Thread.run(Thread.java:744)‍
> >
> >
> >
> > part of  the datanode log is following：
> >
> >
> > 2014-11-28 16:51:56,420 INFO org.apache.hadoop.ipc.Client: Retrying
> > connect to server: l-hbase1.dba.dev.cn0/10.86.36.217:8020. Already tried
> > 8 time(s); retry policy is
> > RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000
> > MILLISECONDS)
> > 2014-11-28 16:52:12,421 INFO org.apache.hadoop.ipc.Client: Retrying
> > connect to server: l-hbase1.dba.dev.cn0/10.86.36.217:8020. Already tried
> > 9 time(s); retry policy is
> > RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000
> > MILLISECONDS)
> > 2014-11-28 16:52:27,422 WARN
> > org.apache.hadoop.hdfs.server.datanode.DataNode: IOException in
> offerService
> > java.net.ConnectException: Call From
> > l-hbase3.dba.dev.cn0.qunar.com/10.86.36.219 to l-hbase1.dba.dev.cn0:8020
> > failed on connection exception: java.net.ConnectException:
> >  Connection timed out; For more details see:
> > http://wiki.apache.org/hadoop/ConnectionRefused
> >         at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> > Method)
> >         at
> >
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
> >         at
> >
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> >         at
> java.lang.reflect.Constructor.newInstance(Constructor.java:526)
> >         at
> > org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:783)
> >         at
> org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:730)
> >         at org.apache.hadoop.ipc.Client.call(Client.java:1413)
> >         at org.apache.hadoop.ipc.Client.call(Client.java:1362)
> >         at
> >
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
> >         at com.sun.proxy.$Proxy9.sendHeartbeat(Unknown Source)
> >         at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
> >         at
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> >         at java.lang.reflect.Method.invoke(Method.java:606)
> >         at
> >
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186)
> >         at
> >
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> >         at com.sun.proxy.$Proxy9.sendHeartbeat(Unknown Source)
> >         at
> >
> org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.sendHeartbeat(DatanodeProtocolClientSideTranslatorPB.java:178)
> >         at
> >
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:566)
> >         at
> >
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:664)
> >         at
> >
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:834)
> >         at java.lang.Thread.run(Thread.java:744)
> > Caused by: java.net.ConnectException: Connection timed out
> >         at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
> >         at
> > sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:735)
> >         at
> >
> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
> >         at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:529)
> >         at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:493)
> >         at
> > org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:604)
> >         at
> > org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:699)
> >         at
> > org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:367)
> >         at org.apache.hadoop.ipc.Client.getConnection(Client.java:1461)
> >         at org.apache.hadoop.ipc.Client.call(Client.java:1380)
> >         ... 14 more
> > 2014-11-28 16:52:43,424 INFO org.apache.hadoop.ipc.Client: Retrying
> > connect to server: l-hbase1.dba.dev.cn0/10.86.36.217:8020. Already tried
> > 0 time(s); retry policy is
> > RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000
> > MILLISECONDS)
> > 2014-11-28 16:52:59,424 INFO org.apache.hadoop.ipc.Client: Retrying
> > connect to server: l-hbase1.dba.dev.cn0/10.86.36.217:8020. Already tried
> > 1 time(s); retry policy is
> > RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000
> > MILLISECONDS)
> > 2014-11-28 16:53:15,425 INFO org.apache.hadoop.ipc.Client: Retrying
> > connect to server: l-hbase1.dba.dev.cn0/10.86.36.217:8020. Already tried
> > 2 time(s); retry policy is
> > RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000
> > MILLISECONDS)‍
>
>
>
>
> --
> Bharath Vissapragada
> <http://www.cloudera.com>




-- 
Bharath Vissapragada
<http://www.cloudera.com>

回复： After hadoop QJM failover，hbase can not write

Posted by 聪聪 <17...@qq.com>.

Related to the complete log as attachment！‍


------------------ 原始邮件 ------------------
发件人: "175998806";<17...@qq.com>;
发送时间: 2014年12月1日(星期一) 晚上11:21
收件人: "user"<us...@hbase.apache.org>; 

主题: 回复： After hadoop QJM failover，hbase can not write



Thanks for you！
According to your suggestion，I configure  "dfs.client.retry.policy.enabled" to "true"  in core-site.xml，and restart making effect.I find some changes in hbase master log. In mater log,retry information appear.But it still takes a long time to be able to write.I want ask how long hbase can write?What is retry policy?Whether can configure which parameters？‍


attach hbase master log：
2014-12-01 22:47:30,487 INFO  [master:l-hbase2:60000-SendThread(l-hbase2.dba.dev.cn0.qunar.com:2181)] zookeeper.ClientCnxn: Session establishment complete on server l-hbase2.dba.dev.cn0.qunar.com/10.86.36.218:2181, sessionid = 0x14a0640d2100007, negotiated timeout = 40000
2014-12-01 22:48:38,729 INFO  [master:l-hbase2:60000.oldLogCleaner] ipc.Client: Retrying connect to server: l-hbase1.dba.dev.cn0/10.86.36.217:8020. Already tried 0 time(s); retry policy is RetryPolicy[MultipleLinearRandomRetry[6x10000ms, 10x60000ms], TryOnceThenFail]
2014-12-01 22:49:07,748 INFO  [master:l-hbase2:60000.oldLogCleaner] ipc.Client: Retrying connect to server: l-hbase1.dba.dev.cn0/10.86.36.217:8020. Already tried 1 time(s); retry policy is RetryPolicy[MultipleLinearRandomRetry[6x10000ms, 10x60000ms], TryOnceThenFail]
2014-12-01 22:49:22,534 DEBUG [l-hbase2.dba.dev.cn0.qunar.com,60000,1417445031747-BalancerChore] balancer.BaseLoadBalancer: Not running balancer because only 1 active regionserver(s)
2014-12-01 22:49:34,080 INFO  [master:l-hbase2:60000.oldLogCleaner] ipc.Client: Retrying connect to server: l-hbase1.dba.dev.cn0/10.86.36.217:8020. Already tried 2 time(s); retry policy is RetryPolicy[MultipleLinearRandomRetry[6x10000ms, 10x60000ms], TryOnceThenFail]
2014-12-01 22:49:54,752 INFO  [master:l-hbase2:60000.oldLogCleaner] ipc.Client: Retrying connect to server: l-hbase1.dba.dev.cn0/10.86.36.217:8020. Already tried 3 time(s); retry policy is RetryPolicy[MultipleLinearRandomRetry[6x10000ms, 10x60000ms], TryOnceThenFail]
2014-12-01 22:50:19,014 INFO  [master:l-hbase2:60000.oldLogCleaner] ipc.Client: Retrying connect to server: l-hbase1.dba.dev.cn0/10.86.36.217:8020. Already tried 4 time(s); retry policy is RetryPolicy[MultipleLinearRandomRetry[6x10000ms, 10x60000ms], TryOnceThenFail]
2014-12-01 22:50:44,438 INFO  [master:l-hbase2:60000.oldLogCleaner] ipc.Client: Retrying connect to server: l-hbase1.dba.dev.cn0/10.86.36.217:8020. Already tried 5 time(s); retry policy is RetryPolicy[MultipleLinearRandomRetry[6x10000ms, 10x60000ms], TryOnceThenFail]
2014-12-01 22:51:05,546 INFO  [master:l-hbase2:60000.oldLogCleaner] ipc.Client: Retrying connect to server: l-hbase1.dba.dev.cn0/10.86.36.217:8020. Already tried 6 time(s); retry policy is RetryPolicy[MultipleLinearRandomRetry[6x10000ms, 10x60000ms], TryOnceThenFail]
2014-12-01 22:51:58,980 INFO  [master:l-hbase2:60000.oldLogCleaner] ipc.Client: Retrying connect to server: l-hbase1.dba.dev.cn0/10.86.36.217:8020. Already tried 7 time(s); retry policy is RetryPolicy[MultipleLinearRandomRetry[6x10000ms, 10x60000ms], TryOnceThenFail]
2014-12-01 22:53:33,330 INFO  [master:l-hbase2:60000.oldLogCleaner] ipc.Client: Retrying connect to server: l-hbase1.dba.dev.cn0/10.86.36.217:8020. Already tried 8 time(s); retry policy is RetryPolicy[MultipleLinearRandomRetry[6x10000ms, 10x60000ms], TryOnceThenFail]
2014-12-01 22:54:22,533 DEBUG [l-hbase2.dba.dev.cn0.qunar.com,60000,1417445031747-BalancerChore] balancer.BaseLoadBalancer: Not running balancer because only 1 active regionserver(s)
2014-12-01 22:54:30,953 INFO  [master:l-hbase2:60000.oldLogCleaner] ipc.Client: Retrying connect to server: l-hbase1.dba.dev.cn0/10.86.36.217:8020. Already tried 9 time(s); retry policy is RetryPolicy[MultipleLinearRandomRetry[6x10000ms, 10x60000ms], TryOnceThenFail]
2014-12-01 22:55:43,189 INFO  [master:l-hbase2:60000.oldLogCleaner] ipc.Client: Retrying connect to server: l-hbase1.dba.dev.cn0/10.86.36.217:8020. Already tried 10 time(s); retry policy is RetryPolicy[MultipleLinearRandomRetry[6x10000ms, 10x60000ms], TryOnceThenFail]
2014-12-01 22:56:49,457 INFO  [master:l-hbase2:60000.oldLogCleaner] ipc.Client: Retrying connect to server: l-hbase1.dba.dev.cn0/10.86.36.217:8020. Already tried 11 time(s); retry policy is RetryPolicy[MultipleLinearRandomRetry[6x10000ms, 10x60000ms], TryOnceThenFail]
2014-12-01 22:58:29,088 INFO  [master:l-hbase2:60000.oldLogCleaner] ipc.Client: Retrying connect to server: l-hbase1.dba.dev.cn0/10.86.36.217:8020. Already tried 12 time(s); retry policy is RetryPolicy[MultipleLinearRandomRetry[6x10000ms, 10x60000ms], TryOnceThenFail]
2014-12-01 22:59:22,532 DEBUG [l-hbase2.dba.dev.cn0.qunar.com,60000,1417445031747-BalancerChore] balancer.BaseLoadBalancer: Not running balancer because only 1 active regionserver(s)
2014-12-01 22:59:25,346 INFO  [master:l-hbase2:60000.oldLogCleaner] ipc.Client: Retrying connect to server: l-hbase1.dba.dev.cn0/10.86.36.217:8020. Already tried 13 time(s); retry policy is RetryPolicy[MultipleLinearRandomRetry[6x10000ms, 10x60000ms], TryOnceThenFail]
2014-12-01 23:00:55,023 INFO  [master:l-hbase2:60000.oldLogCleaner] ipc.Client: Retrying connect to server: l-hbase1.dba.dev.cn0/10.86.36.217:8020. Already tried 14 time(s); retry policy is RetryPolicy[MultipleLinearRandomRetry[6x10000ms, 10x60000ms], TryOnceThenFail]
2014-12-01 23:01:59,966 INFO  [master:l-hbase2:60000.oldLogCleaner] ipc.Client: Retrying connect to server: l-hbase1.dba.dev.cn0/10.86.36.217:8020. Already tried 15 time(s); retry policy is RetryPolicy[MultipleLinearRandomRetry[6x10000ms, 10x60000ms], TryOnceThenFail]
2014-12-01 23:02:46,067 INFO  [master:l-hbase2:60000.oldLogCleaner] ipc.Client: Retrying connect to server: l-hbase1.dba.dev.cn0/10.86.36.217:8020. Already tried 16 time(s); retry policy is RetryPolicy[MultipleLinearRandomRetry[6x10000ms, 10x60000ms], TryOnceThenFail]
2014-12-01 23:03:01,073 INFO  [master:l-hbase2:60000.oldLogCleaner] retry.RetryInvocationHandler: Exception while invoking getListing of class ClientNamenodeProtocolTranslatorPB over l-hbase1.dba.dev.cn0/10.86.36.217:8020. Trying to fail over immediately.
java.net.ConnectException: Call From l-hbase2.dba.dev.cn0.qunar.com/10.86.36.218 to l-hbase1.dba.dev.cn0:8020 failed on connection exception: java.net.ConnectException: Connection timed out; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
	at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:783)
	at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:730)
	at org.apache.hadoop.ipc.Client.call(Client.java:1415)
	at org.apache.hadoop.ipc.Client.call(Client.java:1364)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
	at com.sun.proxy.$Proxy17.getListing(Unknown Source)
	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getListing(ClientNamenodeProtocolTranslatorPB.java:546)
	at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:606)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
	at com.sun.proxy.$Proxy18.getListing(Unknown Source)
	at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:606)
	at org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:294)
	at com.sun.proxy.$Proxy20.getListing(Unknown Source)
	at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:1906)
	at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:1889)
	at org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:654)
	at org.apache.hadoop.hdfs.DistributedFileSystem.access$600(DistributedFileSystem.java:104)
	at org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:716)
	at org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:712)
	at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
	at org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:712)
	at org.apache.hadoop.hbase.util.FSUtils.listStatus(FSUtils.java:1555)
	at org.apache.hadoop.hbase.util.FSUtils.listStatus(FSUtils.java:1575)
	at org.apache.hadoop.hbase.master.cleaner.CleanerChore.chore(CleanerChore.java:123)
	at org.apache.hadoop.hbase.Chore.run(Chore.java:87)
	at java.lang.Thread.run(Thread.java:744)
Caused by: java.net.ConnectException: Connection timed out
	at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
	at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:735)
	at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
	at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:529)
	at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:493)
	at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:606)
	at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:700)
	at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:367)
	at org.apache.hadoop.ipc.Client.getConnection(Client.java:1463)
	at org.apache.hadoop.ipc.Client.call(Client.java:1382)
	... 28 more
2014-12-01 23:03:01,082 DEBUG [master:l-hbase2:60000.oldLogCleaner] master.ReplicationLogCleaner: Didn't find this log in ZK, deleting: l-hbase3.dba.dev.cn0.qunar.com%2C60020%2C1417428218321.1417442628891.meta

‍





------------------ 原始邮件 ------------------
发件人: "Bharath Vissapragada";<bh...@cloudera.com>;
发送时间: 2014年12月1日(星期一) 晚上8:27
收件人: "hbase-user"<us...@hbase.apache.org>; 

主题: Re: After hadoop QJM failover，hbase can not write



Did you override "dfs.client.retry.policy.enabled" to "true" in the
regionserver configs?

On Mon, Dec 1, 2014 at 9:13 AM, 聪聪 <17...@qq.com> wrote:

> hi，there:
> I encount a problem，it let me upset.
>
>
> I use version of hadoop is hadoop-2.3.0-cdh5.1.0，namenode HA use  the
> Quorum Journal Manager (QJM) feature ，dfs.ha.fencing.methods option is
> following：
> <property>
>         <name>dfs.ha.fencing.methods</name>
>         <value>sshfence
>                shell(q_hadoop_fence.sh $target_host $target_port)
>         </value>
> </property>
>
>
>
> or
>
>
> <property>
>         <name>dfs.ha.fencing.methods</name>
>         <value>sshfence
>                shell(/bin/true)
>         </value>
> </property>
>
>
>
> I use iptables to  simulate  machine of active namenode  crash。After
> automatic failover completed，hdfs  can the normal write，for example
> ./bin/hdfs dfs -put a.txt /tmp,but  hbase  still  can not write.
> After a very long time，hbase can write,but I can not statistic How long
> did it take.
> I want to ask：
> 1、Why hdfs Complete failover，hbase can not write？
> 2、After hdfs Complete failover，how long hbase can write？
> 3、Whether a particular parameters influence this time？
> ‍
>
> Looking forward for your responses!‍
> attach regionserver log,until the following content appears to be able to
> write：‍
> 2014-12-01 11:35:16,965 INFO  [MemStoreFlusher.6] regionserver.HRegion:
> Finished memstore flush of ~7.9 K/8096, currentsize=0/0 for region
> t,,1417403859247.645d0fbe63663fabfb73025d3eb99524. in 46ms, sequenceid=48,
> compaction requested=false
> 2014-12-01 11:35:17,755 WARN  [RpcServer.reader=1,port=60020]
> ipc.RpcServer: RpcServer.listener,port=60020: count of bytes read: 0
> java.io.IOException: Connection reset by peer
>         at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
>         at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
>         at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
>         at sun.nio.ch.IOUtil.read(IOUtil.java:197)
>         at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
>         at
> org.apache.hadoop.hbase.ipc.RpcServer.channelRead(RpcServer.java:2248)
>         at
> org.apache.hadoop.hbase.ipc.RpcServer$Connection.readAndProcess(RpcServer.java:1427)
>         at
> org.apache.hadoop.hbase.ipc.RpcServer$Listener.doRead(RpcServer.java:802)
>         at
> org.apache.hadoop.hbase.ipc.RpcServer$Listener$Reader.doRunLoop(RpcServer.java:593)
>         at
> org.apache.hadoop.hbase.ipc.RpcServer$Listener$Reader.run(RpcServer.java:568)
>         at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         at java.lang.Thread.run(Thread.java:744)‍
>
>
>
> part of  the datanode log is following：
>
>
> 2014-11-28 16:51:56,420 INFO org.apache.hadoop.ipc.Client: Retrying
> connect to server: l-hbase1.dba.dev.cn0/10.86.36.217:8020. Already tried
> 8 time(s); retry policy is
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000
> MILLISECONDS)
> 2014-11-28 16:52:12,421 INFO org.apache.hadoop.ipc.Client: Retrying
> connect to server: l-hbase1.dba.dev.cn0/10.86.36.217:8020. Already tried
> 9 time(s); retry policy is
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000
> MILLISECONDS)
> 2014-11-28 16:52:27,422 WARN
> org.apache.hadoop.hdfs.server.datanode.DataNode: IOException in offerService
> java.net.ConnectException: Call From
> l-hbase3.dba.dev.cn0.qunar.com/10.86.36.219 to l-hbase1.dba.dev.cn0:8020
> failed on connection exception: java.net.ConnectException:
>  Connection timed out; For more details see:
> http://wiki.apache.org/hadoop/ConnectionRefused
>         at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> Method)
>         at
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>         at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>         at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>         at
> org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:783)
>         at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:730)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1413)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1362)
>         at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
>         at com.sun.proxy.$Proxy9.sendHeartbeat(Unknown Source)
>         at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
>         at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:606)
>         at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186)
>         at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>         at com.sun.proxy.$Proxy9.sendHeartbeat(Unknown Source)
>         at
> org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.sendHeartbeat(DatanodeProtocolClientSideTranslatorPB.java:178)
>         at
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:566)
>         at
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:664)
>         at
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:834)
>         at java.lang.Thread.run(Thread.java:744)
> Caused by: java.net.ConnectException: Connection timed out
>         at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>         at
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:735)
>         at
> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
>         at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:529)
>         at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:493)
>         at
> org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:604)
>         at
> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:699)
>         at
> org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:367)
>         at org.apache.hadoop.ipc.Client.getConnection(Client.java:1461)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1380)
>         ... 14 more
> 2014-11-28 16:52:43,424 INFO org.apache.hadoop.ipc.Client: Retrying
> connect to server: l-hbase1.dba.dev.cn0/10.86.36.217:8020. Already tried
> 0 time(s); retry policy is
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000
> MILLISECONDS)
> 2014-11-28 16:52:59,424 INFO org.apache.hadoop.ipc.Client: Retrying
> connect to server: l-hbase1.dba.dev.cn0/10.86.36.217:8020. Already tried
> 1 time(s); retry policy is
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000
> MILLISECONDS)
> 2014-11-28 16:53:15,425 INFO org.apache.hadoop.ipc.Client: Retrying
> connect to server: l-hbase1.dba.dev.cn0/10.86.36.217:8020. Already tried
> 2 time(s); retry policy is
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000
> MILLISECONDS)‍




-- 
Bharath Vissapragada
<http://www.cloudera.com>




从QQ邮箱发来的超大附件

hadoop-hadoop-namenode-l-hbase2.dba.dev.cn0.log (149.38M, 2015年01月02日 17:34 到期)进入下载页面：http://mail.qq.com/cgi-bin/ftnExs_download?t=exs_ftn_download&k=0733316350d8cd9335b44c691737564a5257050750015c544e030154541a5057070a1c02060f51480706545a570e005d0603045a3118640d02575e0c411a0c04075c5e131c590508065d5e07541a08480b51501054054a0101521f0754414a060d031f0f5e506458&code=c31c17de




hbase-hadoop-master-l-hbase2.dba.dev.cn0.log (5.05M, 2015年01月02日 17:34 到期)进入下载页面：http://mail.qq.com/cgi-bin/ftnExs_download?t=exs_ftn_download&k=02313162e675989e35b64c684330054e5a070852015153054e0709015d1d030356011c5a5008014c510401005055520207535201651c370901504207485856050c5e414f0851441506431c0e485855001054034c0152564f0754474c065e074f0f5e566258&code=c11be07a




hbase-hadoop-regionserver-l-hbase3.dba.dev.cn0.log (7.33M, 2015年01月02日 17:34 到期)进入下载页面：http://mail.qq.com/cgi-bin/ftnExs_download?t=exs_ftn_download&k=5f35323400f5b9c76eb24f3e173601170d5150510602040b15025407541b070b5e561f0c055306150f050a070205570f5e510502310433505a5441511c5e525c575a421943535451575b41514340564a15591f5c5357405d0b1b56565018575d4e1b515a01185f575f350f&code=85241638

回复： After hadoop QJM failover，hbase can not write

Posted by 聪聪 <17...@qq.com>.

Thanks for you！
According to your suggestion，I configure  "dfs.client.retry.policy.enabled" to "true"  in core-site.xml，and restart making effect.I find some changes in hbase master log. In mater log,retry information appear.But it still takes a long time to be able to write.I want ask how long hbase can write?What is retry policy?Whether can configure which parameters？‍


attach hbase master log：
2014-12-01 22:47:30,487 INFO  [master:l-hbase2:60000-SendThread(l-hbase2.dba.dev.cn0.qunar.com:2181)] zookeeper.ClientCnxn: Session establishment complete on server l-hbase2.dba.dev.cn0.qunar.com/10.86.36.218:2181, sessionid = 0x14a0640d2100007, negotiated timeout = 40000
2014-12-01 22:48:38,729 INFO  [master:l-hbase2:60000.oldLogCleaner] ipc.Client: Retrying connect to server: l-hbase1.dba.dev.cn0/10.86.36.217:8020. Already tried 0 time(s); retry policy is RetryPolicy[MultipleLinearRandomRetry[6x10000ms, 10x60000ms], TryOnceThenFail]
2014-12-01 22:49:07,748 INFO  [master:l-hbase2:60000.oldLogCleaner] ipc.Client: Retrying connect to server: l-hbase1.dba.dev.cn0/10.86.36.217:8020. Already tried 1 time(s); retry policy is RetryPolicy[MultipleLinearRandomRetry[6x10000ms, 10x60000ms], TryOnceThenFail]
2014-12-01 22:49:22,534 DEBUG [l-hbase2.dba.dev.cn0.qunar.com,60000,1417445031747-BalancerChore] balancer.BaseLoadBalancer: Not running balancer because only 1 active regionserver(s)
2014-12-01 22:49:34,080 INFO  [master:l-hbase2:60000.oldLogCleaner] ipc.Client: Retrying connect to server: l-hbase1.dba.dev.cn0/10.86.36.217:8020. Already tried 2 time(s); retry policy is RetryPolicy[MultipleLinearRandomRetry[6x10000ms, 10x60000ms], TryOnceThenFail]
2014-12-01 22:49:54,752 INFO  [master:l-hbase2:60000.oldLogCleaner] ipc.Client: Retrying connect to server: l-hbase1.dba.dev.cn0/10.86.36.217:8020. Already tried 3 time(s); retry policy is RetryPolicy[MultipleLinearRandomRetry[6x10000ms, 10x60000ms], TryOnceThenFail]
2014-12-01 22:50:19,014 INFO  [master:l-hbase2:60000.oldLogCleaner] ipc.Client: Retrying connect to server: l-hbase1.dba.dev.cn0/10.86.36.217:8020. Already tried 4 time(s); retry policy is RetryPolicy[MultipleLinearRandomRetry[6x10000ms, 10x60000ms], TryOnceThenFail]
2014-12-01 22:50:44,438 INFO  [master:l-hbase2:60000.oldLogCleaner] ipc.Client: Retrying connect to server: l-hbase1.dba.dev.cn0/10.86.36.217:8020. Already tried 5 time(s); retry policy is RetryPolicy[MultipleLinearRandomRetry[6x10000ms, 10x60000ms], TryOnceThenFail]
2014-12-01 22:51:05,546 INFO  [master:l-hbase2:60000.oldLogCleaner] ipc.Client: Retrying connect to server: l-hbase1.dba.dev.cn0/10.86.36.217:8020. Already tried 6 time(s); retry policy is RetryPolicy[MultipleLinearRandomRetry[6x10000ms, 10x60000ms], TryOnceThenFail]
2014-12-01 22:51:58,980 INFO  [master:l-hbase2:60000.oldLogCleaner] ipc.Client: Retrying connect to server: l-hbase1.dba.dev.cn0/10.86.36.217:8020. Already tried 7 time(s); retry policy is RetryPolicy[MultipleLinearRandomRetry[6x10000ms, 10x60000ms], TryOnceThenFail]
2014-12-01 22:53:33,330 INFO  [master:l-hbase2:60000.oldLogCleaner] ipc.Client: Retrying connect to server: l-hbase1.dba.dev.cn0/10.86.36.217:8020. Already tried 8 time(s); retry policy is RetryPolicy[MultipleLinearRandomRetry[6x10000ms, 10x60000ms], TryOnceThenFail]
2014-12-01 22:54:22,533 DEBUG [l-hbase2.dba.dev.cn0.qunar.com,60000,1417445031747-BalancerChore] balancer.BaseLoadBalancer: Not running balancer because only 1 active regionserver(s)
2014-12-01 22:54:30,953 INFO  [master:l-hbase2:60000.oldLogCleaner] ipc.Client: Retrying connect to server: l-hbase1.dba.dev.cn0/10.86.36.217:8020. Already tried 9 time(s); retry policy is RetryPolicy[MultipleLinearRandomRetry[6x10000ms, 10x60000ms], TryOnceThenFail]
2014-12-01 22:55:43,189 INFO  [master:l-hbase2:60000.oldLogCleaner] ipc.Client: Retrying connect to server: l-hbase1.dba.dev.cn0/10.86.36.217:8020. Already tried 10 time(s); retry policy is RetryPolicy[MultipleLinearRandomRetry[6x10000ms, 10x60000ms], TryOnceThenFail]
2014-12-01 22:56:49,457 INFO  [master:l-hbase2:60000.oldLogCleaner] ipc.Client: Retrying connect to server: l-hbase1.dba.dev.cn0/10.86.36.217:8020. Already tried 11 time(s); retry policy is RetryPolicy[MultipleLinearRandomRetry[6x10000ms, 10x60000ms], TryOnceThenFail]
2014-12-01 22:58:29,088 INFO  [master:l-hbase2:60000.oldLogCleaner] ipc.Client: Retrying connect to server: l-hbase1.dba.dev.cn0/10.86.36.217:8020. Already tried 12 time(s); retry policy is RetryPolicy[MultipleLinearRandomRetry[6x10000ms, 10x60000ms], TryOnceThenFail]
2014-12-01 22:59:22,532 DEBUG [l-hbase2.dba.dev.cn0.qunar.com,60000,1417445031747-BalancerChore] balancer.BaseLoadBalancer: Not running balancer because only 1 active regionserver(s)
2014-12-01 22:59:25,346 INFO  [master:l-hbase2:60000.oldLogCleaner] ipc.Client: Retrying connect to server: l-hbase1.dba.dev.cn0/10.86.36.217:8020. Already tried 13 time(s); retry policy is RetryPolicy[MultipleLinearRandomRetry[6x10000ms, 10x60000ms], TryOnceThenFail]
2014-12-01 23:00:55,023 INFO  [master:l-hbase2:60000.oldLogCleaner] ipc.Client: Retrying connect to server: l-hbase1.dba.dev.cn0/10.86.36.217:8020. Already tried 14 time(s); retry policy is RetryPolicy[MultipleLinearRandomRetry[6x10000ms, 10x60000ms], TryOnceThenFail]
2014-12-01 23:01:59,966 INFO  [master:l-hbase2:60000.oldLogCleaner] ipc.Client: Retrying connect to server: l-hbase1.dba.dev.cn0/10.86.36.217:8020. Already tried 15 time(s); retry policy is RetryPolicy[MultipleLinearRandomRetry[6x10000ms, 10x60000ms], TryOnceThenFail]
2014-12-01 23:02:46,067 INFO  [master:l-hbase2:60000.oldLogCleaner] ipc.Client: Retrying connect to server: l-hbase1.dba.dev.cn0/10.86.36.217:8020. Already tried 16 time(s); retry policy is RetryPolicy[MultipleLinearRandomRetry[6x10000ms, 10x60000ms], TryOnceThenFail]
2014-12-01 23:03:01,073 INFO  [master:l-hbase2:60000.oldLogCleaner] retry.RetryInvocationHandler: Exception while invoking getListing of class ClientNamenodeProtocolTranslatorPB over l-hbase1.dba.dev.cn0/10.86.36.217:8020. Trying to fail over immediately.
java.net.ConnectException: Call From l-hbase2.dba.dev.cn0.qunar.com/10.86.36.218 to l-hbase1.dba.dev.cn0:8020 failed on connection exception: java.net.ConnectException: Connection timed out; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
	at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:783)
	at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:730)
	at org.apache.hadoop.ipc.Client.call(Client.java:1415)
	at org.apache.hadoop.ipc.Client.call(Client.java:1364)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
	at com.sun.proxy.$Proxy17.getListing(Unknown Source)
	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getListing(ClientNamenodeProtocolTranslatorPB.java:546)
	at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:606)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
	at com.sun.proxy.$Proxy18.getListing(Unknown Source)
	at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:606)
	at org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:294)
	at com.sun.proxy.$Proxy20.getListing(Unknown Source)
	at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:1906)
	at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:1889)
	at org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:654)
	at org.apache.hadoop.hdfs.DistributedFileSystem.access$600(DistributedFileSystem.java:104)
	at org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:716)
	at org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:712)
	at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
	at org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:712)
	at org.apache.hadoop.hbase.util.FSUtils.listStatus(FSUtils.java:1555)
	at org.apache.hadoop.hbase.util.FSUtils.listStatus(FSUtils.java:1575)
	at org.apache.hadoop.hbase.master.cleaner.CleanerChore.chore(CleanerChore.java:123)
	at org.apache.hadoop.hbase.Chore.run(Chore.java:87)
	at java.lang.Thread.run(Thread.java:744)
Caused by: java.net.ConnectException: Connection timed out
	at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
	at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:735)
	at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
	at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:529)
	at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:493)
	at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:606)
	at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:700)
	at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:367)
	at org.apache.hadoop.ipc.Client.getConnection(Client.java:1463)
	at org.apache.hadoop.ipc.Client.call(Client.java:1382)
	... 28 more
2014-12-01 23:03:01,082 DEBUG [master:l-hbase2:60000.oldLogCleaner] master.ReplicationLogCleaner: Didn't find this log in ZK, deleting: l-hbase3.dba.dev.cn0.qunar.com%2C60020%2C1417428218321.1417442628891.meta

‍





------------------ 原始邮件 ------------------
发件人: "Bharath Vissapragada";<bh...@cloudera.com>;
发送时间: 2014年12月1日(星期一) 晚上8:27
收件人: "hbase-user"<us...@hbase.apache.org>; 

主题: Re: After hadoop QJM failover，hbase can not write



Did you override "dfs.client.retry.policy.enabled" to "true" in the
regionserver configs?

On Mon, Dec 1, 2014 at 9:13 AM, 聪聪 <17...@qq.com> wrote:

> hi，there:
> I encount a problem，it let me upset.
>
>
> I use version of hadoop is hadoop-2.3.0-cdh5.1.0，namenode HA use  the
> Quorum Journal Manager (QJM) feature ，dfs.ha.fencing.methods option is
> following：
> <property>
>         <name>dfs.ha.fencing.methods</name>
>         <value>sshfence
>                shell(q_hadoop_fence.sh $target_host $target_port)
>         </value>
> </property>
>
>
>
> or
>
>
> <property>
>         <name>dfs.ha.fencing.methods</name>
>         <value>sshfence
>                shell(/bin/true)
>         </value>
> </property>
>
>
>
> I use iptables to  simulate  machine of active namenode  crash。After
> automatic failover completed，hdfs  can the normal write，for example
> ./bin/hdfs dfs -put a.txt /tmp,but  hbase  still  can not write.
> After a very long time，hbase can write,but I can not statistic How long
> did it take.
> I want to ask：
> 1、Why hdfs Complete failover，hbase can not write？
> 2、After hdfs Complete failover，how long hbase can write？
> 3、Whether a particular parameters influence this time？
> ‍
>
> Looking forward for your responses!‍
> attach regionserver log,until the following content appears to be able to
> write：‍
> 2014-12-01 11:35:16,965 INFO  [MemStoreFlusher.6] regionserver.HRegion:
> Finished memstore flush of ~7.9 K/8096, currentsize=0/0 for region
> t,,1417403859247.645d0fbe63663fabfb73025d3eb99524. in 46ms, sequenceid=48,
> compaction requested=false
> 2014-12-01 11:35:17,755 WARN  [RpcServer.reader=1,port=60020]
> ipc.RpcServer: RpcServer.listener,port=60020: count of bytes read: 0
> java.io.IOException: Connection reset by peer
>         at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
>         at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
>         at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
>         at sun.nio.ch.IOUtil.read(IOUtil.java:197)
>         at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
>         at
> org.apache.hadoop.hbase.ipc.RpcServer.channelRead(RpcServer.java:2248)
>         at
> org.apache.hadoop.hbase.ipc.RpcServer$Connection.readAndProcess(RpcServer.java:1427)
>         at
> org.apache.hadoop.hbase.ipc.RpcServer$Listener.doRead(RpcServer.java:802)
>         at
> org.apache.hadoop.hbase.ipc.RpcServer$Listener$Reader.doRunLoop(RpcServer.java:593)
>         at
> org.apache.hadoop.hbase.ipc.RpcServer$Listener$Reader.run(RpcServer.java:568)
>         at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         at java.lang.Thread.run(Thread.java:744)‍
>
>
>
> part of  the datanode log is following：
>
>
> 2014-11-28 16:51:56,420 INFO org.apache.hadoop.ipc.Client: Retrying
> connect to server: l-hbase1.dba.dev.cn0/10.86.36.217:8020. Already tried
> 8 time(s); retry policy is
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000
> MILLISECONDS)
> 2014-11-28 16:52:12,421 INFO org.apache.hadoop.ipc.Client: Retrying
> connect to server: l-hbase1.dba.dev.cn0/10.86.36.217:8020. Already tried
> 9 time(s); retry policy is
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000
> MILLISECONDS)
> 2014-11-28 16:52:27,422 WARN
> org.apache.hadoop.hdfs.server.datanode.DataNode: IOException in offerService
> java.net.ConnectException: Call From
> l-hbase3.dba.dev.cn0.qunar.com/10.86.36.219 to l-hbase1.dba.dev.cn0:8020
> failed on connection exception: java.net.ConnectException:
>  Connection timed out; For more details see:
> http://wiki.apache.org/hadoop/ConnectionRefused
>         at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> Method)
>         at
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>         at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>         at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>         at
> org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:783)
>         at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:730)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1413)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1362)
>         at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
>         at com.sun.proxy.$Proxy9.sendHeartbeat(Unknown Source)
>         at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
>         at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:606)
>         at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186)
>         at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>         at com.sun.proxy.$Proxy9.sendHeartbeat(Unknown Source)
>         at
> org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.sendHeartbeat(DatanodeProtocolClientSideTranslatorPB.java:178)
>         at
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:566)
>         at
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:664)
>         at
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:834)
>         at java.lang.Thread.run(Thread.java:744)
> Caused by: java.net.ConnectException: Connection timed out
>         at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>         at
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:735)
>         at
> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
>         at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:529)
>         at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:493)
>         at
> org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:604)
>         at
> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:699)
>         at
> org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:367)
>         at org.apache.hadoop.ipc.Client.getConnection(Client.java:1461)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1380)
>         ... 14 more
> 2014-11-28 16:52:43,424 INFO org.apache.hadoop.ipc.Client: Retrying
> connect to server: l-hbase1.dba.dev.cn0/10.86.36.217:8020. Already tried
> 0 time(s); retry policy is
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000
> MILLISECONDS)
> 2014-11-28 16:52:59,424 INFO org.apache.hadoop.ipc.Client: Retrying
> connect to server: l-hbase1.dba.dev.cn0/10.86.36.217:8020. Already tried
> 1 time(s); retry policy is
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000
> MILLISECONDS)
> 2014-11-28 16:53:15,425 INFO org.apache.hadoop.ipc.Client: Retrying
> connect to server: l-hbase1.dba.dev.cn0/10.86.36.217:8020. Already tried
> 2 time(s); retry policy is
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000
> MILLISECONDS)‍




-- 
Bharath Vissapragada
<http://www.cloudera.com>