You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by mail list <lo...@gmail.com> on 2014/12/05 06:01:23 UTC

Slow waitForAckedSeqno took too long time

Hi ,all

I deploy Hbase0.98.6-cdh5.2.0 on 3 machine:

l-hbase1.dev.dba.cn0(hadoop namenode active, HMaster active)
l-hbase2.dev.dba.cn0(hadoop namenode standby, HMaster standby, hadoop datanode)
l-hbase3.dev.dba.cn0(regionserver, hadoop datanode)

Then I shutdown the l-hbase1.dev.dba.cn0,  But HBase can not work until about 15mins later.
I check the log and find the following log in the region server’s log:

2014-12-05 12:03:19,169 WARN  [regionserver60020-WAL.AsyncSyncer0] hdfs.DFSClient: Slow waitForAckedSeqno took 927762ms (threshold=30000ms)
2014-12-05 12:03:19,186 INFO  [regionserver60020-WAL.AsyncSyncer0] wal.FSHLog: Slow sync cost: 927779 ms, current pipeline: [10.86.36.219:50010]
2014-12-05 12:03:19,186 DEBUG [regionserver60020.logRoller] regionserver.LogRoller: HLog roll requested
2014-12-05 12:03:19,236 WARN  [regionserver60020-WAL.AsyncSyncer1] hdfs.DFSClient: Slow waitForAckedSeqno took 867706ms (threshold=30000ms)

It seems the WAL Asysnc took too long time for region server recovery? I don’t know if the log matters ?
Can any body explain the reason? and how to reduce the time for recovery?

Fwd: Slow waitForAckedSeqno took too long time

Posted by Ted Yu <yu...@gmail.com>.

Forwarding to hdfs-dev@ since the logs were pertaining to namenode.

Cheers

---------- Forwarded message ----------
From: louis.hust <lo...@gmail.com>
Date: Fri, Dec 5, 2014 at 8:52 PM
Subject: Re: Slow waitForAckedSeqno took too long time
To: user@hbase.apache.org

Hi Ted,

Thanks for your reply.

Actually I shutdown the l-hbase1.dev.dba.cn0 at the time 11:47:51 and the
hbase recover at 12:03

the below log is the zkfc log on l-hbase2

{log}
2014-12-05 11:18:47,077 INFO org.apache.hadoop.ha.ActiveStandbyElector:
Already in election. Not re-connecting.
2014-12-05 11:48:16,023 INFO org.apache.hadoop.ha.ActiveStandbyElector:
Checking for any old active which needs to be fenced...
2014-12-05 11:48:16,025 INFO org.apache.hadoop.ha.ActiveStandbyElector: Old
node exists:
0a096d79636c757374657212036e6e311a146c2d6862617365312e6462612e6465762e636e3020d43e28d33e
2014-12-05 11:48:16,028 INFO org.apache.hadoop.ha.ZKFailoverController:
Should fence: NameNode at l-hbase1.dba.dev.cn0/10.86.36.217:8020
2014-12-05 11:48:32,031 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: l-hbase1.dba.dev.cn0/10.86.36.217:8020. Already tried 0 time(s);
retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=1,
sleepTime=1000 MILLISECONDS)
2014-12-05 11:48:47,033 WARN org.apache.hadoop.ha.FailoverController:
Unable to gracefully make NameNode at l-hbase1.dba.dev.cn0/10.86.36.217:8020
standby (unable to connect)
java.net.ConnectException: Call From l-hbase2.dba.dev.cn0.
qunar.com/10.86.36.218 to l-hbase1.dba.dev.cn0:8020 failed on connection
exception: java.net.ConnectException: Connection timed out; For more
details see:  http://wiki.apache.org/hadoop/ConnectionRefused
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:783)
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:730)
at org.apache.hadoop.ipc.Client.call(Client.java:1413)
at org.apache.hadoop.ipc.Client.call(Client.java:1362)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
at com.sun.proxy.$Proxy8.transitionToStandby(Unknown Source)
at
org.apache.hadoop.ha.protocolPB.HAServiceProtocolClientSideTranslatorPB.transitionToStandby(HAServiceProtocolClientSideTranslatorPB.java:112)
at
org.apache.hadoop.ha.FailoverController.tryGracefulFence(FailoverController.java:172)
at
org.apache.hadoop.ha.ZKFailoverController.doFence(ZKFailoverController.java:503)
at
org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailoverController.java:494)
at
org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.java:59)
at
org.apache.hadoop.ha.ZKFailoverController$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:837)
at
org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:901)
at
org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:800)
at
org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:415)
at
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
Caused by: java.net.ConnectException: Connection timed out
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:735)
at
org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:529)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:493)
at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:604)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:699)
at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:367)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1461)
at org.apache.hadoop.ipc.Client.call(Client.java:1380)
... 14 more
2014-12-05 11:48:47,034 INFO org.apache.hadoop.ha.NodeFencer: ======
Beginning Service Fencing Process... ======
2014-12-05 11:48:47,034 INFO org.apache.hadoop.ha.NodeFencer: Trying method
1/1: org.apache.hadoop.ha.ShellCommandFencer(/bin/true)
2014-12-05 11:48:47,040 INFO org.apache.hadoop.ha.ShellCommandFencer:
Launched fencing command '/bin/true' with pid 24513
2014-12-05 11:48:47,044 INFO org.apache.hadoop.ha.NodeFencer: ======
Fencing successful by method
org.apache.hadoop.ha.ShellCommandFencer(/bin/true) ======
2014-12-05 11:48:47,044 INFO org.apache.hadoop.ha.ActiveStandbyElector:
Writing znode /hadoop-ha/mycluster/ActiveBreadCrumb to indicate that the
local node is the most recent active...
2014-12-05 11:48:47,063 INFO org.apache.hadoop.ha.ZKFailoverController:
Trying to make NameNode at l-hbase2.dba.dev.cn0/10.86.36.218:8020 active...
2014-12-05 11:48:47,284 INFO org.apache.hadoop.ha.ZKFailoverController:
Successfully transitioned NameNode at l-hbase2.dba.dev.cn0/10.86.36.218:8020
to active state
2014-12-05 12:29:56,906 INFO
org.apache.hadoop.hdfs.tools.DFSZKFailoverController: Allowed RPC access
from hadoop (auth:SIMPLE) at 10.86.36.217
2014-12-05 12:29:56,907 INFO org.apache.hadoop.ha.ZKFailoverController:
Requested by hadoop (auth:SIMPLE) at 10.86.36.217 to cede active role.
2014-12-05 12:29:56,943 INFO org.apache.hadoop.ha.ZKFailoverController:
Successfully ensured local node is in standby mode
2014-12-05 12:29:56,943 INFO org.apache.hadoop.ha.ActiveStandbyElector:
Yielding from election
2014-12-05 12:29:56,943 INFO org.apache.hadoop.ha.ActiveStandbyElector:
Deleting bread-crumb of active node…

{log}

the name node log is a bit large, so attach it.

On Dec 5, 2014, at 23:01, Ted Yu <yu...@gmail.com> wrote:

The warning was logged by DFSOutputStream.

What was the load on hdfs around 2014-12-05 12:03 ?
Have you checked namenode log ?

Cheers

On Thu, Dec 4, 2014 at 9:01 PM, mail list <lo...@gmail.com> wrote:

Hi ,all

I deploy Hbase0.98.6-cdh5.2.0 on 3 machine:

l-hbase1.dev.dba.cn0(hadoop namenode active, HMaster active)
l-hbase2.dev.dba.cn0(hadoop namenode standby, HMaster standby, hadoop
datanode)
l-hbase3.dev.dba.cn0(regionserver, hadoop datanode)

Then I shutdown the l-hbase1.dev.dba.cn0,  But HBase can not work until
about 15mins later.
I check the log and find the following log in the region server’s log:

2014-12-05 12:03:19,169 WARN  [regionserver60020-WAL.AsyncSyncer0]
hdfs.DFSClient: Slow waitForAckedSeqno took 927762ms (threshold=30000ms)
2014-12-05 12:03:19,186 INFO  [regionserver60020-WAL.AsyncSyncer0]
wal.FSHLog: Slow sync cost: 927779 ms, current pipeline: [
10.86.36.219:50010]
2014-12-05 12:03:19,186 DEBUG [regionserver60020.logRoller]
regionserver.LogRoller: HLog roll requested
2014-12-05 12:03:19,236 WARN  [regionserver60020-WAL.AsyncSyncer1]
hdfs.DFSClient: Slow waitForAckedSeqno took 867706ms (threshold=30000ms)

It seems the WAL Asysnc took too long time for region server recovery? I
don’t know if the log matters ?
Can any body explain the reason? and how to reduce the time for recovery?

Re: Slow waitForAckedSeqno took too long time

Posted by Esteban Gutierrez <es...@cloudera.com>.

Hello Louis,

Just looking into the timeouts and the services you are running in this
cluster I think the issue could be caused by memory oversubscription or
some sort of network issue that triggers a failover in the NN making Region
Servers hang until HDFS is available. Make sure that the heap sizes for
each process don't max out the amount of physical memory you have in your
cluster or that you don't have any other network issue. Also, since your
cluster is very small probably you should consider not to use HA for HDFS
at this point thats only consuming additional resources and you won't get
any real advantage in 3 node cluster. Are you using by any chance VMs? or
are those 3 nodes physical boxes?

cheers,
esteban.


--
Cloudera, Inc.


On Fri, Dec 5, 2014 at 8:52 PM, louis.hust <lo...@gmail.com> wrote:

> Hi Ted,
>
> Thanks for your reply.
>
> Actually I shutdown the l-hbase1.dev.dba.cn0 at the time 11:47:51 and the
> hbase recover at 12:03
>
> the below log is the zkfc log on l-hbase2
>
> {log}
> 2014-12-05 11:18:47,077 INFO org.apache.hadoop.ha.ActiveStandbyElector:
> Already in election. Not re-connecting.
> 2014-12-05 11:48:16,023 INFO org.apache.hadoop.ha.ActiveStandbyElector:
> Checking for any old active which needs to be fenced...
> 2014-12-05 11:48:16,025 INFO org.apache.hadoop.ha.ActiveStandbyElector:
> Old node exists:
> 0a096d79636c757374657212036e6e311a146c2d6862617365312e6462612e6465762e636e3020d43e28d33e
> 2014-12-05 11:48:16,028 INFO org.apache.hadoop.ha.ZKFailoverController:
> Should fence: NameNode at l-hbase1.dba.dev.cn0/10.86.36.217:8020
> 2014-12-05 11:48:32,031 INFO org.apache.hadoop.ipc.Client: Retrying
> connect to server: l-hbase1.dba.dev.cn0/10.86.36.217:8020. Already tried
> 0 time(s); retry policy is
> RetryUpToMaximumCountWithFixedSleep(maxRetries=1, sleepTime=1000
> MILLISECONDS)
> 2014-12-05 11:48:47,033 WARN org.apache.hadoop.ha.FailoverController:
> Unable to gracefully make NameNode at l-hbase1.dba.dev.cn0/
> 10.86.36.217:8020 standby (unable to connect)
> java.net.ConnectException: Call From l-hbase2.dba.dev.cn0.
> qunar.com/10.86.36.218 to l-hbase1.dba.dev.cn0:8020 failed on connection
> exception: java.net.ConnectException: Connection timed out; For more
> details see:  http://wiki.apache.org/hadoop/ConnectionRefused
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
> at
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
> at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
> at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:783)
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:730)
> at org.apache.hadoop.ipc.Client.call(Client.java:1413)
> at org.apache.hadoop.ipc.Client.call(Client.java:1362)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
> at com.sun.proxy.$Proxy8.transitionToStandby(Unknown Source)
> at
> org.apache.hadoop.ha.protocolPB.HAServiceProtocolClientSideTranslatorPB.transitionToStandby(HAServiceProtocolClientSideTranslatorPB.java:112)
> at
> org.apache.hadoop.ha.FailoverController.tryGracefulFence(FailoverController.java:172)
> at
> org.apache.hadoop.ha.ZKFailoverController.doFence(ZKFailoverController.java:503)
> at
> org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailoverController.java:494)
> at
> org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.java:59)
> at
> org.apache.hadoop.ha.ZKFailoverController$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:837)
> at
> org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:901)
> at
> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:800)
> at
> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:415)
> at
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599)
> at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
> Caused by: java.net.ConnectException: Connection timed out
> at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
> at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:735)
> at
> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
> at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:529)
> at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:493)
> at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:604)
> at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:699)
> at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:367)
> at org.apache.hadoop.ipc.Client.getConnection(Client.java:1461)
> at org.apache.hadoop.ipc.Client.call(Client.java:1380)
> ... 14 more
> 2014-12-05 11:48:47,034 INFO org.apache.hadoop.ha.NodeFencer: ======
> Beginning Service Fencing Process... ======
> 2014-12-05 11:48:47,034 INFO org.apache.hadoop.ha.NodeFencer: Trying
> method 1/1: org.apache.hadoop.ha.ShellCommandFencer(/bin/true)
> 2014-12-05 11:48:47,040 INFO org.apache.hadoop.ha.ShellCommandFencer:
> Launched fencing command '/bin/true' with pid 24513
> 2014-12-05 11:48:47,044 INFO org.apache.hadoop.ha.NodeFencer: ======
> Fencing successful by method
> org.apache.hadoop.ha.ShellCommandFencer(/bin/true) ======
> 2014-12-05 11:48:47,044 INFO org.apache.hadoop.ha.ActiveStandbyElector:
> Writing znode /hadoop-ha/mycluster/ActiveBreadCrumb to indicate that the
> local node is the most recent active...
> 2014-12-05 11:48:47,063 INFO org.apache.hadoop.ha.ZKFailoverController:
> Trying to make NameNode at l-hbase2.dba.dev.cn0/10.86.36.218:8020
> active...
> 2014-12-05 11:48:47,284 INFO org.apache.hadoop.ha.ZKFailoverController:
> Successfully transitioned NameNode at l-hbase2.dba.dev.cn0/
> 10.86.36.218:8020 to active state
> 2014-12-05 12:29:56,906 INFO
> org.apache.hadoop.hdfs.tools.DFSZKFailoverController: Allowed RPC access
> from hadoop (auth:SIMPLE) at 10.86.36.217
> 2014-12-05 12:29:56,907 INFO org.apache.hadoop.ha.ZKFailoverController:
> Requested by hadoop (auth:SIMPLE) at 10.86.36.217 to cede active role.
> 2014-12-05 12:29:56,943 INFO org.apache.hadoop.ha.ZKFailoverController:
> Successfully ensured local node is in standby mode
> 2014-12-05 12:29:56,943 INFO org.apache.hadoop.ha.ActiveStandbyElector:
> Yielding from election
> 2014-12-05 12:29:56,943 INFO org.apache.hadoop.ha.ActiveStandbyElector:
> Deleting bread-crumb of active node…
>
> {log}
>
>
> the name node log is a bit large, so attach it.
>
>
> On Dec 5, 2014, at 23:01, Ted Yu <yu...@gmail.com> wrote:
>
> The warning was logged by DFSOutputStream.
>
> What was the load on hdfs around 2014-12-05 12:03 ?
> Have you checked namenode log ?
>
> Cheers
>
> On Thu, Dec 4, 2014 at 9:01 PM, mail list <lo...@gmail.com> wrote:
>
> Hi ,all
>
> I deploy Hbase0.98.6-cdh5.2.0 on 3 machine:
>
> l-hbase1.dev.dba.cn0(hadoop namenode active, HMaster active)
> l-hbase2.dev.dba.cn0(hadoop namenode standby, HMaster standby, hadoop
> datanode)
> l-hbase3.dev.dba.cn0(regionserver, hadoop datanode)
>
> Then I shutdown the l-hbase1.dev.dba.cn0,  But HBase can not work until
> about 15mins later.
> I check the log and find the following log in the region server’s log:
>
> 2014-12-05 12:03:19,169 WARN  [regionserver60020-WAL.AsyncSyncer0]
> hdfs.DFSClient: Slow waitForAckedSeqno took 927762ms (threshold=30000ms)
> 2014-12-05 12:03:19,186 INFO  [regionserver60020-WAL.AsyncSyncer0]
> wal.FSHLog: Slow sync cost: 927779 ms, current pipeline: [
> 10.86.36.219:50010]
> 2014-12-05 12:03:19,186 DEBUG [regionserver60020.logRoller]
> regionserver.LogRoller: HLog roll requested
> 2014-12-05 12:03:19,236 WARN  [regionserver60020-WAL.AsyncSyncer1]
> hdfs.DFSClient: Slow waitForAckedSeqno took 867706ms (threshold=30000ms)
>
> It seems the WAL Asysnc took too long time for region server recovery? I
> don’t know if the log matters ?
> Can any body explain the reason? and how to reduce the time for recovery?
>
>
>
>
>
>

Re: Slow waitForAckedSeqno took too long time

Posted by "louis.hust" <lo...@gmail.com>.

Hi Ted,

Thanks for your reply.

Actually I shutdown the l-hbase1.dev.dba.cn0 at the time 11:47:51 and the hbase recover at 12:03

the below log is the zkfc log on l-hbase2

{log}
2014-12-05 11:18:47,077 INFO org.apache.hadoop.ha.ActiveStandbyElector: Already in election. Not re-connecting.
2014-12-05 11:48:16,023 INFO org.apache.hadoop.ha.ActiveStandbyElector: Checking for any old active which needs to be fenced...
2014-12-05 11:48:16,025 INFO org.apache.hadoop.ha.ActiveStandbyElector: Old node exists: 0a096d79636c757374657212036e6e311a146c2d6862617365312e6462612e6465762e636e3020d43e28d33e
2014-12-05 11:48:16,028 INFO org.apache.hadoop.ha.ZKFailoverController: Should fence: NameNode at l-hbase1.dba.dev.cn0/10.86.36.217:8020
2014-12-05 11:48:32,031 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: l-hbase1.dba.dev.cn0/10.86.36.217:8020. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=1, sleepTime=1000 MILLISECONDS)
2014-12-05 11:48:47,033 WARN org.apache.hadoop.ha.FailoverController: Unable to gracefully make NameNode at l-hbase1.dba.dev.cn0/10.86.36.217:8020 standby (unable to connect)
java.net.ConnectException: Call From l-hbase2.dba.dev.cn0.qunar.com/10.86.36.218 to l-hbase1.dba.dev.cn0:8020 failed on connection exception: java.net.ConnectException: Connection timed out; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
	at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:783)
	at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:730)
	at org.apache.hadoop.ipc.Client.call(Client.java:1413)
	at org.apache.hadoop.ipc.Client.call(Client.java:1362)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
	at com.sun.proxy.$Proxy8.transitionToStandby(Unknown Source)
	at org.apache.hadoop.ha.protocolPB.HAServiceProtocolClientSideTranslatorPB.transitionToStandby(HAServiceProtocolClientSideTranslatorPB.java:112)
	at org.apache.hadoop.ha.FailoverController.tryGracefulFence(FailoverController.java:172)
	at org.apache.hadoop.ha.ZKFailoverController.doFence(ZKFailoverController.java:503)
	at org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailoverController.java:494)
	at org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.java:59)
	at org.apache.hadoop.ha.ZKFailoverController$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:837)
	at org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:901)
	at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:800)
	at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:415)
	at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599)
	at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
Caused by: java.net.ConnectException: Connection timed out
	at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
	at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:735)
	at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
	at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:529)
	at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:493)
	at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:604)
	at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:699)
	at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:367)
	at org.apache.hadoop.ipc.Client.getConnection(Client.java:1461)
	at org.apache.hadoop.ipc.Client.call(Client.java:1380)
	... 14 more
2014-12-05 11:48:47,034 INFO org.apache.hadoop.ha.NodeFencer: ====== Beginning Service Fencing Process... ======
2014-12-05 11:48:47,034 INFO org.apache.hadoop.ha.NodeFencer: Trying method 1/1: org.apache.hadoop.ha.ShellCommandFencer(/bin/true)
2014-12-05 11:48:47,040 INFO org.apache.hadoop.ha.ShellCommandFencer: Launched fencing command '/bin/true' with pid 24513
2014-12-05 11:48:47,044 INFO org.apache.hadoop.ha.NodeFencer: ====== Fencing successful by method org.apache.hadoop.ha.ShellCommandFencer(/bin/true) ======
2014-12-05 11:48:47,044 INFO org.apache.hadoop.ha.ActiveStandbyElector: Writing znode /hadoop-ha/mycluster/ActiveBreadCrumb to indicate that the local node is the most recent active...
2014-12-05 11:48:47,063 INFO org.apache.hadoop.ha.ZKFailoverController: Trying to make NameNode at l-hbase2.dba.dev.cn0/10.86.36.218:8020 active...
2014-12-05 11:48:47,284 INFO org.apache.hadoop.ha.ZKFailoverController: Successfully transitioned NameNode at l-hbase2.dba.dev.cn0/10.86.36.218:8020 to active state
2014-12-05 12:29:56,906 INFO org.apache.hadoop.hdfs.tools.DFSZKFailoverController: Allowed RPC access from hadoop (auth:SIMPLE) at 10.86.36.217
2014-12-05 12:29:56,907 INFO org.apache.hadoop.ha.ZKFailoverController: Requested by hadoop (auth:SIMPLE) at 10.86.36.217 to cede active role.
2014-12-05 12:29:56,943 INFO org.apache.hadoop.ha.ZKFailoverController: Successfully ensured local node is in standby mode
2014-12-05 12:29:56,943 INFO org.apache.hadoop.ha.ActiveStandbyElector: Yielding from election
2014-12-05 12:29:56,943 INFO org.apache.hadoop.ha.ActiveStandbyElector: Deleting bread-crumb of active node…

{log}


the name node log is a bit large, so attach it.


On Dec 5, 2014, at 23:01, Ted Yu <yu...@gmail.com> wrote:

> The warning was logged by DFSOutputStream.
> 
> What was the load on hdfs around 2014-12-05 12:03 ?
> Have you checked namenode log ?
> 
> Cheers
> 
> On Thu, Dec 4, 2014 at 9:01 PM, mail list <lo...@gmail.com> wrote:
> 
>> Hi ,all
>> 
>> I deploy Hbase0.98.6-cdh5.2.0 on 3 machine:
>> 
>> l-hbase1.dev.dba.cn0(hadoop namenode active, HMaster active)
>> l-hbase2.dev.dba.cn0(hadoop namenode standby, HMaster standby, hadoop
>> datanode)
>> l-hbase3.dev.dba.cn0(regionserver, hadoop datanode)
>> 
>> Then I shutdown the l-hbase1.dev.dba.cn0,  But HBase can not work until
>> about 15mins later.
>> I check the log and find the following log in the region server’s log:
>> 
>> 2014-12-05 12:03:19,169 WARN  [regionserver60020-WAL.AsyncSyncer0]
>> hdfs.DFSClient: Slow waitForAckedSeqno took 927762ms (threshold=30000ms)
>> 2014-12-05 12:03:19,186 INFO  [regionserver60020-WAL.AsyncSyncer0]
>> wal.FSHLog: Slow sync cost: 927779 ms, current pipeline: [
>> 10.86.36.219:50010]
>> 2014-12-05 12:03:19,186 DEBUG [regionserver60020.logRoller]
>> regionserver.LogRoller: HLog roll requested
>> 2014-12-05 12:03:19,236 WARN  [regionserver60020-WAL.AsyncSyncer1]
>> hdfs.DFSClient: Slow waitForAckedSeqno took 867706ms (threshold=30000ms)
>> 
>> It seems the WAL Asysnc took too long time for region server recovery? I
>> don’t know if the log matters ?
>> Can any body explain the reason? and how to reduce the time for recovery?
>> 
>> 
>>

Re: Slow waitForAckedSeqno took too long time

Posted by Ted Yu <yu...@gmail.com>.

The warning was logged by DFSOutputStream.

What was the load on hdfs around 2014-12-05 12:03 ?
Have you checked namenode log ?

Cheers

On Thu, Dec 4, 2014 at 9:01 PM, mail list <lo...@gmail.com> wrote:

> Hi ,all
>
> I deploy Hbase0.98.6-cdh5.2.0 on 3 machine:
>
> l-hbase1.dev.dba.cn0(hadoop namenode active, HMaster active)
> l-hbase2.dev.dba.cn0(hadoop namenode standby, HMaster standby, hadoop
> datanode)
> l-hbase3.dev.dba.cn0(regionserver, hadoop datanode)
>
> Then I shutdown the l-hbase1.dev.dba.cn0,  But HBase can not work until
> about 15mins later.
> I check the log and find the following log in the region server’s log:
>
> 2014-12-05 12:03:19,169 WARN  [regionserver60020-WAL.AsyncSyncer0]
> hdfs.DFSClient: Slow waitForAckedSeqno took 927762ms (threshold=30000ms)
> 2014-12-05 12:03:19,186 INFO  [regionserver60020-WAL.AsyncSyncer0]
> wal.FSHLog: Slow sync cost: 927779 ms, current pipeline: [
> 10.86.36.219:50010]
> 2014-12-05 12:03:19,186 DEBUG [regionserver60020.logRoller]
> regionserver.LogRoller: HLog roll requested
> 2014-12-05 12:03:19,236 WARN  [regionserver60020-WAL.AsyncSyncer1]
> hdfs.DFSClient: Slow waitForAckedSeqno took 867706ms (threshold=30000ms)
>
> It seems the WAL Asysnc took too long time for region server recovery? I
> don’t know if the log matters ?
> Can any body explain the reason? and how to reduce the time for recovery?
>
>
>

Re: Slow waitForAckedSeqno took too long time

Posted by mail list <lo...@gmail.com>.

I also got the the RegionServer stack on the region server as below:

"RS_OPEN_META-l-hbase3:60020-0-WAL.AsyncNotifier" prio=10 tid=0x00007f7e7c259000 nid=0x3d1 in Object.wait() [0x00007f7e5eb90000]
   java.lang.Thread.State: WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        - waiting on <0x00000000d59e4808> (a java.lang.Object)
        at java.lang.Object.wait(Object.java:503)
        at org.apache.hadoop.hbase.regionserver.wal.FSHLog$AsyncNotifier.run(FSHLog.java:1338)
        - locked <0x00000000d59e4808> (a java.lang.Object)
        at java.lang.Thread.run(Thread.java:744)

"RS_OPEN_META-l-hbase3:60020-0-WAL.AsyncSyncer4" prio=10 tid=0x00007f7e7c257000 nid=0x3d0 in Object.wait() [0x00007f7e5ec91000]
   java.lang.Thread.State: WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        - waiting on <0x00000000d59e4498> (a java.lang.Object)
        at java.lang.Object.wait(Object.java:503)
        at org.apache.hadoop.hbase.regionserver.wal.FSHLog$AsyncSyncer.run(FSHLog.java:1209)
        - locked <0x00000000d59e4498> (a java.lang.Object)
        at java.lang.Thread.run(Thread.java:744)

"RS_OPEN_META-l-hbase3:60020-0-WAL.AsyncSyncer3" prio=10 tid=0x00007f7e7c255000 nid=0x3cf in Object.wait() [0x00007f7e5ed92000]
   java.lang.Thread.State: WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        - waiting on <0x00000000d5990570> (a java.lang.Object)
        at java.lang.Object.wait(Object.java:503)
        at org.apache.hadoop.hbase.regionserver.wal.FSHLog$AsyncSyncer.run(FSHLog.java:1209)
        - locked <0x00000000d5990570> (a java.lang.Object)
        at java.lang.Thread.run(Thread.java:744)


On Dec 5, 2014, at 13:01, mail list <lo...@gmail.com> wrote:

> Hi ,all
> 
> I deploy Hbase0.98.6-cdh5.2.0 on 3 machine:
> 
> l-hbase1.dev.dba.cn0(hadoop namenode active, HMaster active)
> l-hbase2.dev.dba.cn0(hadoop namenode standby, HMaster standby, hadoop datanode)
> l-hbase3.dev.dba.cn0(regionserver, hadoop datanode)
> 
> Then I shutdown the l-hbase1.dev.dba.cn0,  But HBase can not work until about 15mins later.
> I check the log and find the following log in the region server’s log:
> 
> 2014-12-05 12:03:19,169 WARN  [regionserver60020-WAL.AsyncSyncer0] hdfs.DFSClient: Slow waitForAckedSeqno took 927762ms (threshold=30000ms)
> 2014-12-05 12:03:19,186 INFO  [regionserver60020-WAL.AsyncSyncer0] wal.FSHLog: Slow sync cost: 927779 ms, current pipeline: [10.86.36.219:50010]
> 2014-12-05 12:03:19,186 DEBUG [regionserver60020.logRoller] regionserver.LogRoller: HLog roll requested
> 2014-12-05 12:03:19,236 WARN  [regionserver60020-WAL.AsyncSyncer1] hdfs.DFSClient: Slow waitForAckedSeqno took 867706ms (threshold=30000ms)
> 
> It seems the WAL Asysnc took too long time for region server recovery? I don’t know if the log matters ?
> Can any body explain the reason? and how to reduce the time for recovery?
> 
>