You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Pere Kyle <pe...@whisper.sh> on 2014/11/07 04:28:45 UTC
Re: Hbase Unusable after auto split to 1024 regions SOLVED

First of all I would like to thank you all for your help on this issue, especially Ted, Bryan, and Nick. 

It turned out the issue was the GMOND daemon.

When the regions split more hbase metrics were sent to the ganglia daemon. This resulted in ganglia saturating the I/O on the main ephemeral store on the instance (/mnt), confirmed via iotop and atop. This load seemed to cause ZooKeeper to timeout since it required a commit to disk, slowing all the RPC actions to master. The issue was resolved by removing the hbase,and thrift metrics from ganglia and restarting the regions. So it seems yet again that it is a bad idea to have one drive per machine, I will eventually migrate these instances to I2

Regards,
Pere

On Nov 6, 2014, at 4:20 PM, Pere Kyle <pe...@whisper.sh> wrote:

> So I have another symptom which is quite odd. When trying to take a snapshot of the the table with no writes coming in (i stopped thrift) it continually times out when trying to flush (i don’t believe i have the option of non flush in .94). Every single time I get:
> 
> ERROR: org.apache.hadoop.hbase.snapshot.HBaseSnapshotException: org.apache.hadoop.hbase.snapshot.HBaseSnapshotException: Snapshot { ss=backup_weaver table=weaver_events type=FLUSH } had an error.  Procedure backup_weaver { waiting=[ip-10-227-42-142.us-west-2.compute.internal,60020,1415302661297, ip-10-227-42-252.us-west-2.compute.internal,60020,1415304752318, ip-10-231-21-106.us-west-2.compute.internal,60020,1415306503049, ip-10-230-130-102.us-west-2.compute.internal,60020,1415296951057, ip-10-231-138-119.us-west-2.compute.internal,60020,1415303920176, ip-10-224-53-183.us-west-2.compute.internal,60020,1415311138483, ip-10-250-1-140.us-west-2.compute.internal,60020,1415311984665, ip-10-227-40-150.us-west-2.compute.internal,60020,1415313275623, ip-10-231-139-198.us-west-2.compute.internal,60020,1415295324957, ip-10-250-77-76.us-west-2.compute.internal,60020,1415297345932, ip-10-248-42-35.us-west-2.compute.internal,60020,1415312717768, ip-10-227-45-74.us-west-2.compute.internal,60020,1415296135484, ip-10-227-43-49.us-west-2.compute.internal,60020,1415303176867, ip-10-230-130-121.us-west-2.compute.internal,60020,1415294726028, ip-10-224-49-168.us-west-2.compute.internal,60020,1415312488614, ip-10-227-0-82.us-west-2.compute.internal,60020,1415301974178, ip-10-224-0-167.us-west-2.compute.internal,60020,1415309549108] done=[] }
> 	at org.apache.hadoop.hbase.master.snapshot.SnapshotManager.isSnapshotDone(SnapshotManager.java:362)
> 	at org.apache.hadoop.hbase.master.HMaster.isSnapshotDone(HMaster.java:2313)
> 	at sun.reflect.GeneratedMethodAccessor19.invoke(Unknown Source)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 	at java.lang.reflect.Method.invoke(Method.java:606)
> 	at org.apache.hadoop.hbase.ipc.WritableRpcEngine$Server.call(WritableRpcEngine.java:354)
> 	at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1434)
> Caused by: org.apache.hadoop.hbase.errorhandling.TimeoutException via timer-java.util.Timer@239e8159:org.apache.hadoop.hbase.errorhandling.TimeoutException: Timeout elapsed! Source:Timeout caused Foreign Exception Start:1415319201016, End:1415319261016, diff:60000, max:60000 ms
> 	at org.apache.hadoop.hbase.errorhandling.ForeignExceptionDispatcher.rethrowException(ForeignExceptionDispatcher.java:85)
> 	at org.apache.hadoop.hbase.master.snapshot.TakeSnapshotHandler.rethrowExceptionIfFailed(TakeSnapshotHandler.java:285)
> 	at org.apache.hadoop.hbase.master.snapshot.SnapshotManager.isSnapshotDone(SnapshotManager.java:352)
> 	... 6 more
> Caused by: org.apache.hadoop.hbase.errorhandling.TimeoutException: Timeout elapsed! Source:Timeout caused Foreign Exception Start:1415319201016, End:1415319261016, diff:60000, max:60000 ms
> 	at org.apache.hadoop.hbase.errorhandling.TimeoutExceptionInjector$1.run(TimeoutExceptionInjector.java:68)
> 	at java.util.TimerThread.mainLoop(Timer.java:555)
> 	at java.util.TimerThread.run(Timer.java:505)
> 
> 
> I do not have a single write coming in so how in the world could these tables not be flushed? I could understand a error maybe the first time or two, but how could it not be flushed after a couple requests? Now I can’t even get the data off the node to a new cluster. Any help would be greatly appreciated. 
> 
> Thanks,
> -Pere
> 
> 
> 
> On Nov 6, 2014, at 2:09 PM, Nick Dimiduk <nd...@gmail.com> wrote:
> 
>> One other thought: you might try tracing your requests to see where the
>> slowness happens. Recent versions of PerformanceEvaluation support this
>> feature and can be used directly or as an example for adding tracing to
>> your application.
>> 
>> On Thursday, November 6, 2014, Pere Kyle <pe...@whisper.sh> wrote:
>> 
>>> Bryan,
>>> 
>>> Thanks again for the incredibly useful reply.
>>> 
>>> I have confirmed that the callQueueLen is in fact 0, with a max value of 2
>>> in the last week (in ganglia)
>>> 
>>> hbase.hstore.compaction.max was set to 15 on the nodes, from a previous 7.
>>> 
>>> Freezes (laggy responses) on the cluster are frequent and affect both
>>> reads and writes. I noticed iowait on the nodes that spikes.
>>> 
>>> The cluster goes between a state of working 100% to nothing
>>> serving/timeouts for no discernible reason.
>>> 
>>> Looking through the logs I have tons of responseTooSlow, this is the only
>>> regular occurrence in the logs:
>>> hbase-hadoop-regionserver-ip-10-230-130-121.us-west-2.compute.internal.log:2014-11-06
>>> 03:54:31,640 WARN org.apache.hadoop.ipc.HBaseServer (IPC Server handler 39
>>> on 60020): (responseTooSlow):
>>> {"processingtimems":14573,"call":"multi(org.apache.hadoop.hbase.client.MultiAction@c67b2ac),
>>> rpc version=1, client version=29, methodsFingerPrint=-540141542","client":"
>>> 10.231.139.198:57223
>>> ","starttimems":1415246057066,"queuetimems":20640,"class":"HRegionServer","responsesize":0,"method":"multi"}
>>> hbase-hadoop-regionserver-ip-10-230-130-121.us-west-2.compute.internal.log:2014-11-06
>>> 03:54:31,640 WARN org.apache.hadoop.ipc.HBaseServer (IPC Server handler 42
>>> on 60020): (responseTooSlow):
>>> {"processingtimems":45660,"call":"multi(org.apache.hadoop.hbase.client.MultiAction@6c034090),
>>> rpc version=1, client version=29, methodsFingerPrint=-540141542","client":"
>>> 10.231.21.106:41126
>>> ","starttimems":1415246025979,"queuetimems":202,"class":"HRegionServer","responsesize":0,"method":"multi"}
>>> hbase-hadoop-regionserver-ip-10-230-130-121.us-west-2.compute.internal.log:2014-11-06
>>> 03:54:31,642 WARN org.apache.hadoop.ipc.HBaseServer (IPC Server handler 46
>>> on 60020): (responseTooSlow):
>>> {"processingtimems":14620,"call":"multi(org.apache.hadoop.hbase.client.MultiAction@4fc3bb1f),
>>> rpc version=1, client version=29, methodsFingerPrint=-540141542","client":"
>>> 10.230.130.102:54068
>>> ","starttimems":1415246057021,"queuetimems":27565,"class":"HRegionServer","responsesize":0,"method":"multi"}
>>> hbase-hadoop-regionserver-ip-10-230-130-121.us-west-2.compute.internal.log:2014-11-06
>>> 03:54:31,642 WARN org.apache.hadoop.ipc.HBaseServer (IPC Server handler 35
>>> on 60020): (responseTooSlow):
>>> {"processingtimems":13431,"call":"multi(org.apache.hadoop.hbase.client.MultiAction@3b321922),
>>> rpc version=1, client version=29, methodsFingerPrint=-540141542","client":"
>>> 10.227.42.252:60493
>>> ","starttimems":1415246058210,"queuetimems":1134,"class":"HRegionServer","responsesize":0,"method":"multi"}
>>> On Nov 6, 2014, at 12:38 PM, Bryan Beaudreault <bbeaudreault@hubspot.com
>>> <javascript:;>> wrote:
>>> 
>>>> blockingStoreFiles
>>> 
>>> 
>