You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Hayden Marchant <ha...@amobee.com> on 2014/10/23 07:57:49 UTC
Orphaned aborted snapshot

Hi all,


I am running HBase 0.94.6 on a 20 node cluster, and am taking daily snapshots of our single table (only keeping snapshots for the last 3 days. Yesterday, I started seeing the following messages in one of the region servers that had to be restarted:


2014-10-22 08:29:19,982 INFO org.apache.hadoop.ipc.HBaseServer: REPL IPC Server handler 1 on 60020: starting
2014-10-22 08:29:19,982 INFO org.apache.hadoop.ipc.HBaseServer: REPL IPC Server handler 2 on 60020: starting
2014-10-22 08:29:19,986 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Serving as njhdslave40,60020,1413980958234, RPC listening on njhdslave40/172.30.120.180:60020, sessionid=0x2482ca09984b22a
2014-10-22 08:29:19,986 INFO org.apache.hadoop.hbase.regionserver.SplitLogWorker: SplitLogWorker njhdslave40,60020,1413980958234 starting
2014-10-22 08:29:19,988 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Registered RegionServer MXBean
2014-10-22 08:29:20,024 INFO org.apache.hadoop.hbase.procedure.ProcedureMember: Received abort on procedure with no local subprocedure upd-2014_10_19, ignoring it.
org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable via timer-java.util.Timer@2cef133c:org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable: org.apache.hadoop.hbase.errorhandling.TimeoutException: Timeout elapsed! Source:Timeout caused Foreign Exception Start:1413728408736, End:1413728468758, diff:60022, max:60000 ms
    at org.apache.hadoop.hbase.errorhandling.ForeignException.deserialize(ForeignException.java:171)
    at org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.abort(ZKProcedureMemberRpcs.java:320)
    at org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.watchForAbortedProcedures(ZKProcedureMemberRpcs.java:143)
    at org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.start(ZKProcedureMemberRpcs.java:340)
    at org.apache.hadoop.hbase.regionserver.snapshot.RegionServerSnapshotManager.start(RegionServerSnapshotManager.java:141)
    at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:734)
    at java.lang.Thread.run(Thread.java:662)
Caused by: org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable: org.apache.hadoop.hbase.errorhandling.TimeoutException: Timeout elapsed! Source:Timeout caused Foreign Exception Start:1413728408736, End:1413728468758, diff:60022, max:60000 ms
    at org.apache.hadoop.hbase.errorhandling.TimeoutExceptionInjector$1.run(TimeoutExceptionInjector.java:71)
    at java.util.TimerThread.mainLoop(Timer.java:512)
    at java.util.TimerThread.run(Timer.java:462)
2014-10-22 08:29:29,526 WARN org.apache.hadoop.conf.Configuration: hadoop.native.lib is deprecated. Instead, use io.native.lib.available




After looking at the code, I see that the RegionServerSnapshotManager is watching for aborted nodes, and reports the exception above. Indeed, a few days ago, we had some issues with one of the servers, and I guess the creation of the daily snapshot was aborted. Indeed, looking in the zookeeper node, we see a record of an aborted snapshot from 19 October.



Here is a dump from a  zookeeper node:

[zk: slave:2181 (CONNECTED) 2] ls /hbase/online-snapshot/abort
[upd-2014_10_19]

Just to confirm, I restarted another region server, and saw the same error. It seems that the cluster is working correctly, and new snapshots are being created. 


My question is,are these error messages expected, and  what process is responsible for automatically cleaning up the 'abort' node, and are there any orphaned HLogs from the aborted snapshot that need manual cleaning up.

Thanks,
Hayden