You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Ian Brooks <i....@sensewhere.com> on 2014/09/23 11:28:08 UTC

Snapshot failure

Hi,

I'm seeing an issue on our hbase cluster which is preventing snapshots from working. So far the only way i can get it working again is to restart all the regionservers which is not ideal.

I'm running hbase 0.98.3 on hadoop 2.4.0, the stacktraces are

>From host given the snapshot command via hbase shell 

2014-09-23 10:00:16,116 INFO  [regionserver16020-EventThread] procedure.ZKProcedureMemberRpcs: Received procedure start children changed event: /hbase/online-snapshot/acquired
2014-09-23 10:00:16,116 DEBUG [regionserver16020-EventThread] procedure.ZKProcedureMemberRpcs: Looking for new procedures under znode:'/hbase/online-snapshot/acquired'
2014-09-23 10:00:16,117 DEBUG [regionserver16020-EventThread] procedure.ZKProcedureMemberRpcs: Found procedure znode: /hbase/online-snapshot/acquired/offlineScans-20140923-100009
2014-09-23 10:00:16,117 DEBUG [regionserver16020-EventThread] procedure.ZKProcedureMemberRpcs: start proc data length is 59
2014-09-23 10:00:16,117 DEBUG [regionserver16020-EventThread] procedure.ZKProcedureMemberRpcs: Found data for znode:/hbase/online-snapshot/acquired/offlineScans-20140923-100009
2014-09-23 10:00:16,118 DEBUG [regionserver16020-EventThread] snapshot.RegionServerSnapshotManager: Launching subprocedure for snapshot offlineScans-20140923-100009 from table offlineScans
2014-09-23 10:00:16,118 DEBUG [regionserver16020-EventThread] procedure.ProcedureMember: Submitting new Subprocedure:offlineScans-20140923-100009
2014-09-23 10:00:16,120 DEBUG [member: '############,16020,1410126483043' subprocedure-pool1-thread-314] procedure.Subprocedure: Starting subprocedure 'offlineScans-20140923-100009' with timeout 60000ms
2014-09-23 10:00:16,120 DEBUG [member: '############,16020,1410126483043' subprocedure-pool1-thread-314] errorhandling.TimeoutExceptionInjector: Scheduling process timer to run in: 60000 ms
2014-09-23 10:00:16,121 DEBUG [member: '############,16020,1410126483043' subprocedure-pool1-thread-314] procedure.Subprocedure: Subprocedure 'offlineScans-20140923-100009' starting 'acquire' stage
2014-09-23 10:00:16,121 DEBUG [member: '############,16020,1410126483043' subprocedure-pool1-thread-314] procedure.Subprocedure: Subprocedure 'offlineScans-20140923-100009' locally acquired
2014-09-23 10:00:16,121 DEBUG [member: '############,16020,1410126483043' subprocedure-pool1-thread-314] procedure.ZKProcedureMemberRpcs: Member: '##########,16020,1410126483043' joining acquired barrier for procedure (offlineScans-20140923-100009) in zk
2014-09-23 10:00:16,123 DEBUG [member: '############,16020,1410126483043' subprocedure-pool1-thread-314] procedure.ZKProcedureMemberRpcs: Watch for global barrier reached:/hbase/online-snapshot/reached/offlineScans-20140923-100009
2014-09-23 10:00:16,124 DEBUG [member: '############,16020,1410126483043' subprocedure-pool1-thread-314] procedure.Subprocedure: Subprocedure 'offlineScans-20140923-100009' coordinator notified of 'acquire', waiting on 'reached' or 'abort' from coordinator
2014-09-23 10:00:16,133 INFO  [regionserver16020-EventThread] procedure.ZKProcedureMemberRpcs: Received created event:/hbase/online-snapshot/abort/offlineScans-20140923-100009
2014-09-23 10:00:16,134 DEBUG [regionserver16020-EventThread] procedure.ZKProcedureMemberRpcs: Aborting procedure member for znode /hbase/online-snapshot/abort/offlineScans-20140923-100009
2014-09-23 10:00:16,134 DEBUG [regionserver16020-EventThread] procedure.ProcedureMember: Request received to abort procedure offlineScans-20140923-100009
org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable via ###############,16020,1408633802567:org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable: 
        at org.apache.hadoop.hbase.errorhandling.ForeignException.deserialize(ForeignException.java:171)
        at org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.abort(ZKProcedureMemberRpcs.java:320)
        at org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs$1.nodeCreated(ZKProcedureMemberRpcs.java:94)
        at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:327)
        at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522)
        at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
Caused by: org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable: 
        at org.apache.hadoop.hbase.procedure.Subprocedure.cancel(Subprocedure.java:270)
        at org.apache.hadoop.hbase.procedure.ProcedureMember.submitSubprocedure(ProcedureMember.java:171)
        at org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.startNewSubprocedure(ZKProcedureMemberRpcs.java:214)
        at org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.waitForNewProcedures(ZKProcedureMemberRpcs.java:172)
        at org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.access$100(ZKProcedureMemberRpcs.java:55)
        at org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs$1.nodeChildrenChanged(ZKProcedureMemberRpcs.java:107)
        at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:348)
        ... 2 more



>From the regionserver it was trying to contact


2014-09-23 10:00:16,124 INFO  [regionserver16020-EventThread] procedure.ZKProcedureMemberRpcs: Received procedure start children changed event: /hbase/online-snapshot/acquired
2014-09-23 10:00:16,124 DEBUG [regionserver16020-EventThread] procedure.ZKProcedureMemberRpcs: Looking for new procedures under znode:'/hbase/online-snapshot/acquired'
2014-09-23 10:00:16,125 DEBUG [regionserver16020-EventThread] procedure.ZKProcedureMemberRpcs: Found procedure znode: /hbase/online-snapshot/acquired/offlineScans-20140923-100009
2014-09-23 10:00:16,127 DEBUG [regionserver16020-EventThread] procedure.ZKProcedureMemberRpcs: start proc data length is 59
2014-09-23 10:00:16,127 DEBUG [regionserver16020-EventThread] procedure.ZKProcedureMemberRpcs: Found data for znode:/hbase/online-snapshot/acquired/offlineScans-20140923-100009
2014-09-23 10:00:16,127 DEBUG [regionserver16020-EventThread] snapshot.RegionServerSnapshotManager: Launching subprocedure for snapshot offlineScans-20140923-100009 from table offlineScans
2014-09-23 10:00:16,127 DEBUG [regionserver16020-EventThread] procedure.ProcedureMember: Submitting new Subprocedure:offlineScans-20140923-100009
2014-09-23 10:00:16,128 ERROR [regionserver16020-EventThread] procedure.Subprocedure: Subprocedure pool is full!
2014-09-23 10:00:16,128 DEBUG [regionserver16020-EventThread] errorhandling.ForeignExceptionDispatcher: offlineScans-20140923-100009 accepting received exception
2014-09-23 10:00:16,128 DEBUG [regionserver16020-EventThread] procedure.ZKProcedureMemberRpcs: Aborting procedure (offlineScans-20140923-100009) in zk
2014-09-23 10:00:16,134 DEBUG [regionserver16020-EventThread] procedure.ZKProcedureMemberRpcs: Finished creating abort znode:/hbase/online-snapshot/abort/offlineScans-20140923-100009
2014-09-23 10:00:16,134 ERROR [regionserver16020-EventThread] procedure.ProcedureMember: Failed to start subprocedure 'offlineScans-20140923-100009'
2014-09-23 10:00:16,134 INFO  [regionserver16020-EventThread] procedure.ZKProcedureMemberRpcs: Received created event:/hbase/online-snapshot/abort/offlineScans-20140923-100009
2014-09-23 10:00:16,134 DEBUG [regionserver16020-EventThread] procedure.ZKProcedureMemberRpcs: Aborting procedure member for znode /hbase/online-snapshot/abort/offlineScans-20140923-100009
2014-09-23 10:00:16,145 DEBUG [regionserver16020-EventThread] procedure.ProcedureMember: Request received to abort procedure offlineScans-20140923-100009
org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable via ###########,16020,1408633802567:org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable:
        at org.apache.hadoop.hbase.errorhandling.ForeignException.deserialize(ForeignException.java:171)
        at org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.abort(ZKProcedureMemberRpcs.java:320)
        at org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs$1.nodeCreated(ZKProcedureMemberRpcs.java:94)
        at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:327)
        at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522)
        at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
Caused by: org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable:
        at org.apache.hadoop.hbase.procedure.Subprocedure.cancel(Subprocedure.java:270)
        at org.apache.hadoop.hbase.procedure.ProcedureMember.submitSubprocedure(ProcedureMember.java:171)
        at org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.startNewSubprocedure(ZKProcedureMemberRpcs.java:214)
        at org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.waitForNewProcedures(ZKProcedureMemberRpcs.java:172)
        at org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.access$100(ZKProcedureMemberRpcs.java:55)
        at org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs$1.nodeChildrenChanged(ZKProcedureMemberRpcs.java:107)
        at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:348)
        ... 2 more

Any ideas on how I can resolve this?

-Ian


Re: Snapshot failure

Posted by Ian Brooks <i....@sensewhere.com>.
Hi,

I applied the suggested timeout changes and it lasted longer before it started failing, but its now just throwing the same errors on "procedure.Subprocedure: Subprocedure pool is full!"

Looking through the logs, it looks like the nodes may not correctly updating they completion state after finishing a snapshot, If i go through and restart the offending region servers it will work again for a while, then stop again. As far as I can tell, no snapshot is actually taking anywhere near the set value of 10 minutes to run.

eg.

2014-09-25 21:03:15,713 INFO  [regionserver16020-EventThread] procedure.ZKProcedureMemberRpcs: Received procedure start children changed event: /hbase/online-snapshot/acquired
2014-09-25 21:03:15,713 DEBUG [regionserver16020-EventThread] procedure.ZKProcedureMemberRpcs: Looking for new procedures under znode:'/hbase/online-snapshot/acquired'
2014-09-25 21:03:15,715 DEBUG [regionserver16020-EventThread] procedure.ZKProcedureMemberRpcs: Found procedure znode: /hbase/online-snapshot/acquired/requestData-20140925-210256
2014-09-25 21:03:15,718 DEBUG [regionserver16020-EventThread] procedure.ZKProcedureMemberRpcs: start proc data length is 57
2014-09-25 21:03:15,718 DEBUG [regionserver16020-EventThread] procedure.ZKProcedureMemberRpcs: Found data for znode:/hbase/online-snapshot/acquired/requestData-20140925-210256
2014-09-25 21:03:15,718 DEBUG [regionserver16020-EventThread] snapshot.RegionServerSnapshotManager: Launching subprocedure for snapshot requestData-20140925-210256 from table requestData
2014-09-25 21:03:15,718 DEBUG [regionserver16020-EventThread] procedure.ProcedureMember: Submitting new Subprocedure:requestData-20140925-210256
2014-09-25 21:03:15,719 DEBUG [member: '###########-005,16020,1411640680181' subprocedure-pool1-thread-1] procedure.Subprocedure: Starting subprocedure 'requestData-20140925-210256' with timeout 600000ms
2014-09-25 21:03:15,719 DEBUG [member: '###########-005,16020,1411640680181' subprocedure-pool1-thread-1] errorhandling.TimeoutExceptionInjector: Scheduling process timer to run in: 600000 ms
2014-09-25 21:03:15,720 DEBUG [member: '###########-005,16020,1411640680181' subprocedure-pool1-thread-1] procedure.Subprocedure: Subprocedure 'requestData-20140925-210256' starting 'acquire' stage
2014-09-25 21:03:15,720 DEBUG [member: '###########-005,16020,1411640680181' subprocedure-pool1-thread-1] procedure.Subprocedure: Subprocedure 'requestData-20140925-210256' locally acquired
2014-09-25 21:03:15,720 DEBUG [member: '###########-005,16020,1411640680181' subprocedure-pool1-thread-1] procedure.ZKProcedureMemberRpcs: Member: '###########-005,16020,1411640680181' joining acquired barrier fo
r procedure (requestData-20140925-210256) in zk
2014-09-25 21:03:15,724 DEBUG [member: '###########-005,16020,1411640680181' subprocedure-pool1-thread-1] procedure.ZKProcedureMemberRpcs: Watch for global barrier reached:/hbase/online-snapshot/reached/request
Data-20140925-210256
2014-09-25 21:03:15,739 DEBUG [member: '###########-005,16020,1411640680181' subprocedure-pool1-thread-1] procedure.Subprocedure: Subprocedure 'requestData-20140925-210256' coordinator notified of 'acquire', wa
iting on 'reached' or 'abort' from coordinator
2014-09-25 21:03:15,825 INFO  [regionserver16020-EventThread] procedure.ZKProcedureMemberRpcs: Received created event:/hbase/online-snapshot/reached/requestData-20140925-210256
2014-09-25 21:03:15,825 DEBUG [regionserver16020-EventThread] procedure.ZKProcedureMemberRpcs: Recieved reached global barrier:/hbase/online-snapshot/reached/requestData-20140925-210256
2014-09-25 21:03:15,825 DEBUG [member: '###########-005,16020,1411640680181' subprocedure-pool1-thread-1] procedure.Subprocedure: Subprocedure 'requestData-20140925-210256' received 'reached' from coordinator.

...

2014-09-25 21:03:38,026 DEBUG [member: '###########-005,16020,1411640680181' subprocedure-pool1-thread-1] snapshot.RegionServerSnapshotManager: Completed 5 local region snapshots.
2014-09-25 21:03:38,026 DEBUG [member: '###########-005,16020,1411640680181' subprocedure-pool1-thread-1] snapshot.RegionServerSnapshotManager: cancelling 0 tasks for snapshot ###########-005,16020,1411640680181
2014-09-25 21:03:38,026 DEBUG [member: '###########-005,16020,1411640680181' subprocedure-pool1-thread-1] procedure.Subprocedure: Subprocedure 'requestData-20140925-210256' locally completed
2014-09-25 21:03:38,026 DEBUG [member: '###########-005,16020,1411640680181' subprocedure-pool1-thread-1] procedure.ZKProcedureMemberRpcs: Marking procedure  'requestData-20140925-210256' completed for member '
###########-005,16020,1411640680181' in zk
2014-09-25 21:03:38,028 DEBUG [member: '###########-005,16020,1411640680181' subprocedure-pool1-thread-1] procedure.Subprocedure: Subprocedure 'requestData-20140925-210256' has notified controller of completion
2014-09-25 21:03:38,028 DEBUG [member: '###########-005,16020,1411640680181' subprocedure-pool1-thread-1] errorhandling.TimeoutExceptionInjector: Marking timer as complete - no error notifications will be recei
ved for this timer.
2014-09-25 21:03:38,028 DEBUG [member: '###########-005,16020,1411640680181' subprocedure-pool1-thread-1] procedure.Subprocedure: Subprocedure 'requestData-20140925-210256' completed.
2


Which suggests that "requestData-20140925-210256" completed happily, however, 10 minutes later the following comes up in the logs


2014-09-25 21:13:15,777 INFO  [regionserver16020-EventThread] procedure.ZKProcedureMemberRpcs: Received created event:/hbase/online-snapshot/abort/requestData-20140925-210256
2014-09-25 21:13:15,777 DEBUG [regionserver16020-EventThread] procedure.ZKProcedureMemberRpcs: Aborting procedure member for znode /hbase/online-snapshot/abort/requestData-20140925-210256
2014-09-25 21:13:15,778 DEBUG [regionserver16020-EventThread] procedure.ProcedureMember: Request received to abort procedure requestData-20140925-210256
org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable via timer-java.util.Timer@5cb99ea6:org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable: org.apache.hadoop.hbase.errorhandling.TimeoutException: Timeout elapsed! Source:Timeout caused Foreign Exception Start:1411675395727, End:1411675995727, diff:600000, max:600000 ms
        at org.apache.hadoop.hbase.errorhandling.ForeignException.deserialize(ForeignException.java:171)
        at org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.abort(ZKProcedureMemberRpcs.java:320)
        at org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs$1.nodeCreated(ZKProcedureMemberRpcs.java:94)
        at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:327)
        at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522)
        at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
Caused by: org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable: org.apache.hadoop.hbase.errorhandling.TimeoutException: Timeout elapsed! Source:Timeout caused Foreign Exception Start:1411675395727, End:1411675995727, diff:600000, max:600000 ms
        at org.apache.hadoop.hbase.errorhandling.TimeoutExceptionInjector$1.run(TimeoutExceptionInjector.java:70)
        at java.util.TimerThread.mainLoop(Timer.java:555)
        at java.util.TimerThread.run(Timer.java:505)
2014-09-25 21:13:15,780 INFO  [regionserver16020-EventThread] procedure.ProcedureMember: Received abort on procedure with no local subprocedure requestData-20140925-210256, ignoring it.
org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable via timer-java.util.Timer@5cb99ea6:org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable: org.apache.hadoop.hbase.errorhandling.TimeoutException: Timeout elapsed! Source:Timeout caused Foreign Exception Start:1411675395727, End:1411675995727, diff:600000, max:600000 ms
        at org.apache.hadoop.hbase.errorhandling.ForeignException.deserialize(ForeignException.java:171)
        at org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.abort(ZKProcedureMemberRpcs.java:320)
        at org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs$1.nodeCreated(ZKProcedureMemberRpcs.java:94)
        at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:327)
        at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522)
        at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
Caused by: org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable: org.apache.hadoop.hbase.errorhandling.TimeoutException: Timeout elapsed! Source:Timeout caused Foreign Exception Start:1411675395727, End:1411675995727, diff:600000, max:600000 ms
        at org.apache.hadoop.hbase.errorhandling.TimeoutExceptionInjector$1.run(TimeoutExceptionInjector.java:70)
        at java.util.TimerThread.mainLoop(Timer.java:555)
        at java.util.TimerThread.run(Timer.java:505)





-Ian Brooks


On Tuesday 23 September 2014 06:52:34 Ted Yu wrote:
> Here're the config parameters related to controlling snapshot timeout:
> 
>   <property>
>     <name>hbase.snapshot.master.timeoutMillis</name>
>     <!-- Change from default of 60s to 600s to allow for slow flushing of
> tables -->
>     <value>600000</value>
>     <description>
>       This is the time HBase master waits for the snapshot operation to
> complete.
>       Do not confuse this hbase.snapshot.master.timeout.millis, which
> although
>       sounding similar, serves a very different purpose.
>       Note: This property has a completely different meaning before hbase
> version
>       0.94.11 and should not enabled on a cluster using snapshots and
> running
>       a version before 0.94.11.
>     </description>
>   </property>
>   <property>
>     <name>hbase.snapshot.master.timeout.millis</name>
>     <!-- Change from default of 60s to 600s to allow for slow flushing of
> tables -->
>     <value>600000</value>
>     <description>
>       This is the timeout the master indicates the client to wait when it
> takes
>       the snapshot. The client actually waits longer than this due to
> exponential
>       backoff. See HBaseAdmin.snapshot for the exact algorithm.
>     </description>
>   </property>
>   <property>
>     <name>hbase.snapshot.region.timeout</name>
>     <!-- Change from default of 60s to 600s to allow for slow flushing of
> tables -->
>     <value>600000</value>
>     <description>
>       This is the time the regionserver waits to complete all of its
> activities
>       for a snapshot operation.
>     </description>
>   </property>

Re: Snapshot failure

Posted by Ian Brooks <i....@sensewhere.com>.
Hi,

I got a change finally to apply the timeout changes and it lasted longer before it started failing, but its now just throwing the same errors on "procedure.Subprocedure: Subprocedure pool is full!"

Looking through the logs, it looks like the nodes may not correctly updating they completion state after finishing a snapshot,

eg.

2014-09-25 21:03:15,713 INFO  [regionserver16020-EventThread] procedure.ZKProcedureMemberRpcs: Received procedure start children changed event: /hbase/online-snapshot/acquired
2014-09-25 21:03:15,713 DEBUG [regionserver16020-EventThread] procedure.ZKProcedureMemberRpcs: Looking for new procedures under znode:'/hbase/online-snapshot/acquired'
2014-09-25 21:03:15,715 DEBUG [regionserver16020-EventThread] procedure.ZKProcedureMemberRpcs: Found procedure znode: /hbase/online-snapshot/acquired/requestData-20140925-210256
2014-09-25 21:03:15,718 DEBUG [regionserver16020-EventThread] procedure.ZKProcedureMemberRpcs: start proc data length is 57
2014-09-25 21:03:15,718 DEBUG [regionserver16020-EventThread] procedure.ZKProcedureMemberRpcs: Found data for znode:/hbase/online-snapshot/acquired/requestData-20140925-210256
2014-09-25 21:03:15,718 DEBUG [regionserver16020-EventThread] snapshot.RegionServerSnapshotManager: Launching subprocedure for snapshot requestData-20140925-210256 from table requestData
2014-09-25 21:03:15,718 DEBUG [regionserver16020-EventThread] procedure.ProcedureMember: Submitting new Subprocedure:requestData-20140925-210256
2014-09-25 21:03:15,719 DEBUG [member: '###########-005,16020,1411640680181' subprocedure-pool1-thread-1] procedure.Subprocedure: Starting subprocedure 'requestData-20140925-210256' with timeout 600000ms
2014-09-25 21:03:15,719 DEBUG [member: '###########-005,16020,1411640680181' subprocedure-pool1-thread-1] errorhandling.TimeoutExceptionInjector: Scheduling process timer to run in: 600000 ms
2014-09-25 21:03:15,720 DEBUG [member: '###########-005,16020,1411640680181' subprocedure-pool1-thread-1] procedure.Subprocedure: Subprocedure 'requestData-20140925-210256' starting 'acquire' stage
2014-09-25 21:03:15,720 DEBUG [member: '###########-005,16020,1411640680181' subprocedure-pool1-thread-1] procedure.Subprocedure: Subprocedure 'requestData-20140925-210256' locally acquired
2014-09-25 21:03:15,720 DEBUG [member: '###########-005,16020,1411640680181' subprocedure-pool1-thread-1] procedure.ZKProcedureMemberRpcs: Member: '###########-005,16020,1411640680181' joining acquired barrier fo
r procedure (requestData-20140925-210256) in zk
2014-09-25 21:03:15,724 DEBUG [member: '###########-005,16020,1411640680181' subprocedure-pool1-thread-1] procedure.ZKProcedureMemberRpcs: Watch for global barrier reached:/hbase/online-snapshot/reached/request
Data-20140925-210256
2014-09-25 21:03:15,739 DEBUG [member: '###########-005,16020,1411640680181' subprocedure-pool1-thread-1] procedure.Subprocedure: Subprocedure 'requestData-20140925-210256' coordinator notified of 'acquire', wa
iting on 'reached' or 'abort' from coordinator
2014-09-25 21:03:15,825 INFO  [regionserver16020-EventThread] procedure.ZKProcedureMemberRpcs: Received created event:/hbase/online-snapshot/reached/requestData-20140925-210256
2014-09-25 21:03:15,825 DEBUG [regionserver16020-EventThread] procedure.ZKProcedureMemberRpcs: Recieved reached global barrier:/hbase/online-snapshot/reached/requestData-20140925-210256
2014-09-25 21:03:15,825 DEBUG [member: '###########-005,16020,1411640680181' subprocedure-pool1-thread-1] procedure.Subprocedure: Subprocedure 'requestData-20140925-210256' received 'reached' from coordinator.

...

2014-09-25 21:03:38,026 DEBUG [member: '###########-005,16020,1411640680181' subprocedure-pool1-thread-1] snapshot.RegionServerSnapshotManager: Completed 5 local region snapshots.
2014-09-25 21:03:38,026 DEBUG [member: '###########-005,16020,1411640680181' subprocedure-pool1-thread-1] snapshot.RegionServerSnapshotManager: cancelling 0 tasks for snapshot ###########-005,16020,1411640680181
2014-09-25 21:03:38,026 DEBUG [member: '###########-005,16020,1411640680181' subprocedure-pool1-thread-1] procedure.Subprocedure: Subprocedure 'requestData-20140925-210256' locally completed
2014-09-25 21:03:38,026 DEBUG [member: '###########-005,16020,1411640680181' subprocedure-pool1-thread-1] procedure.ZKProcedureMemberRpcs: Marking procedure  'requestData-20140925-210256' completed for member '
###########-005,16020,1411640680181' in zk
2014-09-25 21:03:38,028 DEBUG [member: '###########-005,16020,1411640680181' subprocedure-pool1-thread-1] procedure.Subprocedure: Subprocedure 'requestData-20140925-210256' has notified controller of completion
2014-09-25 21:03:38,028 DEBUG [member: '###########-005,16020,1411640680181' subprocedure-pool1-thread-1] errorhandling.TimeoutExceptionInjector: Marking timer as complete - no error notifications will be recei
ved for this timer.
2014-09-25 21:03:38,028 DEBUG [member: '###########-005,16020,1411640680181' subprocedure-pool1-thread-1] procedure.Subprocedure: Subprocedure 'requestData-20140925-210256' completed.
2


Which suggests that "requestData-20140925-210256" completed happily, however, 10 minutes later the following comes up in the logs


2014-09-25 21:13:15,777 INFO  [regionserver16020-EventThread] procedure.ZKProcedureMemberRpcs: Received created event:/hbase/online-snapshot/abort/requestData-20140925-210256
2014-09-25 21:13:15,777 DEBUG [regionserver16020-EventThread] procedure.ZKProcedureMemberRpcs: Aborting procedure member for znode /hbase/online-snapshot/abort/requestData-20140925-210256
2014-09-25 21:13:15,778 DEBUG [regionserver16020-EventThread] procedure.ProcedureMember: Request received to abort procedure requestData-20140925-210256
org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable via timer-java.util.Timer@5cb99ea6:org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable: org.apache.hadoop.hbase.errorhandling.TimeoutException: Timeout elapsed! Source:Timeout caused Foreign Exception Start:1411675395727, End:1411675995727, diff:600000, max:600000 ms
        at org.apache.hadoop.hbase.errorhandling.ForeignException.deserialize(ForeignException.java:171)
        at org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.abort(ZKProcedureMemberRpcs.java:320)
        at org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs$1.nodeCreated(ZKProcedureMemberRpcs.java:94)
        at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:327)
        at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522)
        at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
Caused by: org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable: org.apache.hadoop.hbase.errorhandling.TimeoutException: Timeout elapsed! Source:Timeout caused Foreign Exception Start:1411675395727, End:1411675995727, diff:600000, max:600000 ms
        at org.apache.hadoop.hbase.errorhandling.TimeoutExceptionInjector$1.run(TimeoutExceptionInjector.java:70)
        at java.util.TimerThread.mainLoop(Timer.java:555)
        at java.util.TimerThread.run(Timer.java:505)
2014-09-25 21:13:15,780 INFO  [regionserver16020-EventThread] procedure.ProcedureMember: Received abort on procedure with no local subprocedure requestData-20140925-210256, ignoring it.
org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable via timer-java.util.Timer@5cb99ea6:org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable: org.apache.hadoop.hbase.errorhandling.TimeoutException: Timeout elapsed! Source:Timeout caused Foreign Exception Start:1411675395727, End:1411675995727, diff:600000, max:600000 ms
        at org.apache.hadoop.hbase.errorhandling.ForeignException.deserialize(ForeignException.java:171)
        at org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.abort(ZKProcedureMemberRpcs.java:320)
        at org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs$1.nodeCreated(ZKProcedureMemberRpcs.java:94)
        at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:327)
        at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522)
        at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
Caused by: org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable: org.apache.hadoop.hbase.errorhandling.TimeoutException: Timeout elapsed! Source:Timeout caused Foreign Exception Start:1411675395727, End:1411675995727, diff:600000, max:600000 ms
        at org.apache.hadoop.hbase.errorhandling.TimeoutExceptionInjector$1.run(TimeoutExceptionInjector.java:70)
        at java.util.TimerThread.mainLoop(Timer.java:555)
        at java.util.TimerThread.run(Timer.java:505)





-Ian Brooks


On Tuesday 23 September 2014 06:52:34 Ted Yu wrote:
> Here're the config parameters related to controlling snapshot timeout:
> 
>   <property>
>     <name>hbase.snapshot.master.timeoutMillis</name>
>     <!-- Change from default of 60s to 600s to allow for slow flushing of
> tables -->
>     <value>600000</value>
>     <description>
>       This is the time HBase master waits for the snapshot operation to
> complete.
>       Do not confuse this hbase.snapshot.master.timeout.millis, which
> although
>       sounding similar, serves a very different purpose.
>       Note: This property has a completely different meaning before hbase
> version
>       0.94.11 and should not enabled on a cluster using snapshots and
> running
>       a version before 0.94.11.
>     </description>
>   </property>
>   <property>
>     <name>hbase.snapshot.master.timeout.millis</name>
>     <!-- Change from default of 60s to 600s to allow for slow flushing of
> tables -->
>     <value>600000</value>
>     <description>
>       This is the timeout the master indicates the client to wait when it
> takes
>       the snapshot. The client actually waits longer than this due to
> exponential
>       backoff. See HBaseAdmin.snapshot for the exact algorithm.
>     </description>
>   </property>
>   <property>
>     <name>hbase.snapshot.region.timeout</name>
>     <!-- Change from default of 60s to 600s to allow for slow flushing of
> tables -->
>     <value>600000</value>
>     <description>
>       This is the time the regionserver waits to complete all of its
> activities
>       for a snapshot operation.
>     </description>
>   </property>

Re: Snapshot failure

Posted by Ian Brooks <i....@sensewhere.com>.
Hi,

I got a change finally to apply the timeout changes and it lasted longer before it started failing, but its now just throwing the same errors on "procedure.Subprocedure: Subprocedure pool is full!"

Looking through the logs, it looks like the nodes may not correctly updating they completion state after finishing a snapshot,

eg.

2014-09-25 21:03:15,713 INFO  [regionserver16020-EventThread] procedure.ZKProcedureMemberRpcs: Received procedure start children changed event: /hbase/online-snapshot/acquired
2014-09-25 21:03:15,713 DEBUG [regionserver16020-EventThread] procedure.ZKProcedureMemberRpcs: Looking for new procedures under znode:'/hbase/online-snapshot/acquired'
2014-09-25 21:03:15,715 DEBUG [regionserver16020-EventThread] procedure.ZKProcedureMemberRpcs: Found procedure znode: /hbase/online-snapshot/acquired/requestData-20140925-210256
2014-09-25 21:03:15,718 DEBUG [regionserver16020-EventThread] procedure.ZKProcedureMemberRpcs: start proc data length is 57
2014-09-25 21:03:15,718 DEBUG [regionserver16020-EventThread] procedure.ZKProcedureMemberRpcs: Found data for znode:/hbase/online-snapshot/acquired/requestData-20140925-210256
2014-09-25 21:03:15,718 DEBUG [regionserver16020-EventThread] snapshot.RegionServerSnapshotManager: Launching subprocedure for snapshot requestData-20140925-210256 from table requestData
2014-09-25 21:03:15,718 DEBUG [regionserver16020-EventThread] procedure.ProcedureMember: Submitting new Subprocedure:requestData-20140925-210256
2014-09-25 21:03:15,719 DEBUG [member: '###########-005,16020,1411640680181' subprocedure-pool1-thread-1] procedure.Subprocedure: Starting subprocedure 'requestData-20140925-210256' with timeout 600000ms
2014-09-25 21:03:15,719 DEBUG [member: '###########-005,16020,1411640680181' subprocedure-pool1-thread-1] errorhandling.TimeoutExceptionInjector: Scheduling process timer to run in: 600000 ms
2014-09-25 21:03:15,720 DEBUG [member: '###########-005,16020,1411640680181' subprocedure-pool1-thread-1] procedure.Subprocedure: Subprocedure 'requestData-20140925-210256' starting 'acquire' stage
2014-09-25 21:03:15,720 DEBUG [member: '###########-005,16020,1411640680181' subprocedure-pool1-thread-1] procedure.Subprocedure: Subprocedure 'requestData-20140925-210256' locally acquired
2014-09-25 21:03:15,720 DEBUG [member: '###########-005,16020,1411640680181' subprocedure-pool1-thread-1] procedure.ZKProcedureMemberRpcs: Member: '###########-005,16020,1411640680181' joining acquired barrier fo
r procedure (requestData-20140925-210256) in zk
2014-09-25 21:03:15,724 DEBUG [member: '###########-005,16020,1411640680181' subprocedure-pool1-thread-1] procedure.ZKProcedureMemberRpcs: Watch for global barrier reached:/hbase/online-snapshot/reached/request
Data-20140925-210256
2014-09-25 21:03:15,739 DEBUG [member: '###########-005,16020,1411640680181' subprocedure-pool1-thread-1] procedure.Subprocedure: Subprocedure 'requestData-20140925-210256' coordinator notified of 'acquire', wa
iting on 'reached' or 'abort' from coordinator
2014-09-25 21:03:15,825 INFO  [regionserver16020-EventThread] procedure.ZKProcedureMemberRpcs: Received created event:/hbase/online-snapshot/reached/requestData-20140925-210256
2014-09-25 21:03:15,825 DEBUG [regionserver16020-EventThread] procedure.ZKProcedureMemberRpcs: Recieved reached global barrier:/hbase/online-snapshot/reached/requestData-20140925-210256
2014-09-25 21:03:15,825 DEBUG [member: '###########-005,16020,1411640680181' subprocedure-pool1-thread-1] procedure.Subprocedure: Subprocedure 'requestData-20140925-210256' received 'reached' from coordinator.

...

2014-09-25 21:03:38,026 DEBUG [member: '###########-005,16020,1411640680181' subprocedure-pool1-thread-1] snapshot.RegionServerSnapshotManager: Completed 5 local region snapshots.
2014-09-25 21:03:38,026 DEBUG [member: '###########-005,16020,1411640680181' subprocedure-pool1-thread-1] snapshot.RegionServerSnapshotManager: cancelling 0 tasks for snapshot ###########-005,16020,1411640680181
2014-09-25 21:03:38,026 DEBUG [member: '###########-005,16020,1411640680181' subprocedure-pool1-thread-1] procedure.Subprocedure: Subprocedure 'requestData-20140925-210256' locally completed
2014-09-25 21:03:38,026 DEBUG [member: '###########-005,16020,1411640680181' subprocedure-pool1-thread-1] procedure.ZKProcedureMemberRpcs: Marking procedure  'requestData-20140925-210256' completed for member '
###########-005,16020,1411640680181' in zk
2014-09-25 21:03:38,028 DEBUG [member: '###########-005,16020,1411640680181' subprocedure-pool1-thread-1] procedure.Subprocedure: Subprocedure 'requestData-20140925-210256' has notified controller of completion
2014-09-25 21:03:38,028 DEBUG [member: '###########-005,16020,1411640680181' subprocedure-pool1-thread-1] errorhandling.TimeoutExceptionInjector: Marking timer as complete - no error notifications will be recei
ved for this timer.
2014-09-25 21:03:38,028 DEBUG [member: '###########-005,16020,1411640680181' subprocedure-pool1-thread-1] procedure.Subprocedure: Subprocedure 'requestData-20140925-210256' completed.
2


Which suggests that "requestData-20140925-210256" completed happily, however, 10 minutes later the following comes up in the logs


2014-09-25 21:13:15,777 INFO  [regionserver16020-EventThread] procedure.ZKProcedureMemberRpcs: Received created event:/hbase/online-snapshot/abort/requestData-20140925-210256
2014-09-25 21:13:15,777 DEBUG [regionserver16020-EventThread] procedure.ZKProcedureMemberRpcs: Aborting procedure member for znode /hbase/online-snapshot/abort/requestData-20140925-210256
2014-09-25 21:13:15,778 DEBUG [regionserver16020-EventThread] procedure.ProcedureMember: Request received to abort procedure requestData-20140925-210256
org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable via timer-java.util.Timer@5cb99ea6:org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable: org.apache.hadoop.hbase.errorhandling.TimeoutException: Timeout elapsed! Source:Timeout caused Foreign Exception Start:1411675395727, End:1411675995727, diff:600000, max:600000 ms
        at org.apache.hadoop.hbase.errorhandling.ForeignException.deserialize(ForeignException.java:171)
        at org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.abort(ZKProcedureMemberRpcs.java:320)
        at org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs$1.nodeCreated(ZKProcedureMemberRpcs.java:94)
        at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:327)
        at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522)
        at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
Caused by: org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable: org.apache.hadoop.hbase.errorhandling.TimeoutException: Timeout elapsed! Source:Timeout caused Foreign Exception Start:1411675395727, End:1411675995727, diff:600000, max:600000 ms
        at org.apache.hadoop.hbase.errorhandling.TimeoutExceptionInjector$1.run(TimeoutExceptionInjector.java:70)
        at java.util.TimerThread.mainLoop(Timer.java:555)
        at java.util.TimerThread.run(Timer.java:505)
2014-09-25 21:13:15,780 INFO  [regionserver16020-EventThread] procedure.ProcedureMember: Received abort on procedure with no local subprocedure requestData-20140925-210256, ignoring it.
org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable via timer-java.util.Timer@5cb99ea6:org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable: org.apache.hadoop.hbase.errorhandling.TimeoutException: Timeout elapsed! Source:Timeout caused Foreign Exception Start:1411675395727, End:1411675995727, diff:600000, max:600000 ms
        at org.apache.hadoop.hbase.errorhandling.ForeignException.deserialize(ForeignException.java:171)
        at org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.abort(ZKProcedureMemberRpcs.java:320)
        at org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs$1.nodeCreated(ZKProcedureMemberRpcs.java:94)
        at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:327)
        at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522)
        at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
Caused by: org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable: org.apache.hadoop.hbase.errorhandling.TimeoutException: Timeout elapsed! Source:Timeout caused Foreign Exception Start:1411675395727, End:1411675995727, diff:600000, max:600000 ms
        at org.apache.hadoop.hbase.errorhandling.TimeoutExceptionInjector$1.run(TimeoutExceptionInjector.java:70)
        at java.util.TimerThread.mainLoop(Timer.java:555)
        at java.util.TimerThread.run(Timer.java:505)





-Ian Brooks


On Tuesday 23 September 2014 06:52:34 Ted Yu wrote:
> Here're the config parameters related to controlling snapshot timeout:
> 
>   <property>
>     <name>hbase.snapshot.master.timeoutMillis</name>
>     <!-- Change from default of 60s to 600s to allow for slow flushing of
> tables -->
>     <value>600000</value>
>     <description>
>       This is the time HBase master waits for the snapshot operation to
> complete.
>       Do not confuse this hbase.snapshot.master.timeout.millis, which
> although
>       sounding similar, serves a very different purpose.
>       Note: This property has a completely different meaning before hbase
> version
>       0.94.11 and should not enabled on a cluster using snapshots and
> running
>       a version before 0.94.11.
>     </description>
>   </property>
>   <property>
>     <name>hbase.snapshot.master.timeout.millis</name>
>     <!-- Change from default of 60s to 600s to allow for slow flushing of
> tables -->
>     <value>600000</value>
>     <description>
>       This is the timeout the master indicates the client to wait when it
> takes
>       the snapshot. The client actually waits longer than this due to
> exponential
>       backoff. See HBaseAdmin.snapshot for the exact algorithm.
>     </description>
>   </property>
>   <property>
>     <name>hbase.snapshot.region.timeout</name>
>     <!-- Change from default of 60s to 600s to allow for slow flushing of
> tables -->
>     <value>600000</value>
>     <description>
>       This is the time the regionserver waits to complete all of its
> activities
>       for a snapshot operation.
>     </description>
>   </property>
> 
> On Tue, Sep 23, 2014 at 6:05 AM, Ian Brooks <i....@sensewhere.com> wrote:
> 
> > Hi,
> >
> > I'm running snapshots on 7 tables every hour, the last run managed 15
> > rounds of snapshots before starting to fail.
> >
> > I did notice the following while getting the success count
> >
> > 2014-09-10 11:01:55,453 DEBUG
> > [rs(##############,16020,1410126483043)-snapshot-pool151-thread-1]
> > snapshot.FlushSnapshotSubprocedure: Closing region operation on
> > rawScans,250000000000,1408114528441.ddb14a3d8f90cc18b6dfd4f28df5c534.
> > 2014-09-10 11:01:55,453 DEBUG
> > [rs(##############,16020,1410126483043)-snapshot-pool151-thread-1]
> > snapshot.FlushSnapshotSubprocedure: Starting region operation on
> > rawScans,960000000000,1408114528443.0b3010a2712ff1b6acec0705f5d171d9.
> > 2014-09-10 11:01:55,453 DEBUG
> > [rs(##############,16020,1410126483043)-snapshot-pool151-thread-1]
> > snapshot.FlushSnapshotSubprocedure: Flush Snapshotting region
> > rawScans,960000000000,1408114528443.0b3010a2712ff1b6acec0705f5d171d9.
> > started...
> > 2014-09-10 11:01:55,454 WARN  [member:
> > '##############,16020,1410126483043' subprocedure-pool1-thread-1]
> > snapshot.RegionServerSnapshotManager: Rethrowing ForeignException from
> > SnapshotSubprocedurePool
> > java.util.concurrent.ExecutionException:
> > org.apache.hadoop.hbase.errorhandling.TimeoutException via
> > timer-java.util.Timer@45c34a8:org.apache.hadoop.hbase.errorhandling.TimeoutException:
> > Timeout elapsed! Source:Timeout caused Foreign Exception
> > Start:1410343254867, End:1410343314867, diff:60000, max:60000 ms
> >         at java.util.concurrent.FutureTask.report(FutureTask.java:122)
> >         at java.util.concurrent.FutureTask.get(FutureTask.java:188)
> >         at
> > org.apache.hadoop.hbase.regionserver.snapshot.RegionServerSnapshotManager$SnapshotSubprocedurePool.waitForOutstandingTasks(RegionServerSnapshotManager.java:295)
> >         at
> > org.apache.hadoop.hbase.regionserver.snapshot.FlushSnapshotSubprocedure.flushSnapshot(FlushSnapshotSubprocedure.java:118)
> >         at
> > org.apache.hadoop.hbase.regionserver.snapshot.FlushSnapshotSubprocedure.insideBarrier(FlushSnapshotSubprocedure.java:137)
> >         at
> > org.apache.hadoop.hbase.procedure.Subprocedure.call(Subprocedure.java:181)
> >         at
> > org.apache.hadoop.hbase.procedure.Subprocedure.call(Subprocedure.java:52)
> >         at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> >         at
> > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> >         at
> > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> >         at java.lang.Thread.run(Thread.java:745)
> > Caused by: org.apache.hadoop.hbase.errorhandling.TimeoutException via
> > timer-java.util.Timer@45c34a8:org.apache.hadoop.hbase.errorhandling.TimeoutException:
> > Timeout elapsed! Source:Timeout caused Foreign Exception
> > Start:1410343254867, End:1410343314867, diff:60000, max:60000 ms
> >         at
> > org.apache.hadoop.hbase.errorhandling.ForeignExceptionDispatcher.rethrowException(ForeignExceptionDispatcher.java:83)
> >         at
> > org.apache.hadoop.hbase.regionserver.HRegion.addRegionToSnapshot(HRegion.java:2816)
> >         at
> > org.apache.hadoop.hbase.regionserver.snapshot.FlushSnapshotSubprocedure$RegionSnapshotTask.call(FlushSnapshotSubprocedure.java:83)
> >         at
> > org.apache.hadoop.hbase.regionserver.snapshot.FlushSnapshotSubprocedure$RegionSnapshotTask.call(FlushSnapshotSubprocedure.java:65)
> >         at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> >         at
> > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
> >         ... 4 more
> > Caused by: org.apache.hadoop.hbase.errorhandling.TimeoutException: Timeout
> > elapsed! Source:Timeout caused Foreign Exception Start:1410343254867,
> > End:1410343314867, diff:60000, max:60000 ms
> >         at
> > org.apache.hadoop.hbase.errorhandling.TimeoutExceptionInjector$1.run(TimeoutExceptionInjector.java:67)
> >         at java.util.TimerThread.mainLoop(Timer.java:555)
> >         at java.util.TimerThread.run(Timer.java:505)
> >
> > Is there a configuration option that I need to increase to allow the
> > snapshots to run for more than 60 seconds on a given region? That said, I'm
> > guessing the above shouldn't cause the SubProcedure pool to fill up
> > completely.
> >
> > -Ian
> >
> > On Tuesday 23 September 2014 04:33:39 Ted Yu wrote:
> > > Subprocedure pool is full.
> > >
> > > How many snapshot requests did you submit before seeing the following ?
> > >
> > > Cheers
> > >
> > > On Sep 23, 2014, at 2:28 AM, Ian Brooks <i....@sensewhere.com> wrote:
> > >
> > > > Hi,
> > > >
> > > > I'm seeing an issue on our hbase cluster which is preventing snapshots
> > from working. So far the only way i can get it working again is to restart
> > all the regionservers which is not ideal.
> > > >
> > > > I'm running hbase 0.98.3 on hadoop 2.4.0, the stacktraces are
> > > >
> > > > From host given the snapshot command via hbase shell
> > > >
> > > > 2014-09-23 10:00:16,116 INFO  [regionserver16020-EventThread]
> > procedure.ZKProcedureMemberRpcs: Received procedure start children changed
> > event: /hbase/online-snapshot/acquired
> > > > 2014-09-23 10:00:16,116 DEBUG [regionserver16020-EventThread]
> > procedure.ZKProcedureMemberRpcs: Looking for new procedures under
> > znode:'/hbase/online-snapshot/acquired'
> > > > 2014-09-23 10:00:16,117 DEBUG [regionserver16020-EventThread]
> > procedure.ZKProcedureMemberRpcs: Found procedure znode:
> > /hbase/online-snapshot/acquired/offlineScans-20140923-100009
> > > > 2014-09-23 10:00:16,117 DEBUG [regionserver16020-EventThread]
> > procedure.ZKProcedureMemberRpcs: start proc data length is 59
> > > > 2014-09-23 10:00:16,117 DEBUG [regionserver16020-EventThread]
> > procedure.ZKProcedureMemberRpcs: Found data for
> > znode:/hbase/online-snapshot/acquired/offlineScans-20140923-100009
> > > > 2014-09-23 10:00:16,118 DEBUG [regionserver16020-EventThread]
> > snapshot.RegionServerSnapshotManager: Launching subprocedure for snapshot
> > offlineScans-20140923-100009 from table offlineScans
> > > > 2014-09-23 10:00:16,118 DEBUG [regionserver16020-EventThread]
> > procedure.ProcedureMember: Submitting new
> > Subprocedure:offlineScans-20140923-100009
> > > > 2014-09-23 10:00:16,120 DEBUG [member:
> > '############,16020,1410126483043' subprocedure-pool1-thread-314]
> > procedure.Subprocedure: Starting subprocedure
> > 'offlineScans-20140923-100009' with timeout 60000ms
> > > > 2014-09-23 10:00:16,120 DEBUG [member:
> > '############,16020,1410126483043' subprocedure-pool1-thread-314]
> > errorhandling.TimeoutExceptionInjector: Scheduling process timer to run in:
> > 60000 ms
> > > > 2014-09-23 10:00:16,121 DEBUG [member:
> > '############,16020,1410126483043' subprocedure-pool1-thread-314]
> > procedure.Subprocedure: Subprocedure 'offlineScans-20140923-100009'
> > starting 'acquire' stage
> > > > 2014-09-23 10:00:16,121 DEBUG [member:
> > '############,16020,1410126483043' subprocedure-pool1-thread-314]
> > procedure.Subprocedure: Subprocedure 'offlineScans-20140923-100009' locally
> > acquired
> > > > 2014-09-23 10:00:16,121 DEBUG [member:
> > '############,16020,1410126483043' subprocedure-pool1-thread-314]
> > procedure.ZKProcedureMemberRpcs: Member: '##########,16020,1410126483043'
> > joining acquired barrier for procedure (offlineScans-20140923-100009) in zk
> > > > 2014-09-23 10:00:16,123 DEBUG [member:
> > '############,16020,1410126483043' subprocedure-pool1-thread-314]
> > procedure.ZKProcedureMemberRpcs: Watch for global barrier
> > reached:/hbase/online-snapshot/reached/offlineScans-20140923-100009
> > > > 2014-09-23 10:00:16,124 DEBUG [member:
> > '############,16020,1410126483043' subprocedure-pool1-thread-314]
> > procedure.Subprocedure: Subprocedure 'offlineScans-20140923-100009'
> > coordinator notified of 'acquire', waiting on 'reached' or 'abort' from
> > coordinator
> > > > 2014-09-23 10:00:16,133 INFO  [regionserver16020-EventThread]
> > procedure.ZKProcedureMemberRpcs: Received created
> > event:/hbase/online-snapshot/abort/offlineScans-20140923-100009
> > > > 2014-09-23 10:00:16,134 DEBUG [regionserver16020-EventThread]
> > procedure.ZKProcedureMemberRpcs: Aborting procedure member for znode
> > /hbase/online-snapshot/abort/offlineScans-20140923-100009
> > > > 2014-09-23 10:00:16,134 DEBUG [regionserver16020-EventThread]
> > procedure.ProcedureMember: Request received to abort procedure
> > offlineScans-20140923-100009
> > > > org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable
> > via
> > ###############,16020,1408633802567:org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable:
> > > >        at
> > org.apache.hadoop.hbase.errorhandling.ForeignException.deserialize(ForeignException.java:171)
> > > >        at
> > org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.abort(ZKProcedureMemberRpcs.java:320)
> > > >        at
> > org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs$1.nodeCreated(ZKProcedureMemberRpcs.java:94)
> > > >        at
> > org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:327)
> > > >        at
> > org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522)
> > > >        at
> > org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
> > > > Caused by:
> > org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable:
> > > >        at
> > org.apache.hadoop.hbase.procedure.Subprocedure.cancel(Subprocedure.java:270)
> > > >        at
> > org.apache.hadoop.hbase.procedure.ProcedureMember.submitSubprocedure(ProcedureMember.java:171)
> > > >        at
> > org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.startNewSubprocedure(ZKProcedureMemberRpcs.java:214)
> > > >        at
> > org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.waitForNewProcedures(ZKProcedureMemberRpcs.java:172)
> > > >        at
> > org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.access$100(ZKProcedureMemberRpcs.java:55)
> > > >        at
> > org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs$1.nodeChildrenChanged(ZKProcedureMemberRpcs.java:107)
> > > >        at
> > org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:348)
> > > >        ... 2 more
> > > >
> > > >
> > > >
> > > > From the regionserver it was trying to contact
> > > >
> > > >
> > > > 2014-09-23 10:00:16,124 INFO  [regionserver16020-EventThread]
> > procedure.ZKProcedureMemberRpcs: Received procedure start children changed
> > event: /hbase/online-snapshot/acquired
> > > > 2014-09-23 10:00:16,124 DEBUG [regionserver16020-EventThread]
> > procedure.ZKProcedureMemberRpcs: Looking for new procedures under
> > znode:'/hbase/online-snapshot/acquired'
> > > > 2014-09-23 10:00:16,125 DEBUG [regionserver16020-EventThread]
> > procedure.ZKProcedureMemberRpcs: Found procedure znode:
> > /hbase/online-snapshot/acquired/offlineScans-20140923-100009
> > > > 2014-09-23 10:00:16,127 DEBUG [regionserver16020-EventThread]
> > procedure.ZKProcedureMemberRpcs: start proc data length is 59
> > > > 2014-09-23 10:00:16,127 DEBUG [regionserver16020-EventThread]
> > procedure.ZKProcedureMemberRpcs: Found data for
> > znode:/hbase/online-snapshot/acquired/offlineScans-20140923-100009
> > > > 2014-09-23 10:00:16,127 DEBUG [regionserver16020-EventThread]
> > snapshot.RegionServerSnapshotManager: Launching subprocedure for snapshot
> > offlineScans-20140923-100009 from table offlineScans
> > > > 2014-09-23 10:00:16,127 DEBUG [regionserver16020-EventThread]
> > procedure.ProcedureMember: Submitting new
> > Subprocedure:offlineScans-20140923-100009
> > > > 2014-09-23 10:00:16,128 ERROR [regionserver16020-EventThread]
> > procedure.Subprocedure: Subprocedure pool is full!
> > > > 2014-09-23 10:00:16,128 DEBUG [regionserver16020-EventThread]
> > errorhandling.ForeignExceptionDispatcher: offlineScans-20140923-100009
> > accepting received exception
> > > > 2014-09-23 10:00:16,128 DEBUG [regionserver16020-EventThread]
> > procedure.ZKProcedureMemberRpcs: Aborting procedure
> > (offlineScans-20140923-100009) in zk
> > > > 2014-09-23 10:00:16,134 DEBUG [regionserver16020-EventThread]
> > procedure.ZKProcedureMemberRpcs: Finished creating abort
> > znode:/hbase/online-snapshot/abort/offlineScans-20140923-100009
> > > > 2014-09-23 10:00:16,134 ERROR [regionserver16020-EventThread]
> > procedure.ProcedureMember: Failed to start subprocedure
> > 'offlineScans-20140923-100009'
> > > > 2014-09-23 10:00:16,134 INFO  [regionserver16020-EventThread]
> > procedure.ZKProcedureMemberRpcs: Received created
> > event:/hbase/online-snapshot/abort/offlineScans-20140923-100009
> > > > 2014-09-23 10:00:16,134 DEBUG [regionserver16020-EventThread]
> > procedure.ZKProcedureMemberRpcs: Aborting procedure member for znode
> > /hbase/online-snapshot/abort/offlineScans-20140923-100009
> > > > 2014-09-23 10:00:16,145 DEBUG [regionserver16020-EventThread]
> > procedure.ProcedureMember: Request received to abort procedure
> > offlineScans-20140923-100009
> > > > org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable
> > via
> > ###########,16020,1408633802567:org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable:
> > > >        at
> > org.apache.hadoop.hbase.errorhandling.ForeignException.deserialize(ForeignException.java:171)
> > > >        at
> > org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.abort(ZKProcedureMemberRpcs.java:320)
> > > >        at
> > org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs$1.nodeCreated(ZKProcedureMemberRpcs.java:94)
> > > >        at
> > org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:327)
> > > >        at
> > org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522)
> > > >        at
> > org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
> > > > Caused by:
> > org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable:
> > > >        at
> > org.apache.hadoop.hbase.procedure.Subprocedure.cancel(Subprocedure.java:270)
> > > >        at
> > org.apache.hadoop.hbase.procedure.ProcedureMember.submitSubprocedure(ProcedureMember.java:171)
> > > >        at
> > org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.startNewSubprocedure(ZKProcedureMemberRpcs.java:214)
> > > >        at
> > org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.waitForNewProcedures(ZKProcedureMemberRpcs.java:172)
> > > >        at
> > org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.access$100(ZKProcedureMemberRpcs.java:55)
> > > >        at
> > org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs$1.nodeChildrenChanged(ZKProcedureMemberRpcs.java:107)
> > > >        at
> > org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:348)
> > > >        ... 2 more
> > > >
> > > > Any ideas on how I can resolve this?
> > > >
> > > > -Ian
> > > >
> >

Re: Snapshot failure

Posted by Ted Yu <yu...@gmail.com>.
Here're the config parameters related to controlling snapshot timeout:

  <property>
    <name>hbase.snapshot.master.timeoutMillis</name>
    <!-- Change from default of 60s to 600s to allow for slow flushing of
tables -->
    <value>600000</value>
    <description>
      This is the time HBase master waits for the snapshot operation to
complete.
      Do not confuse this hbase.snapshot.master.timeout.millis, which
although
      sounding similar, serves a very different purpose.
      Note: This property has a completely different meaning before hbase
version
      0.94.11 and should not enabled on a cluster using snapshots and
running
      a version before 0.94.11.
    </description>
  </property>
  <property>
    <name>hbase.snapshot.master.timeout.millis</name>
    <!-- Change from default of 60s to 600s to allow for slow flushing of
tables -->
    <value>600000</value>
    <description>
      This is the timeout the master indicates the client to wait when it
takes
      the snapshot. The client actually waits longer than this due to
exponential
      backoff. See HBaseAdmin.snapshot for the exact algorithm.
    </description>
  </property>
  <property>
    <name>hbase.snapshot.region.timeout</name>
    <!-- Change from default of 60s to 600s to allow for slow flushing of
tables -->
    <value>600000</value>
    <description>
      This is the time the regionserver waits to complete all of its
activities
      for a snapshot operation.
    </description>
  </property>

On Tue, Sep 23, 2014 at 6:05 AM, Ian Brooks <i....@sensewhere.com> wrote:

> Hi,
>
> I'm running snapshots on 7 tables every hour, the last run managed 15
> rounds of snapshots before starting to fail.
>
> I did notice the following while getting the success count
>
> 2014-09-10 11:01:55,453 DEBUG
> [rs(##############,16020,1410126483043)-snapshot-pool151-thread-1]
> snapshot.FlushSnapshotSubprocedure: Closing region operation on
> rawScans,250000000000,1408114528441.ddb14a3d8f90cc18b6dfd4f28df5c534.
> 2014-09-10 11:01:55,453 DEBUG
> [rs(##############,16020,1410126483043)-snapshot-pool151-thread-1]
> snapshot.FlushSnapshotSubprocedure: Starting region operation on
> rawScans,960000000000,1408114528443.0b3010a2712ff1b6acec0705f5d171d9.
> 2014-09-10 11:01:55,453 DEBUG
> [rs(##############,16020,1410126483043)-snapshot-pool151-thread-1]
> snapshot.FlushSnapshotSubprocedure: Flush Snapshotting region
> rawScans,960000000000,1408114528443.0b3010a2712ff1b6acec0705f5d171d9.
> started...
> 2014-09-10 11:01:55,454 WARN  [member:
> '##############,16020,1410126483043' subprocedure-pool1-thread-1]
> snapshot.RegionServerSnapshotManager: Rethrowing ForeignException from
> SnapshotSubprocedurePool
> java.util.concurrent.ExecutionException:
> org.apache.hadoop.hbase.errorhandling.TimeoutException via
> timer-java.util.Timer@45c34a8:org.apache.hadoop.hbase.errorhandling.TimeoutException:
> Timeout elapsed! Source:Timeout caused Foreign Exception
> Start:1410343254867, End:1410343314867, diff:60000, max:60000 ms
>         at java.util.concurrent.FutureTask.report(FutureTask.java:122)
>         at java.util.concurrent.FutureTask.get(FutureTask.java:188)
>         at
> org.apache.hadoop.hbase.regionserver.snapshot.RegionServerSnapshotManager$SnapshotSubprocedurePool.waitForOutstandingTasks(RegionServerSnapshotManager.java:295)
>         at
> org.apache.hadoop.hbase.regionserver.snapshot.FlushSnapshotSubprocedure.flushSnapshot(FlushSnapshotSubprocedure.java:118)
>         at
> org.apache.hadoop.hbase.regionserver.snapshot.FlushSnapshotSubprocedure.insideBarrier(FlushSnapshotSubprocedure.java:137)
>         at
> org.apache.hadoop.hbase.procedure.Subprocedure.call(Subprocedure.java:181)
>         at
> org.apache.hadoop.hbase.procedure.Subprocedure.call(Subprocedure.java:52)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>         at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.hadoop.hbase.errorhandling.TimeoutException via
> timer-java.util.Timer@45c34a8:org.apache.hadoop.hbase.errorhandling.TimeoutException:
> Timeout elapsed! Source:Timeout caused Foreign Exception
> Start:1410343254867, End:1410343314867, diff:60000, max:60000 ms
>         at
> org.apache.hadoop.hbase.errorhandling.ForeignExceptionDispatcher.rethrowException(ForeignExceptionDispatcher.java:83)
>         at
> org.apache.hadoop.hbase.regionserver.HRegion.addRegionToSnapshot(HRegion.java:2816)
>         at
> org.apache.hadoop.hbase.regionserver.snapshot.FlushSnapshotSubprocedure$RegionSnapshotTask.call(FlushSnapshotSubprocedure.java:83)
>         at
> org.apache.hadoop.hbase.regionserver.snapshot.FlushSnapshotSubprocedure$RegionSnapshotTask.call(FlushSnapshotSubprocedure.java:65)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>         at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>         ... 4 more
> Caused by: org.apache.hadoop.hbase.errorhandling.TimeoutException: Timeout
> elapsed! Source:Timeout caused Foreign Exception Start:1410343254867,
> End:1410343314867, diff:60000, max:60000 ms
>         at
> org.apache.hadoop.hbase.errorhandling.TimeoutExceptionInjector$1.run(TimeoutExceptionInjector.java:67)
>         at java.util.TimerThread.mainLoop(Timer.java:555)
>         at java.util.TimerThread.run(Timer.java:505)
>
> Is there a configuration option that I need to increase to allow the
> snapshots to run for more than 60 seconds on a given region? That said, I'm
> guessing the above shouldn't cause the SubProcedure pool to fill up
> completely.
>
> -Ian
>
> On Tuesday 23 September 2014 04:33:39 Ted Yu wrote:
> > Subprocedure pool is full.
> >
> > How many snapshot requests did you submit before seeing the following ?
> >
> > Cheers
> >
> > On Sep 23, 2014, at 2:28 AM, Ian Brooks <i....@sensewhere.com> wrote:
> >
> > > Hi,
> > >
> > > I'm seeing an issue on our hbase cluster which is preventing snapshots
> from working. So far the only way i can get it working again is to restart
> all the regionservers which is not ideal.
> > >
> > > I'm running hbase 0.98.3 on hadoop 2.4.0, the stacktraces are
> > >
> > > From host given the snapshot command via hbase shell
> > >
> > > 2014-09-23 10:00:16,116 INFO  [regionserver16020-EventThread]
> procedure.ZKProcedureMemberRpcs: Received procedure start children changed
> event: /hbase/online-snapshot/acquired
> > > 2014-09-23 10:00:16,116 DEBUG [regionserver16020-EventThread]
> procedure.ZKProcedureMemberRpcs: Looking for new procedures under
> znode:'/hbase/online-snapshot/acquired'
> > > 2014-09-23 10:00:16,117 DEBUG [regionserver16020-EventThread]
> procedure.ZKProcedureMemberRpcs: Found procedure znode:
> /hbase/online-snapshot/acquired/offlineScans-20140923-100009
> > > 2014-09-23 10:00:16,117 DEBUG [regionserver16020-EventThread]
> procedure.ZKProcedureMemberRpcs: start proc data length is 59
> > > 2014-09-23 10:00:16,117 DEBUG [regionserver16020-EventThread]
> procedure.ZKProcedureMemberRpcs: Found data for
> znode:/hbase/online-snapshot/acquired/offlineScans-20140923-100009
> > > 2014-09-23 10:00:16,118 DEBUG [regionserver16020-EventThread]
> snapshot.RegionServerSnapshotManager: Launching subprocedure for snapshot
> offlineScans-20140923-100009 from table offlineScans
> > > 2014-09-23 10:00:16,118 DEBUG [regionserver16020-EventThread]
> procedure.ProcedureMember: Submitting new
> Subprocedure:offlineScans-20140923-100009
> > > 2014-09-23 10:00:16,120 DEBUG [member:
> '############,16020,1410126483043' subprocedure-pool1-thread-314]
> procedure.Subprocedure: Starting subprocedure
> 'offlineScans-20140923-100009' with timeout 60000ms
> > > 2014-09-23 10:00:16,120 DEBUG [member:
> '############,16020,1410126483043' subprocedure-pool1-thread-314]
> errorhandling.TimeoutExceptionInjector: Scheduling process timer to run in:
> 60000 ms
> > > 2014-09-23 10:00:16,121 DEBUG [member:
> '############,16020,1410126483043' subprocedure-pool1-thread-314]
> procedure.Subprocedure: Subprocedure 'offlineScans-20140923-100009'
> starting 'acquire' stage
> > > 2014-09-23 10:00:16,121 DEBUG [member:
> '############,16020,1410126483043' subprocedure-pool1-thread-314]
> procedure.Subprocedure: Subprocedure 'offlineScans-20140923-100009' locally
> acquired
> > > 2014-09-23 10:00:16,121 DEBUG [member:
> '############,16020,1410126483043' subprocedure-pool1-thread-314]
> procedure.ZKProcedureMemberRpcs: Member: '##########,16020,1410126483043'
> joining acquired barrier for procedure (offlineScans-20140923-100009) in zk
> > > 2014-09-23 10:00:16,123 DEBUG [member:
> '############,16020,1410126483043' subprocedure-pool1-thread-314]
> procedure.ZKProcedureMemberRpcs: Watch for global barrier
> reached:/hbase/online-snapshot/reached/offlineScans-20140923-100009
> > > 2014-09-23 10:00:16,124 DEBUG [member:
> '############,16020,1410126483043' subprocedure-pool1-thread-314]
> procedure.Subprocedure: Subprocedure 'offlineScans-20140923-100009'
> coordinator notified of 'acquire', waiting on 'reached' or 'abort' from
> coordinator
> > > 2014-09-23 10:00:16,133 INFO  [regionserver16020-EventThread]
> procedure.ZKProcedureMemberRpcs: Received created
> event:/hbase/online-snapshot/abort/offlineScans-20140923-100009
> > > 2014-09-23 10:00:16,134 DEBUG [regionserver16020-EventThread]
> procedure.ZKProcedureMemberRpcs: Aborting procedure member for znode
> /hbase/online-snapshot/abort/offlineScans-20140923-100009
> > > 2014-09-23 10:00:16,134 DEBUG [regionserver16020-EventThread]
> procedure.ProcedureMember: Request received to abort procedure
> offlineScans-20140923-100009
> > > org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable
> via
> ###############,16020,1408633802567:org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable:
> > >        at
> org.apache.hadoop.hbase.errorhandling.ForeignException.deserialize(ForeignException.java:171)
> > >        at
> org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.abort(ZKProcedureMemberRpcs.java:320)
> > >        at
> org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs$1.nodeCreated(ZKProcedureMemberRpcs.java:94)
> > >        at
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:327)
> > >        at
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522)
> > >        at
> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
> > > Caused by:
> org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable:
> > >        at
> org.apache.hadoop.hbase.procedure.Subprocedure.cancel(Subprocedure.java:270)
> > >        at
> org.apache.hadoop.hbase.procedure.ProcedureMember.submitSubprocedure(ProcedureMember.java:171)
> > >        at
> org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.startNewSubprocedure(ZKProcedureMemberRpcs.java:214)
> > >        at
> org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.waitForNewProcedures(ZKProcedureMemberRpcs.java:172)
> > >        at
> org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.access$100(ZKProcedureMemberRpcs.java:55)
> > >        at
> org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs$1.nodeChildrenChanged(ZKProcedureMemberRpcs.java:107)
> > >        at
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:348)
> > >        ... 2 more
> > >
> > >
> > >
> > > From the regionserver it was trying to contact
> > >
> > >
> > > 2014-09-23 10:00:16,124 INFO  [regionserver16020-EventThread]
> procedure.ZKProcedureMemberRpcs: Received procedure start children changed
> event: /hbase/online-snapshot/acquired
> > > 2014-09-23 10:00:16,124 DEBUG [regionserver16020-EventThread]
> procedure.ZKProcedureMemberRpcs: Looking for new procedures under
> znode:'/hbase/online-snapshot/acquired'
> > > 2014-09-23 10:00:16,125 DEBUG [regionserver16020-EventThread]
> procedure.ZKProcedureMemberRpcs: Found procedure znode:
> /hbase/online-snapshot/acquired/offlineScans-20140923-100009
> > > 2014-09-23 10:00:16,127 DEBUG [regionserver16020-EventThread]
> procedure.ZKProcedureMemberRpcs: start proc data length is 59
> > > 2014-09-23 10:00:16,127 DEBUG [regionserver16020-EventThread]
> procedure.ZKProcedureMemberRpcs: Found data for
> znode:/hbase/online-snapshot/acquired/offlineScans-20140923-100009
> > > 2014-09-23 10:00:16,127 DEBUG [regionserver16020-EventThread]
> snapshot.RegionServerSnapshotManager: Launching subprocedure for snapshot
> offlineScans-20140923-100009 from table offlineScans
> > > 2014-09-23 10:00:16,127 DEBUG [regionserver16020-EventThread]
> procedure.ProcedureMember: Submitting new
> Subprocedure:offlineScans-20140923-100009
> > > 2014-09-23 10:00:16,128 ERROR [regionserver16020-EventThread]
> procedure.Subprocedure: Subprocedure pool is full!
> > > 2014-09-23 10:00:16,128 DEBUG [regionserver16020-EventThread]
> errorhandling.ForeignExceptionDispatcher: offlineScans-20140923-100009
> accepting received exception
> > > 2014-09-23 10:00:16,128 DEBUG [regionserver16020-EventThread]
> procedure.ZKProcedureMemberRpcs: Aborting procedure
> (offlineScans-20140923-100009) in zk
> > > 2014-09-23 10:00:16,134 DEBUG [regionserver16020-EventThread]
> procedure.ZKProcedureMemberRpcs: Finished creating abort
> znode:/hbase/online-snapshot/abort/offlineScans-20140923-100009
> > > 2014-09-23 10:00:16,134 ERROR [regionserver16020-EventThread]
> procedure.ProcedureMember: Failed to start subprocedure
> 'offlineScans-20140923-100009'
> > > 2014-09-23 10:00:16,134 INFO  [regionserver16020-EventThread]
> procedure.ZKProcedureMemberRpcs: Received created
> event:/hbase/online-snapshot/abort/offlineScans-20140923-100009
> > > 2014-09-23 10:00:16,134 DEBUG [regionserver16020-EventThread]
> procedure.ZKProcedureMemberRpcs: Aborting procedure member for znode
> /hbase/online-snapshot/abort/offlineScans-20140923-100009
> > > 2014-09-23 10:00:16,145 DEBUG [regionserver16020-EventThread]
> procedure.ProcedureMember: Request received to abort procedure
> offlineScans-20140923-100009
> > > org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable
> via
> ###########,16020,1408633802567:org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable:
> > >        at
> org.apache.hadoop.hbase.errorhandling.ForeignException.deserialize(ForeignException.java:171)
> > >        at
> org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.abort(ZKProcedureMemberRpcs.java:320)
> > >        at
> org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs$1.nodeCreated(ZKProcedureMemberRpcs.java:94)
> > >        at
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:327)
> > >        at
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522)
> > >        at
> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
> > > Caused by:
> org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable:
> > >        at
> org.apache.hadoop.hbase.procedure.Subprocedure.cancel(Subprocedure.java:270)
> > >        at
> org.apache.hadoop.hbase.procedure.ProcedureMember.submitSubprocedure(ProcedureMember.java:171)
> > >        at
> org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.startNewSubprocedure(ZKProcedureMemberRpcs.java:214)
> > >        at
> org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.waitForNewProcedures(ZKProcedureMemberRpcs.java:172)
> > >        at
> org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.access$100(ZKProcedureMemberRpcs.java:55)
> > >        at
> org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs$1.nodeChildrenChanged(ZKProcedureMemberRpcs.java:107)
> > >        at
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:348)
> > >        ... 2 more
> > >
> > > Any ideas on how I can resolve this?
> > >
> > > -Ian
> > >
>

Re: Snapshot failure

Posted by Ian Brooks <i....@sensewhere.com>.
Hi,

I'm running snapshots on 7 tables every hour, the last run managed 15 rounds of snapshots before starting to fail.

I did notice the following while getting the success count

2014-09-10 11:01:55,453 DEBUG [rs(##############,16020,1410126483043)-snapshot-pool151-thread-1] snapshot.FlushSnapshotSubprocedure: Closing region operation on rawScans,250000000000,1408114528441.ddb14a3d8f90cc18b6dfd4f28df5c534.
2014-09-10 11:01:55,453 DEBUG [rs(##############,16020,1410126483043)-snapshot-pool151-thread-1] snapshot.FlushSnapshotSubprocedure: Starting region operation on rawScans,960000000000,1408114528443.0b3010a2712ff1b6acec0705f5d171d9.
2014-09-10 11:01:55,453 DEBUG [rs(##############,16020,1410126483043)-snapshot-pool151-thread-1] snapshot.FlushSnapshotSubprocedure: Flush Snapshotting region rawScans,960000000000,1408114528443.0b3010a2712ff1b6acec0705f5d171d9. started...
2014-09-10 11:01:55,454 WARN  [member: '##############,16020,1410126483043' subprocedure-pool1-thread-1] snapshot.RegionServerSnapshotManager: Rethrowing ForeignException from SnapshotSubprocedurePool
java.util.concurrent.ExecutionException: org.apache.hadoop.hbase.errorhandling.TimeoutException via timer-java.util.Timer@45c34a8:org.apache.hadoop.hbase.errorhandling.TimeoutException: Timeout elapsed! Source:Timeout caused Foreign Exception Start:1410343254867, End:1410343314867, diff:60000, max:60000 ms
        at java.util.concurrent.FutureTask.report(FutureTask.java:122)
        at java.util.concurrent.FutureTask.get(FutureTask.java:188)
        at org.apache.hadoop.hbase.regionserver.snapshot.RegionServerSnapshotManager$SnapshotSubprocedurePool.waitForOutstandingTasks(RegionServerSnapshotManager.java:295)
        at org.apache.hadoop.hbase.regionserver.snapshot.FlushSnapshotSubprocedure.flushSnapshot(FlushSnapshotSubprocedure.java:118)
        at org.apache.hadoop.hbase.regionserver.snapshot.FlushSnapshotSubprocedure.insideBarrier(FlushSnapshotSubprocedure.java:137)
        at org.apache.hadoop.hbase.procedure.Subprocedure.call(Subprocedure.java:181)
        at org.apache.hadoop.hbase.procedure.Subprocedure.call(Subprocedure.java:52)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.hadoop.hbase.errorhandling.TimeoutException via timer-java.util.Timer@45c34a8:org.apache.hadoop.hbase.errorhandling.TimeoutException: Timeout elapsed! Source:Timeout caused Foreign Exception Start:1410343254867, End:1410343314867, diff:60000, max:60000 ms
        at org.apache.hadoop.hbase.errorhandling.ForeignExceptionDispatcher.rethrowException(ForeignExceptionDispatcher.java:83)
        at org.apache.hadoop.hbase.regionserver.HRegion.addRegionToSnapshot(HRegion.java:2816)
        at org.apache.hadoop.hbase.regionserver.snapshot.FlushSnapshotSubprocedure$RegionSnapshotTask.call(FlushSnapshotSubprocedure.java:83)
        at org.apache.hadoop.hbase.regionserver.snapshot.FlushSnapshotSubprocedure$RegionSnapshotTask.call(FlushSnapshotSubprocedure.java:65)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
        ... 4 more
Caused by: org.apache.hadoop.hbase.errorhandling.TimeoutException: Timeout elapsed! Source:Timeout caused Foreign Exception Start:1410343254867, End:1410343314867, diff:60000, max:60000 ms
        at org.apache.hadoop.hbase.errorhandling.TimeoutExceptionInjector$1.run(TimeoutExceptionInjector.java:67)
        at java.util.TimerThread.mainLoop(Timer.java:555)
        at java.util.TimerThread.run(Timer.java:505)

Is there a configuration option that I need to increase to allow the snapshots to run for more than 60 seconds on a given region? That said, I'm guessing the above shouldn't cause the SubProcedure pool to fill up completely.

-Ian

On Tuesday 23 September 2014 04:33:39 Ted Yu wrote:
> Subprocedure pool is full. 
> 
> How many snapshot requests did you submit before seeing the following ?
> 
> Cheers
> 
> On Sep 23, 2014, at 2:28 AM, Ian Brooks <i....@sensewhere.com> wrote:
> 
> > Hi,
> > 
> > I'm seeing an issue on our hbase cluster which is preventing snapshots from working. So far the only way i can get it working again is to restart all the regionservers which is not ideal.
> > 
> > I'm running hbase 0.98.3 on hadoop 2.4.0, the stacktraces are
> > 
> > From host given the snapshot command via hbase shell 
> > 
> > 2014-09-23 10:00:16,116 INFO  [regionserver16020-EventThread] procedure.ZKProcedureMemberRpcs: Received procedure start children changed event: /hbase/online-snapshot/acquired
> > 2014-09-23 10:00:16,116 DEBUG [regionserver16020-EventThread] procedure.ZKProcedureMemberRpcs: Looking for new procedures under znode:'/hbase/online-snapshot/acquired'
> > 2014-09-23 10:00:16,117 DEBUG [regionserver16020-EventThread] procedure.ZKProcedureMemberRpcs: Found procedure znode: /hbase/online-snapshot/acquired/offlineScans-20140923-100009
> > 2014-09-23 10:00:16,117 DEBUG [regionserver16020-EventThread] procedure.ZKProcedureMemberRpcs: start proc data length is 59
> > 2014-09-23 10:00:16,117 DEBUG [regionserver16020-EventThread] procedure.ZKProcedureMemberRpcs: Found data for znode:/hbase/online-snapshot/acquired/offlineScans-20140923-100009
> > 2014-09-23 10:00:16,118 DEBUG [regionserver16020-EventThread] snapshot.RegionServerSnapshotManager: Launching subprocedure for snapshot offlineScans-20140923-100009 from table offlineScans
> > 2014-09-23 10:00:16,118 DEBUG [regionserver16020-EventThread] procedure.ProcedureMember: Submitting new Subprocedure:offlineScans-20140923-100009
> > 2014-09-23 10:00:16,120 DEBUG [member: '############,16020,1410126483043' subprocedure-pool1-thread-314] procedure.Subprocedure: Starting subprocedure 'offlineScans-20140923-100009' with timeout 60000ms
> > 2014-09-23 10:00:16,120 DEBUG [member: '############,16020,1410126483043' subprocedure-pool1-thread-314] errorhandling.TimeoutExceptionInjector: Scheduling process timer to run in: 60000 ms
> > 2014-09-23 10:00:16,121 DEBUG [member: '############,16020,1410126483043' subprocedure-pool1-thread-314] procedure.Subprocedure: Subprocedure 'offlineScans-20140923-100009' starting 'acquire' stage
> > 2014-09-23 10:00:16,121 DEBUG [member: '############,16020,1410126483043' subprocedure-pool1-thread-314] procedure.Subprocedure: Subprocedure 'offlineScans-20140923-100009' locally acquired
> > 2014-09-23 10:00:16,121 DEBUG [member: '############,16020,1410126483043' subprocedure-pool1-thread-314] procedure.ZKProcedureMemberRpcs: Member: '##########,16020,1410126483043' joining acquired barrier for procedure (offlineScans-20140923-100009) in zk
> > 2014-09-23 10:00:16,123 DEBUG [member: '############,16020,1410126483043' subprocedure-pool1-thread-314] procedure.ZKProcedureMemberRpcs: Watch for global barrier reached:/hbase/online-snapshot/reached/offlineScans-20140923-100009
> > 2014-09-23 10:00:16,124 DEBUG [member: '############,16020,1410126483043' subprocedure-pool1-thread-314] procedure.Subprocedure: Subprocedure 'offlineScans-20140923-100009' coordinator notified of 'acquire', waiting on 'reached' or 'abort' from coordinator
> > 2014-09-23 10:00:16,133 INFO  [regionserver16020-EventThread] procedure.ZKProcedureMemberRpcs: Received created event:/hbase/online-snapshot/abort/offlineScans-20140923-100009
> > 2014-09-23 10:00:16,134 DEBUG [regionserver16020-EventThread] procedure.ZKProcedureMemberRpcs: Aborting procedure member for znode /hbase/online-snapshot/abort/offlineScans-20140923-100009
> > 2014-09-23 10:00:16,134 DEBUG [regionserver16020-EventThread] procedure.ProcedureMember: Request received to abort procedure offlineScans-20140923-100009
> > org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable via ###############,16020,1408633802567:org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable: 
> >        at org.apache.hadoop.hbase.errorhandling.ForeignException.deserialize(ForeignException.java:171)
> >        at org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.abort(ZKProcedureMemberRpcs.java:320)
> >        at org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs$1.nodeCreated(ZKProcedureMemberRpcs.java:94)
> >        at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:327)
> >        at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522)
> >        at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
> > Caused by: org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable: 
> >        at org.apache.hadoop.hbase.procedure.Subprocedure.cancel(Subprocedure.java:270)
> >        at org.apache.hadoop.hbase.procedure.ProcedureMember.submitSubprocedure(ProcedureMember.java:171)
> >        at org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.startNewSubprocedure(ZKProcedureMemberRpcs.java:214)
> >        at org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.waitForNewProcedures(ZKProcedureMemberRpcs.java:172)
> >        at org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.access$100(ZKProcedureMemberRpcs.java:55)
> >        at org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs$1.nodeChildrenChanged(ZKProcedureMemberRpcs.java:107)
> >        at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:348)
> >        ... 2 more
> > 
> > 
> > 
> > From the regionserver it was trying to contact
> > 
> > 
> > 2014-09-23 10:00:16,124 INFO  [regionserver16020-EventThread] procedure.ZKProcedureMemberRpcs: Received procedure start children changed event: /hbase/online-snapshot/acquired
> > 2014-09-23 10:00:16,124 DEBUG [regionserver16020-EventThread] procedure.ZKProcedureMemberRpcs: Looking for new procedures under znode:'/hbase/online-snapshot/acquired'
> > 2014-09-23 10:00:16,125 DEBUG [regionserver16020-EventThread] procedure.ZKProcedureMemberRpcs: Found procedure znode: /hbase/online-snapshot/acquired/offlineScans-20140923-100009
> > 2014-09-23 10:00:16,127 DEBUG [regionserver16020-EventThread] procedure.ZKProcedureMemberRpcs: start proc data length is 59
> > 2014-09-23 10:00:16,127 DEBUG [regionserver16020-EventThread] procedure.ZKProcedureMemberRpcs: Found data for znode:/hbase/online-snapshot/acquired/offlineScans-20140923-100009
> > 2014-09-23 10:00:16,127 DEBUG [regionserver16020-EventThread] snapshot.RegionServerSnapshotManager: Launching subprocedure for snapshot offlineScans-20140923-100009 from table offlineScans
> > 2014-09-23 10:00:16,127 DEBUG [regionserver16020-EventThread] procedure.ProcedureMember: Submitting new Subprocedure:offlineScans-20140923-100009
> > 2014-09-23 10:00:16,128 ERROR [regionserver16020-EventThread] procedure.Subprocedure: Subprocedure pool is full!
> > 2014-09-23 10:00:16,128 DEBUG [regionserver16020-EventThread] errorhandling.ForeignExceptionDispatcher: offlineScans-20140923-100009 accepting received exception
> > 2014-09-23 10:00:16,128 DEBUG [regionserver16020-EventThread] procedure.ZKProcedureMemberRpcs: Aborting procedure (offlineScans-20140923-100009) in zk
> > 2014-09-23 10:00:16,134 DEBUG [regionserver16020-EventThread] procedure.ZKProcedureMemberRpcs: Finished creating abort znode:/hbase/online-snapshot/abort/offlineScans-20140923-100009
> > 2014-09-23 10:00:16,134 ERROR [regionserver16020-EventThread] procedure.ProcedureMember: Failed to start subprocedure 'offlineScans-20140923-100009'
> > 2014-09-23 10:00:16,134 INFO  [regionserver16020-EventThread] procedure.ZKProcedureMemberRpcs: Received created event:/hbase/online-snapshot/abort/offlineScans-20140923-100009
> > 2014-09-23 10:00:16,134 DEBUG [regionserver16020-EventThread] procedure.ZKProcedureMemberRpcs: Aborting procedure member for znode /hbase/online-snapshot/abort/offlineScans-20140923-100009
> > 2014-09-23 10:00:16,145 DEBUG [regionserver16020-EventThread] procedure.ProcedureMember: Request received to abort procedure offlineScans-20140923-100009
> > org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable via ###########,16020,1408633802567:org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable:
> >        at org.apache.hadoop.hbase.errorhandling.ForeignException.deserialize(ForeignException.java:171)
> >        at org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.abort(ZKProcedureMemberRpcs.java:320)
> >        at org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs$1.nodeCreated(ZKProcedureMemberRpcs.java:94)
> >        at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:327)
> >        at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522)
> >        at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
> > Caused by: org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable:
> >        at org.apache.hadoop.hbase.procedure.Subprocedure.cancel(Subprocedure.java:270)
> >        at org.apache.hadoop.hbase.procedure.ProcedureMember.submitSubprocedure(ProcedureMember.java:171)
> >        at org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.startNewSubprocedure(ZKProcedureMemberRpcs.java:214)
> >        at org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.waitForNewProcedures(ZKProcedureMemberRpcs.java:172)
> >        at org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.access$100(ZKProcedureMemberRpcs.java:55)
> >        at org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs$1.nodeChildrenChanged(ZKProcedureMemberRpcs.java:107)
> >        at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:348)
> >        ... 2 more
> > 
> > Any ideas on how I can resolve this?
> > 
> > -Ian
> > 

Re: Snapshot failure

Posted by Ted Yu <yu...@gmail.com>.
Subprocedure pool is full. 

How many snapshot requests did you submit before seeing the following ?

Cheers

On Sep 23, 2014, at 2:28 AM, Ian Brooks <i....@sensewhere.com> wrote:

> Hi,
> 
> I'm seeing an issue on our hbase cluster which is preventing snapshots from working. So far the only way i can get it working again is to restart all the regionservers which is not ideal.
> 
> I'm running hbase 0.98.3 on hadoop 2.4.0, the stacktraces are
> 
> From host given the snapshot command via hbase shell 
> 
> 2014-09-23 10:00:16,116 INFO  [regionserver16020-EventThread] procedure.ZKProcedureMemberRpcs: Received procedure start children changed event: /hbase/online-snapshot/acquired
> 2014-09-23 10:00:16,116 DEBUG [regionserver16020-EventThread] procedure.ZKProcedureMemberRpcs: Looking for new procedures under znode:'/hbase/online-snapshot/acquired'
> 2014-09-23 10:00:16,117 DEBUG [regionserver16020-EventThread] procedure.ZKProcedureMemberRpcs: Found procedure znode: /hbase/online-snapshot/acquired/offlineScans-20140923-100009
> 2014-09-23 10:00:16,117 DEBUG [regionserver16020-EventThread] procedure.ZKProcedureMemberRpcs: start proc data length is 59
> 2014-09-23 10:00:16,117 DEBUG [regionserver16020-EventThread] procedure.ZKProcedureMemberRpcs: Found data for znode:/hbase/online-snapshot/acquired/offlineScans-20140923-100009
> 2014-09-23 10:00:16,118 DEBUG [regionserver16020-EventThread] snapshot.RegionServerSnapshotManager: Launching subprocedure for snapshot offlineScans-20140923-100009 from table offlineScans
> 2014-09-23 10:00:16,118 DEBUG [regionserver16020-EventThread] procedure.ProcedureMember: Submitting new Subprocedure:offlineScans-20140923-100009
> 2014-09-23 10:00:16,120 DEBUG [member: '############,16020,1410126483043' subprocedure-pool1-thread-314] procedure.Subprocedure: Starting subprocedure 'offlineScans-20140923-100009' with timeout 60000ms
> 2014-09-23 10:00:16,120 DEBUG [member: '############,16020,1410126483043' subprocedure-pool1-thread-314] errorhandling.TimeoutExceptionInjector: Scheduling process timer to run in: 60000 ms
> 2014-09-23 10:00:16,121 DEBUG [member: '############,16020,1410126483043' subprocedure-pool1-thread-314] procedure.Subprocedure: Subprocedure 'offlineScans-20140923-100009' starting 'acquire' stage
> 2014-09-23 10:00:16,121 DEBUG [member: '############,16020,1410126483043' subprocedure-pool1-thread-314] procedure.Subprocedure: Subprocedure 'offlineScans-20140923-100009' locally acquired
> 2014-09-23 10:00:16,121 DEBUG [member: '############,16020,1410126483043' subprocedure-pool1-thread-314] procedure.ZKProcedureMemberRpcs: Member: '##########,16020,1410126483043' joining acquired barrier for procedure (offlineScans-20140923-100009) in zk
> 2014-09-23 10:00:16,123 DEBUG [member: '############,16020,1410126483043' subprocedure-pool1-thread-314] procedure.ZKProcedureMemberRpcs: Watch for global barrier reached:/hbase/online-snapshot/reached/offlineScans-20140923-100009
> 2014-09-23 10:00:16,124 DEBUG [member: '############,16020,1410126483043' subprocedure-pool1-thread-314] procedure.Subprocedure: Subprocedure 'offlineScans-20140923-100009' coordinator notified of 'acquire', waiting on 'reached' or 'abort' from coordinator
> 2014-09-23 10:00:16,133 INFO  [regionserver16020-EventThread] procedure.ZKProcedureMemberRpcs: Received created event:/hbase/online-snapshot/abort/offlineScans-20140923-100009
> 2014-09-23 10:00:16,134 DEBUG [regionserver16020-EventThread] procedure.ZKProcedureMemberRpcs: Aborting procedure member for znode /hbase/online-snapshot/abort/offlineScans-20140923-100009
> 2014-09-23 10:00:16,134 DEBUG [regionserver16020-EventThread] procedure.ProcedureMember: Request received to abort procedure offlineScans-20140923-100009
> org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable via ###############,16020,1408633802567:org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable: 
>        at org.apache.hadoop.hbase.errorhandling.ForeignException.deserialize(ForeignException.java:171)
>        at org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.abort(ZKProcedureMemberRpcs.java:320)
>        at org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs$1.nodeCreated(ZKProcedureMemberRpcs.java:94)
>        at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:327)
>        at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522)
>        at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
> Caused by: org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable: 
>        at org.apache.hadoop.hbase.procedure.Subprocedure.cancel(Subprocedure.java:270)
>        at org.apache.hadoop.hbase.procedure.ProcedureMember.submitSubprocedure(ProcedureMember.java:171)
>        at org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.startNewSubprocedure(ZKProcedureMemberRpcs.java:214)
>        at org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.waitForNewProcedures(ZKProcedureMemberRpcs.java:172)
>        at org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.access$100(ZKProcedureMemberRpcs.java:55)
>        at org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs$1.nodeChildrenChanged(ZKProcedureMemberRpcs.java:107)
>        at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:348)
>        ... 2 more
> 
> 
> 
> From the regionserver it was trying to contact
> 
> 
> 2014-09-23 10:00:16,124 INFO  [regionserver16020-EventThread] procedure.ZKProcedureMemberRpcs: Received procedure start children changed event: /hbase/online-snapshot/acquired
> 2014-09-23 10:00:16,124 DEBUG [regionserver16020-EventThread] procedure.ZKProcedureMemberRpcs: Looking for new procedures under znode:'/hbase/online-snapshot/acquired'
> 2014-09-23 10:00:16,125 DEBUG [regionserver16020-EventThread] procedure.ZKProcedureMemberRpcs: Found procedure znode: /hbase/online-snapshot/acquired/offlineScans-20140923-100009
> 2014-09-23 10:00:16,127 DEBUG [regionserver16020-EventThread] procedure.ZKProcedureMemberRpcs: start proc data length is 59
> 2014-09-23 10:00:16,127 DEBUG [regionserver16020-EventThread] procedure.ZKProcedureMemberRpcs: Found data for znode:/hbase/online-snapshot/acquired/offlineScans-20140923-100009
> 2014-09-23 10:00:16,127 DEBUG [regionserver16020-EventThread] snapshot.RegionServerSnapshotManager: Launching subprocedure for snapshot offlineScans-20140923-100009 from table offlineScans
> 2014-09-23 10:00:16,127 DEBUG [regionserver16020-EventThread] procedure.ProcedureMember: Submitting new Subprocedure:offlineScans-20140923-100009
> 2014-09-23 10:00:16,128 ERROR [regionserver16020-EventThread] procedure.Subprocedure: Subprocedure pool is full!
> 2014-09-23 10:00:16,128 DEBUG [regionserver16020-EventThread] errorhandling.ForeignExceptionDispatcher: offlineScans-20140923-100009 accepting received exception
> 2014-09-23 10:00:16,128 DEBUG [regionserver16020-EventThread] procedure.ZKProcedureMemberRpcs: Aborting procedure (offlineScans-20140923-100009) in zk
> 2014-09-23 10:00:16,134 DEBUG [regionserver16020-EventThread] procedure.ZKProcedureMemberRpcs: Finished creating abort znode:/hbase/online-snapshot/abort/offlineScans-20140923-100009
> 2014-09-23 10:00:16,134 ERROR [regionserver16020-EventThread] procedure.ProcedureMember: Failed to start subprocedure 'offlineScans-20140923-100009'
> 2014-09-23 10:00:16,134 INFO  [regionserver16020-EventThread] procedure.ZKProcedureMemberRpcs: Received created event:/hbase/online-snapshot/abort/offlineScans-20140923-100009
> 2014-09-23 10:00:16,134 DEBUG [regionserver16020-EventThread] procedure.ZKProcedureMemberRpcs: Aborting procedure member for znode /hbase/online-snapshot/abort/offlineScans-20140923-100009
> 2014-09-23 10:00:16,145 DEBUG [regionserver16020-EventThread] procedure.ProcedureMember: Request received to abort procedure offlineScans-20140923-100009
> org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable via ###########,16020,1408633802567:org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable:
>        at org.apache.hadoop.hbase.errorhandling.ForeignException.deserialize(ForeignException.java:171)
>        at org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.abort(ZKProcedureMemberRpcs.java:320)
>        at org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs$1.nodeCreated(ZKProcedureMemberRpcs.java:94)
>        at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:327)
>        at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522)
>        at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
> Caused by: org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable:
>        at org.apache.hadoop.hbase.procedure.Subprocedure.cancel(Subprocedure.java:270)
>        at org.apache.hadoop.hbase.procedure.ProcedureMember.submitSubprocedure(ProcedureMember.java:171)
>        at org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.startNewSubprocedure(ZKProcedureMemberRpcs.java:214)
>        at org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.waitForNewProcedures(ZKProcedureMemberRpcs.java:172)
>        at org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.access$100(ZKProcedureMemberRpcs.java:55)
>        at org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs$1.nodeChildrenChanged(ZKProcedureMemberRpcs.java:107)
>        at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:348)
>        ... 2 more
> 
> Any ideas on how I can resolve this?
> 
> -Ian
>