You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Ian Brooks <i....@sensewhere.com> on 2014/10/16 17:04:40 UTC

Re: Snapshot failure

Hi,

I got a change finally to apply the timeout changes and it lasted longer before it started failing, but its now just throwing the same errors on "procedure.Subprocedure: Subprocedure pool is full!"

Looking through the logs, it looks like the nodes may not correctly updating they completion state after finishing a snapshot,

eg.

2014-09-25 21:03:15,713 INFO  [regionserver16020-EventThread] procedure.ZKProcedureMemberRpcs: Received procedure start children changed event: /hbase/online-snapshot/acquired
2014-09-25 21:03:15,713 DEBUG [regionserver16020-EventThread] procedure.ZKProcedureMemberRpcs: Looking for new procedures under znode:'/hbase/online-snapshot/acquired'
2014-09-25 21:03:15,715 DEBUG [regionserver16020-EventThread] procedure.ZKProcedureMemberRpcs: Found procedure znode: /hbase/online-snapshot/acquired/requestData-20140925-210256
2014-09-25 21:03:15,718 DEBUG [regionserver16020-EventThread] procedure.ZKProcedureMemberRpcs: start proc data length is 57
2014-09-25 21:03:15,718 DEBUG [regionserver16020-EventThread] procedure.ZKProcedureMemberRpcs: Found data for znode:/hbase/online-snapshot/acquired/requestData-20140925-210256
2014-09-25 21:03:15,718 DEBUG [regionserver16020-EventThread] snapshot.RegionServerSnapshotManager: Launching subprocedure for snapshot requestData-20140925-210256 from table requestData
2014-09-25 21:03:15,718 DEBUG [regionserver16020-EventThread] procedure.ProcedureMember: Submitting new Subprocedure:requestData-20140925-210256
2014-09-25 21:03:15,719 DEBUG [member: '###########-005,16020,1411640680181' subprocedure-pool1-thread-1] procedure.Subprocedure: Starting subprocedure 'requestData-20140925-210256' with timeout 600000ms
2014-09-25 21:03:15,719 DEBUG [member: '###########-005,16020,1411640680181' subprocedure-pool1-thread-1] errorhandling.TimeoutExceptionInjector: Scheduling process timer to run in: 600000 ms
2014-09-25 21:03:15,720 DEBUG [member: '###########-005,16020,1411640680181' subprocedure-pool1-thread-1] procedure.Subprocedure: Subprocedure 'requestData-20140925-210256' starting 'acquire' stage
2014-09-25 21:03:15,720 DEBUG [member: '###########-005,16020,1411640680181' subprocedure-pool1-thread-1] procedure.Subprocedure: Subprocedure 'requestData-20140925-210256' locally acquired
2014-09-25 21:03:15,720 DEBUG [member: '###########-005,16020,1411640680181' subprocedure-pool1-thread-1] procedure.ZKProcedureMemberRpcs: Member: '###########-005,16020,1411640680181' joining acquired barrier fo
r procedure (requestData-20140925-210256) in zk
2014-09-25 21:03:15,724 DEBUG [member: '###########-005,16020,1411640680181' subprocedure-pool1-thread-1] procedure.ZKProcedureMemberRpcs: Watch for global barrier reached:/hbase/online-snapshot/reached/request
Data-20140925-210256
2014-09-25 21:03:15,739 DEBUG [member: '###########-005,16020,1411640680181' subprocedure-pool1-thread-1] procedure.Subprocedure: Subprocedure 'requestData-20140925-210256' coordinator notified of 'acquire', wa
iting on 'reached' or 'abort' from coordinator
2014-09-25 21:03:15,825 INFO  [regionserver16020-EventThread] procedure.ZKProcedureMemberRpcs: Received created event:/hbase/online-snapshot/reached/requestData-20140925-210256
2014-09-25 21:03:15,825 DEBUG [regionserver16020-EventThread] procedure.ZKProcedureMemberRpcs: Recieved reached global barrier:/hbase/online-snapshot/reached/requestData-20140925-210256
2014-09-25 21:03:15,825 DEBUG [member: '###########-005,16020,1411640680181' subprocedure-pool1-thread-1] procedure.Subprocedure: Subprocedure 'requestData-20140925-210256' received 'reached' from coordinator.

...

2014-09-25 21:03:38,026 DEBUG [member: '###########-005,16020,1411640680181' subprocedure-pool1-thread-1] snapshot.RegionServerSnapshotManager: Completed 5 local region snapshots.
2014-09-25 21:03:38,026 DEBUG [member: '###########-005,16020,1411640680181' subprocedure-pool1-thread-1] snapshot.RegionServerSnapshotManager: cancelling 0 tasks for snapshot ###########-005,16020,1411640680181
2014-09-25 21:03:38,026 DEBUG [member: '###########-005,16020,1411640680181' subprocedure-pool1-thread-1] procedure.Subprocedure: Subprocedure 'requestData-20140925-210256' locally completed
2014-09-25 21:03:38,026 DEBUG [member: '###########-005,16020,1411640680181' subprocedure-pool1-thread-1] procedure.ZKProcedureMemberRpcs: Marking procedure  'requestData-20140925-210256' completed for member '
###########-005,16020,1411640680181' in zk
2014-09-25 21:03:38,028 DEBUG [member: '###########-005,16020,1411640680181' subprocedure-pool1-thread-1] procedure.Subprocedure: Subprocedure 'requestData-20140925-210256' has notified controller of completion
2014-09-25 21:03:38,028 DEBUG [member: '###########-005,16020,1411640680181' subprocedure-pool1-thread-1] errorhandling.TimeoutExceptionInjector: Marking timer as complete - no error notifications will be recei
ved for this timer.
2014-09-25 21:03:38,028 DEBUG [member: '###########-005,16020,1411640680181' subprocedure-pool1-thread-1] procedure.Subprocedure: Subprocedure 'requestData-20140925-210256' completed.
2


Which suggests that "requestData-20140925-210256" completed happily, however, 10 minutes later the following comes up in the logs


2014-09-25 21:13:15,777 INFO  [regionserver16020-EventThread] procedure.ZKProcedureMemberRpcs: Received created event:/hbase/online-snapshot/abort/requestData-20140925-210256
2014-09-25 21:13:15,777 DEBUG [regionserver16020-EventThread] procedure.ZKProcedureMemberRpcs: Aborting procedure member for znode /hbase/online-snapshot/abort/requestData-20140925-210256
2014-09-25 21:13:15,778 DEBUG [regionserver16020-EventThread] procedure.ProcedureMember: Request received to abort procedure requestData-20140925-210256
org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable via timer-java.util.Timer@5cb99ea6:org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable: org.apache.hadoop.hbase.errorhandling.TimeoutException: Timeout elapsed! Source:Timeout caused Foreign Exception Start:1411675395727, End:1411675995727, diff:600000, max:600000 ms
        at org.apache.hadoop.hbase.errorhandling.ForeignException.deserialize(ForeignException.java:171)
        at org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.abort(ZKProcedureMemberRpcs.java:320)
        at org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs$1.nodeCreated(ZKProcedureMemberRpcs.java:94)
        at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:327)
        at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522)
        at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
Caused by: org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable: org.apache.hadoop.hbase.errorhandling.TimeoutException: Timeout elapsed! Source:Timeout caused Foreign Exception Start:1411675395727, End:1411675995727, diff:600000, max:600000 ms
        at org.apache.hadoop.hbase.errorhandling.TimeoutExceptionInjector$1.run(TimeoutExceptionInjector.java:70)
        at java.util.TimerThread.mainLoop(Timer.java:555)
        at java.util.TimerThread.run(Timer.java:505)
2014-09-25 21:13:15,780 INFO  [regionserver16020-EventThread] procedure.ProcedureMember: Received abort on procedure with no local subprocedure requestData-20140925-210256, ignoring it.
org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable via timer-java.util.Timer@5cb99ea6:org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable: org.apache.hadoop.hbase.errorhandling.TimeoutException: Timeout elapsed! Source:Timeout caused Foreign Exception Start:1411675395727, End:1411675995727, diff:600000, max:600000 ms
        at org.apache.hadoop.hbase.errorhandling.ForeignException.deserialize(ForeignException.java:171)
        at org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.abort(ZKProcedureMemberRpcs.java:320)
        at org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs$1.nodeCreated(ZKProcedureMemberRpcs.java:94)
        at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:327)
        at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522)
        at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
Caused by: org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable: org.apache.hadoop.hbase.errorhandling.TimeoutException: Timeout elapsed! Source:Timeout caused Foreign Exception Start:1411675395727, End:1411675995727, diff:600000, max:600000 ms
        at org.apache.hadoop.hbase.errorhandling.TimeoutExceptionInjector$1.run(TimeoutExceptionInjector.java:70)
        at java.util.TimerThread.mainLoop(Timer.java:555)
        at java.util.TimerThread.run(Timer.java:505)





-Ian Brooks


On Tuesday 23 September 2014 06:52:34 Ted Yu wrote:
> Here're the config parameters related to controlling snapshot timeout:
> 
>   <property>
>     <name>hbase.snapshot.master.timeoutMillis</name>
>     <!-- Change from default of 60s to 600s to allow for slow flushing of
> tables -->
>     <value>600000</value>
>     <description>
>       This is the time HBase master waits for the snapshot operation to
> complete.
>       Do not confuse this hbase.snapshot.master.timeout.millis, which
> although
>       sounding similar, serves a very different purpose.
>       Note: This property has a completely different meaning before hbase
> version
>       0.94.11 and should not enabled on a cluster using snapshots and
> running
>       a version before 0.94.11.
>     </description>
>   </property>
>   <property>
>     <name>hbase.snapshot.master.timeout.millis</name>
>     <!-- Change from default of 60s to 600s to allow for slow flushing of
> tables -->
>     <value>600000</value>
>     <description>
>       This is the timeout the master indicates the client to wait when it
> takes
>       the snapshot. The client actually waits longer than this due to
> exponential
>       backoff. See HBaseAdmin.snapshot for the exact algorithm.
>     </description>
>   </property>
>   <property>
>     <name>hbase.snapshot.region.timeout</name>
>     <!-- Change from default of 60s to 600s to allow for slow flushing of
> tables -->
>     <value>600000</value>
>     <description>
>       This is the time the regionserver waits to complete all of its
> activities
>       for a snapshot operation.
>     </description>
>   </property>
> 
> On Tue, Sep 23, 2014 at 6:05 AM, Ian Brooks <i....@sensewhere.com> wrote:
> 
> > Hi,
> >
> > I'm running snapshots on 7 tables every hour, the last run managed 15
> > rounds of snapshots before starting to fail.
> >
> > I did notice the following while getting the success count
> >
> > 2014-09-10 11:01:55,453 DEBUG
> > [rs(##############,16020,1410126483043)-snapshot-pool151-thread-1]
> > snapshot.FlushSnapshotSubprocedure: Closing region operation on
> > rawScans,250000000000,1408114528441.ddb14a3d8f90cc18b6dfd4f28df5c534.
> > 2014-09-10 11:01:55,453 DEBUG
> > [rs(##############,16020,1410126483043)-snapshot-pool151-thread-1]
> > snapshot.FlushSnapshotSubprocedure: Starting region operation on
> > rawScans,960000000000,1408114528443.0b3010a2712ff1b6acec0705f5d171d9.
> > 2014-09-10 11:01:55,453 DEBUG
> > [rs(##############,16020,1410126483043)-snapshot-pool151-thread-1]
> > snapshot.FlushSnapshotSubprocedure: Flush Snapshotting region
> > rawScans,960000000000,1408114528443.0b3010a2712ff1b6acec0705f5d171d9.
> > started...
> > 2014-09-10 11:01:55,454 WARN  [member:
> > '##############,16020,1410126483043' subprocedure-pool1-thread-1]
> > snapshot.RegionServerSnapshotManager: Rethrowing ForeignException from
> > SnapshotSubprocedurePool
> > java.util.concurrent.ExecutionException:
> > org.apache.hadoop.hbase.errorhandling.TimeoutException via
> > timer-java.util.Timer@45c34a8:org.apache.hadoop.hbase.errorhandling.TimeoutException:
> > Timeout elapsed! Source:Timeout caused Foreign Exception
> > Start:1410343254867, End:1410343314867, diff:60000, max:60000 ms
> >         at java.util.concurrent.FutureTask.report(FutureTask.java:122)
> >         at java.util.concurrent.FutureTask.get(FutureTask.java:188)
> >         at
> > org.apache.hadoop.hbase.regionserver.snapshot.RegionServerSnapshotManager$SnapshotSubprocedurePool.waitForOutstandingTasks(RegionServerSnapshotManager.java:295)
> >         at
> > org.apache.hadoop.hbase.regionserver.snapshot.FlushSnapshotSubprocedure.flushSnapshot(FlushSnapshotSubprocedure.java:118)
> >         at
> > org.apache.hadoop.hbase.regionserver.snapshot.FlushSnapshotSubprocedure.insideBarrier(FlushSnapshotSubprocedure.java:137)
> >         at
> > org.apache.hadoop.hbase.procedure.Subprocedure.call(Subprocedure.java:181)
> >         at
> > org.apache.hadoop.hbase.procedure.Subprocedure.call(Subprocedure.java:52)
> >         at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> >         at
> > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> >         at
> > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> >         at java.lang.Thread.run(Thread.java:745)
> > Caused by: org.apache.hadoop.hbase.errorhandling.TimeoutException via
> > timer-java.util.Timer@45c34a8:org.apache.hadoop.hbase.errorhandling.TimeoutException:
> > Timeout elapsed! Source:Timeout caused Foreign Exception
> > Start:1410343254867, End:1410343314867, diff:60000, max:60000 ms
> >         at
> > org.apache.hadoop.hbase.errorhandling.ForeignExceptionDispatcher.rethrowException(ForeignExceptionDispatcher.java:83)
> >         at
> > org.apache.hadoop.hbase.regionserver.HRegion.addRegionToSnapshot(HRegion.java:2816)
> >         at
> > org.apache.hadoop.hbase.regionserver.snapshot.FlushSnapshotSubprocedure$RegionSnapshotTask.call(FlushSnapshotSubprocedure.java:83)
> >         at
> > org.apache.hadoop.hbase.regionserver.snapshot.FlushSnapshotSubprocedure$RegionSnapshotTask.call(FlushSnapshotSubprocedure.java:65)
> >         at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> >         at
> > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
> >         ... 4 more
> > Caused by: org.apache.hadoop.hbase.errorhandling.TimeoutException: Timeout
> > elapsed! Source:Timeout caused Foreign Exception Start:1410343254867,
> > End:1410343314867, diff:60000, max:60000 ms
> >         at
> > org.apache.hadoop.hbase.errorhandling.TimeoutExceptionInjector$1.run(TimeoutExceptionInjector.java:67)
> >         at java.util.TimerThread.mainLoop(Timer.java:555)
> >         at java.util.TimerThread.run(Timer.java:505)
> >
> > Is there a configuration option that I need to increase to allow the
> > snapshots to run for more than 60 seconds on a given region? That said, I'm
> > guessing the above shouldn't cause the SubProcedure pool to fill up
> > completely.
> >
> > -Ian
> >
> > On Tuesday 23 September 2014 04:33:39 Ted Yu wrote:
> > > Subprocedure pool is full.
> > >
> > > How many snapshot requests did you submit before seeing the following ?
> > >
> > > Cheers
> > >
> > > On Sep 23, 2014, at 2:28 AM, Ian Brooks <i....@sensewhere.com> wrote:
> > >
> > > > Hi,
> > > >
> > > > I'm seeing an issue on our hbase cluster which is preventing snapshots
> > from working. So far the only way i can get it working again is to restart
> > all the regionservers which is not ideal.
> > > >
> > > > I'm running hbase 0.98.3 on hadoop 2.4.0, the stacktraces are
> > > >
> > > > From host given the snapshot command via hbase shell
> > > >
> > > > 2014-09-23 10:00:16,116 INFO  [regionserver16020-EventThread]
> > procedure.ZKProcedureMemberRpcs: Received procedure start children changed
> > event: /hbase/online-snapshot/acquired
> > > > 2014-09-23 10:00:16,116 DEBUG [regionserver16020-EventThread]
> > procedure.ZKProcedureMemberRpcs: Looking for new procedures under
> > znode:'/hbase/online-snapshot/acquired'
> > > > 2014-09-23 10:00:16,117 DEBUG [regionserver16020-EventThread]
> > procedure.ZKProcedureMemberRpcs: Found procedure znode:
> > /hbase/online-snapshot/acquired/offlineScans-20140923-100009
> > > > 2014-09-23 10:00:16,117 DEBUG [regionserver16020-EventThread]
> > procedure.ZKProcedureMemberRpcs: start proc data length is 59
> > > > 2014-09-23 10:00:16,117 DEBUG [regionserver16020-EventThread]
> > procedure.ZKProcedureMemberRpcs: Found data for
> > znode:/hbase/online-snapshot/acquired/offlineScans-20140923-100009
> > > > 2014-09-23 10:00:16,118 DEBUG [regionserver16020-EventThread]
> > snapshot.RegionServerSnapshotManager: Launching subprocedure for snapshot
> > offlineScans-20140923-100009 from table offlineScans
> > > > 2014-09-23 10:00:16,118 DEBUG [regionserver16020-EventThread]
> > procedure.ProcedureMember: Submitting new
> > Subprocedure:offlineScans-20140923-100009
> > > > 2014-09-23 10:00:16,120 DEBUG [member:
> > '############,16020,1410126483043' subprocedure-pool1-thread-314]
> > procedure.Subprocedure: Starting subprocedure
> > 'offlineScans-20140923-100009' with timeout 60000ms
> > > > 2014-09-23 10:00:16,120 DEBUG [member:
> > '############,16020,1410126483043' subprocedure-pool1-thread-314]
> > errorhandling.TimeoutExceptionInjector: Scheduling process timer to run in:
> > 60000 ms
> > > > 2014-09-23 10:00:16,121 DEBUG [member:
> > '############,16020,1410126483043' subprocedure-pool1-thread-314]
> > procedure.Subprocedure: Subprocedure 'offlineScans-20140923-100009'
> > starting 'acquire' stage
> > > > 2014-09-23 10:00:16,121 DEBUG [member:
> > '############,16020,1410126483043' subprocedure-pool1-thread-314]
> > procedure.Subprocedure: Subprocedure 'offlineScans-20140923-100009' locally
> > acquired
> > > > 2014-09-23 10:00:16,121 DEBUG [member:
> > '############,16020,1410126483043' subprocedure-pool1-thread-314]
> > procedure.ZKProcedureMemberRpcs: Member: '##########,16020,1410126483043'
> > joining acquired barrier for procedure (offlineScans-20140923-100009) in zk
> > > > 2014-09-23 10:00:16,123 DEBUG [member:
> > '############,16020,1410126483043' subprocedure-pool1-thread-314]
> > procedure.ZKProcedureMemberRpcs: Watch for global barrier
> > reached:/hbase/online-snapshot/reached/offlineScans-20140923-100009
> > > > 2014-09-23 10:00:16,124 DEBUG [member:
> > '############,16020,1410126483043' subprocedure-pool1-thread-314]
> > procedure.Subprocedure: Subprocedure 'offlineScans-20140923-100009'
> > coordinator notified of 'acquire', waiting on 'reached' or 'abort' from
> > coordinator
> > > > 2014-09-23 10:00:16,133 INFO  [regionserver16020-EventThread]
> > procedure.ZKProcedureMemberRpcs: Received created
> > event:/hbase/online-snapshot/abort/offlineScans-20140923-100009
> > > > 2014-09-23 10:00:16,134 DEBUG [regionserver16020-EventThread]
> > procedure.ZKProcedureMemberRpcs: Aborting procedure member for znode
> > /hbase/online-snapshot/abort/offlineScans-20140923-100009
> > > > 2014-09-23 10:00:16,134 DEBUG [regionserver16020-EventThread]
> > procedure.ProcedureMember: Request received to abort procedure
> > offlineScans-20140923-100009
> > > > org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable
> > via
> > ###############,16020,1408633802567:org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable:
> > > >        at
> > org.apache.hadoop.hbase.errorhandling.ForeignException.deserialize(ForeignException.java:171)
> > > >        at
> > org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.abort(ZKProcedureMemberRpcs.java:320)
> > > >        at
> > org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs$1.nodeCreated(ZKProcedureMemberRpcs.java:94)
> > > >        at
> > org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:327)
> > > >        at
> > org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522)
> > > >        at
> > org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
> > > > Caused by:
> > org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable:
> > > >        at
> > org.apache.hadoop.hbase.procedure.Subprocedure.cancel(Subprocedure.java:270)
> > > >        at
> > org.apache.hadoop.hbase.procedure.ProcedureMember.submitSubprocedure(ProcedureMember.java:171)
> > > >        at
> > org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.startNewSubprocedure(ZKProcedureMemberRpcs.java:214)
> > > >        at
> > org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.waitForNewProcedures(ZKProcedureMemberRpcs.java:172)
> > > >        at
> > org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.access$100(ZKProcedureMemberRpcs.java:55)
> > > >        at
> > org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs$1.nodeChildrenChanged(ZKProcedureMemberRpcs.java:107)
> > > >        at
> > org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:348)
> > > >        ... 2 more
> > > >
> > > >
> > > >
> > > > From the regionserver it was trying to contact
> > > >
> > > >
> > > > 2014-09-23 10:00:16,124 INFO  [regionserver16020-EventThread]
> > procedure.ZKProcedureMemberRpcs: Received procedure start children changed
> > event: /hbase/online-snapshot/acquired
> > > > 2014-09-23 10:00:16,124 DEBUG [regionserver16020-EventThread]
> > procedure.ZKProcedureMemberRpcs: Looking for new procedures under
> > znode:'/hbase/online-snapshot/acquired'
> > > > 2014-09-23 10:00:16,125 DEBUG [regionserver16020-EventThread]
> > procedure.ZKProcedureMemberRpcs: Found procedure znode:
> > /hbase/online-snapshot/acquired/offlineScans-20140923-100009
> > > > 2014-09-23 10:00:16,127 DEBUG [regionserver16020-EventThread]
> > procedure.ZKProcedureMemberRpcs: start proc data length is 59
> > > > 2014-09-23 10:00:16,127 DEBUG [regionserver16020-EventThread]
> > procedure.ZKProcedureMemberRpcs: Found data for
> > znode:/hbase/online-snapshot/acquired/offlineScans-20140923-100009
> > > > 2014-09-23 10:00:16,127 DEBUG [regionserver16020-EventThread]
> > snapshot.RegionServerSnapshotManager: Launching subprocedure for snapshot
> > offlineScans-20140923-100009 from table offlineScans
> > > > 2014-09-23 10:00:16,127 DEBUG [regionserver16020-EventThread]
> > procedure.ProcedureMember: Submitting new
> > Subprocedure:offlineScans-20140923-100009
> > > > 2014-09-23 10:00:16,128 ERROR [regionserver16020-EventThread]
> > procedure.Subprocedure: Subprocedure pool is full!
> > > > 2014-09-23 10:00:16,128 DEBUG [regionserver16020-EventThread]
> > errorhandling.ForeignExceptionDispatcher: offlineScans-20140923-100009
> > accepting received exception
> > > > 2014-09-23 10:00:16,128 DEBUG [regionserver16020-EventThread]
> > procedure.ZKProcedureMemberRpcs: Aborting procedure
> > (offlineScans-20140923-100009) in zk
> > > > 2014-09-23 10:00:16,134 DEBUG [regionserver16020-EventThread]
> > procedure.ZKProcedureMemberRpcs: Finished creating abort
> > znode:/hbase/online-snapshot/abort/offlineScans-20140923-100009
> > > > 2014-09-23 10:00:16,134 ERROR [regionserver16020-EventThread]
> > procedure.ProcedureMember: Failed to start subprocedure
> > 'offlineScans-20140923-100009'
> > > > 2014-09-23 10:00:16,134 INFO  [regionserver16020-EventThread]
> > procedure.ZKProcedureMemberRpcs: Received created
> > event:/hbase/online-snapshot/abort/offlineScans-20140923-100009
> > > > 2014-09-23 10:00:16,134 DEBUG [regionserver16020-EventThread]
> > procedure.ZKProcedureMemberRpcs: Aborting procedure member for znode
> > /hbase/online-snapshot/abort/offlineScans-20140923-100009
> > > > 2014-09-23 10:00:16,145 DEBUG [regionserver16020-EventThread]
> > procedure.ProcedureMember: Request received to abort procedure
> > offlineScans-20140923-100009
> > > > org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable
> > via
> > ###########,16020,1408633802567:org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable:
> > > >        at
> > org.apache.hadoop.hbase.errorhandling.ForeignException.deserialize(ForeignException.java:171)
> > > >        at
> > org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.abort(ZKProcedureMemberRpcs.java:320)
> > > >        at
> > org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs$1.nodeCreated(ZKProcedureMemberRpcs.java:94)
> > > >        at
> > org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:327)
> > > >        at
> > org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522)
> > > >        at
> > org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
> > > > Caused by:
> > org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable:
> > > >        at
> > org.apache.hadoop.hbase.procedure.Subprocedure.cancel(Subprocedure.java:270)
> > > >        at
> > org.apache.hadoop.hbase.procedure.ProcedureMember.submitSubprocedure(ProcedureMember.java:171)
> > > >        at
> > org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.startNewSubprocedure(ZKProcedureMemberRpcs.java:214)
> > > >        at
> > org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.waitForNewProcedures(ZKProcedureMemberRpcs.java:172)
> > > >        at
> > org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.access$100(ZKProcedureMemberRpcs.java:55)
> > > >        at
> > org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs$1.nodeChildrenChanged(ZKProcedureMemberRpcs.java:107)
> > > >        at
> > org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:348)
> > > >        ... 2 more
> > > >
> > > > Any ideas on how I can resolve this?
> > > >
> > > > -Ian
> > > >
> >