You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Robert Yokota (JIRA)" <ji...@apache.org> on 2016/08/02 15:06:21 UTC
[jira] [Comment Edited] (HBASE-16270) When using region replicas, get thousands of occurrences of UnexpectedStateException: Current snapshot id is -1

    [ https://issues.apache.org/jira/browse/HBASE-16270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15403288#comment-15403288 ] 

Robert Yokota edited comment on HBASE-16270 at 8/2/16 3:06 PM:
---------------------------------------------------------------

First, I'm sorry but I won't be able to reproduce this for you.  When it happened, it caused disruption for external and internal customers.

Second, thank you for your patience with my musings as I am out of my depth in this matter.

My conjecture is that this is happening for primary regions (not secondary regions).  I'm also guessing that it is through concurrent calls to StoreFlusherImpl.commit().  For example, here is another line where HRegion.internalFlushcache() eventually calls StoreFlusherImpl.commit().  Note that it is called for a primary region, but only if region replicas are turned on.  However, I don't know if it is possible that another call to StoreFlusherImpl.commit() can be happening concurrently.

https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java#L1234-L1243


was (Author: rayokota):
First, I'm sorry but I won't be able to reproduce this for you.  It was only happening in our production environment and was causing disruption to our production users.

Second, thank you for your patience with my musings as I am out of my depth in this matter.

My conjecture is that this is happening for primary regions (not secondary regions).  I'm also guessing that it is through concurrent calls to StoreFlusherImpl.commit().  For example, here is another line where HRegion.internalFlushcache() eventually calls StoreFlusherImpl.commit().  Note that it is called for a primary region, but only if region replicas are turned on.  However, I don't know if it is possible that another call to StoreFlusherImpl.commit() can be happening concurrently.

https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java#L1234-L1243

> When using region replicas, get thousands of occurrences of UnexpectedStateException: Current snapshot id is -1
> ---------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-16270
>                 URL: https://issues.apache.org/jira/browse/HBASE-16270
>             Project: HBase
>          Issue Type: Bug
>          Components: Replication
>    Affects Versions: 1.1.2
>            Reporter: Robert Yokota
>         Attachments: HBASE-16270-branch-1.2.patch, HBASE-16270-master.patch
>
>
> We have an HBase (1.1.2) production cluster with 58 region servers and a staging cluster with 6 region servers.
> For both clusters, we enabled region replicas with the following settings:
> hbase.regionserver.storefile.refresh.period = 0
> hbase.region.replica.replication.enabled = true
> hbase.region.replica.replication.memstore.enabled = true
> hbase.master.hfilecleaner.ttl = 3600000
> hbase.master.loadbalancer.class = org.apache.hadoop.hbase.master.balancer.StochasticLoadBalancer
> hbase.meta.replica.count = 3
> hbase.regionserver.meta.storefile.refresh.period = 30000
> hbase.region.replica.wait.for.primary.flush = true
> hbase.region.replica.storefile.refresh.memstore.multiplier = 4
> We then altered our HBase tables to have REGION_REPLICATION => 2
> Both clusters got into a state where a few region servers were spewing the following error below in the HBase logs.  In one instance the error occurred over 70K times.  At this time, these region servers would see 10x write traffic (the hadoop.HBase.RegionServer.Server.writeRequestCount metric) and in some instances would crash.
> At the same time, we would see secondary regions move and then go into the "reads are disabled" state for extended periods.  
> It appears there is a race condition where the DefaultMemStore::clearSnapshot method might be called more than once in succession. The first call would set snapshotId to -1, but the second call would throw an exception.  It seems the second call should just return if the snapshotId is already -1, rather than throwing an exception.
> Thu Jul 21 08:38:50 UTC 2016, RpcRetryingCaller{globalStartTime=1469090201543, pause=100, retries=35}, org.apache.hadoop.hbase.regionserver.UnexpectedStateException: org.apache.hadoop.hbase.regionserver.UnexpectedStateException: Current snapshot id is -1,passed 1469085004304
>         at org.apache.hadoop.hbase.regionserver.DefaultMemStore.clearSnapshot(DefaultMemStore.java:187)
>         at org.apache.hadoop.hbase.regionserver.HStore.updateStorefiles(HStore.java:1054)
>         at org.apache.hadoop.hbase.regionserver.HStore.access$500(HStore.java:128)
>         at org.apache.hadoop.hbase.regionserver.HStore$StoreFlusherImpl.replayFlush(HStore.java:2270)
>         at org.apache.hadoop.hbase.regionserver.HRegion.replayFlushInStores(HRegion.java:4487)
>         at org.apache.hadoop.hbase.regionserver.HRegion.replayWALFlushCommitMarker(HRegion.java:4388)
>         at org.apache.hadoop.hbase.regionserver.HRegion.replayWALFlushMarker(HRegion.java:4191)
>         at org.apache.hadoop.hbase.regionserver.RSRpcServices.doReplayBatchOp(RSRpcServices.java:776)
>         at org.apache.hadoop.hbase.regionserver.RSRpcServices.replay(RSRpcServices.java:1655)
>         at org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:22255)
>         at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2114)
>         at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:101)
>         at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130)
>         at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107)
>         at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)