You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Aaron Beppu <ab...@sift.com> on 2019/05/13 20:19:45 UTC

consistently observe "not seeked" RS exceptions after specific schema change

Hey HBase users,

I've been struggling with a weird issue. Our team has a table which
currently has a large number of versions per row, and we're seeking to
apply a schema change which both constrains the number and age of versions
stored:
```
alter 'api_grains', {NAME => 'g', MIN_VERSIONS => 5, VERSIONS => 500, TTL
=> 7257600},  {NAME => 'isg', MIN_VERSIONS => 5, VERSIONS => 500, TTL =>
7257600}
```
When attempting to apply a schema change to a large table on a 5.2.0 (CDH5)
cluster, the alter seems to be applied across all regions without problems,
but almost immediately after finishing, I consistently see the region
servers surface the following error.

```

Unexpected throwable object
org.apache.hadoop.hbase.io.hfile.AbstractHFileReader$NotSeekedException:
Not seeked to a key/value
	at org.apache.hadoop.hbase.io.hfile.AbstractHFileReader$Scanner.assertSeeked(AbstractHFileReader.java:313)
	at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$ScannerV2.next(HFileReaderV2.java:878)
	at org.apache.hadoop.hbase.regionserver.StoreFileScanner.next(StoreFileScanner.java:181)
	at org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:108)
	at org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:588)
	at org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:147)
	at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.populateResult(HRegion.java:5775)
	at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:5931)
	at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:5709)
	at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:5685)
	at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:5671)
	at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:6904)
	at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:6862)
	at org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:2010)
	at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:33644)
	at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2191)
	at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:112)
	at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:183)
	at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:163)

```

i.e., it seems to have not appropriately set up scanners to read its own
HFiles. This occurs in the logs from many RSs in the cluster, and happens
continuously. This breaks the service which queries this table, and
continues until I restore a snapshot from before the schema change. The
issue is reproducible (I've caused it about 8 times in our preprod
environments), and is always resolved if I restore a snapshot from before
the schema change.

During the period where region servers throw these exceptions, I don't see
any other indications that Hbase is in poor health. There are no regions in
transition, hbck doesn't report anything interesting, and other tables seem
unaffected.

Just to confirm that the issue is not actually about the HFiles themselves
being malformed, I took a snapshot from the table while it was in the
"broken" state. After exporting this to a different environment, I
confirmed that at a minimum, I can run spark or Hadoop jobs which run over
the files in the snapshot without encountering any issues. So I believe
that the files themselves are fine, because they're readable by HFile input
formats.

A further source of confusion is that we have recently done extremely
similar `alter table ...` commands for other tables in the same cluster,
without issue.

If anyone can comment on how the region servers might into such a state
(where it doesn't appropriately initialize and seek an HFile reader), or
how that state would be related to specific  table admin operations, please
share any insights you may have.

I understand that due to the older version we're running it may be tempting
to recommend that we upgrade to 2.1 and report back if our issue is
unresolved. Please understand that we're running large cluster which
support some high throughput, customer-facing services and that such a
migration is a substantial project. If you make such a recommendation,
please point to a specific issue or bug which has been resolved in more
recent versions.

Thanks,
Aaron

Re: consistently observe "not seeked" RS exceptions after specific schema change

Posted by Aaron Beppu <ab...@sift.com>.
I'd like to add a couple details which I've only recently uncovered:
- The part of the alter which causes the error is `MIN_VERSIONS`. If I
apply just the `VERSIONS` and `TTL` portions, I don't observe these errors
(though this doesn't preserve some behavior that I care about.)
- The table in question has a somewhat large number of column qualifiers.
The tables where I mentioned we had previously applied very similar changes
had only a small fixed set of qualifiers. In principle, I understand that
this might mean that the RS has to do more work to enforce constraints on
the number of versions. But I don't understand why this would cause things
to break for `MIN_VERSIONS` but be fine for (max) `VERSIONS`, nor do I
understand why that would surface as 'Not seeked" states.

On Mon, May 13, 2019 at 1:19 PM Aaron Beppu <ab...@sift.com> wrote:

> Hey HBase users,
>
> I've been struggling with a weird issue. Our team has a table which
> currently has a large number of versions per row, and we're seeking to
> apply a schema change which both constrains the number and age of versions
> stored:
> ```
> alter 'api_grains', {NAME => 'g', MIN_VERSIONS => 5, VERSIONS => 500, TTL
> => 7257600},  {NAME => 'isg', MIN_VERSIONS => 5, VERSIONS => 500, TTL =>
> 7257600}
> ```
> When attempting to apply a schema change to a large table on a 5.2.0
> (CDH5) cluster, the alter seems to be applied across all regions without
> problems, but almost immediately after finishing, I consistently see the
> region servers surface the following error.
>
> ```
>
> Unexpected throwable object
> org.apache.hadoop.hbase.io.hfile.AbstractHFileReader$NotSeekedException: Not seeked to a key/value
> 	at org.apache.hadoop.hbase.io.hfile.AbstractHFileReader$Scanner.assertSeeked(AbstractHFileReader.java:313)
> 	at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$ScannerV2.next(HFileReaderV2.java:878)
> 	at org.apache.hadoop.hbase.regionserver.StoreFileScanner.next(StoreFileScanner.java:181)
> 	at org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:108)
> 	at org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:588)
> 	at org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:147)
> 	at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.populateResult(HRegion.java:5775)
> 	at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:5931)
> 	at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:5709)
> 	at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:5685)
> 	at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:5671)
> 	at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:6904)
> 	at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:6862)
> 	at org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:2010)
> 	at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:33644)
> 	at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2191)
> 	at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:112)
> 	at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:183)
> 	at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:163)
>
> ```
>
> i.e., it seems to have not appropriately set up scanners to read its own
> HFiles. This occurs in the logs from many RSs in the cluster, and happens
> continuously. This breaks the service which queries this table, and
> continues until I restore a snapshot from before the schema change. The
> issue is reproducible (I've caused it about 8 times in our preprod
> environments), and is always resolved if I restore a snapshot from before
> the schema change.
>
> During the period where region servers throw these exceptions, I don't see
> any other indications that Hbase is in poor health. There are no regions in
> transition, hbck doesn't report anything interesting, and other tables seem
> unaffected.
>
> Just to confirm that the issue is not actually about the HFiles themselves
> being malformed, I took a snapshot from the table while it was in the
> "broken" state. After exporting this to a different environment, I
> confirmed that at a minimum, I can run spark or Hadoop jobs which run over
> the files in the snapshot without encountering any issues. So I believe
> that the files themselves are fine, because they're readable by HFile input
> formats.
>
> A further source of confusion is that we have recently done extremely
> similar `alter table ...` commands for other tables in the same cluster,
> without issue.
>
> If anyone can comment on how the region servers might into such a state
> (where it doesn't appropriately initialize and seek an HFile reader), or
> how that state would be related to specific  table admin operations, please
> share any insights you may have.
>
> I understand that due to the older version we're running it may be
> tempting to recommend that we upgrade to 2.1 and report back if our issue
> is unresolved. Please understand that we're running large cluster which
> support some high throughput, customer-facing services and that such a
> migration is a substantial project. If you make such a recommendation,
> please point to a specific issue or bug which has been resolved in more
> recent versions.
>
> Thanks,
> Aaron
>