You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Wellington Chevreuil (Jira)" <ji...@apache.org> on 2022/11/15 00:15:00 UTC
[jira] [Created] (HBASE-27484) FNFE on StoreFileScanner after a flush followed by a compaction

Wellington Chevreuil created HBASE-27484:
--------------------------------------------

             Summary: FNFE on StoreFileScanner after a flush followed by a compaction
                 Key: HBASE-27484
                 URL: https://issues.apache.org/jira/browse/HBASE-27484
             Project: HBase
          Issue Type: Bug
            Reporter: Wellington Chevreuil
            Assignee: Wellington Chevreuil


One of our customers was running SyncTable from a 1.2 based cluster, where SyncTable map tasks were open scanners on a 2.4 based cluster for comparing the two clusters. Few of the map tasks failed with a DoNotRetryException caused by a FileNotFoundException blowing all the way up to the client:
{noformat}
Error: org.apache.hadoop.hbase.DoNotRetryIOException: org.apache.hadoop.hbase.DoNotRetryIOException: java.io.FileNotFoundException: open s3a://xxxxxxxxx-xxxxxxxx-xxxxxxx/hbase/data/default/xxxxxx/4c53da8c2ab9b7d7a0d6046ef3bb701c/0/e2c58350e4e54c21b0f713a4c271b8c0 at 7225 on s3a://xxxxxxxxx-xxxxxxxx-xxxxxxx/hbase/data/default/xxxxxx/4c53da8c2ab9b7d7a0d6046ef3bb701c/0/e2c58350e4e54c21b0f713a4c271b8c0: com.amazonaws.services.s3.model.AmazonS3Exception: The specified key does not exist. (Service: Amazon S3; Status Code: 404; Error Code: NoSuchKey; Request ID: KBRNC67WZGCS4SCF; S3 Extended Request ID: wWEJm8tlFSj/8g+xFpmD1vgWzT/n7HzBcOFAZ8ayIqKMsDZGN/d2kEhdusPLhMd540h+QAPP1xw=; Proxy: null), S3 Extended Request ID: wWEJm8tlFSj/8g+xFpmD1vgWzT/n7HzBcOFAZ8ayIqKMsDZGN/d2kEhdusPLhMd540h+QAPP1xw=:NoSuchKey
        at org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:3712)
        at org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:45819)
        at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:392)
        at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:140)
        at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:359)
        at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:339)
Caused by: java.io.FileNotFoundException: open s3a://xxxxxxxxx-xxxxxxxx-xxxxxxx/hbase/data/default/xxxxxx/4c53da8c2ab9b7d7a0d6046ef3bb701c/0/e2c58350e4e54c21b0f713a4c271b8c0 at 7225 on s3a://xxxxxxxxx-xxxxxxxx-xxxxxxx/hbase/data/default/xxxxxx/4c53da8c2ab9b7d7a0d6046ef3bb701c/0/e2c58350e4e54c21b0f713a4c271b8c0: com.amazonaws.services.s3.model.AmazonS3Exception: The specified key does not exist. (Service: Amazon S3; Status Code: 404; Error Code: NoSuchKey; Request ID: KBRNC67WZGCS4SCF; S3 Extended Request ID: wWEJm8tlFSj/8g+xFpmD1vgWzT/n7HzBcOFAZ8ayIqKMsDZGN/d2kEhdusPLhMd540h+QAPP1xw=; Proxy: null), S3 Extended Request ID: wWEJm8tlFSj/8g+xFpmD1vgWzT/n7HzBcOFAZ8ayIqKMsDZGN/d2kEhdusPLhMd540h+QAPP1xw=:NoSuchKey
...
at org.apache.hadoop.hbase.io.hfile.HFileReaderImpl$HFileScannerImpl.seekTo(HFileReaderImpl.java:632)
        at org.apache.hadoop.hbase.regionserver.StoreFileScanner.seekAtOrAfter(StoreFileScanner.java:315)
        at org.apache.hadoop.hbase.regionserver.StoreFileScanner.seek(StoreFileScanner.java:216)
        at org.apache.hadoop.hbase.regionserver.StoreScanner.seekScanners(StoreScanner.java:417)
        at org.apache.hadoop.hbase.regionserver.StoreScanner.reopenAfterFlush(StoreScanner.java:1018)
        at org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:552)
        at org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:155)
        at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.populateResult(HRegion.java:7399)
        at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:7567)
        at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:7331)
        at org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:3373) {noformat}

We can see on the RS logs that the above file got recently create as an outcome of a memstore flush, then compaction is triggered shortly:
{noformat}
2022-11-11 22:16:50,322 INFO org.apache.hadoop.hbase.regionserver.DefaultStoreFlusher: Flushed memstore data size=208.15 KB at sequenceid=4949703 (bloomFilter=false), to=s3a://xxxxxxxxx-xxxxxxxx-xxxxxxx/hbase/data/default/xxxxxx/4c53da8c2ab9b7d7a0d6046ef3bb701c/0/e2c58350e4e54c21b0f713a4c271b8c0
2022-11-11 22:16:50,513 INFO org.apache.hadoop.hbase.regionserver.HStore: Added s3a://xxxxxxxxx-xxxxxxxx-xxxxxxx/hbase/data/default/xxxxxx/4c53da8c2ab9b7d7a0d6046ef3bb701c/0/e2c58350e4e54c21b0f713a4c271b8c0, entries=951, sequenceid=4949703, filesize=26.2 K
...
2022-11-11 22:16:50,791 INFO org.apache.hadoop.hbase.regionserver.HRegion: Starting compaction of 4c53da8c2ab9b7d7a0d6046ef3bb701c/0 in xxxxxx,IT001E90506702\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00,1618275339031.4c53da8c2ab9b7d7a0d6046ef3bb701c.
2022-11-11 22:16:50,791 INFO org.apache.hadoop.hbase.regionserver.HStore: Starting compaction of [s3a://xxxxxxxxx-xxxxxxxx-xxxxxxx/hbase/data/default/xxxxxx/4c53da8c2ab9b7d7a0d6046ef3bb701c/0/9a865e5e3ba74ab6b9597fe0e295fd8d, s3a://xxxxxxxxx-xxxxxxxx-xxxxxxx/hbase/data/default/xxxxxx/4c53da8c2ab9b7d7a0d6046ef3bb701c/0/adbe118d8c76482d8faed3170437be0b, s3a://xxxxxxxxx-xxxxxxxx-xxxxxxx/hbase/data/default/xxxxxx/4c53da8c2ab9b7d7a0d6046ef3bb701c/0/e2c58350e4e54c21b0f713a4c271b8c0] into tmpdir=s3a://xxxxxxxxx-xxxxxxxx-xxxxxxx/hbase/data/default/xxxxxx/4c53da8c2ab9b7d7a0d6046ef3bb701c/.tmp, totalSize=27.1 M
...
2022-11-11 22:16:54,469 INFO org.apache.hadoop.hbase.regionserver.HStore: Completed compaction of 3 file(s) in 4c53da8c2ab9b7d7a0d6046ef3bb701c/0 of 4c53da8c2ab9b7d7a0d6046ef3bb701c into bc44bfd06a03429fb331ab4dc10dbb8d(size=27.1 M), total size for store is 4.9 G. This selection was in queue for 0sec, and took 3sec to execute.
{noformat}

I believe this is an unlucky scenario where the compaction discharger moved the compacted away files while the StoreFileScanner was getting created but had not updated the refCounter on the file reader yet. We couldn't reproduce this on a real cluster, but I could emulate it with a UT and artificially inducing a delay in the StoreFileScanner creation when creating scans not for compactions. One possible fix is to update the reader refCounter as soon we get the files for the StoreFileScanner instances we are creating.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)