You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Bryan Beaudreault (Jira)" <ji...@apache.org> on 2022/07/08 01:41:00 UTC

[jira] [Comment Edited] (HBASE-27053) IOException during caching of uncompressed block to the block cache.

    [ https://issues.apache.org/jira/browse/HBASE-27053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17564026#comment-17564026 ] 

Bryan Beaudreault edited comment on HBASE-27053 at 7/8/22 1:40 AM:
-------------------------------------------------------------------

I've been digging into this. Relaying some of my findings from comments on the PR:

The problem here is that in HFileBlock#allocateBuffer, we are including space for the checksums in the unpacked buffer. But then, when we call HFileBlockDecodingContext#prepareEncoding, we pass in getUncompressedSizeWithoutHeader() as the uncompressedSize, which does not include checksum length. This mismatch causes all unpacked HFileBlock buffers to have a section of un-set bytes at the end of each buffer.

When using BucketCache, all HFileBlock buffers are allocated through the ByteBuffAllocactor, which pools them for re-use. When a buffer is pulled from the pool, clear() is called on the buffer, but that doesn't zero out the bytes. It just resets the pointers and limits. As a result, you pull a buffer from the pool which previously had bytes in the section of the buffer that happens to hold the un-set checksum, you end up with junk in the checksum portion.

This is happening for all HFileBlock buffers, when BucketCache is enabled. But this typically doesn't matter, because the checksum is not used after reading from disk. In fact, most usages of the HFileBlock buffer get it through HFileBlock#getBufferWithoutHeader(), which excludes the checksum section. 

The reason it matters for this case, is because when splits occur its possible for the split key to be in the middle of a block, resulting in that block being loaded by both daughter regions on open. This happens in 2 separate threads, so a race condition occurs. This wouldn't be a problem if the 2 blocks were identical, but in this case they each have different junk at the end of the buffer, in the checksum area. This race condition was detected and mitigated in HBASE-20447 by validating the contents of the conflicting block. Since the checksum contains junk, we are hitting that case where contents differ often enough to cause frequent IOExceptions on split.

I think the best fix here would be to make HFileBlock#allocateBuffer not include the checksum bytes in the capacity. The only challenge with that is we need to then handle the case in getBufferWithoutHeader() so we properly set the limit. Currently getBufferWithoutHeader() always passes {{false}} into {{{}getBufferWithoutHeader(boolean withChecksum){}}}. I think this would be a matter of changing {{false}} to {{{}!isUnpacked(){}}}. Since packed blocks would always have the checksum available, and unpacked would never.

So my suggestion ends up being a two-line change:
 * in HFileBlock#allocateBuffer(), remove {{cksumBytes}} from capacity
 * In HFileBlock#getBufferWithoutHeader(), pass {{!isUnpacked()}} instead of {{{}false{}}}.

We can validate this functionality with a new test in TestHFileBlockUnpack:
{code:java}
  @Test
  public void itUnpacksIdenticallyEachTime() throws IOException {
    Path path = new Path(TEST_UTIL.getDataTestDir(), name.getMethodName());
    int totalSize = createTestBlock(path);

    // Allocate a bunch of random buffers, so we can be sure that unpack will only have "dirty"
    // buffers to choose from when allocating itself.
    Random random = new Random();
    byte[] temp = new byte[HConstants.DEFAULT_BLOCKSIZE];
    List<ByteBuff> buffs = new ArrayList<>();
    for (int i = 0; i < 10; i++) {
      ByteBuff buff = allocator.allocate(HConstants.DEFAULT_BLOCKSIZE);
      random.nextBytes(temp);
      buff.put(temp);
      buffs.add(buff);
    }

    buffs.forEach(ByteBuff::release);

    // read the same block twice. we should expect the underlying buffer below to
    // be identical each time
    HFileBlock blockOne = readBlock(path, totalSize);
    HFileBlock blockTwo = readBlock(path, totalSize);

    ByteBuff bufferOne = blockOne.getBufferWithoutHeader(true);
    ByteBuff bufferTwo = blockTwo.getBufferWithoutHeader(true);

    // This assertion should succeed, but it fails on master. It will succeed once
    // cksumBytes is removed from HFileBlock#allocateBuffer.
    assertEquals(0, ByteBuff.compareTo(bufferOne, 0, bufferOne.limit(), bufferTwo, 0 ,bufferTwo.limit()));
  } {code}
This test basically ensures that if you read the same block twice, they get the exact same underlying buffer. I think this is a good invariant to ensure.


was (Author: bbeaudreault):
I've been digging into this. Relaying some of my findings from comments on the PR:

The problem here is that in HFileBlock#allocateBuffer, we are including space for the checksums in the unpacked buffer. But then, when we call HFileBlockDecodingContext#prepareEncoding, we pass in getUncompressedSizeWithoutHeader() as the uncompressedSize, which does not include checksum length. This mismatch causes all unpacked HFileBlock buffers to have a section of un-set bytes at the end of each buffer.

When using BucketCache, all HFileBlock buffers are allocated through the ByteBuffAllocactor, which pools them for re-use. When a buffer is pulled from the pool, clear() is called on the buffer, but that doesn't zero out the bytes. It just resets the pointers and limits. As a result, you pull a buffer from the pool which previously had bytes in the section of the buffer that happens to hold the un-set checksum, you end up with junk.

This is happening for all HFileBlock buffers, when BucketCache is enabled. But this typically doesn't matter, because the checksum is not used after reading from disk. In fact, most usages of the HFileBlock buffer get it through HFileBlock#getBufferWithoutHeader(), which excludes the checksum section. 

I think the best fix here would be to make HFileBlock#allocateBuffer not include the checksum bytes in the capacity. The only challenge with that is we need to then handle the case in getBufferWithoutHeader() so we properly set the limit. Currently getBufferWithoutHeader() always passes {{false}} into {{{}getBufferWithoutHeader(boolean withChecksum){}}}. I think this would be a matter of changing {{false}} to {{{}!isUnpacked(){}}}. Since packed blocks would always have the checksum available, and unpacked would never.

So my suggestion ends up being a two-line change:
 * in HFileBlock#allocateBuffer(), remove {{cksumBytes}} from capacity
 * In HFileBlock#getBufferWithoutHeader(), pass {{!isUnpacked()}} instead of {{{}false{}}}.

We can validate this functionality with a new test in TestHFileBlockUnpack:
{code:java}
  @Test
  public void itUnpacksIdenticallyEachTime() throws IOException {
    Path path = new Path(TEST_UTIL.getDataTestDir(), name.getMethodName());
    int totalSize = createTestBlock(path);

    // Allocate a bunch of random buffers, so we can be sure that unpack will only have "dirty"
    // buffers to choose from when allocating itself.
    Random random = new Random();
    byte[] temp = new byte[HConstants.DEFAULT_BLOCKSIZE];
    List<ByteBuff> buffs = new ArrayList<>();
    for (int i = 0; i < 10; i++) {
      ByteBuff buff = allocator.allocate(HConstants.DEFAULT_BLOCKSIZE);
      random.nextBytes(temp);
      buff.put(temp);
      buffs.add(buff);
    }

    buffs.forEach(ByteBuff::release);

    // read the same block twice. we should expect the underlying buffer below to
    // be identical each time
    HFileBlock blockOne = readBlock(path, totalSize);
    HFileBlock blockTwo = readBlock(path, totalSize);

    ByteBuff bufferOne = blockOne.getBufferWithoutHeader(true);
    ByteBuff bufferTwo = blockTwo.getBufferWithoutHeader(true);

    // This assertion should succeed, but it fails on master. It will succeed once
    // cksumBytes is removed from HFileBlock#allocateBuffer.
    assertEquals(0, ByteBuff.compareTo(bufferOne, 0, bufferOne.limit(), bufferTwo, 0 ,bufferTwo.limit()));
  } {code}
This test basically ensures that if you read the same block twice, they get the exact same underlying buffer. I think this is a good invariant to ensure.

> IOException during caching of uncompressed block to the block cache.
> --------------------------------------------------------------------
>
>                 Key: HBASE-27053
>                 URL: https://issues.apache.org/jira/browse/HBASE-27053
>             Project: HBase
>          Issue Type: Bug
>          Components: BlockCache
>    Affects Versions: 2.4.12
>            Reporter: Sergey Soldatov
>            Assignee: Sergey Soldatov
>            Priority: Major
>
> When prefetch to block cache is enabled and blocks are compressed sometimes caching fails with the exception:
> {noformat}
> 2022-05-18 21:37:29,597 ERROR [RS_OPEN_REGION-regionserver/x1:16020-2] regionserver.HRegion: Could not initialize all stores for the region=cluster_test,66666666,1652935047946.a57ca5f9e7bebb4855a44523063f79c7.
> 2022-05-18 21:37:29,598 WARN  [RS_OPEN_REGION-regionserver/x1:16020-2] regionserver.HRegion: Failed initialize of region= cluster_test,66666666,1652935047946.a57ca5f9e7bebb4855a44523063f79c7., starting to roll back memstore
> java.io.IOException: java.io.IOException: java.lang.RuntimeException: Cached block contents differ, which should not have happened.cacheKey:19307adf1c2248ebb5675116ea640712.c3a21f2005abf308e4a8c9759d4e05fe_0
>   at org.apache.hadoop.hbase.regionserver.HRegion.initializeStores(HRegion.java:1149)
>   at org.apache.hadoop.hbase.regionserver.HRegion.initializeStores(HRegion.java:1092)
>   at org.apache.hadoop.hbase.regionserver.HRegion.initializeRegionInternals(HRegion.java:996)
>   at org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:946)
>   at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:7240)
>   at org.apache.hadoop.hbase.regionserver.HRegion.openHRegionFromTableDir(HRegion.java:7199)
>   at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:7175)
>   at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:7134)
>   at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:7090)
>   at org.apache.hadoop.hbase.regionserver.handler.AssignRegionHandler.process(AssignRegionHandler.java:147)
>   at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:100)
>   at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: java.io.IOException: java.lang.RuntimeException: Cached block contents differ, which should not have happened.cacheKey:19307adf1c2248ebb5675116ea640712.c3a21f2005abf308e4a8c9759d4e05fe_0
>   at org.apache.hadoop.hbase.regionserver.StoreEngine.openStoreFiles(StoreEngine.java:294)
>   at org.apache.hadoop.hbase.regionserver.StoreEngine.initialize(StoreEngine.java:344)
>   at org.apache.hadoop.hbase.regionserver.HStore.<init>(HStore.java:294)
>   at org.apache.hadoop.hbase.regionserver.HRegion.instantiateHStore(HRegion.java:6375)
>   at org.apache.hadoop.hbase.regionserver.HRegion$1.call(HRegion.java:1115)
>   at org.apache.hadoop.hbase.regionserver.HRegion$1.call(HRegion.java:1112)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   ... 3 more
> Caused by: java.lang.RuntimeException: Cached block contents differ, which should not have happened.cacheKey:19307adf1c2248ebb5675116ea640712.c3a21f2005abf308e4a8c9759d4e05fe_0
>   at org.apache.hadoop.hbase.io.hfile.BlockCacheUtil.validateBlockAddition(BlockCacheUtil.java:199)
>   at org.apache.hadoop.hbase.io.hfile.BlockCacheUtil.shouldReplaceExistingCacheBlock(BlockCacheUtil.java:231)
>   at org.apache.hadoop.hbase.io.hfile.bucket.BucketCache.shouldReplaceExistingCacheBlock(BucketCache.java:447)
>   at org.apache.hadoop.hbase.io.hfile.bucket.BucketCache.cacheBlockWithWait(BucketCache.java:432)
>   at org.apache.hadoop.hbase.io.hfile.bucket.BucketCache.cacheBlock(BucketCache.java:418)
>   at org.apache.hadoop.hbase.io.hfile.CombinedBlockCache.cacheBlock(CombinedBlockCache.java:60)
>   at org.apache.hadoop.hbase.io.hfile.HFileReaderImpl.lambda$readBlock$2(HFileReaderImpl.java:1319)
>   at java.util.Optional.ifPresent(Optional.java:159)
>   at org.apache.hadoop.hbase.io.hfile.HFileReaderImpl.readBlock(HFileReaderImpl.java:1317)
>   at org.apache.hadoop.hbase.io.hfile.HFileReaderImpl$HFileScannerImpl.readAndUpdateNewBlock(HFileReaderImpl.java:942)
>   at org.apache.hadoop.hbase.io.hfile.HFileReaderImpl$HFileScannerImpl.seekTo(HFileReaderImpl.java:931)
>   at org.apache.hadoop.hbase.io.HalfStoreFileReader$1.seekTo(HalfStoreFileReader.java:171)
>   at org.apache.hadoop.hbase.io.HalfStoreFileReader.getFirstKey(HalfStoreFileReader.java:321)
>   at org.apache.hadoop.hbase.regionserver.HStoreFile.open(HStoreFile.java:477)
>   at org.apache.hadoop.hbase.regionserver.HStoreFile.initReader(HStoreFile.java:490)
>   at org.apache.hadoop.hbase.regionserver.StoreEngine.createStoreFileAndReader(StoreEngine.java:231)
>   at org.apache.hadoop.hbase.regionserver.StoreEngine.lambda$openStoreFiles$0(StoreEngine.java:272)
>   ... 6 more
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)