You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by GitBox <gi...@apache.org> on 2022/07/06 14:13:38 UTC
[GitHub] [hbase] bbeaudreault commented on pull request #4453: HBASE-27053 IOException during caching of uncompressed block to the b…

bbeaudreault commented on PR #4453:
URL: https://github.com/apache/hbase/pull/4453#issuecomment-1176274853

   @joshelser @Apache9 @ss77892 I decided to take a look at this, as we're trying to push forward on BucketCache in prod and this was an issue we see every time a region is split. I believe I have the real root cause and suggested fix, detailed below:
   
   First question, why are we even getting into this point where a read does not get a cache hit, but then sees a conflict when caching it's result from disk? This is covered by the javadoc in `BlockCacheUtils#shouldReplaceExistingCacheBlock`:
   
   >   * Because of the region splitting, it's possible that the split key locate in the middle of a
   >   * block. So it's possible that both the daughter regions load the same block from their parent
   >   * HFile. When pread, we don't force the read to read all of the next block header. So when two
   >   * threads try to cache the same block, it's possible that one thread read all of the next block
   >   * header but the other one didn't. if the already cached block hasn't next block header but the
   >   * new block to cache has, then we can replace the existing block with the new block for better
   >   * performance.(HBASE-20447)
   
   Next question, why are we supposedly only _sometimes_ not reading the checksum? In actuality, we _never_ are reading the checksum, but the buffer always includes space for them. The unpacked HFileBlock's buffer always has random data at the end since we're allocating space for the checksum, but not filling it. This only becomes a problem in the split case because otherwise we never tried to double cache a block and we never reference the checksum once cached.  See below analysis:
   
   1. The checksums are used when reading from disk, in `HFileBlock#readBlockDataInternal`. Beyond that, I don't see much use of checksums. 
   2. In `HFileBlock#unpack`, we allocate a buffer for the decompression, with capacity `headerSize + uncompressedSizeWithoutHeader + cksumBytes`. Before passing to decoder, we put the header onto the buffer.
   3. Next, we pass `getUncompressedSizeWithoutHeader()` to `HFileBlockDecodingContext#prepareEncoding`. The javadoc for this function says "the uncompressed size of data part (header and **checksum excluded**).". This value is used in Compression.decompress, to determine how many bytes to readFully into our destination buffer. 
   4. At this point we immediately see a discrepancy -- the allocated buffer includes space for the checksum, but we are only writing headerSize and uncompressedSizeWithoutHeader into that buffer.
   
   Let's validate that analysis:
   
   1. On the write side, take a look at `HFileBlock#finishBlockAndWriteHeaderAndData`, the `onDiskBlockBytesWithHeader` is written out first, and then separately the checksums are written out.
   2. The `onDiskBlockBytesWithHeader` is the blob which includes header and compressed bytes.
   3. On reading, the `uncompressedSizeWithoutHeader` is read from that header using`buf.getInt(Header.UNCOMPRESSED_SIZE_WITHOUT_HEADER_INDEX)` (index 12). 
   4. Looking back at the writing side, `HFileBlock#finishBlock` passes `baosInMemory.size()` as uncompressedSize to `putHeader`. `baosInMemory` does not include the checksumming, which is calculated _after_ writing the header in `finishBlock`. So the value read in step 3 above does not include checksum length.
   5. So at this point I'm confident that the javadoc mentioned above is accurate, uncompressedSizeWithoutHeader does not include checksum.
   6. However, as mentioned in the original analysis, we're allocating a buffer that includes `cksumBytes`, and we're passing into the encoder `unpacked.getBufferWithoutHeader(true)`, which includes the checksum part of that buffer.
   
   Ok so at this point we can see that Compression.decompress is being passed an `uncompressedSize` which doesn't include checksum, and a buffer which includes space for a checksum. The `uncompressedSize` dictates how many bytes are read into that buffer, which means the checksum section at the end will be left untouched.
   
   If BucketCache is enabled, all DATA block reads go to ByteBuffAllocator, which re-uses buffers of capacity 65k. When a buffer is pulled from the pool, `clear()` is called, which doesnt actually zero out the bytes. It just resets the pointers.
   
   So if you allocate a buffer and don't fill the entire requested capacity, you will have junk left over. It's important to allocate only what you need, and always fill what you allocate. Otherwise we hit the problem we're facing in unpack.
   
   The fix here is simple -- change HFileBlock#allocateBuffer to not include `cksumBytes`.  This will also save cpu cycles over the current fix in this PR, because we don't need to unnecessarily zero out the extra bytes.
   
   I verified all of this with a new test method in the `TestHFileBlockUnpack` that I recently added for HBASE-27170. All you need to do is write a block, and then read that same block twice and expect the underlying buffer to be identical. See below:
   
   ```java
    @Test
     public void itUnpacksIdenticallyEachTime() throws IOException {
       Path path = new Path(TEST_UTIL.getDataTestDir(), name.getMethodName());
       int totalSize = createTestBlock(path);
   
       // Allocate a bunch of random buffers, so we can be sure that unpack will only have "dirty"
       // buffers to choose from when allocating itself.
       Random random = new Random();
       byte[] temp = new byte[HConstants.DEFAULT_BLOCKSIZE];
       List<ByteBuff> buffs = new ArrayList<>();
       for (int i = 0; i < 10; i++) {
         ByteBuff buff = allocator.allocate(HConstants.DEFAULT_BLOCKSIZE);
         random.nextBytes(temp);
         buff.put(temp);
         buffs.add(buff);
       }
   
       buffs.forEach(ByteBuff::release);
   
       // read the same block twice. we should expect the underlying buffer below to
       // be identical each time
       HFileBlock blockOne = readBlock(path, totalSize);
       HFileBlock blockTwo = readBlock(path, totalSize);
   
       ByteBuff bufferOne = blockOne.getBufferWithoutHeader(true);
       ByteBuff bufferTwo = blockTwo.getBufferWithoutHeader(true);
   
       // This assertion should succeed, but it fails on master. It will succeed once
       // cksumBytes is removed from HFileBlock#allocateBuffer.
       assertEquals(0, ByteBuff.compareTo(bufferOne, 0, bufferOne.limit(), bufferTwo, 0 ,bufferTwo.limit()));
     }
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@hbase.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org