You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Xiaolin Ha (Jira)" <ji...@apache.org> on 2021/08/12 09:25:00 UTC
[jira] [Resolved] (HBASE-26155) JVM crash when scan

     [ https://issues.apache.org/jira/browse/HBASE-26155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Xiaolin Ha resolved HBASE-26155.
--------------------------------
    Fix Version/s: 3.0.0-alpha-1
                   2.3.7
                   2.4.6
                   2.5.0
       Resolution: Fixed

Merged to master and branch2.3+, thanks [~stack]  [~zhangduo] for reviewing.

> JVM crash when scan
> -------------------
>
>                 Key: HBASE-26155
>                 URL: https://issues.apache.org/jira/browse/HBASE-26155
>             Project: HBase
>          Issue Type: Bug
>          Components: Scanners
>    Affects Versions: 3.0.0-alpha-1
>            Reporter: Xiaolin Ha
>            Assignee: Xiaolin Ha
>            Priority: Major
>             Fix For: 2.5.0, 2.4.6, 2.3.7, 3.0.0-alpha-1
>
>         Attachments: scan-error.png
>
>
> There are scanner close caused regionserver JVM coredump problems on our production clusters.
> {code:java}
> Stack: [0x00007fca4b0cc000,0x00007fca4b1cd000],  sp=0x00007fca4b1cb0d8,  free space=1020k
> Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
> V  [libjvm.so+0x7fd314]
> J 2810  sun.misc.Unsafe.copyMemory(Ljava/lang/Object;JLjava/lang/Object;JJ)V (0 bytes) @ 0x00007fdae55a9e61 [0x00007fdae55a9d80+0xe1]
> j  org.apache.hadoop.hbase.util.UnsafeAccess.unsafeCopy(Ljava/lang/Object;JLjava/lang/Object;JJ)V+36
> j  org.apache.hadoop.hbase.util.UnsafeAccess.copy(Ljava/nio/ByteBuffer;I[BII)V+69
> j  org.apache.hadoop.hbase.util.ByteBufferUtils.copyFromBufferToArray([BLjava/nio/ByteBuffer;III)V+39
> j  org.apache.hadoop.hbase.CellUtil.copyQualifierTo(Lorg/apache/hadoop/hbase/Cell;[BI)I+31
> j  org.apache.hadoop.hbase.KeyValueUtil.appendKeyTo(Lorg/apache/hadoop/hbase/Cell;[BI)I+43
> J 14724 C2 org.apache.hadoop.hbase.regionserver.StoreScanner.shipped()V (51 bytes) @ 0x00007fdae6a298d0 [0x00007fdae6a29780+0x150]
> J 21387 C2 org.apache.hadoop.hbase.regionserver.RSRpcServices$RegionScannerShippedCallBack.run()V (53 bytes) @ 0x00007fdae622bab8 [0x00007fdae622acc0+0xdf8]
> J 26353 C2 org.apache.hadoop.hbase.ipc.ServerCall.setResponse(Lorg/apache/hbase/thirdparty/com/google/protobuf/Message;Lorg/apache/hadoop/hbase/CellScanner;Ljava/lang/Throwable;Ljava/lang/String;)V (384 bytes) @ 0x00007fdae7f139d8 [0x00007fdae7f12980+0x1058]
> J 26226 C2 org.apache.hadoop.hbase.ipc.CallRunner.run()V (1554 bytes) @ 0x00007fdae959f68c [0x00007fdae959e400+0x128c]
> J 19598% C2 org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(Ljava/util/concurrent/BlockingQueue;Ljava/util/concurrent/atomic/AtomicInteger;)V (338 bytes) @ 0x00007fdae81c54d4 [0x00007fdae81c53e0+0xf4]
> {code}
> There are also scan rpc errors when coredump happens at the handler,
> !scan-error.png|width=585,height=235!
> I found some clue in the logs, that some blocks may be replaced when its nextBlockOnDiskSize less than the newly one in the method 
>  
> {code:java}
> public static boolean shouldReplaceExistingCacheBlock(BlockCache blockCache,
>     BlockCacheKey cacheKey, Cacheable newBlock) {
>   if (cacheKey.toString().indexOf(".") != -1) { // reference file
>     LOG.warn("replace existing cached block, cache key is : " + cacheKey);
>     return true;
>   }
>   Cacheable existingBlock = blockCache.getBlock(cacheKey, false, false, false);
>   if (existingBlock == null) {
>     return true;
>   }
>   try {
>     int comparison = BlockCacheUtil.validateBlockAddition(existingBlock, newBlock, cacheKey);
>     if (comparison < 0) {
>       LOG.warn("Cached block contents differ by nextBlockOnDiskSize, the new block has "
>           + "nextBlockOnDiskSize set. Caching new block.");
>       return true;
> ......{code}
>  
> And the block will be replaced if it is not in the RAMCache but in the BucketCache.
> When using 
>  
> {code:java}
> private void putIntoBackingMap(BlockCacheKey key, BucketEntry bucketEntry) {
>   BucketEntry previousEntry = backingMap.put(key, bucketEntry);
>   if (previousEntry != null && previousEntry != bucketEntry) {
>     ReentrantReadWriteLock lock = offsetLock.getLock(previousEntry.offset());
>     lock.writeLock().lock();
>     try {
>       blockEvicted(key, previousEntry, false);
>     } finally {
>       lock.writeLock().unlock();
>     }
>   }
> }
> {code}
> to replace the old block, to avoid previous bucket entry mem leak, the previous bucket entry will be force released regardless of RPC references to it.
>  
> {code:java}
> void blockEvicted(BlockCacheKey cacheKey, BucketEntry bucketEntry, boolean decrementBlockNumber) {
>   bucketAllocator.freeBlock(bucketEntry.offset());
>   realCacheSize.add(-1 * bucketEntry.getLength());
>   blocksByHFile.remove(cacheKey);
>   if (decrementBlockNumber) {
>     this.blockNumber.decrement();
>   }
> }
> {code}
> I used the check of RPC reference before replace bucket entry, and it works, no coredumps until now.
>  
> That is:
> {code:java}
> public void cacheBlockWithWait(BlockCacheKey cacheKey, Cacheable cachedItem, boolean inMemory,
>     boolean wait) {
>   if (cacheEnabled) {
>     if (backingMap.containsKey(cacheKey) || ramCache.containsKey(cacheKey)) {
>       if (BlockCacheUtil.shouldReplaceExistingCacheBlock(this, cacheKey, cachedItem)) {
>         BucketEntry bucketEntry = backingMap.get(cacheKey);
>         if (bucketEntry != null && bucketEntry.isRpcRef()) {
>           // avoid replace when there are RPC refs for the bucket entry in bucket cache
>           return;
>         }
>         cacheBlockWithWaitInternal(cacheKey, cachedItem, inMemory, wait);
>       }
>     } else {
>       cacheBlockWithWaitInternal(cacheKey, cachedItem, inMemory, wait);
>     }
>   }
> }
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)