You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Michael Stack (Jira)" <ji...@apache.org> on 2021/07/03 00:15:00 UTC
[jira] [Comment Edited] (HBASE-26062) SIGSEGV in AsyncFSWAL consume

    [ https://issues.apache.org/jira/browse/HBASE-26062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17373806#comment-17373806 ] 

Michael Stack edited comment on HBASE-26062 at 7/3/21, 12:14 AM:
-----------------------------------------------------------------

Made this an issue (was a sub-issue of HBASE-26042). I don't think it related to HBASE-26042 now. I think it something else.

TODO: See if rpc timeouts around time of this crash. If so, try and see if the inbound rpc had a trailing cellblock sidecar. If so, perhaps this code suspect in ServerRpcConnection:
{code:java}
if (header.hasCellBlockMeta()) {
  buf.position(offset);
  ByteBuff dup = buf.duplicate();
  dup.limit(offset + header.getCellBlockMeta().getLength());
  cellScanner = this.rpcServer.cellBlockBuilder.createCellScannerReusingBuffers(
      this.codec, this.compressionCodec, dup);
} {code}
 

Update: took a look back at one of the crashes. In server-side logs at least, just struggling server... lots of slow syncs and a couple of minutes back, a GC pause of three seconds else nothing untoward.


was (Author: stack):
Made this an issue. I don't think it related to HBASE-26042. I think it something else.

TODO: See if rpc timeouts around time of this crash. If so, try and see if the inbound rpc had a trailing cellblock sidecar. If so, perhaps this code suspect in ServerRpcConnection:
{code:java}
if (header.hasCellBlockMeta()) {
  buf.position(offset);
  ByteBuff dup = buf.duplicate();
  dup.limit(offset + header.getCellBlockMeta().getLength());
  cellScanner = this.rpcServer.cellBlockBuilder.createCellScannerReusingBuffers(
      this.codec, this.compressionCodec, dup);
} {code}

> SIGSEGV in AsyncFSWAL consume
> -----------------------------
>
>                 Key: HBASE-26062
>                 URL: https://issues.apache.org/jira/browse/HBASE-26062
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Michael Stack
>            Priority: Major
>
> Seems related to the parent issue. Its happened a few times on one of our clusters here. Below are two examples. Need more detail but perhaps the call has timed out, the buffer has thus been freed, but the late consume on the other side of the ringbuffer doesn't know that and goes ahead (Just speculation).
>  
> {code:java}
> #  SIGSEGV (0xb) at pc=0x00007f8b3ef5b77c, pid=37631, tid=0x00007f61560ed700
> RAX=0x00000000ffffdf6e is an unknown valueRBX=0x00007f8a38d7b6f8 is an oopjava.nio.DirectByteBuffer - klass: 'java/nio/DirectByteBuffer'RCX=0x00007f60e2767898 is pointing into metadataRDX=0x0000000000000de7 is an unknown valueRSP=0x00007f61560ec6f0 is pointing into the stack for thread: 0x00007f8b3017b800RBP=[error occurred during error reporting (printing register info), id 0xb]
> Stack: [0x00007f6155fed000,0x00007f61560ee000],  sp=0x00007f61560ec6f0,  free space=1021kNative frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)J 23901 C2 java.util.stream.MatchOps$1MatchSink.accept(Ljava/lang/Object;)V (44 bytes) @ 0x00007f8b3ef5b77c [0x00007f8b3ef5b640+0x13c]J 16165 C2 java.util.ArrayList$ArrayListSpliterator.tryAdvance(Ljava/util/function/Consumer;)Z (79 bytes) @ 0x00007f8b3d67b344 [0x00007f8b3d67b2c0+0x84]J 16160 C2 java.util.stream.MatchOps$MatchOp.evaluateSequential(Ljava/util/stream/PipelineHelper;Ljava/util/Spliterator;)Ljava/lang/Object; (7 bytes) @ 0x00007f8b3d67bc9c [0x00007f8b3d67b900+0x39c]J 17729 C2 org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALActionListener.visitLogEntryBeforeWrite(Lorg/apache/hadoop/hbase/wal/WALKey;Lorg/apache/hadoop/hbase/wal/WALEdit;)V (10 bytes) @ 0x00007f8b3fc39010 [0x00007f8b3fc388a0+0x770]J 29991 C2 org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.appendAndSync()V (261 bytes) @ 0x00007f8b3fd03d90 [0x00007f8b3fd039e0+0x3b0]J 20773 C2 org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.consume()V (474 bytes) @ 0x00007f8b40283728 [0x00007f8b40283480+0x2a8]J 15191 C2 org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL$$Lambda$76.run()V (8 bytes) @ 0x00007f8b3ed69ecc [0x00007f8b3ed69ea0+0x2c]J 17383% C2 java.util.concurrent.ThreadPoolExecutor.runWorker(Ljava/util/concurrent/ThreadPoolExecutor$Worker;)V (225 bytes) @ 0x00007f8b3d9423f8 [0x00007f8b3d942260+0x198]j  java.util.concurrent.ThreadPoolExecutor$Worker.run()V+5j  java.lang.Thread.run()V+11v  ~StubRoutines::call_stubV  [libjvm.so+0x66b9ba]  JavaCalls::call_helper(JavaValue*, methodHandle*, JavaCallArguments*, Thread*)+0xe1aV  [libjvm.so+0x669073]  JavaCalls::call_virtual(JavaValue*, KlassHandle, Symbol*, Symbol*, JavaCallArguments*, Thread*)+0x263V  [libjvm.so+0x669647]  JavaCalls::call_virtual(JavaValue*, Handle, KlassHandle, Symbol*, Symbol*, Thread*)+0x57V  [libjvm.so+0x6aaa4c]  thread_entry(JavaThread*, Thread*)+0x6cV  [libjvm.so+0xa224cb]  JavaThread::thread_main_inner()+0xdbV  [libjvm.so+0xa22816]  JavaThread::run()+0x316V  [libjvm.so+0x8c4202]  java_start(Thread*)+0x102C  [libpthread.so.0+0x76ba]  start_thread+0xca {code}
>  
> This one is from a month previous and has a deeper stack... we're trying to read a Cell...
>  
> {code:java}
> Stack: [0x00007fa1d5fb8000,0x00007fa1d60b9000],  sp=0x00007fa1d60b7660,  free space=1021kNative frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)J 30665 C2 org.apache.hadoop.hbase.PrivateCellUtil.matchingFamily(Lorg/apache/hadoop/hbase/Cell;[BII)Z (59 bytes) @ 0x00007fcc2d29eeb2 [0x00007fcc2d29e7c0+0x6f2]J 25816 C2 org.apache.hadoop.hbase.CellUtil.matchingFamily(Lorg/apache/hadoop/hbase/Cell;[B)Z (28 bytes) @ 0x00007fcc2a0430f8 [0x00007fcc2a0430e0+0x18]J 17236 C2 org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALActionListener$$Lambda$254.test(Ljava/lang/Object;)Z (8 bytes) @ 0x00007fcc2b40bc68 [0x00007fcc2b40bc20+0x48]J 13735 C2 java.util.ArrayList$ArrayListSpliterator.tryAdvance(Ljava/util/function/Consumer;)Z (79 bytes) @ 0x00007fcc2b7d936c [0x00007fcc2b7d92c0+0xac]J 17162 C2 java.util.stream.MatchOps$MatchOp.evaluateSequential(Ljava/util/stream/PipelineHelper;Ljava/util/Spliterator;)Ljava/lang/Object; (7 bytes) @ 0x00007fcc29bc05e8 [0x00007fcc29bbfe80+0x768]J 16934 C2 org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALActionListener.visitLogEntryBeforeWrite(Lorg/apache/hadoop/hbase/wal/WALKey;Lorg/apache/hadoop/hbase/wal/WALEdit;)V (10 bytes) @ 0x00007fcc2bb313f8 [0x00007fcc2bb30c60+0x798]J 30732 C2 org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.appendAndSync()V (261 bytes) @ 0x00007fcc2ae5a420 [0x00007fcc2ae59d60+0x6c0]J 22203 C2 org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.consume()V (474 bytes) @ 0x00007fcc2a987420 [0x00007fcc2a987200+0x220]J 16857 C2 org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL$$Lambda$126.run()V (8 bytes) @ 0x00007fcc2b0bf28c [0x00007fcc2b0bf260+0x2c]J 13721% C2 java.util.concurrent.ThreadPoolExecutor.runWorker(Ljava/util/concurrent/ThreadPoolExecutor$Worker;)V (225 bytes) @ 0x00007fcc2b7d77c0 [0x00007fcc2b7d7240+0x580]j  java.util.concurrent.ThreadPoolExecutor$Worker.run()V+5j  java.lang.Thread.run()V+11v  ~StubRoutines::call_stub {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)