You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Michael Stack (Jira)" <ji...@apache.org> on 2021/07/15 18:14:00 UTC

[jira] [Commented] (HBASE-26092) JVM core dump in the replication path

    [ https://issues.apache.org/jira/browse/HBASE-26092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17381512#comment-17381512 ] 

Michael Stack commented on HBASE-26092:
---------------------------------------

With replication enabled on a ~700 node cluster, we'd lose a RS every day or so w/ crashes that were variants on the below (building cellblock):
{code:java}
Stack: [0x00007edc2b215000,0x00007edc2b316000],  sp=0x00007edc2b314480,  free space=1021kNative frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)J 12332 C2 org.apache.hadoop.hbase.codec.KeyValueCodecWithTags$KeyValueEncoder.write(Lorg/apache/hadoop/hbase/Cell;)V (27 bytes) @ 0x00007f065ada3047 [0x00007f065ada2c40+0x407]J 16249 C2 org.apache.hadoop.hbase.ipc.CellBlockBuilder.encodeCellsTo(Ljava/io/OutputStream;Lorg/apache/hadoop/hbase/CellScanner;Lorg/apache/hadoop/hbase/codec/Codec;Lorg/apache/hadoop/io/compress/CompressionCodec;)V (138 bytes) @ 0x00007f065b716550 [0x00007f065b716380+0x1d0]J 6822 C2 org.apache.hadoop.hbase.ipc.CellBlockBuilder.buildCellBlock(Lorg/apache/hadoop/hbase/codec/Codec;Lorg/apache/hadoop/io/compress/CompressionCodec;Lorg/apache/hadoop/hbase/CellScanner;Lorg/apache/hadoop/hbase/ipc/CellBlockBuilder$OutputStreamSupplier;)Z (113 bytes) @ 0x00007f0659917424 [0x00007f0659916fc0+0x464]J 6824 C2 org.apache.hadoop.hbase.ipc.NettyRpcDuplexHandler.writeRequest(Lorg/apache/hbase/thirdparty/io/netty/channel/ChannelHandlerContext;Lorg/apache/hadoop/hbase/ipc/Call;Lorg/apache/hbase/thirdparty/io/netty/channel/ChannelPromise;)V (370 bytes) @ 0x00007f065a4041f4 [0x00007f065a403fc0+0x234]J 6823 C2 org.apache.hadoop.hbase.ipc.NettyRpcDuplexHandler.write(Lorg/apache/hbase/thirdparty/io/netty/channel/ChannelHandlerContext;Ljava/lang/Object;Lorg/apache/hbase/thirdparty/io/netty/channel/ChannelPromise;)V (30 bytes) @ 0x00007f065962d414 [0x00007f065962d3e0+0x34]J 5492 C2 org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.write(Ljava/lang/Object;ZLorg/apache/hbase/thirdparty/io/netty/channel/ChannelPromise;)V (149 bytes) @ 0x00007f0659f04f48 [0x00007f0659f04c60+0x2e8]J 6996 C2 org.apache.hadoop.hbase.ipc.NettyRpcConnection$6$1.run()V (22 bytes) @ 0x00007f06599d4eec [0x00007f06599d4c80+0x26c]J 27396 C2 org.apache.hbase.thirdparty.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(J)Z (106 bytes) @ 0x00007f065c15e660 [0x00007f065c15e400+0x260]J 21998% C2 org.apache.hbase.thirdparty.io.netty.channel.epoll.EpollEventLoop.run()V (461 bytes) @ 0x00007f0659de9570 [0x00007f0659de9000+0x570]j  org.apache.hbase.thirdparty.io.netty.util.concurrent.SingleThreadEventExecutor$4.run()V+44j  org.apache.hbase.thirdparty.io.netty.util.internal.ThreadExecutorMap$2.run()V+11j  org.apache.hbase.thirdparty.io.netty.util.concurrent.FastThreadLocalRunnable.run()V+4 {code}

> JVM core dump in the replication path
> -------------------------------------
>
>                 Key: HBASE-26092
>                 URL: https://issues.apache.org/jira/browse/HBASE-26092
>             Project: HBase
>          Issue Type: Bug
>          Components: Replication
>    Affects Versions: 2.3.5
>            Reporter: Huaxiang Sun
>            Priority: Critical
>
> When replication is turned on, we found the following code dump in the region server. 
> I checked the code dump for replication. I think I got some ideas. For replication, when RS receives walEdits from remote cluster, it needs to send them out to final RS. In this case, NettyRpcConnection is deployed, calls are queued while it refers to ByteBuffer in the context of replicationHandler (returned to the pool once it returns). Code dump will happen since the byteBuffer has been reused. Needs ref count in this asynchronous processing.
>  
> Feel free to take it, otherwise, I will try to work on a patch later.
>  
>  
> {code:java}
> Stack: [0x00007fb1bf039000,0x00007fb1bf13a000],  sp=0x00007fb1bf138560,  free space=1021k
> Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
> J 28175 C2 org.apache.hadoop.hbase.ByteBufferKeyValue.write(Ljava/io/OutputStream;Z)I (21 bytes) @ 0x00007fdbbbb2663c [0x00007fdbbbb263c0+0x27c]
> J 14912 C2 org.apache.hadoop.hbase.ipc.NettyRpcDuplexHandler.writeRequest(Lorg/apache/hbase/thirdparty/io/netty/channel/ChannelHandlerContext;Lorg/apache/hadoop/hbase/ipc/Call;Lorg/apache/hbase/thirdparty/io/netty/channel/ChannelPromise;)V (370 bytes) @ 0x00007fdbbb94b590 [0x00007fdbbb949c00+0x1990]
> J 14911 C2 org.apache.hadoop.hbase.ipc.NettyRpcDuplexHandler.write(Lorg/apache/hbase/thirdparty/io/netty/channel/ChannelHandlerContext;Ljava/lang/Object;Lorg/apache/hbase/thirdparty/io/netty/channel/ChannelPromise;)V (30 bytes) @ 0x00007fdbb972d1d4 [0x00007fdbb972d1a0+0x34]
> J 30476 C2 org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.write(Ljava/lang/Object;ZLorg/apache/hbase/thirdparty/io/netty/channel/ChannelPromise;)V (149 bytes) @ 0x00007fdbbd4e7084 [0x00007fdbbd4e6900+0x784]
> J 14914 C2 org.apache.hadoop.hbase.ipc.NettyRpcConnection$6$1.run()V (22 bytes) @ 0x00007fdbbb9344ec [0x00007fdbbb934280+0x26c]
> J 23528 C2 org.apache.hbase.thirdparty.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(J)Z (106 bytes) @ 0x00007fdbbcbb0efc [0x00007fdbbcbb0c40+0x2bc]
> J 15987% C2 org.apache.hbase.thirdparty.io.netty.channel.epoll.EpollEventLoop.run()V (461 bytes) @ 0x00007fdbbbaf1580 [0x00007fdbbbaf1360+0x220]
> j  org.apache.hbase.thirdparty.io.netty.util.concurrent.SingleThreadEventExecutor$4.run()V+44
> j  org.apache.hbase.thirdparty.io.netty.util.internal.ThreadExecutorMap$2.run()V+11
> j  org.apache.hbase.thirdparty.io.netty.util.concurrent.FastThreadLocalRunnable.run()V+4
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)