You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@bookkeeper.apache.org by GitBox <gi...@apache.org> on 2021/06/08 10:11:22 UTC

[GitHub] [bookkeeper] hangc0276 opened a new issue #2729: 【BUG】BookKeeper Netty chennel OOM

hangc0276 opened a new issue #2729:
URL: https://github.com/apache/bookkeeper/issues/2729


   **BUG REPORT**
   
   ***Describe the bug***
   When running hundreds of bookie in BookKeeper cluster, and auditor triggered to check ledgers, it throw OOM
   ```
   18:12:03.200 [bookkeeper-io-44-2] ERROR org.apache.bookkeeper.common.allocator.impl.ByteBufAllocatorImpl - Unable to allocate memory
   java.lang.OutOfMemoryError: Direct buffer memory
           at java.nio.Bits.reserveMemory(Bits.java:187) ~[?:?]
           at java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:123) ~[?:?]
           at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:310) ~[?:?]
           at io.netty.buffer.PoolArena$DirectArena.allocateDirect(PoolArena.java:755) ~[io.netty-netty-buffer-4.1.51.Final.jar:4.1.51.Final]
           at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:731) ~[io.netty-netty-buffer-4.1.51.Final.jar:4.1.51.Final]
           at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:247) ~[io.netty-netty-buffer-4.1.51.Final.jar:4.1.51.Final]
           at io.netty.buffer.PoolArena.allocate(PoolArena.java:215) ~[io.netty-netty-buffer-4.1.51.Final.jar:4.1.51.Final]
           at io.netty.buffer.PoolArena.allocate(PoolArena.java:147) ~[io.netty-netty-buffer-4.1.51.Final.jar:4.1.51.Final]
           at io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:356) ~[io.netty-netty-buffer-4.1.51.Final.jar:4.1.51.Final]
           at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:187) ~[io.netty-netty-buffer-4.1.51.Final.jar:4.1.51.Final]
           at org.apache.bookkeeper.common.allocator.impl.ByteBufAllocatorImpl.newDirectBuffer(ByteBufAllocatorImpl.java:164) [org.apache.bookkeeper-bookkeeper-common-allocator-4.12.0.jar:4.12.0]
           at org.apache.bookkeeper.common.allocator.impl.ByteBufAllocatorImpl.newDirectBuffer(ByteBufAllocatorImpl.java:158) [org.apache.bookkeeper-bookkeeper-common-allocator-4.12.0.jar:4.12.0]
           at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:187) [io.netty-netty-buffer-4.1.51.Final.jar:4.1.51.Final]
           at org.apache.bookkeeper.proto.BookieProtoEncoding.serializeProtobuf(BookieProtoEncoding.java:366) [org.apache.bookkeeper-bookkeeper-server-4.12.0.jar:4.12.0]
           at org.apache.bookkeeper.proto.BookieProtoEncoding.access$100(BookieProtoEncoding.java:54) [org.apache.bookkeeper-bookkeeper-server-4.12.0.jar:4.12.0]
           at org.apache.bookkeeper.proto.BookieProtoEncoding$RequestEnDecoderV3.encode(BookieProtoEncoding.java:328) [org.apache.bookkeeper-bookkeeper-server-4.12.0.jar:4.12.0]
           at org.apache.bookkeeper.proto.BookieProtoEncoding$RequestEncoder.write(BookieProtoEncoding.java:400) [org.apache.bookkeeper-bookkeeper-server-4.12.0.jar:4.12.0]
           at io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:717) [io.netty-netty-transport-4.1.51.Final.jar:4.1.51.Final]
           at io.netty.channel.AbstractChannelHandlerContext.invokeWriteAndFlush(AbstractChannelHandlerContext.java:764) [io.netty-netty-transport-4.1.51.Final.jar:4.1.51.Final]
           at io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:790) [io.netty-netty-transport-4.1.51.Final.jar:4.1.51.Final]
           at io.netty.channel.AbstractChannelHandlerContext.writeAndFlush(AbstractChannelHandlerContext.java:758) [io.netty-netty-transport-4.1.51.Final.jar:4.1.51.Final]
           at io.netty.channel.AbstractChannelHandlerContext.writeAndFlush(AbstractChannelHandlerContext.java:808) [io.netty-netty-transport-4.1.51.Final.jar:4.1.51.Final]
           at org.apache.bookkeeper.proto.AuthHandler$ClientSideHandler$AuthHandshakeCompleteCallback.operationComplete(AuthHandler.java:437) [org.apache.bookkeeper-bookkeeper-server-4.12.0.jar:4.12.0]
           at org.apache.bookkeeper.proto.AuthHandler$ClientSideHandler$AuthHandshakeCompleteCallback.operationComplete(AuthHandler.java:423) [org.apache.bookkeeper-bookkeeper-server-4.12.0.jar:4.12.0]
           at org.apache.bookkeeper.auth.AuthProviderFactoryFactory$NullClientAuthProviderFactory.newProvider(AuthProviderFactoryFactory.java:102) [org.apache.bookkeeper-bookkeeper-server-4.12.0.jar:4.12.0]
           at org.apache.bookkeeper.proto.AuthHandler$ClientSideHandler.channelActive(AuthHandler.java:258) [org.apache.bookkeeper-bookkeeper-server-4.12.0.jar:4.12.0]
           at io.netty.channel.AbstractChannelHandlerContext.invokeChannelActive(AbstractChannelHandlerContext.java:230) [io.netty-netty-transport-4.1.51.Final.jar:4.1.51.Final]
           at io.netty.channel.AbstractChannelHandlerContext.invokeChannelActive(AbstractChannelHandlerContext.java:216) [io.netty-netty-transport-4.1.51.Final.jar:4.1.51.Final]
           at io.netty.channel.AbstractChannelHandlerContext.fireChannelActive(AbstractChannelHandlerContext.java:209) [io.netty-netty-transport-4.1.51.Final.jar:4.1.51.Final]
           at io.netty.channel.DefaultChannelPipeline$HeadContext.channelActive(DefaultChannelPipeline.java:1398) [io.netty-netty-transport-4.1.51.Final.jar:4.1.51.Final]
           at io.netty.channel.AbstractChannelHandlerContext.invokeChannelActive(AbstractChannelHandlerContext.java:230) [io.netty-netty-transport-4.1.51.Final.jar:4.1.51.Final]
           at io.netty.channel.AbstractChannelHandlerContext.invokeChannelActive(AbstractChannelHandlerContext.java:216) [io.netty-netty-transport-4.1.51.Final.jar:4.1.51.Final]
           at io.netty.channel.DefaultChannelPipeline.fireChannelActive(DefaultChannelPipeline.java:895) [io.netty-netty-transport-4.1.51.Final.jar:4.1.51.Final]
           at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.fulfillConnectPromise(AbstractEpollChannel.java:620) [io.netty-netty-transport-native-epoll-4.1.51.Final.jar:4.1.51.Final]
           at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.finishConnect(AbstractEpollChannel.java:653) [io.netty-netty-transport-native-epoll-4.1.51.Final.jar:4.1.51.Final]
           at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.epollOutReady(AbstractEpollChannel.java:529) [io.netty-netty-transport-native-epoll-4.1.51.Final.jar:4.1.51.Final]
           at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:465) [io.netty-netty-transport-native-epoll-4.1.51.Final.jar:4.1.51.Final]
           at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:378) [io.netty-netty-transport-native-epoll-4.1.51.Final.jar:4.1.51.Final]
           at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) [io.netty-netty-common-4.1.51.Final.jar:4.1.51.Final]
           at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [io.netty-netty-common-4.1.51.Final.jar:4.1.51.Final]
           at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [io.netty-netty-common-4.1.51.Final.jar:4.1.51.Final]
           at java.lang.Thread.run(Thread.java:844) [?:?]
   
   ```
   
   The code introduced by #2441 , and it change `serializeProtobuf` message from heap to direct memory. The code as follow
   ```Java
   private static ByteBuf serializeProtobuf(MessageLite msg, ByteBufAllocator allocator) {
           int size = msg.getSerializedSize();
           // Protobuf serialization is the last step of the netty pipeline. We used to allocate
           // a heap buffer while serializing and pass it down to netty library.
           // In AbstractChannel#filterOutboundMessage(), netty copies that data to a direct buffer if
           // it is currently in heap (otherwise skips it and uses it directly).
           // Allocating a direct buffer reducing unncessary CPU cycles for buffer copies in BK client
           // and also helps alleviate pressure off the GC, since there is less memory churn.
           // Bookies aren't usually CPU bound. This change improves READ_ENTRY code paths by a small factor as well.
           ByteBuf buf = allocator.directBuffer(size, size);
   
           try {
               msg.writeTo(CodedOutputStream.newInstance(buf.nioBuffer(buf.readerIndex(), size)));
           } catch (IOException e) {
               // This is in-memory serialization, should not fail
               throw new RuntimeException(e);
           }
   
           // Advance writer idx
           buf.writerIndex(buf.capacity());
           return buf;
      }
   ```
   
   I doubt whether will casue direct memory leak due to not release.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [bookkeeper] hangc0276 commented on issue #2729: 【BUG】BookKeeper Netty chennel OOM

Posted by GitBox <gi...@apache.org>.
hangc0276 commented on issue #2729:
URL: https://github.com/apache/bookkeeper/issues/2729#issuecomment-856753209


   @karanmehta93 Do you have any ideas?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [bookkeeper] hangc0276 commented on issue #2729: 【BUG】BookKeeper Netty chennel OOM

Posted by GitBox <gi...@apache.org>.
hangc0276 commented on issue #2729:
URL: https://github.com/apache/bookkeeper/issues/2729#issuecomment-856642765


   @merlimat @eolivelli  @sijie  PTAL, thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [bookkeeper] hangc0276 commented on issue #2729: 【BUG】BookKeeper Netty chennel OOM

Posted by GitBox <gi...@apache.org>.
hangc0276 commented on issue #2729:
URL: https://github.com/apache/bookkeeper/issues/2729#issuecomment-856717441


   > @hangc0276 Do you have a chance to run with `-Dio.netty.leakDetectionLevel=advanced -Dio.netty.leakDetection.targetRecords=30` in the JVM options for the BookKeeper server? That would enable the Netty leak detector and print out log entries that would help locate the possible leak.
   > 
   > It's also possible that there isn't a memory leak and it's simply caused by high memory requirements when there's high load. When using Pulsar, the default configuration for BookKeeper doesn't enable the specific backpressure features that are available in Bookkeeper server (#1409) and client (#1086). There's a separate enhancement request in Pulsar to make use of these features. It is filed as [apache/pulsar#10439](https://github.com/apache/pulsar/issues/10439) .
   
   @lhotari Thanks for your feedback. Our production bookkeeper cluster has hundreds of bookie instances, and i should roll restart all bookie instances to add those JVM options. The OOM only occurs when auditor starts to scan all ledgers, and the JVM direct memory monitor shows no direct memory used and the bookie instance with low load.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [bookkeeper] lhotari commented on issue #2729: 【BUG】BookKeeper Netty chennel OOM

Posted by GitBox <gi...@apache.org>.
lhotari commented on issue #2729:
URL: https://github.com/apache/bookkeeper/issues/2729#issuecomment-856719830


   > Our production bookkeeper cluster has hundreds of bookie instances, and i should roll restart all bookie instances to add those JVM options. The OOM only occurs when auditor starts to scan all ledgers, and the JVM direct memory monitor shows no direct memory used and the bookie instance with low load.
   
   The `-Dio.netty.leakDetectionLevel=advanced -Dio.netty.leakDetection.targetRecords=30` settings might have a major performance impact, so it's not recommended to put the settings to a production cluster. It's better to first test the impact of enabling the leak detector in a test environment before using the leak detector setting in production.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [bookkeeper] lhotari commented on issue #2729: 【BUG】BookKeeper Netty chennel OOM

Posted by GitBox <gi...@apache.org>.
lhotari commented on issue #2729:
URL: https://github.com/apache/bookkeeper/issues/2729#issuecomment-856728501


   @hangc0276  Can you share more information about the Bookeeper version and your environment?
   What are your current JVM options? What type of deployment ? k8s or bare metal?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [bookkeeper] karanmehta93 commented on issue #2729: 【BUG】BookKeeper Netty chennel OOM

Posted by GitBox <gi...@apache.org>.
karanmehta93 commented on issue #2729:
URL: https://github.com/apache/bookkeeper/issues/2729#issuecomment-856869881


   @hangc0276 In my understanding, this change shouldn't increase/decrease direct memory usage from before. 
   
   Earlier, when the heap buffer was allocated, netty lib used to allocate direct memory and then copy contents there. Now that is bypassed.
   
   Netty lib takes the responsibility of releasing memory. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [bookkeeper] hangc0276 commented on issue #2729: 【BUG】BookKeeper Netty chennel OOM

Posted by GitBox <gi...@apache.org>.
hangc0276 commented on issue #2729:
URL: https://github.com/apache/bookkeeper/issues/2729#issuecomment-856722183


   > > Our production bookkeeper cluster has hundreds of bookie instances, and i should roll restart all bookie instances to add those JVM options. The OOM only occurs when auditor starts to scan all ledgers, and the JVM direct memory monitor shows no direct memory used and the bookie instance with low load.
   > 
   > The `-Dio.netty.leakDetectionLevel=advanced -Dio.netty.leakDetection.targetRecords=30` settings might have a major performance impact, so it's not recommended to put the settings to a production cluster. It's better to first test the impact of enabling the leak detector in a test environment before using the leak detector setting in production.
   
   It may occurs in large scale bookkeeper cluster, and only occurs on Auditor bookie instance. I try to enable those JVM options first. Do you have any other ideas about this OOM?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [bookkeeper] lhotari commented on issue #2729: 【BUG】BookKeeper Netty chennel OOM

Posted by GitBox <gi...@apache.org>.
lhotari commented on issue #2729:
URL: https://github.com/apache/bookkeeper/issues/2729#issuecomment-856726505


   > Do you have any other ideas about this OOM?
   
   Well, there's a lot of ideas of how to debug it, but debugging in production is something where you have to be very cautious. :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [bookkeeper] dlg99 commented on issue #2729: 【BUG】BookKeeper Netty chennel OOM

Posted by GitBox <gi...@apache.org>.
dlg99 commented on issue #2729:
URL: https://github.com/apache/bookkeeper/issues/2729#issuecomment-856921213


   @hangc0276 It is more than likely that there is no leak at all and application simply runs out of (direct) memory under specific load (when auditor/autorecovery works).
   
   I think your options are:
   
   1.  increase direct memory (jvm option)
   2. set `autoRecoveryDaemonEnabled=false` on bookies and[ run autorecovery as a separate instance(s)](https://bookkeeper.apache.org/docs/latest/admin/autorecovery/) - I think this is the best option in general; autorecovery won't interfere with bookie.
   3. figure out the way to throttle the autorecovery (I don't remember off the top of my head what's going on there/which config params could help) - this can be helpful for option 2 as well
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [bookkeeper] hangc0276 commented on issue #2729: 【BUG】BookKeeper Netty chennel OOM

Posted by GitBox <gi...@apache.org>.
hangc0276 commented on issue #2729:
URL: https://github.com/apache/bookkeeper/issues/2729#issuecomment-856733212


   This OOM doesn’t cause bookie break down and read/write entry is normal. When i restart the OOM bookie, the auditor will be elected to another bookie instance, and the elected bookie instance will also cause OOM on auditor thread start to check all ledgers. I configured regionAware policy for bookkeeper.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [bookkeeper] lhotari commented on issue #2729: 【BUG】BookKeeper Netty chennel OOM

Posted by GitBox <gi...@apache.org>.
lhotari commented on issue #2729:
URL: https://github.com/apache/bookkeeper/issues/2729#issuecomment-856687753


   @hangc0276 Do you have a chance to run with `-Dio.netty.leakDetectionLevel=advanced -Dio.netty.leakDetection.targetRecords=30` in the JVM options for the BookKeeper server? That would enable the Netty leak detector and print out log entries that would help locate the possible leak.
   
   It's also possible that there isn't a memory leak and it's simply caused by high memory requirements when there's high load. When using Pulsar, the default configuration for BookKeeper doesn't enable the specific backpressure features that are available in Bookkeeper server (#1409) and client (#1086). There's a separate enhancement request in Pulsar to make use of these features. It is filed as https://github.com/apache/pulsar/issues/10439 . 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org