You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@bookkeeper.apache.org by GitBox <gi...@apache.org> on 2022/07/16 09:28:02 UTC

[GitHub] [bookkeeper] gaozhangmin opened a new issue, #3408: AutoRecovery caused DirectMemory OOM error.

gaozhangmin opened a new issue, #3408:
URL: https://github.com/apache/bookkeeper/issues/3408

   Our prod environment went wrong last week, all bookies were killed because of direct memory OOM, this happened after one bookie's disk was broken, we tried to offline this bookie. After auditBookie triggered, all the bookies Direct Memory keep increase, it seem that, there is memory leak problem.
   
   The ReplicateWorker log: x.x.x.x is the ip of lost bookie
   ```
   022-07-14 21:50:41.721 [ReplicationWorker] ERROR org.apache.bookkeeper.proto.PerChannelBookieClient - Could not connect to bookie: null/x.x.x.x:3181, current state CONNECTING : 
   2022-07-14 21:50:41.723 [BookKeeperClientWorker-OrderedExecutor-8-0] INFO  org.apache.bookkeeper.client.PendingReadOp - Error: Bookie handle is not available while reading L61080496 E1502 from bookie: x.x.x.x:3181
   2022-07-14 21:50:41.724 [ReplicationWorker] ERROR org.apache.bookkeeper.proto.PerChannelBookieClient - Cannot connect to x.x.x.x:3181 as endpoint resolution failed (probably bookie is down) err org.apache.bookkeeper.proto.BookieAddressResolver$BookieIdNotResolvedException: Cannot resolve bookieId x.x.x.x:3181, bookie does not exist or it is not running
   2022-07-14 21:50:41.724 [BookKeeperClientWorker-OrderedExecutor-8-0] INFO  org.apache.bookkeeper.client.PendingReadOp - Error: Bookie handle is not available while reading L61080496 E1506 from bookie: x.x.x.x:3181
   2022-07-14 21:50:41.724 [ReplicationWorker] ERROR org.apache.bookkeeper.proto.PerChannelBookieClient - Cannot connect to x.x.x.x:3181 as endpoint resolution failed (probably bookie is down) err org.apache.bookkeeper.proto.BookieAddressResolver$BookieIdNotResolvedException: Cannot resolve bookieId x.x.x.x:3181, bookie does not exist or it is not running
   2022-07-14 21:50:41.724 [BookKeeperClientWorker-OrderedExecutor-8-0] INFO  org.apache.bookkeeper.client.PendingReadOp - Error: Bookie handle is not available while reading L61080496 E1510 from bookie: x.x.x.x:3181
   2022-07-14 21:50:41.725 [ReplicationWorker] ERROR org.apache.bookkeeper.proto.PerChannelBookieClient - Cannot connect to x.x.x.x:3181 as endpoint resolution failed (probably bookie is down) err org.apache.bookkeeper.proto.BookieAddressResolver$BookieIdNotResolvedException: Cannot resolve bookieId x.x.x.x:3181, bookie does not exist or it is not running
   2022-07-14 21:50:41.725 [BookKeeperClientWorker-OrderedExecutor-8-0] INFO  org.apache.bookkeeper.client.PendingReadOp - Error: Bookie handle is not available while reading L61080496 E1514 from bookie: x.x.x.x:3181
   2022-07-14 21:50:41.725 [ReplicationWorker] ERROR org.apache.bookkeeper.proto.PerChannelBookieClient - Cannot connect to x.x.x.x:3181 as endpoint resolution failed (probably bookie is down) err org.apache.bookkeeper.proto.BookieAddressResolver$BookieIdNotResolvedException: Cannot resolve bookieId x.x.x.x:3181, bookie does not exist or it is not running
   2022-07-14 21:50:41.725 [BookKeeperClientWorker-OrderedExecutor-8-0] INFO  org.apache.bookkeeper.client.PendingReadOp - Error: Bookie handle is not available while reading L61080496 E1518 from bookie: x.x.x.x:3181
   2022-07-14 21:50:41.725 [ReplicationWorker] ERROR org.apache.bookkeeper.proto.PerChannelBookieClient - Cannot connect to x.x.x.x:3181 as endpoint resolution failed (probably bookie is down) err org.apache.bookkeeper.proto.BookieAddressResolver$BookieIdNotResolvedException: Cannot resolve bookieId x.x.x.x:3181, bookie does not exist or it is not running
   
   2022-07-14 21:50:42.403 [BookKeeperClientWorker-OrderedExecutor-8-0] INFO  org.apache.bookkeeper.client.PendingReadOp - Error: Bookie handle is not available while reading L61080496 E1974 from bookie: x.x.x.x:3181
   2022-07-14 21:50:42.440 [BookKeeperClientWorker-OrderedExecutor-8-0] INFO  org.apache.bookkeeper.client.PendingReadOp - Error: Bookie handle is not available while reading L61080496 E1978 from bookie: x.x.x.x:3181
   2022-07-14 21:50:42.525 [BookKeeperClientWorker-OrderedExecutor-8-0] INFO  org.apache.bookkeeper.client.PendingReadOp - Error: Bookie handle is not available while reading L61080496 E1982 from bookie: x.x.x.x:3181
   2022-07-14 21:50:42.593 [BookKeeperClientWorker-OrderedExecutor-8-0] INFO  org.apache.bookkeeper.client.PendingReadOp - Error: Bookie handle is not available while reading L61080496 E1986 from bookie: x.x.x.x:3181
   2022-07-14 21:50:42.665 [BookKeeperClientWorker-OrderedExecutor-8-0] INFO  org.apache.bookkeeper.client.PendingReadOp - Error: Bookie handle is not available while reading L61080496 E1990 from bookie: x.x.x.x:3181
   2022-07-14 21:50:42.706 [BookKeeperClientWorker-OrderedExecutor-8-0] INFO  org.apache.bookkeeper.client.PendingReadOp - Error: Bookie handle is not available while reading L61080496 E1994 from bookie: x.x.x.x:3181
   2022-07-14 21:50:42.776 [BookKeeperClientWorker-OrderedExecutor-8-0] INFO  org.apache.bookkeeper.client.PendingReadOp - Error: Bookie handle is not available while reading L61080496 E1998 from bookie: x.x.x.x:3181
   2022-07-14 21:50:44.271 [BookKeeperClientWorker-OrderedExecutor-8-0] ERROR org.apache.bookkeeper.proto.PerChannelBookieClient - Cannot connect to x.x.x.x:3181 as endpoint resolution failed (probably bookie is down) err org.apache.bookkeeper.proto.BookieAddressResolver$BookieIdNotResolvedException: Cannot resolve bookieId x.x.x.x:3181, bookie does not exist or it is not running
   2022-07-14 21:50:44.271 [BookKeeperClientWorker-OrderedExecutor-8-0] ERROR org.apache.bookkeeper.proto.PerChannelBookieClient - Cannot connect to x.x.x.x:3181 as endpoint resolution failed (probably bookie is down) err org.apache.bookkeeper.proto.BookieAddressResolver$BookieIdNotResolvedException: Cannot resolve bookieId x.x.x.x:3181, bookie does not exist or it is not running
   2022-07-14 21:50:44.271 [BookKeeperClientWorker-OrderedExecutor-8-0] ERROR org.apache.bookkeeper.proto.PerChannelBookieClient - Cannot connect to x.x.x.x:3181 as endpoint resolution failed (probably bookie is down) err org.apache.bookkeeper.proto.BookieAddressResolver$BookieIdNotResolvedException: Cannot resolve bookieId x.x.x.x:3181, bookie does not exist or it is not running
   2022-07-14 21:50:44.271 [BookKeeperClientWorker-OrderedExecutor-8-0] ERROR org.apache.bookkeeper.proto.PerChannelBookieClient - Cannot connect to x.x.x.x:3181 as endpoint resolution failed (probably bookie is down) err org.apache.bookkeeper.proto.BookieAddressResolver$BookieIdNotResolvedException: Cannot resolve bookieId x.x.x.x:3181, bookie does not exist or it is not running
   2022-07-14 21:50:44.271 [BookKeeperClientWorker-OrderedExecutor-8-0] ERROR org.apache.bookkeeper.proto.PerChannelBookieClient - Cannot connect to x.x.x.x:3181 as endpoint resolution failed (probably bookie is down) err org.apache.bookkeeper.proto.BookieAddressResolver$BookieIdNotResolvedException: Cannot resolve bookieId x.x.x.x:3181, bookie does not exist or it is not running
   2022-07-14 21:50:44.271 [BookKeeperClientWorker-OrderedExecutor-8-0] ERROR org.apache.bookkeeper.proto.PerChannelBookieClient - Cannot connect to x.x.x.x:3181 as endpoint resolution failed (probably bookie is down) err org.apache.bookkeeper.proto.BookieAddressResolver$BookieIdNotResolvedException: Cannot resolve bookieId x.x.x.x:3181, bookie does not exist or it is not running
   
   2022-07-14 22:10:19.830 [BookKeeperClientWorker-OrderedExecutor-41-0] INFO  org.apache.bookkeeper.client.PendingReadOp - Error: Bookie operation timeout while reading L60558419 E359 from bookie: 10.71.168.13:3181
   2022-07-14 22:10:19.830 [BookKeeperClientWorker-OrderedExecutor-41-0] ERROR org.apache.bookkeeper.client.LedgerFragmentReplicator - BK error reading ledger entry: 434
   2022-07-14 22:10:19.831 [BookKeeperClientWorker-OrderedExecutor-41-0] ERROR org.apache.bookkeeper.proto.BookkeeperInternalCallbacks - Error in multi callback : -23
   
   
   is (-1, rc = null)
   2022-07-14 22:10:19.830 [BookKeeperClientWorker-OrderedExecutor-41-0] INFO  org.apache.bookkeeper.client.PendingReadOp - Error: Bookie operation timeout while reading L60558419 E378 from bookie: 1.1.1.1:3181
   2022-07-14 22:10:19.830 [BookKeeperClientWorker-OrderedExecutor-41-0] ERROR org.apache.bookkeeper.client.PendingReadOp - Read of ledger entry failed: L60558419 E378-E378, Sent to [x.x.x.x:3181, 1.1.1.1:3181], Heard from [] : bitset = {}, Error = 'Bookie operation timeout'. First unread entry is (-1, rc = null)
   
   
   ```
   
   And there are bookies quarantined by brokers continuous, all bookies are crashed at last.
   
   
   ![image](https://user-images.githubusercontent.com/9278488/179348752-21cbf405-1f80-4ecc-917e-8524f2742bb9.png)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@bookkeeper.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [bookkeeper] StevenLuMT commented on issue #3408: AutoRecovery caused DirectMemory OOM error.

Posted by GitBox <gi...@apache.org>.
StevenLuMT commented on issue #3408:
URL: https://github.com/apache/bookkeeper/issues/3408#issuecomment-1201450172

   > @horizonzy I also suffered OutOfDirectMemoryError when doing performance test. I send very huge writes and bookie will OOM and restart all the time. But in my case, replica recovery is disabled. If it is caused by allocation fatser than release, why the direct memory is still not released after a long time even if I stopped prodcuer and entries rate is 0. Here is a snapshot of direct memory usage. <img alt="Screen_Shot_2022-07-31_at_4_30_24_PM" width="469" src="https://user-images.githubusercontent.com/20301740/182018073-8c4243ad-c60c-43fc-83fc-011b228a4006.png">
   
   Your scene is different,I will discuss the details with you offline @yapxue 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@bookkeeper.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [bookkeeper] jimmycxm commented on issue #3408: AutoRecovery caused DirectMemory OOM error.

Posted by GitBox <gi...@apache.org>.
jimmycxm commented on issue #3408:
URL: https://github.com/apache/bookkeeper/issues/3408#issuecomment-1252054489

   > @horizonzy I also suffered OutOfDirectMemoryError when doing performance test. I send very huge writes and bookie will OOM and restart all the time. But in my case, replica recovery is disabled. If it is caused by allocation fatser than release, why the direct memory is still not released after a long time even if I stopped prodcuer and entries rate is 0. Here is a snapshot of direct memory usage. <img alt="Screen_Shot_2022-07-31_at_4_30_24_PM" width="469" src="https://user-images.githubusercontent.com/20301740/182018073-8c4243ad-c60c-43fc-83fc-011b228a4006.png">
   
   Got same issue. Is the problem resolved?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@bookkeeper.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [bookkeeper] gaozhangmin commented on issue #3408: AutoRecovery caused DirectMemory OOM error.

Posted by GitBox <gi...@apache.org>.
gaozhangmin commented on issue #3408:
URL: https://github.com/apache/bookkeeper/issues/3408#issuecomment-1188520204

   @dlg99  it's 500,  it's because of this setting?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@bookkeeper.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [bookkeeper] horizonzy commented on issue #3408: AutoRecovery caused DirectMemory OOM error.

Posted by GitBox <gi...@apache.org>.
horizonzy commented on issue #3408:
URL: https://github.com/apache/bookkeeper/issues/3408#issuecomment-1200383527

   BytBuf.release maybe not free PoolChunk. If The ByteBuf is cached by PoolThreadCache. The direct memory can't be release.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@bookkeeper.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [bookkeeper] StevenLuMT commented on issue #3408: AutoRecovery caused DirectMemory OOM error.

Posted by GitBox <gi...@apache.org>.
StevenLuMT commented on issue #3408:
URL: https://github.com/apache/bookkeeper/issues/3408#issuecomment-1189716555

   can you provide a bookie configuration? I help you analyze it @gaozhangmin 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@bookkeeper.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [bookkeeper] horizonzy commented on issue #3408: AutoRecovery caused DirectMemory OOM error.

Posted by GitBox <gi...@apache.org>.
horizonzy commented on issue #3408:
URL: https://github.com/apache/bookkeeper/issues/3408#issuecomment-1196586078

   After research for a long time. We found that is bookie problem, the request from `ReplicationWorker` is too many.
   
   The shutdown bookie holds many ledgers, when it shutdown, the `Auditor` mark many ledgers to underreplication. 
   And there are many `ReplicationWorker` to replicate ledger, the config `rereplicationEntryBatchSize ` is 500, so every `ReplicationWorker` will send 500 read request to bookie servers, so the bookie server receives lots of reuqest, it will allocate direct memory for reuqest.
   
   The release operation is not catching up allocate operation, so the PoolChunk is more and more until it reach maxDirectMemory.
   
   @gaozhangmin supply two heap dumps file, the `less` is dumpped when replicate operation start, The `more` file is dumpped when the replicate for a while.
   
   [less.hprof.zip](https://github.com/apache/bookkeeper/files/9197159/less.hprof.zip)
   [more.hprof.zip](https://github.com/apache/bookkeeper/files/9197161/more.hprof.zip)
   
   I found that `PoolChunk` is 244 in `more`, 120 in `less`. The `PoolChunk` direct memory is 4M in bookie, so it increase 124 * 4M direct memory than `less`.  
   
   And there is another issue we found, if user config `DbLedgerStorage`, when it start, it will occupy 1/2 direct memory for readCache and writeCache, it's unpooled but cuupy direct memory. 
   
   In the Direct memory pool, it only has 1/2 direct memory to allocate, it will cause oom easier.
   
    


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@bookkeeper.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [bookkeeper] horizonzy commented on issue #3408: AutoRecovery caused DirectMemory OOM error.

Posted by GitBox <gi...@apache.org>.
horizonzy commented on issue #3408:
URL: https://github.com/apache/bookkeeper/issues/3408#issuecomment-1198857079

   So this is a normal case, you can reduce `rereplicationEntryBatchSize ` value to decrease bookie server reque frequency.
   If no more confusion, can be close.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@bookkeeper.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [bookkeeper] dlg99 commented on issue #3408: AutoRecovery caused DirectMemory OOM error.

Posted by GitBox <gi...@apache.org>.
dlg99 commented on issue #3408:
URL: https://github.com/apache/bookkeeper/issues/3408#issuecomment-1188095338

   Do you have rereplicationEntryBatchSize set to a large value? try setting it back to default (10).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@bookkeeper.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [bookkeeper] yapxue commented on issue #3408: AutoRecovery caused DirectMemory OOM error.

Posted by GitBox <gi...@apache.org>.
yapxue commented on issue #3408:
URL: https://github.com/apache/bookkeeper/issues/3408#issuecomment-1200379520

   @horizonzy I also suffered OutOfDirectMemoryError when doing performance test. I send very huge writes and bookie will OOM and restart all the time. But in my case, replica recovery is disabled. 
   If it is caused by allocation fatser than release. But I can see  the direct memory is still not released after a long time even if I stopped prodcuer and entries rate is 0.
   <img width="469" alt="Screen_Shot_2022-07-31_at_4_30_24_PM" src="https://user-images.githubusercontent.com/20301740/182018073-8c4243ad-c60c-43fc-83fc-011b228a4006.png">
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@bookkeeper.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [bookkeeper] horizonzy commented on issue #3408: AutoRecovery caused DirectMemory OOM error.

Posted by GitBox <gi...@apache.org>.
horizonzy commented on issue #3408:
URL: https://github.com/apache/bookkeeper/issues/3408#issuecomment-1197638689

   And we found the `io.netty.buffer.PooledByteBufAllocator#directArenas` array size is 80, it also cause oom easier.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@bookkeeper.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org