You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@bookkeeper.apache.org by "dlg99 (via GitHub)" <gi...@apache.org> on 2023/01/31 01:09:49 UTC

[GitHub] [bookkeeper] dlg99 commented on issue #3734: RocksDB: segfault in org.rocksdb.WriteBatch::delete called from org.apache.bookkeeper.bookie.storage.ldb.EntryLocationIndex#removeOffsetFromDeletedLedgers

dlg99 commented on issue #3734:
URL: https://github.com/apache/bookkeeper/issues/3734#issuecomment-1409599047

   @hangc0276 Thank you for looking at this problem!
   
   > I suggest reverting the PR https://github.com/apache/bookkeeper/pull/3653 on branch-4.14 and branch-4.15. For the master branch, we keep the PR and try to upgrade the RocksDB version to 7.8+ to see if the segfault issue is resolved.
   
   This means that time to confirm the fix goes into the remote future, Pulsar 2.10/2.11 use bk 4.15 IIRC. 
   
   I think we still should try to upgrade RocksDB. I'd be ok with upgraded db backported to 4.14/4.15 if we can guarantee safe downgrade. 
   
   Currently we've downgraded BK on prod so this problem is no longer happening, unfortunately it means I don't have any logs/dumps and it really happened only one time.
   
   I've spent some time experimenting with code/injecting errors.
   
   With this:
   ```java
   diff --git a/bookkeeper-server/src/main/java/org/apache/bookkeeper/bookie/storage/ldb/EntryLocationIndex.java b/bookkeeper-server/src/main/java/org/apache/bookkeeper/bookie/storage/ldb/EntryLocationIndex.java
   index 3f6d1ae55b..03acfecc87 100644
   --- a/bookkeeper-server/src/main/java/org/apache/bookkeeper/bookie/storage/ldb/EntryLocationIndex.java
   +++ b/bookkeeper-server/src/main/java/org/apache/bookkeeper/bookie/storage/ldb/EntryLocationIndex.java
   @@ -26,6 +26,8 @@ import java.io.IOException;
    import java.util.Map.Entry;
    import java.util.Set;
    import java.util.concurrent.TimeUnit;
   +
   +import lombok.SneakyThrows;
    import org.apache.bookkeeper.bookie.Bookie;
    import org.apache.bookkeeper.bookie.EntryLocation;
    import org.apache.bookkeeper.bookie.storage.ldb.KeyValueStorage.Batch;
   @@ -189,6 +191,7 @@ public class EntryLocationIndex implements Closeable {
            deletedLedgers.add(ledgerId);
        }
    
   +    @SneakyThrows
        public void removeOffsetFromDeletedLedgers() throws IOException {
            LongPairWrapper firstKeyWrapper = LongPairWrapper.get(-1, -1);
            LongPairWrapper lastKeyWrapper = LongPairWrapper.get(-1, -1);
   @@ -202,6 +205,7 @@ public class EntryLocationIndex implements Closeable {
            log.info("Deleting indexes for ledgers: {}", ledgersToDelete);
            long startTime = System.nanoTime();
    
   +        locationsDb.close();
            try (Batch batch = locationsDb.newBatch()) {
                for (long ledgerId : ledgersToDelete) {
                    if (log.isDebugEnabled()) {
   @@ -213,7 +217,6 @@ public class EntryLocationIndex implements Closeable {
    
                    batch.deleteRange(firstKeyWrapper.array, lastKeyWrapper.array);
                }
   -
                batch.flush();
                for (long ledgerId : ledgersToDelete) {
                    deletedLedgers.remove(ledgerId);
   ```
   
   I got rocksdb segfault
   ```
   ---------------  T H R E A D  ---------------
   
   Current thread (0x00007f9dc800d000):  JavaThread "main" [_thread_in_native, id=6147, stack(0x0000700003b4f000,0x0000700003c4f000)]
   
   Stack: [0x0000700003b4f000,0x0000700003c4f000],  sp=0x0000700003c4d2c0,  free space=1016k
   Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
   C  [librocksdbjni13563433824350328902.jnilib+0x22e1c]  Java_org_rocksdb_RocksDB_write0+0x1c
   j  org.rocksdb.RocksDB.write0(JJJ)V+0
   #
   ```
   with [this dump](https://gist.github.com/dlg99/0459323e8a6fa0d47ac2215349e866b4)
   
   This does not look exactly as original case and more similar to https://github.com/apache/bookkeeper/pull/3043 but the question is i it possible some other rocksdb calls should not run concurrently like index update on deleted range?
   I've tried injecting a few other errors/cases but so far without additional success. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@bookkeeper.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org