You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@bookkeeper.apache.org by GitBox <gi...@apache.org> on 2022/12/07 14:51:06 UTC

[GitHub] [bookkeeper] RaulGracia commented on issue #3466: Missing ledgers when inducing network packet loss in Bookkeeper 4.15

RaulGracia commented on issue #3466:
URL: https://github.com/apache/bookkeeper/issues/3466#issuecomment-1341079895

   Hi all, I think we could have another reproduction of this problem, this time using Bookkeeper 4.14.2 client/server (therefore, the issue reported here may be present in other versions of Bookkeeper, not only in Bookkeeper 4.15).
   
   This is the information I have been able to recollect so far (the missing ledger this time is `3538`):
   
   1. Ledger is created correctly: 
   ```
   2022-12-05 15:21:11,541 8491908 [core-23] INFO  i.p.s.s.i.bookkeeper.BookKeeperLog - Log[29]: Created Ledger 3538.
   ``` 
   
   2. Pravega can write and process properly up to entry `52`:
   ```
   2022-12-05 15:26:53,197 8833564 [BookKeeperClientWorker-OrderedExecutor-2-0] DEBUG o.a.bookkeeper.client.PendingAddOp - Submit callback (lid:3538, eid: 52). rc:0
   2022-12-05 15:26:53,197 8833564 [core-17] DEBUG i.p.s.s.logs.OperationProcessor - OperationProcessor[29]: CommitSuccess (TxnId = 51, LastFullySerializedSN = 505834, LastStartedSN = 505834, Address = Sequence = 17179869235, LedgerId = 3538, EntryId = 51, Length = 35244).
   2022-12-05 15:26:53,197 8833564 [core-28] DEBUG i.p.s.s.logs.OperationProcessor - OperationProcessor[29]: CommitSuccess (TxnId = 52, LastFullySerializedSN = 506120, LastStartedSN = 506120, Address = Sequence = 17179869236, LedgerId = 3538, EntryId = 52, Length = 783197).
   ```
   
   3. Post that, we see that, probably due to the network fault-injection test induced, there is a ensamble change that impacts entry `53` onwards for that ledger:
   ```
   2022-12-05 15:27:07,801 8848168 [BookKeeperClientWorker-OrderedExecutor-2-0] DEBUG o.a.bookkeeper.client.LedgerHandle - [EnsembleChange(ledger:3538, change-id:0000000001)][attempt:1] changing ensemble from: [bookkeeper-bookie-2-13057, bookkeeper-bookie-1-21445] to: [bookkeeper-bookie-2-13057, bookkeeper-bookie-0-27781] starting at entry: 53
   ```
   
   4. We keep getting some errors and acknowledgements writing entries from `53` onwards, until we get this error:
   ```
   2022-12-05 15:27:15,431 8855798 [ZKC-connect-executor-0-EventThread] ERROR o.a.b.client.MetadataUpdateLoop - UpdateLoop(ledgerId=3538,loopId=1ec0500f) Error writing metadata to store
     Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /pravega/pravega/bookkeeper/ledgers/00/0000/L3538
   2022-12-05 15:27:15,432 8855799 [BookKeeperClientWorker-OrderedExecutor-2-0] WARN  o.a.bookkeeper.client.LedgerHandle - [EnsembleChange(ledger:3538, change-id:0000000001)][attempt:1] Exception changing ensemble
     Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /pravega/pravega/bookkeeper/ledgers/00/0000/L3538
   2022-12-05 15:27:15,433 8855800 [BookKeeperClientWorker-OrderedExecutor-2-0] ERROR o.a.bookkeeper.client.LedgerHandle - Closing ledger 3538 due to ZKException: Error while using ZooKeeper
   ```
   
   5. After that, the zNode for that very same ledger `3538` seems to be deleted:
   ```
   2022-12-05 15:27:19,429 8859796 [ZKC-connect-executor-0-SendThread(zookeeper-client:2181)] DEBUG org.apache.zookeeper.ClientCnxn - Got WatchedEvent state:SyncConnected type:NodeDeleted path:/pravega/pravega/bookkeeper/ledgers/00/0000/L3538 for session id 0x3004cfb266c0003
   2022-12-05 15:27:19,429 8859796 [ZKC-connect-executor-0-EventThread] DEBUG o.a.b.meta.AbstractZkLedgerManager - Received watched event WatchedEvent state:SyncConnected type:NodeDeleted path:/pravega/pravega/bookkeeper/ledgers/00/0000/L3538 from zookeeper based ledger manager.
   2022-12-05 15:27:19,429 8859796 [ZKC-connect-executor-0-EventThread] DEBUG o.a.b.meta.AbstractZkLedgerManager - No ledger metadata listeners to remove from ledger 3538 after it's deleted.
   ```
   
   6. And it seems deleted because, after that, we just see `NoNodeException` when we try to access the Zookeeper metadata of this ledger:
   ```
   2022-12-05 15:27:23,218 8863585 [ZKC-connect-executor-0-EventThread] DEBUG o.a.b.meta.AbstractZkLedgerManager - No such ledger: 3538
   segmentstore_2022-12-05_15.640.log.gz:org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /pravega/pravega/bookkeeper/ledgers/00/0000/L3538
   2022-12-05 15:27:23,218 8863585 [ZKC-connect-executor-0-EventThread] ERROR o.a.b.client.MetadataUpdateLoop - UpdateLoop(ledgerId=3538,loopId=4e9e5246) Failed to read metadata from store
   2022-12-05 15:27:23,219 8863586 [core-19] ERROR i.p.s.s.i.bookkeeper.BookKeeperLog - Log[29]: Unable to close LedgerHandle for Ledger 3538.
      io.pravega.segmentstore.storage.DurableDataLogException: Unable to close ledger 3538.
   ```
   
   Once the issue was reported, I tried to dig a little more with the Pravega and Bookkeeper CLIs:
   
   - From Pravega side, if we inspect the `BookkeeperLog[29]`, we clearly see that ledger `3538` is not accessible:
   ```
   > bk details 29
   {
     "key": "log_summary",
     "value": {
       "logId": 29,
       "epoch": 9,
       "version": 511,
       "enabled": true,
       "ledgers": 4,
       "truncation": "Sequence = 12884902323, LedgerId = 3491, EntryId = 435"
     }
   }
   {
     "key": "ledger_details",
     "value": {
       "ledger": 3491,
       "seq": 3,
       "status": "NotEmpty",
       "lac": 971,
       "length": 0,
       "numBookies": 3,
       "numFragments": 2,
       "ensembleSize": 2,
       "writeQuorumSize": 2,
       "ackQuorumSize": 2,
       "ensembles": "0: [bookkeeper-bookie-0-27781,bookkeeper-bookie-1-21445],972: [bookkeeper-bookie-3-9007,bookkeeper-bookie-1-21445]"
     }
   }
   Exception executing BK details command: Unable to open-read ledger 3538.
   	Ledger 3538: Seq = 4, Status = NotEmpty. BK: Unable to open-read ledger 3538.
   {
     "key": "ledger_details",
     "value": {
       "ledger": 3720,
       "seq": 8,
       "status": "Empty",
       "lac": -1,
       "length": 0,
       "numBookies": 2,
       "numFragments": 1,
       "ensembleSize": 2,
       "writeQuorumSize": 2,
       "ackQuorumSize": 2,
       "ensembles": "0: [bookkeeper-bookie-2-13057,bookkeeper-bookie-3-9007]"
     }
   }
   {
     "key": "ledger_details",
     "value": {
       "ledger": 3738,
       "seq": 9,
       "status": "Unknown",
       "lac": -1,
       "length": 0,
       "numBookies": 2,
       "numFragments": 1,
       "ensembleSize": 2,
       "writeQuorumSize": 2,
       "ackQuorumSize": 2,
       "ensembles": "0: [bookkeeper-bookie-0-27781,bookkeeper-bookie-3-9007]"
     }
   }
   ```
   
   - Using the Bookkeeper CLI, if we list all the ledgers in the system, we can see ledger `3538`:
   ```
   bin/bookkeeper shell listledgers
   ...
   17:39:21,474 INFO  ledgerID: 3537
   17:39:21,474 INFO  ledgerID: 3538     <<<<<< HERE IT IS
   17:39:21,474 INFO  ledgerID: 3539
   ...
   ```
   - However, it is not possible to get its contents or metadata:
   
   ```
   ===== LEDGER: 3538 =====
   master key  : 850bf1071c5e3d8c24235676f8816ae0cbe2f14f
   size         : 0
   entries      : 0
   isFenced     : true
   ===== LEDGER: 3538 =====
   ```
   
   ```
   bin/bookkeeper shell readledger -ledgerid 3538
   
   17:37:18,747 INFO  Successfully connected to bookie: bookkeeper-bookie-1-21445 [id: 0x45c77aa8, L:/10.128.0.116:33870 - R:10.129.0.171/10.129.0.171:3181]
   17:37:18,748 INFO  connection [id: 0x45c77aa8, L:/10.128.0.116:33870 - R:10.129.0.171/10.129.0.171:3181] authenticated as BookKeeperPrincipal{ANONYMOUS}
   17:37:18,755 ERROR Read of ledger entry failed: L3538 E0-E0, Sent to [bookkeeper-bookie-2-13057, bookkeeper-bookie-1-21445], Heard from [] : bitset = {}, Error = 'No such ledger exists on Bookies'. First unread entry is (-1, rc = null)
   17:37:18,761 ERROR Error reading entry 0 from ledger 3538
   org.apache.bookkeeper.client.BKException$BKNoSuchLedgerExistsException: No such ledger exists on Bookies
   	at org.apache.bookkeeper.client.SyncCallbackUtils.finish(SyncCallbackUtils.java:83) ~[org.apache.bookkeeper-bookkeeper-server-4.14.3-build-223.jar:4.14.3-build-223]
   ```
   
   ```
   bin/bookkeeper shell ledgermetadata -ledgerid 3538
   
    17:42:00,323 WARN  An exception was thrown while closing send thread for session 0x2004d6096890377.
   org.apache.zookeeper.ClientCnxn$EndOfStreamException: Unable to read additional data from server sessionid 0x2004d6096890377, likely server has closed socket
   	at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:77) ~[org.apache.zookeeper-zookeeper-3.6.2.jar:3.6.2]
   	at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:350) ~[org.apache.zookeeper-zookeeper-3.6.2.jar:3.6.2]
   	at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1275) [org.apache.zookeeper-zookeeper-3.6.2.jar:3.6.2]
   Exception in thread "main" com.google.common.util.concurrent.UncheckedExecutionException: java.util.concurrent.ExecutionException: org.apache.bookkeeper.client.BKException$BKNoSuchLedgerExistsOnMetadataServerException: No such ledger exists on Metadata Server
   ```
   
   How inducing delays at the network level in the pods leads very rarely to this situation is something that I still don't know. I'm also attaching the relevant log lines from Pravega Segment Store for ledger `3538` ([ledger-3538.log](https://github.com/apache/bookkeeper/files/10177104/ledger-3538.log)).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@bookkeeper.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org