You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@bookkeeper.apache.org by "davtex (via GitHub)" <gi...@apache.org> on 2023/04/05 17:47:40 UTC

[GitHub] [bookkeeper] davtex opened a new issue, #3905: Autorecovery stuck when ensemble equals total bookies

davtex opened a new issue, #3905:
URL: https://github.com/apache/bookkeeper/issues/3905

   **BUG REPORT**
   
   To be completely honest I am not sure if this is a bug or I just seriously misunderstand/misconfigure something.
   
   - running Pulsar 2.9.3 in Openshift platform in AWS.
   - 3 bookies running
   - ledgers configured as 3:3:2
   
   We lose one of the bookies (bookie-2), all data is lost. We delete cookie, reconnect bookie (with same BookieID). Now we have bookie-0 and bookie-1 with ledger intact, bookie-2 without ledger, and ledger metadata in Zookeeper pointing to all 3 of them. Autorecovery runs and confirms above state. Now I would expect autorecovery to replicate missing data from bookie-0/bookie-1 to bookie-2. But what happens is that autorecovery removes all bookies from list of available bookies, including bookie-2, which does not hold the ledger, it just happens to still be listed in ledger's metadata. I do not understand why this is happening -> I would expect bookie-2 to be considered for replication after it replies with "NOLEDGER". Only way to fix this issue for me is to start 4th bookie. Is this how this should work? Do I need to keep ensemble lower than my total number of running bookies simply to have a spare for autorecovery?
   
   ***To Reproduce***
   
   Steps to reproduce the behavior:
   1. Run 3 bookies (0,1,2)
   2. Create test data
   ```
   shell simpletest -ensemble 3 -writeQuorum 3 -ackQuorum 2 -numEntries 100
   ```
   3. Shut down bookie-2 and delete all data from bookie-2
   4. Remove cookie of bookie-2
   5. Start bookie-2
   6. Enable autorecovery, watch it fail with "Not enough non-faulty bookies available"
   ```
   2023-04-05T16:54:54,050+0000 [BookKeeperClientWorker-OrderedExecutor-0-0] DEBUG org.apache.bookkeeper.proto.PerChannelBookieClient - Got Read response from bookie:prod01-neuron-pulsar-bookie-1.prod01-neuron-pulsar-bookie.prod01-neuron-pulsar.svc.cluster.local:3181 rc:EOK, ledger:74:entry:0:entryLength:136
   2023-04-05T16:54:54,051+0000 [BookKeeperClientWorker-OrderedExecutor-0-0] DEBUG org.apache.bookkeeper.proto.PerChannelBookieClient - Got Read response from bookie:prod01-neuron-pulsar-bookie-1.prod01-neuron-pulsar-bookie.prod01-neuron-pulsar.svc.cluster.local:3181 rc:EOK, ledger:74:entry:99:entryLength:136
   2023-04-05T16:54:54,052+0000 [BookKeeperClientWorker-OrderedExecutor-0-0] DEBUG org.apache.bookkeeper.proto.PerChannelBookieClient - Got Read response from bookie:prod01-neuron-pulsar-bookie-0.prod01-neuron-pulsar-bookie.prod01-neuron-pulsar.svc.cluster.local:3181 rc:EOK, ledger:74:entry:0:entryLength:136
   2023-04-05T16:54:54,053+0000 [BookKeeperClientWorker-OrderedExecutor-0-0] DEBUG org.apache.bookkeeper.proto.PerChannelBookieClient - Got Read response from bookie:prod01-neuron-pulsar-bookie-0.prod01-neuron-pulsar-bookie.prod01-neuron-pulsar.svc.cluster.local:3181 rc:EOK, ledger:74:entry:99:entryLength:136
   2023-04-05T16:54:54,055+0000 [BookKeeperClientWorker-OrderedExecutor-0-0] DEBUG org.apache.bookkeeper.proto.PerChannelBookieClient - Got Read response from bookie:prod01-neuron-pulsar-bookie-2.prod01-neuron-pulsar-bookie.prod01-neuron-pulsar.svc.cluster.local:3181 rc:ENOLEDGER, ledger:74:entry:0:entryLength:0
   2023-04-05T16:54:54,056+0000 [ReplicationWorker] DEBUG org.apache.bookkeeper.replication.ReplicationWorker - Founds fragments [Fragment(LedgerID: 74, FirstEntryID: 0[0], LastKnownEntryID: 99[99], Host: [prod01-neuron-pulsar-bookie-2.prod01-neuron-pulsar-bookie.prod01-neuron-pulsar.svc.cluster.local:3181], Closed: true)] for replication from ledger: 74
   2023-04-05T16:54:54,056+0000 [ReplicationWorker] DEBUG org.apache.bookkeeper.client.RackawareEnsemblePlacementPolicyImpl - Try to choose a new bookie to replace prod01-neuron-pulsar-bookie-2.prod01-neuron-pulsar-bookie.prod01-neuron-pulsar.svc.cluster.local:3181 from ensemble [<Bookie:prod01-neuron-pulsar-bookie-0.prod01-neuron-pulsar-bookie.prod01-neuron-pulsar.svc.cluster.local:3181>, <Bookie:prod01-neuron-pulsar-bookie-1.prod01-neuron-pulsar-bookie.prod01-neuron-pulsar.svc.cluster.local:3181>], excluding [<Bookie:prod01-neuron-pulsar-bookie-0.prod01-neuron-pulsar-bookie.prod01-neuron-pulsar.svc.cluster.local:3181>, <Bookie:prod01-neuron-pulsar-bookie-1.prod01-neuron-pulsar-bookie.prod01-neuron-pulsar.svc.cluster.local:3181>, <Bookie:prod01-neuron-pulsar-bookie-2.prod01-neuron-pulsar-bookie.prod01-neuron-pulsar.svc.cluster.local:3181>].
   2023-04-05T16:54:54,057+0000 [BookKeeperClientWorker-OrderedExecutor-0-0] DEBUG org.apache.bookkeeper.proto.PerChannelBookieClient - Got Read response from bookie:prod01-neuron-pulsar-bookie-2.prod01-neuron-pulsar-bookie.prod01-neuron-pulsar.svc.cluster.local:3181 rc:ENOLEDGER, ledger:74:entry:99:entryLength:0
   2023-04-05T16:54:54,057+0000 [ReplicationWorker] WARN  org.apache.bookkeeper.client.RackawareEnsemblePlacementPolicyImpl - Failed to find 1 bookies : excludeBookies [<Bookie:prod01-neuron-pulsar-bookie-0.prod01-neuron-pulsar-bookie.prod01-neuron-pulsar.svc.cluster.local:3181>, <Bookie:prod01-neuron-pulsar-bookie-1.prod01-neuron-pulsar-bookie.prod01-neuron-pulsar.svc.cluster.local:3181>, <Bookie:prod01-neuron-pulsar-bookie-2.prod01-neuron-pulsar-bookie.prod01-neuron-pulsar.svc.cluster.local:3181>], allBookies [<Bookie:prod01-neuron-pulsar-bookie-2.prod01-neuron-pulsar-bookie.prod01-neuron-pulsar.svc.cluster.local:3181>, <Bookie:prod01-neuron-pulsar-bookie-0.prod01-neuron-pulsar-bookie.prod01-neuron-pulsar.svc.cluster.local:3181>, <Bookie:prod01-neuron-pulsar-bookie-1.prod01-neuron-pulsar-bookie.prod01-neuron-pulsar.svc.cluster.local:3181>].
   2023-04-05T16:54:54,057+0000 [ReplicationWorker] WARN  org.apache.bookkeeper.client.RackawareEnsemblePlacementPolicyImpl - Failed to choose a bookie: excluded [<Bookie:prod01-neuron-pulsar-bookie-0.prod01-neuron-pulsar-bookie.prod01-neuron-pulsar.svc.cluster.local:3181>, <Bookie:prod01-neuron-pulsar-bookie-1.prod01-neuron-pulsar-bookie.prod01-neuron-pulsar.svc.cluster.local:3181>, <Bookie:prod01-neuron-pulsar-bookie-2.prod01-neuron-pulsar-bookie.prod01-neuron-pulsar.svc.cluster.local:3181>], fallback to choose bookie randomly from the cluster.
   2023-04-05T16:54:54,057+0000 [ReplicationWorker] WARN  org.apache.bookkeeper.client.RackawareEnsemblePlacementPolicyImpl - Failed to find 1 bookies : excludeBookies [<Bookie:prod01-neuron-pulsar-bookie-0.prod01-neuron-pulsar-bookie.prod01-neuron-pulsar.svc.cluster.local:3181>, <Bookie:prod01-neuron-pulsar-bookie-1.prod01-neuron-pulsar-bookie.prod01-neuron-pulsar.svc.cluster.local:3181>, <Bookie:prod01-neuron-pulsar-bookie-2.prod01-neuron-pulsar-bookie.prod01-neuron-pulsar.svc.cluster.local:3181>], allBookies [<Bookie:prod01-neuron-pulsar-bookie-2.prod01-neuron-pulsar-bookie.prod01-neuron-pulsar.svc.cluster.local:3181>, <Bookie:prod01-neuron-pulsar-bookie-0.prod01-neuron-pulsar-bookie.prod01-neuron-pulsar.svc.cluster.local:3181>, <Bookie:prod01-neuron-pulsar-bookie-1.prod01-neuron-pulsar-bookie.prod01-neuron-pulsar.svc.cluster.local:3181>].
   2023-04-05T16:54:54,058+0000 [ReplicationWorker] WARN  org.apache.bookkeeper.replication.ReplicationWorker - BKNotEnoughBookiesException while replicating the fragment
   org.apache.bookkeeper.client.BKException$BKNotEnoughBookiesException: Not enough non-faulty bookies available
           at org.apache.bookkeeper.client.RackawareEnsemblePlacementPolicyImpl.selectRandomInternal(RackawareEnsemblePlacementPolicyImpl.java:781) ~[org.apache.bookkeeper-bookkeeper-server-4.14.5.jar:4.14.5]
           at org.apache.bookkeeper.client.RackawareEnsemblePlacementPolicyImpl.selectRandom(RackawareEnsemblePlacementPolicyImpl.java:698) ~[org.apache.bookkeeper-bookkeeper-server-4.14.5.jar:4.14.5]
           at org.apache.bookkeeper.client.RackawareEnsemblePlacementPolicyImpl.selectFromNetworkLocation(RackawareEnsemblePlacementPolicyImpl.java:587) ~[org.apache.bookkeeper-bookkeeper-server-4.14.5.jar:4.14.5]
           at org.apache.bookkeeper.client.RackawareEnsemblePlacementPolicy.selectFromNetworkLocation(RackawareEnsemblePlacementPolicy.java:206) ~[org.apache.bookkeeper-bookkeeper-server-4.14.5.jar:4.14.5]
           at org.apache.bookkeeper.client.RackawareEnsemblePlacementPolicyImpl.selectFromNetworkLocation(RackawareEnsemblePlacementPolicyImpl.java:547) ~[org.apache.bookkeeper-bookkeeper-server-4.14.5.jar:4.14.5]
           at org.apache.bookkeeper.client.RackawareEnsemblePlacementPolicy.selectFromNetworkLocation(RackawareEnsemblePlacementPolicy.java:227) ~[org.apache.bookkeeper-bookkeeper-server-4.14.5.jar:4.14.5]
           at org.apache.bookkeeper.client.RackawareEnsemblePlacementPolicyImpl.replaceBookie(RackawareEnsemblePlacementPolicyImpl.java:475) ~[org.apache.bookkeeper-bookkeeper-server-4.14.5.jar:4.14.5]
           at org.apache.bookkeeper.client.RackawareEnsemblePlacementPolicy.replaceBookie(RackawareEnsemblePlacementPolicy.java:120) ~[org.apache.bookkeeper-bookkeeper-server-4.14.5.jar:4.14.5]
           at org.apache.bookkeeper.client.BookKeeperAdmin.getReplacementBookiesByIndexes(BookKeeperAdmin.java:1080) ~[org.apache.bookkeeper-bookkeeper-server-4.14.5.jar:4.14.5]
           at org.apache.bookkeeper.client.BookKeeperAdmin.replicateLedgerFragment(BookKeeperAdmin.java:1129) ~[org.apache.bookkeeper-bookkeeper-server-4.14.5.jar:4.14.5]
           at org.apache.bookkeeper.replication.ReplicationWorker.rereplicate(ReplicationWorker.java:377) [org.apache.bookkeeper-bookkeeper-server-4.14.5.jar:4.14.5]
           at org.apache.bookkeeper.replication.ReplicationWorker.rereplicate(ReplicationWorker.java:269) [org.apache.bookkeeper-bookkeeper-server-4.14.5.jar:4.14.5]
           at org.apache.bookkeeper.replication.ReplicationWorker.run(ReplicationWorker.java:230) [org.apache.bookkeeper-bookkeeper-server-4.14.5.jar:4.14.5]
           at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [io.netty-netty-common-4.1.77.Final.jar:4.1.77.Final]
           at java.lang.Thread.run(Thread.java:829) [?:?]
   2023-04-05T16:54:54,059+0000 [ReplicationWorker] DEBUG org.apache.bookkeeper.client.LedgerChecker - Checking fragment Fragment(LedgerID: 74, FirstEntryID: 0[0], LastKnownEntryID: 99[99], Host: [prod01-neuron-pulsar-bookie-0.prod01-neuron-pulsar-bookie.prod01-neuron-pulsar.svc.cluster.local:3181, prod01-neuron-pulsar-bookie-1.prod01-neuron-pulsar-bookie.prod01-neuron-pulsar.svc.cluster.local:3181, prod01-neuron-pulsar-bookie-2.prod01-neuron-pulsar-bookie.prod01-neuron-pulsar.svc.cluster.local:3181], Closed: true)
   2023-04-05T16:54:54,060+0000 [BookieReadThreadPool-OrderedExecutor-2-0] DEBUG org.apache.bookkeeper.proto.ReadEntryProcessor - Received new read request: Op(2)[Ledger:74,Entry:0]
   2023-04-05T16:54:54,060+0000 [BookieReadThreadPool-OrderedExecutor-2-0] DEBUG org.apache.bookkeeper.bookie.storage.ldb.SingleDirectoryDbLedgerStorage - Get Entry: 74@0
   2023-04-05T16:54:54,060+0000 [BookieReadThreadPool-OrderedExecutor-2-0] DEBUG org.apache.bookkeeper.proto.ReadEntryProcessor - ##### Read entry ##### 136 -- ref-count: 1
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@bookkeeper.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [bookkeeper] davtex commented on issue #3905: Autorecovery stuck when ensemble equals total bookies

Posted by "davtex (via GitHub)" <gi...@apache.org>.
davtex commented on issue #3905:
URL: https://github.com/apache/bookkeeper/issues/3905#issuecomment-1500898566

   Problem is that I am using hostnames instead of IPs. So removing pod to get new IP does nothing in my case.
   
   Let's say I modify this and switch to using IPs. What will happen when I restart all 3 bookies and all of them get new IPs?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@bookkeeper.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [bookkeeper] gaozhangmin commented on issue #3905: Autorecovery stuck when ensemble equals total bookies

Posted by "gaozhangmin (via GitHub)" <gi...@apache.org>.
gaozhangmin commented on issue #3905:
URL: https://github.com/apache/bookkeeper/issues/3905#issuecomment-1499863235

   That's because bookie only recognizes bookie's ip address in ensemble. Currently, there's no way but changing ip. Or you can just wait until these data expired


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@bookkeeper.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [bookkeeper] 7hong commented on issue #3905: Autorecovery stuck when ensemble equals total bookies

Posted by "7hong (via GitHub)" <gi...@apache.org>.
7hong commented on issue #3905:
URL: https://github.com/apache/bookkeeper/issues/3905#issuecomment-1498835230

   I had a similar problem when trying pulsar, I only had 3 bookie. persistencec I used was (2,2,2), and when I stop a bookie(simulating a bookie down), autorecover was executed incorrectly: Not enough non-faulty bookies available


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@bookkeeper.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [bookkeeper] davtex commented on issue #3905: Autorecovery stuck when ensemble equals total bookies

Posted by "davtex (via GitHub)" <gi...@apache.org>.
davtex commented on issue #3905:
URL: https://github.com/apache/bookkeeper/issues/3905#issuecomment-1498916598

   I would either need to force Autorecovery to reuse bookie-2 (bookie with missing data) OR somehow clean-up metadata in Zookeeper and remove bookie-2 from each ledger's metadata.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@bookkeeper.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org