You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pulsar.apache.org by GitBox <gi...@apache.org> on 2022/05/27 14:19:42 UTC

[GitHub] [pulsar] lgxbslgx commented on issue #15776: bookie failed to decommission successfully

lgxbslgx commented on issue #15776:
URL: https://github.com/apache/pulsar/issues/15776#issuecomment-1139665002

   I can reproduce this bug locally by using the following steps:
   
   1. Deploy a cluster according to the [document](https://pulsar.apache.org/docs/next/deploy-bare-metal). The cluster has 3 zookeeper nodes, 3 bookkeeper nodes and 3 brokers, which is same as the document.
   2. Use the command `bin/bookkeeper shell simpletest --ensemble 3 --writeQuorum 3 --ackQuorum 3 --numEntries 3` to test the bookkeeper. **Note: this step is important.**
   3. Produce and comsume sevaral times.
   4. Delete journalDirectories and ledgerDirectories directory of one bookie, named `BK1`. (Same as @yebai1105 's step1)
   5. Shutdown the bookie `BK1`. 
   6. Use command `bin/bookkeeper shell listunderreplicated` at bookie node `BK1`. (Same as @yebai1105 's step2, but @yebai1105 didn't indicate which bookie node to use this command.)
   7. Use command `bin/bookkeeper shell decommissionbookie` at bookie node `BK1`. (Same as @yebai1105 's step3, but @yebai1105 didn't indicate which bookie node to use this command.)
   
   Then the same error message occurs. It is because the command `bin/bookkeeper shell simpletest --ensemble 3 --writeQuorum 3 --ackQuorum 3 --numEntries 3` create a ledger whose ensemble size is equal to write quorum size and is equal to the number of all the bookie(also 3). So this ledger can't be replicated util another new bookie node is created.
   
   Now I need to confirm from @yebai1105: have you used the similar command, like `bin/bookkeeper shell simpletest --ensemble 4 --writeQuorum 4 --ackQuorum 4 --numEntries 4`, to test when you deployed your cluster?
   
   If your don't remember whether you had done this test when you deployed your cluster, you can use the following command to get the nodes of the ledger (such as `396606` your log shows) which is under replicated. 
   
   > 2022-05-25 16:40:17.0035 [main] INFO  org.apache.bookkeeper.tools.cli.commands.autorecovery.ListUnderReplicatedCommand - 396606
   > 2022-05-25 16:40:17.0035 [main] INFO  org.apache.bookkeeper.tools.cli.commands.autorecovery.ListUnderReplicatedCommand - 396606
   > 2022-05-25 16:40:17.0035 [main] INFO  org.apache.bookkeeper.tools.cli.commands.autorecovery.ListUnderReplicatedCommand -        Ctime : 1651199961381
   > 2022-05-25 16:40:17.0036 [main] INFO  org.apache.bookkeeper.tools.cli.commands.autorecovery.ListUnderReplicatedCommand - 112963
   2022-05-25 16:40:17.0036 [main] INFO  org.apache.bookkeeper.tools.cli.commands.autorecovery.ListUnderReplicatedCommand -        Ctime : 1650363984734
   
   ```shell
   // open the zookeeper shell
   $ bin/pulsar zookeeper-shell -timeout 5000 -server <zk-ip/zk-domain>:<zk-port>
   
   // get the ledger 396606 which is under replicated
   $ get /ledger/00/0039/6606
   
   // another example 112963
   $ get /ledger/00/0011/2963
   ```
   
   You can count the bookie node number of such ledger. If the node number is 4 in your cluster, it means my assumption is right.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org