You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@pulsar.apache.org by GitBox <gi...@apache.org> on 2022/05/25 08:58:30 UTC

[GitHub] [pulsar] yebai1105 opened a new issue, #15776: bookie failed to decommission successfully

yebai1105 opened a new issue, #15776:
URL: https://github.com/apache/pulsar/issues/15776

   **Describe the bug**
   A clear and concise description of what the bug is.
   
   **To Reproduce**
   Steps to reproduce the behavior:
   1. Go to '...'
   2. Click on '....'
   3. Scroll down to '....'
   4. See error
   
   **Expected behavior**
   A clear and concise description of what you expected to happen.
   
   **Screenshots**
   If applicable, add screenshots to help explain your problem.
   
   **Desktop (please complete the following information):**
    - OS: [e.g. iOS]
   
   **Additional context**
   Add any other context about the problem here.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar] yebai1105 commented on issue #15776: bookie failed to decommission successfully

Posted by GitBox <gi...@apache.org>.

yebai1105 commented on issue #15776:
URL: https://github.com/apache/pulsar/issues/15776#issuecomment-1143121913

   In addition, we also want to know the following and to deal with the scenario where the ledger data has been lost, can the parameter journalWriteData be set to false?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar] vitalii-buchyn-exa commented on issue #15776: bookie failed to decommission successfully

Posted by "vitalii-buchyn-exa (via GitHub)" <gi...@apache.org>.

vitalii-buchyn-exa commented on issue #15776:
URL: https://github.com/apache/pulsar/issues/15776#issuecomment-1727631005

   Hello guys,
   
   Having a similar issue.
   
   We've lost a several bookies and need to clean up everything connected with them.
   
   We've tried `bookkeeper shell decommissionbookie` but it is infinitely runs with a message like `Count of Ledgers which need to be rereplicated: 16`
   We've tried to clean up /ledgers/cookies in zookeeper, restarted brokers, bookies, zookeeper but still see connection errors in bookies logs like:
   ```
   2023-09-20 11:42:55,992 - ERROR - [BookKeeperClientScheduler-OrderedScheduler-0-0:PerChannelBookieClient@534] - Cannot connect to pulsar-boo-bookie-4.pulsar-boo-bookie-headless.infrastructure.svc.cluster.local:3181 as endpoint resolution failed (probably bookie is down) err org.apache.bookkeeper.proto.BookieAddressResolver$BookieIdNotResolvedException: Cannot resolve bookieId pulsar-boo-bookie-4.pulsar-boo-bookie-headless.infrastructure.svc.cluster.local:3181, bookie does not exist or it is not running
   ``` 
   
   We can see those 16 underreplicated ledgers, ike
   ```
   [zk: localhost:2181(CONNECTED) 13] ls /ledgers/underreplication/ledgers/0000/0000
   [000f, 0013, 0015, 0016, 0019, 001a, 001b, 001f, 0020, 0025, 0028, 002c, 002f, 0035, 0036, 0039]
   ```
   
   Is it safe to clean those up in zookeeper and let decommission go?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar] lgxbslgx commented on issue #15776: bookie failed to decommission successfully

Posted by GitBox <gi...@apache.org>.

lgxbslgx commented on issue #15776:
URL: https://github.com/apache/pulsar/issues/15776#issuecomment-1148749883

   @yebai1105 I read the log you provided and have no idea now. And I agree your opinion that such situation should be recovered by pulsar. Maybe we need other more experimented developers to fix it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar] yebai1105 commented on issue #15776: bookie failed to decommission successfully

Posted by GitBox <gi...@apache.org>.

yebai1105 commented on issue #15776:
URL: https://github.com/apache/pulsar/issues/15776#issuecomment-1143117746

   Below are answers to some of your questions and our findings：
   1、your questions ：
   1.1 we never used this command：bin/bookkeeper shell simpletest --ensemble 4 --writeQuorum 4 --ackQuorum 4 --numEntries 4
   1.2 command 'bin/bookkeeper shell listunderreplicated' and command'bin/bookkeeper shell decommissionbookie' are executed on the faulty machine 10.101.129.75
   1.3 All namespace persistence strategies in our cluster are as follows:
   ```
   {
     "bookkeeperEnsemble" : 3,
     "bookkeeperWriteQuorum" : 3,
     "bookkeeperAckQuorum" : 2,
     "managedLedgerMaxMarkDeleteRate" : 0.0
   }
   ```
   1.4  We opened zk and tried to get the  ledger 396606, but didn't find the ledger
   ```
   [zk: localhost:2181(CONNECTED) 5] get /pulsar_dev2/ledgers/00/0
   0000   0001   0004   0006   0011   0012   0029   0030   0031   0032   0034   0040   0041   0042   0043   0044   0045   0046   0047   0049   0050   0051   0052   0053   0054   0055   0056   0057   0058   0059   
   0060   0061   0062   0063   0064   0065   0066   0067   0068   0069   0070   0071   0072   0073   0074   0075   0076   0077   0078   0079   0080   0081   0082   0083   0084   0085   0086   0087   0088   0089   
   0090   0091   0092   0093   0094   0095   0096   0097   0098   0099   0100   0101   0102   0103   0104   0105   0106   0107   0108   0109   0110   0111   0112   0113   0114   0115   0116   0117   0118   0119   
   0120   0121   0122   0123   0124   0125   0126   0127   0128   0129   0130   0131   0133   0137   0138   0139   0140   0142   0143   0144   
   ```
   2、our findings
   I have four bookies, one bookie 10.101.129.75 process is down because the disk is full, and the other two bookies are also down, leaving only one surviving bookie, but I don't know why these three bookies are down, Because I can't find the previous log.I suspect that the continued outage of the three bookies resulted in data loss, causing the problem of not being able to retire
   Below is my test：
   2.1 I listed the ledger with missing copies and the corresponding machines, and found that the leader 396606 was missing on all 4 machines.log see link below：
   https://tva1.sinaimg.cn/large/e6c9d24egy1h2smsws1u8j225s0rj1c4.jpg
   2.2 I tried to use the command 'bin/bookkeeper shell readledger -ledgerid 396606' to read the ledger, and found an error when reading the entry 867, it was actually sending a request to the faulty machine 10.101.129.75.log see link below：
   https://tva1.sinaimg.cn/large/e6c9d24egy1h2sn01kac1j21de0u0apv.jpg
   2.3 I used the command 'bin/bookkeeper shell ledgermetadata -ledgerid 396606' to view the metadata of ledger 396606 and found that the replicas starting from entry845 are allocated on this machine 10.101.129.75.log see link below：
   https://tva1.sinaimg.cn/large/e6c9d24egy1h2sn21dcobj21gr0u0e22.jpg
   When our debug node is retired, we find that the source code will read the ledger that lacks a copy, and the error 'no entry' is reported here.
   https://tva1.sinaimg.cn/large/e6c9d24egy1h2snbxx8xvj212y0np79x.jpg
   
   some doubts：
   I think the data loss is caused by the fact that the three bookie machines are down in a short time and our cluster parameter journalWriteData is set to false (we don't want to enable journal write to write ahead, we allow some data loss). But I have some doubts, why the loss of data will cause such a big problem that the machine cannot be retired, and even in our later tests, it was found that the loss of the ledger will even cause the producer to fail to send data. Maybe the situation of data loss should be considered here. and have countermeasures


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar] github-actions[bot] commented on issue #15776: bookie failed to decommission successfully

Posted by GitBox <gi...@apache.org>.

github-actions[bot] commented on issue #15776:
URL: https://github.com/apache/pulsar/issues/15776#issuecomment-1178461935

   The issue had no activity for 30 days, mark with Stale label.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar] lgxbslgx commented on issue #15776: bookie failed to decommission successfully

Posted by GitBox <gi...@apache.org>.

lgxbslgx commented on issue #15776:
URL: https://github.com/apache/pulsar/issues/15776#issuecomment-1139665002

   I can reproduce this bug locally by using the following steps:
   
   1. Deploy a cluster according to the [document](https://pulsar.apache.org/docs/next/deploy-bare-metal). The cluster has 3 zookeeper nodes, 3 bookkeeper nodes and 3 brokers, which is same as the document.
   2. Use the command `bin/bookkeeper shell simpletest --ensemble 3 --writeQuorum 3 --ackQuorum 3 --numEntries 3` to test the bookkeeper. **Note: this step is important.**
   3. Produce and comsume sevaral times.
   4. Delete journalDirectories and ledgerDirectories directory of one bookie, named `BK1`. (Same as @yebai1105 's step1)
   5. Shutdown the bookie `BK1`. 
   6. Use command `bin/bookkeeper shell listunderreplicated` at bookie node `BK1`. (Same as @yebai1105 's step2, but @yebai1105 didn't indicate which bookie node to use this command.)
   7. Use command `bin/bookkeeper shell decommissionbookie` at bookie node `BK1`. (Same as @yebai1105 's step3, but @yebai1105 didn't indicate which bookie node to use this command.)
   
   Then the same error message occurs. It is because the command `bin/bookkeeper shell simpletest --ensemble 3 --writeQuorum 3 --ackQuorum 3 --numEntries 3` create a ledger whose ensemble size is equal to write quorum size and is equal to the number of all the bookie(also 3). So this ledger can't be replicated util another new bookie node is created.
   
   Now I need to confirm from @yebai1105: have you used the similar command, like `bin/bookkeeper shell simpletest --ensemble 4 --writeQuorum 4 --ackQuorum 4 --numEntries 4`, to test when you deployed your cluster?
   
   If your don't remember whether you had done this test when you deployed your cluster, you can use the following command to get the nodes of the ledger (such as `396606` your log shows) which is under replicated. 
   
   > 2022-05-25 16:40:17.0035 [main] INFO  org.apache.bookkeeper.tools.cli.commands.autorecovery.ListUnderReplicatedCommand - 396606
   > 2022-05-25 16:40:17.0035 [main] INFO  org.apache.bookkeeper.tools.cli.commands.autorecovery.ListUnderReplicatedCommand - 396606
   > 2022-05-25 16:40:17.0035 [main] INFO  org.apache.bookkeeper.tools.cli.commands.autorecovery.ListUnderReplicatedCommand -        Ctime : 1651199961381
   > 2022-05-25 16:40:17.0036 [main] INFO  org.apache.bookkeeper.tools.cli.commands.autorecovery.ListUnderReplicatedCommand - 112963
   2022-05-25 16:40:17.0036 [main] INFO  org.apache.bookkeeper.tools.cli.commands.autorecovery.ListUnderReplicatedCommand -        Ctime : 1650363984734
   
   ```shell
   // open the zookeeper shell
   $ bin/pulsar zookeeper-shell -timeout 5000 -server <zk-ip/zk-domain>:<zk-port>
   
   // get the ledger 396606 which is under replicated
   $ get /ledger/00/0039/6606
   
   // another example 112963
   $ get /ledger/00/0011/2963
   ```
   
   You can count the bookie node number of such ledger. If the node number is 4 in your cluster, it means my assumption is right.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] bookie failed to decommission successfully [pulsar]

Posted by "truong-hua (via GitHub)" <gi...@apache.org>.

truong-hua commented on issue #15776:
URL: https://github.com/apache/pulsar/issues/15776#issuecomment-1838430972

   I have the same problem and my bookie is stuck at waiting for replication. We have 2 ack quorum and 3 write, assemble size. And we have 5 bookies and there is no problem with the server happened since starting time until I ran the first decommission command.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org