You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pulsar.apache.org by GitBox <gi...@apache.org> on 2022/05/25 08:58:30 UTC
[GitHub] [pulsar] yebai1105 opened a new issue, #15776: bookie failed to decommission successfully
yebai1105 opened a new issue, #15776:
URL: https://github.com/apache/pulsar/issues/15776
**Describe the bug**
A clear and concise description of what the bug is.
**To Reproduce**
Steps to reproduce the behavior:
1. Go to '...'
2. Click on '....'
3. Scroll down to '....'
4. See error
**Expected behavior**
A clear and concise description of what you expected to happen.
**Screenshots**
If applicable, add screenshots to help explain your problem.
**Desktop (please complete the following information):**
- OS: [e.g. iOS]
**Additional context**
Add any other context about the problem here.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [pulsar] yebai1105 commented on issue #15776: bookie failed to decommission successfully
Posted by GitBox <gi...@apache.org>.
yebai1105 commented on issue #15776:
URL: https://github.com/apache/pulsar/issues/15776#issuecomment-1143121913
In addition, we also want to know the following and to deal with the scenario where the ledger data has been lost, can the parameter journalWriteData be set to false?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [pulsar] vitalii-buchyn-exa commented on issue #15776: bookie failed to decommission successfully
Posted by "vitalii-buchyn-exa (via GitHub)" <gi...@apache.org>.
vitalii-buchyn-exa commented on issue #15776:
URL: https://github.com/apache/pulsar/issues/15776#issuecomment-1727631005
Hello guys,
Having a similar issue.
We've lost a several bookies and need to clean up everything connected with them.
We've tried `bookkeeper shell decommissionbookie` but it is infinitely runs with a message like `Count of Ledgers which need to be rereplicated: 16`
We've tried to clean up /ledgers/cookies in zookeeper, restarted brokers, bookies, zookeeper but still see connection errors in bookies logs like:
```
2023-09-20 11:42:55,992 - ERROR - [BookKeeperClientScheduler-OrderedScheduler-0-0:PerChannelBookieClient@534] - Cannot connect to pulsar-boo-bookie-4.pulsar-boo-bookie-headless.infrastructure.svc.cluster.local:3181 as endpoint resolution failed (probably bookie is down) err org.apache.bookkeeper.proto.BookieAddressResolver$BookieIdNotResolvedException: Cannot resolve bookieId pulsar-boo-bookie-4.pulsar-boo-bookie-headless.infrastructure.svc.cluster.local:3181, bookie does not exist or it is not running
```
We can see those 16 underreplicated ledgers, ike
```
[zk: localhost:2181(CONNECTED) 13] ls /ledgers/underreplication/ledgers/0000/0000
[000f, 0013, 0015, 0016, 0019, 001a, 001b, 001f, 0020, 0025, 0028, 002c, 002f, 0035, 0036, 0039]
```
Is it safe to clean those up in zookeeper and let decommission go?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [pulsar] lgxbslgx commented on issue #15776: bookie failed to decommission successfully
Posted by GitBox <gi...@apache.org>.
lgxbslgx commented on issue #15776:
URL: https://github.com/apache/pulsar/issues/15776#issuecomment-1148749883
@yebai1105 I read the log you provided and have no idea now. And I agree your opinion that such situation should be recovered by pulsar. Maybe we need other more experimented developers to fix it.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [pulsar] yebai1105 commented on issue #15776: bookie failed to decommission successfully
Posted by GitBox <gi...@apache.org>.
yebai1105 commented on issue #15776:
URL: https://github.com/apache/pulsar/issues/15776#issuecomment-1143117746
Below are answers to some of your questions and our findings:
1、your questions :
1.1 we never used this command:bin/bookkeeper shell simpletest --ensemble 4 --writeQuorum 4 --ackQuorum 4 --numEntries 4
1.2 command 'bin/bookkeeper shell listunderreplicated' and command'bin/bookkeeper shell decommissionbookie' are executed on the faulty machine 10.101.129.75
1.3 All namespace persistence strategies in our cluster are as follows:
```
{
"bookkeeperEnsemble" : 3,
"bookkeeperWriteQuorum" : 3,
"bookkeeperAckQuorum" : 2,
"managedLedgerMaxMarkDeleteRate" : 0.0
}
```
1.4 We opened zk and tried to get the ledger 396606, but didn't find the ledger
```
[zk: localhost:2181(CONNECTED) 5] get /pulsar_dev2/ledgers/00/0
0000 0001 0004 0006 0011 0012 0029 0030 0031 0032 0034 0040 0041 0042 0043 0044 0045 0046 0047 0049 0050 0051 0052 0053 0054 0055 0056 0057 0058 0059
0060 0061 0062 0063 0064 0065 0066 0067 0068 0069 0070 0071 0072 0073 0074 0075 0076 0077 0078 0079 0080 0081 0082 0083 0084 0085 0086 0087 0088 0089
0090 0091 0092 0093 0094 0095 0096 0097 0098 0099 0100 0101 0102 0103 0104 0105 0106 0107 0108 0109 0110 0111 0112 0113 0114 0115 0116 0117 0118 0119
0120 0121 0122 0123 0124 0125 0126 0127 0128 0129 0130 0131 0133 0137 0138 0139 0140 0142 0143 0144
```
2、our findings
I have four bookies, one bookie 10.101.129.75 process is down because the disk is full, and the other two bookies are also down, leaving only one surviving bookie, but I don't know why these three bookies are down, Because I can't find the previous log.I suspect that the continued outage of the three bookies resulted in data loss, causing the problem of not being able to retire
Below is my test:
2.1 I listed the ledger with missing copies and the corresponding machines, and found that the leader 396606 was missing on all 4 machines.log see link below:
https://tva1.sinaimg.cn/large/e6c9d24egy1h2smsws1u8j225s0rj1c4.jpg
2.2 I tried to use the command 'bin/bookkeeper shell readledger -ledgerid 396606' to read the ledger, and found an error when reading the entry 867, it was actually sending a request to the faulty machine 10.101.129.75.log see link below:
https://tva1.sinaimg.cn/large/e6c9d24egy1h2sn01kac1j21de0u0apv.jpg
2.3 I used the command 'bin/bookkeeper shell ledgermetadata -ledgerid 396606' to view the metadata of ledger 396606 and found that the replicas starting from entry845 are allocated on this machine 10.101.129.75.log see link below:
https://tva1.sinaimg.cn/large/e6c9d24egy1h2sn21dcobj21gr0u0e22.jpg
When our debug node is retired, we find that the source code will read the ledger that lacks a copy, and the error 'no entry' is reported here.
https://tva1.sinaimg.cn/large/e6c9d24egy1h2snbxx8xvj212y0np79x.jpg
some doubts:
I think the data loss is caused by the fact that the three bookie machines are down in a short time and our cluster parameter journalWriteData is set to false (we don't want to enable journal write to write ahead, we allow some data loss). But I have some doubts, why the loss of data will cause such a big problem that the machine cannot be retired, and even in our later tests, it was found that the loss of the ledger will even cause the producer to fail to send data. Maybe the situation of data loss should be considered here. and have countermeasures
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [pulsar] github-actions[bot] commented on issue #15776: bookie failed to decommission successfully
Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on issue #15776:
URL: https://github.com/apache/pulsar/issues/15776#issuecomment-1178461935
The issue had no activity for 30 days, mark with Stale label.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [pulsar] lgxbslgx commented on issue #15776: bookie failed to decommission successfully
Posted by GitBox <gi...@apache.org>.
lgxbslgx commented on issue #15776:
URL: https://github.com/apache/pulsar/issues/15776#issuecomment-1139665002
I can reproduce this bug locally by using the following steps:
1. Deploy a cluster according to the [document](https://pulsar.apache.org/docs/next/deploy-bare-metal). The cluster has 3 zookeeper nodes, 3 bookkeeper nodes and 3 brokers, which is same as the document.
2. Use the command `bin/bookkeeper shell simpletest --ensemble 3 --writeQuorum 3 --ackQuorum 3 --numEntries 3` to test the bookkeeper. **Note: this step is important.**
3. Produce and comsume sevaral times.
4. Delete journalDirectories and ledgerDirectories directory of one bookie, named `BK1`. (Same as @yebai1105 's step1)
5. Shutdown the bookie `BK1`.
6. Use command `bin/bookkeeper shell listunderreplicated` at bookie node `BK1`. (Same as @yebai1105 's step2, but @yebai1105 didn't indicate which bookie node to use this command.)
7. Use command `bin/bookkeeper shell decommissionbookie` at bookie node `BK1`. (Same as @yebai1105 's step3, but @yebai1105 didn't indicate which bookie node to use this command.)
Then the same error message occurs. It is because the command `bin/bookkeeper shell simpletest --ensemble 3 --writeQuorum 3 --ackQuorum 3 --numEntries 3` create a ledger whose ensemble size is equal to write quorum size and is equal to the number of all the bookie(also 3). So this ledger can't be replicated util another new bookie node is created.
Now I need to confirm from @yebai1105: have you used the similar command, like `bin/bookkeeper shell simpletest --ensemble 4 --writeQuorum 4 --ackQuorum 4 --numEntries 4`, to test when you deployed your cluster?
If your don't remember whether you had done this test when you deployed your cluster, you can use the following command to get the nodes of the ledger (such as `396606` your log shows) which is under replicated.
> 2022-05-25 16:40:17.0035 [main] INFO org.apache.bookkeeper.tools.cli.commands.autorecovery.ListUnderReplicatedCommand - 396606
> 2022-05-25 16:40:17.0035 [main] INFO org.apache.bookkeeper.tools.cli.commands.autorecovery.ListUnderReplicatedCommand - 396606
> 2022-05-25 16:40:17.0035 [main] INFO org.apache.bookkeeper.tools.cli.commands.autorecovery.ListUnderReplicatedCommand - Ctime : 1651199961381
> 2022-05-25 16:40:17.0036 [main] INFO org.apache.bookkeeper.tools.cli.commands.autorecovery.ListUnderReplicatedCommand - 112963
2022-05-25 16:40:17.0036 [main] INFO org.apache.bookkeeper.tools.cli.commands.autorecovery.ListUnderReplicatedCommand - Ctime : 1650363984734
```shell
// open the zookeeper shell
$ bin/pulsar zookeeper-shell -timeout 5000 -server <zk-ip/zk-domain>:<zk-port>
// get the ledger 396606 which is under replicated
$ get /ledger/00/0039/6606
// another example 112963
$ get /ledger/00/0011/2963
```
You can count the bookie node number of such ledger. If the node number is 4 in your cluster, it means my assumption is right.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
Re: [I] bookie failed to decommission successfully [pulsar]
Posted by "truong-hua (via GitHub)" <gi...@apache.org>.
truong-hua commented on issue #15776:
URL: https://github.com/apache/pulsar/issues/15776#issuecomment-1838430972
I have the same problem and my bookie is stuck at waiting for replication. We have 2 ack quorum and 3 write, assemble size. And we have 5 bookies and there is no problem with the server happened since starting time until I ran the first decommission command.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org