You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pulsar.apache.org by GitBox <gi...@apache.org> on 2022/06/01 04:57:00 UTC
[GitHub] [pulsar] yebai1105 commented on issue #15776: bookie failed to decommission successfully
yebai1105 commented on issue #15776:
URL: https://github.com/apache/pulsar/issues/15776#issuecomment-1143117746
Below are answers to some of your questions and our findings:
1、your questions :
1.1 we never used this command:bin/bookkeeper shell simpletest --ensemble 4 --writeQuorum 4 --ackQuorum 4 --numEntries 4
1.2 command 'bin/bookkeeper shell listunderreplicated' and command'bin/bookkeeper shell decommissionbookie' are executed on the faulty machine 10.101.129.75
1.3 All namespace persistence strategies in our cluster are as follows:
```
{
"bookkeeperEnsemble" : 3,
"bookkeeperWriteQuorum" : 3,
"bookkeeperAckQuorum" : 2,
"managedLedgerMaxMarkDeleteRate" : 0.0
}
```
1.4 We opened zk and tried to get the ledger 396606, but didn't find the ledger
```
[zk: localhost:2181(CONNECTED) 5] get /pulsar_dev2/ledgers/00/0
0000 0001 0004 0006 0011 0012 0029 0030 0031 0032 0034 0040 0041 0042 0043 0044 0045 0046 0047 0049 0050 0051 0052 0053 0054 0055 0056 0057 0058 0059
0060 0061 0062 0063 0064 0065 0066 0067 0068 0069 0070 0071 0072 0073 0074 0075 0076 0077 0078 0079 0080 0081 0082 0083 0084 0085 0086 0087 0088 0089
0090 0091 0092 0093 0094 0095 0096 0097 0098 0099 0100 0101 0102 0103 0104 0105 0106 0107 0108 0109 0110 0111 0112 0113 0114 0115 0116 0117 0118 0119
0120 0121 0122 0123 0124 0125 0126 0127 0128 0129 0130 0131 0133 0137 0138 0139 0140 0142 0143 0144
```
2、our findings
I have four bookies, one bookie 10.101.129.75 process is down because the disk is full, and the other two bookies are also down, leaving only one surviving bookie, but I don't know why these three bookies are down, Because I can't find the previous log.I suspect that the continued outage of the three bookies resulted in data loss, causing the problem of not being able to retire
Below is my test:
2.1 I listed the ledger with missing copies and the corresponding machines, and found that the leader 396606 was missing on all 4 machines.log see link below:
https://tva1.sinaimg.cn/large/e6c9d24egy1h2smsws1u8j225s0rj1c4.jpg
2.2 I tried to use the command 'bin/bookkeeper shell readledger -ledgerid 396606' to read the ledger, and found an error when reading the entry 867, it was actually sending a request to the faulty machine 10.101.129.75.log see link below:
https://tva1.sinaimg.cn/large/e6c9d24egy1h2sn01kac1j21de0u0apv.jpg
2.3 I used the command 'bin/bookkeeper shell ledgermetadata -ledgerid 396606' to view the metadata of ledger 396606 and found that the replicas starting from entry845 are allocated on this machine 10.101.129.75.log see link below:
https://tva1.sinaimg.cn/large/e6c9d24egy1h2sn21dcobj21gr0u0e22.jpg
When our debug node is retired, we find that the source code will read the ledger that lacks a copy, and the error 'no entry' is reported here.
https://tva1.sinaimg.cn/large/e6c9d24egy1h2snbxx8xvj212y0np79x.jpg
some doubts:
I think the data loss is caused by the fact that the three bookie machines are down in a short time and our cluster parameter journalWriteData is set to false (we don't want to enable journal write to write ahead, we allow some data loss). But I have some doubts, why the loss of data will cause such a big problem that the machine cannot be retired, and even in our later tests, it was found that the loss of the ledger will even cause the producer to fail to send data. Maybe the situation of data loss should be considered here. and have countermeasures
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org