You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pulsar.apache.org by GitBox <gi...@apache.org> on 2022/06/01 04:57:00 UTC

[GitHub] [pulsar] yebai1105 commented on issue #15776: bookie failed to decommission successfully

yebai1105 commented on issue #15776:
URL: https://github.com/apache/pulsar/issues/15776#issuecomment-1143117746

   Below are answers to some of your questions and our findings:
   1、your questions :
   1.1 we never used this command:bin/bookkeeper shell simpletest --ensemble 4 --writeQuorum 4 --ackQuorum 4 --numEntries 4
   1.2 command 'bin/bookkeeper shell listunderreplicated' and command'bin/bookkeeper shell decommissionbookie' are executed on the faulty machine 10.101.129.75
   1.3 All namespace persistence strategies in our cluster are as follows:
   ```
   {
     "bookkeeperEnsemble" : 3,
     "bookkeeperWriteQuorum" : 3,
     "bookkeeperAckQuorum" : 2,
     "managedLedgerMaxMarkDeleteRate" : 0.0
   }
   ```
   1.4  We opened zk and tried to get the  ledger 396606, but didn't find the ledger
   ```
   [zk: localhost:2181(CONNECTED) 5] get /pulsar_dev2/ledgers/00/0
   0000   0001   0004   0006   0011   0012   0029   0030   0031   0032   0034   0040   0041   0042   0043   0044   0045   0046   0047   0049   0050   0051   0052   0053   0054   0055   0056   0057   0058   0059   
   0060   0061   0062   0063   0064   0065   0066   0067   0068   0069   0070   0071   0072   0073   0074   0075   0076   0077   0078   0079   0080   0081   0082   0083   0084   0085   0086   0087   0088   0089   
   0090   0091   0092   0093   0094   0095   0096   0097   0098   0099   0100   0101   0102   0103   0104   0105   0106   0107   0108   0109   0110   0111   0112   0113   0114   0115   0116   0117   0118   0119   
   0120   0121   0122   0123   0124   0125   0126   0127   0128   0129   0130   0131   0133   0137   0138   0139   0140   0142   0143   0144   
   ```
   2、our findings
   I have four bookies, one bookie 10.101.129.75 process is down because the disk is full, and the other two bookies are also down, leaving only one surviving bookie, but I don't know why these three bookies are down, Because I can't find the previous log.I suspect that the continued outage of the three bookies resulted in data loss, causing the problem of not being able to retire
   Below is my test:
   2.1 I listed the ledger with missing copies and the corresponding machines, and found that the leader 396606 was missing on all 4 machines.log see link below:
   https://tva1.sinaimg.cn/large/e6c9d24egy1h2smsws1u8j225s0rj1c4.jpg
   2.2 I tried to use the command 'bin/bookkeeper shell readledger -ledgerid 396606' to read the ledger, and found an error when reading the entry 867, it was actually sending a request to the faulty machine 10.101.129.75.log see link below:
   https://tva1.sinaimg.cn/large/e6c9d24egy1h2sn01kac1j21de0u0apv.jpg
   2.3 I used the command 'bin/bookkeeper shell ledgermetadata -ledgerid 396606' to view the metadata of ledger 396606 and found that the replicas starting from entry845 are allocated on this machine 10.101.129.75.log see link below:
   https://tva1.sinaimg.cn/large/e6c9d24egy1h2sn21dcobj21gr0u0e22.jpg
   When our debug node is retired, we find that the source code will read the ledger that lacks a copy, and the error 'no entry' is reported here.
   https://tva1.sinaimg.cn/large/e6c9d24egy1h2snbxx8xvj212y0np79x.jpg
   
   some doubts:
   I think the data loss is caused by the fact that the three bookie machines are down in a short time and our cluster parameter journalWriteData is set to false (we don't want to enable journal write to write ahead, we allow some data loss). But I have some doubts, why the loss of data will cause such a big problem that the machine cannot be retired, and even in our later tests, it was found that the loss of the ledger will even cause the producer to fail to send data. Maybe the situation of data loss should be considered here. and have countermeasures


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org