You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pulsar.apache.org by "ThomasTaketurns (via GitHub)" <gi...@apache.org> on 2024/01/29 14:19:40 UTC

[I] [Bug] Bookkeeper autorecovery ReplicationWorker not rereplicating under replicated ledgers. [pulsar]

ThomasTaketurns opened a new issue, #21987:
URL: https://github.com/apache/pulsar/issues/21987

   ### Search before asking
   
   - [X] I searched in the [issues](https://github.com/apache/pulsar/issues) and found nothing similar.
   
   
   ### Version
   
   3.0.0
   
   ### Minimal reproduce step
   
   - Stop some bookie
   - Some ledgers become under replicated as we can see in pulsar-recovery-0 logs
   - Ledgers are not rereplicated
   
   ### What did you expect to see?
   
   I would expect the ReplicationWorker thread on pulsar-recovery-0 pod to trigger rereplication of under replicated nodes.
   
   ### What did you see instead?
   
   I do not see any log from the ReplicationWorker thread.
   I can see the Auditor thread being able to list under replicated ledgers though.
   
   
   ### Anything else?
   
   Please find attached the logs from taketurns-pulsar-recovery-0 when I stop som bookie.
   [recoveryLogs.txt](https://github.com/apache/pulsar/files/14085300/recoveryLogs.txt)
   
   Please also notice that the recovery process is triggered when I redeploy taketurns-pulsar-recovery-0 on my K8S cluster.
   
   2024-01-29T13:34:29,525+0000 [ReplicationWorker] INFO  org.apache.bookkeeper.replication.ReplicationWorker - Ledger replicated successfully. ledger id is: 548080
   
   I may be missing some configuration here.
   I would like to understand for what reason does the ReplicationWorker thread does not start to rereplicate ledgers when some have been identified as underreplicated by the auditor and how could I make it automatic without needing to restart the recovery pod.
   
   Thanks for your precious help,
   
   Thomas @ Taketurns
   
   ### Are you willing to submit a PR?
   
   - [ ] I'm willing to submit a PR!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [Bug] Bookkeeper autorecovery ReplicationWorker not rereplicating under replicated ledgers. [pulsar]

Posted by "Technoboy- (via GitHub)" <gi...@apache.org>.
Technoboy- commented on issue #21987:
URL: https://github.com/apache/pulsar/issues/21987#issuecomment-1914924090

   @horizonzy Could you help for this issue ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [Bug] Bookkeeper autorecovery ReplicationWorker not rereplicating under replicated ledgers. [pulsar]

Posted by "horizonzy (via GitHub)" <gi...@apache.org>.
horizonzy commented on issue #21987:
URL: https://github.com/apache/pulsar/issues/21987#issuecomment-1916048610

   @ThomasTaketurns Hello, Thomas. Could you set the log level to debug and upgrade more log, it can help us to locate the problems.
   We found two deadlocks related to AutoRecovery recently(#21159 #21010), I'm not sure which pulsar version you use, and whether your problem is related to the issue or not.
   
   Could you upgrade the AutoRecovery to the newest version and test again, remember to set the log level to debug. Thanks.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [Bug] Bookkeeper autorecovery ReplicationWorker not rereplicating under replicated ledgers. [pulsar]

Posted by "horizonzy (via GitHub)" <gi...@apache.org>.
horizonzy commented on issue #21987:
URL: https://github.com/apache/pulsar/issues/21987#issuecomment-1948807278

   @ThomasTaketurns Hi, Thomas. I reproduced the problem that you said. It's indeed an issue. And we have already fixed it on the bookkeeper side. https://github.com/apache/bookkeeper/pull/4058.
   
   You can understand why the replication worker didn't work in the discussion. 
    https://lists.apache.org/thread/1xl3hr2cpyd5xh9kozbx5xlfsjsg3f4h
   
   The bookkeeper 4.16.3 fix it. And the pulsar 3.0.2 upgraded the bk version to 4.16.3, you can upgrade the pulsar version to 3.0.2. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [Bug] Bookkeeper autorecovery ReplicationWorker not rereplicating under replicated ledgers. [pulsar]

Posted by "horizonzy (via GitHub)" <gi...@apache.org>.
horizonzy commented on issue #21987:
URL: https://github.com/apache/pulsar/issues/21987#issuecomment-1936146542

   Hi, Thomas. Sorry for the late reply. 
   I check the recovery jvm stack, I found that the ReplicationWorker didn't find any underreplication ledger, it is waiting for the incoming underreplication ledger. And the metadata store executor didn't block, so if the Auditor find unnderreplication ledger, the ReplicationWorker should be notified, it will get the unnderreplication ledger to replicate.
   
   But from your listunderreplicated result, it show that there are 1885 underreplication ledger, so I guess that the ReplicationWorker using a wrong zk url, it didn't find the unnderreplication ledger. Could you help to check it, and give the heap dump file. Thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [Bug] Bookkeeper autorecovery ReplicationWorker not rereplicating under replicated ledgers. [pulsar]

Posted by "ThomasTaketurns (via GitHub)" <gi...@apache.org>.
ThomasTaketurns commented on issue #21987:
URL: https://github.com/apache/pulsar/issues/21987#issuecomment-1966074531

   Hi @horizonzy ,
   
   I can confirm that I do not reproduce the issue anymore working with Pulsar 3.0.2.
   
   Thanks again,
   
   Thomas


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [Bug] Bookkeeper autorecovery ReplicationWorker not rereplicating under replicated ledgers. [pulsar]

Posted by "dao-jun (via GitHub)" <gi...@apache.org>.
dao-jun closed issue #21987: [Bug] Bookkeeper autorecovery ReplicationWorker not rereplicating under replicated ledgers.
URL: https://github.com/apache/pulsar/issues/21987


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [Bug] Bookkeeper autorecovery ReplicationWorker not rereplicating under replicated ledgers. [pulsar]

Posted by "ThomasTaketurns (via GitHub)" <gi...@apache.org>.
ThomasTaketurns commented on issue #21987:
URL: https://github.com/apache/pulsar/issues/21987#issuecomment-1943314893

   Hi @horizonzy , 
   
   Thanks for your reply.
   
   I reproduced the issue again.
   
   - Here are the logs of pulsar-recovery pod when I delete a bookie from my cluster.
   [auditorLogs.txt](https://github.com/apache/pulsar/files/14277091/auditorLogs.txt)
   
   - As well as a thread dump
   [tdump.txt](https://github.com/apache/pulsar/files/14277149/tdump.txt)
   
   - And a heap dump of the jvm
   [application_heap_dump.zip](https://github.com/apache/pulsar/files/14277156/application_heap_dump.zip)
   
   From what I see in the heap dump, there does not seem to be any issue with ZK url.
   
   - I also attach the bookkeeper.conf file used by the recovery pod.
   [bookkeeper.conf.txt](https://github.com/apache/pulsar/files/14277198/bookkeeper.conf.txt)
   
   Thanks for your help,
   
   Thomas @ TakeTurns
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [Bug] Bookkeeper autorecovery ReplicationWorker not rereplicating under replicated ledgers. [pulsar]

Posted by "ThomasTaketurns (via GitHub)" <gi...@apache.org>.
ThomasTaketurns commented on issue #21987:
URL: https://github.com/apache/pulsar/issues/21987#issuecomment-1963528487

   Hi @horizonzy ,
   
   Thank you for your help !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [Bug] Bookkeeper autorecovery ReplicationWorker not rereplicating under replicated ledgers. [pulsar]

Posted by "ThomasTaketurns (via GitHub)" <gi...@apache.org>.
ThomasTaketurns commented on issue #21987:
URL: https://github.com/apache/pulsar/issues/21987#issuecomment-1916768936

   Hello @horizonzy ,
   
   Thanks for your quick answer.
   
   Here is what I just did : 
   
   1.  I updated the log level of recovery via an env variable in my helm chart yaml autorecovery-statefulset.yaml
   
           env:
             - name: BOOKIE_EXTRA_OPTS
               value: "-Dpulsar.log.level=debug -Dpulsar.log.root.level=debug"
   
   2. I redeployed the statefulset and can see the DEBUG logs. At this step, the ReplicationWorker is working as I would expect.
   
   3. Then I verified that I do not have any underreplicated ledger.
   
   ./bin/bookkeeper shell listunderreplicated
   2024-01-30T11:44:22,039+0000 [main] INFO org.apache.bookkeeper.tools.cli.commands.autorecovery.ListUnderReplicatedCommand - Under replicated ledger count: 0
   
   4.  When I kill one bookie, I can see the auditor identifying newly under replicated ledgers.
   
   [auditorLogs.txt](https://github.com/apache/pulsar/files/14097243/auditorLogs.txt)
   
   ./bin/bookkeeper shell listunderreplicated
   2024-01-30T11:58:39,785+0000 [main] INFO  org.apache.bookkeeper.tools.cli.commands.autorecovery.ListUnderReplicatedCommand - Under replicated ledger count: 1885
   
   
   5.  Nothing seem to happen on the ReplicationWorker side, I cannot find any log for such thread. I will attach other logs from recovery  pod though.
   
   [recoveryPodLogs.txt](https://github.com/apache/pulsar/files/14097593/recoveryPodLogs.txt)
   
   6. I also attach a thread dump of the recovery JVM.
   
   [td00.txt](https://github.com/apache/pulsar/files/14097645/td00.txt)
   
   Pulsar version : 3.0.0
   
   It is complicated to upgrade to the latest version for now since threre are other persons working on the same K8S cluster I use for my tests.  If we cannot investigate further, I will need to find a way to deploy a brand new cluster.
   
   Sincerely,
   
   Thomas @ TakeTurns


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org