You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pulsar.apache.org by GitBox <gi...@apache.org> on 2021/01/22 16:38:13 UTC

[GitHub] [pulsar] srouthu1 opened a new issue #9288: Consumer getting stuck when Qw bookies down though publisher continued to write to other available bookies

srouthu1 opened a new issue #9288:
URL: https://github.com/apache/pulsar/issues/9288


   #### Expected behavior
   we have 6 bookies, with Ensemble=3, Qw=2, Qa=2. While the consumer consuming the messages, Two bookies were brought down which are part of Qw. 
   Broker should dispatch the messages to consumer which are available in the New Ensemble. If the bookies are up after some time, then the broker should dispatch the messages from these bookies as consumer can't afford message loss.
   
   #### Actual behavior
   Publisher is able to continue as the ensemble is formed with other available bookies. But consumer got stuck indefinitely waiting for the messages in the bookies which are down. We have autoSkipNonRecoverableData=true which did not help. We have restarted the owner broker also but it did not help. 
   The consumer resumed when we brought back the bookies which are holding messages.
   The consumer also resumed when we run reset-cursor command but this is not a feasible solution with thousands of topics
   
   #### Steps to reproduce
   Ensemble=3, Qw=2, Qa=2. Total bookies=5 or 6.
   Continuously Publish and consume to a topic. Ensure consumer is slower than publisher.
   Bring down 2 bookies at the same time.
   Verify if consumer is able to resume the consumption.  
   
   #### System configuration
   Pulsar version: 2.6
   We are running on AWS. 
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] codelipenghui commented on issue #9288: Consumer getting stuck when Qw bookies down though publisher continued to write to other available bookies

Posted by GitBox <gi...@apache.org>.
codelipenghui commented on issue #9288:
URL: https://github.com/apache/pulsar/issues/9288#issuecomment-767303356


   Currently, the `autoSkipNonRecoverableData=true` only skip the cases that the broker can't read data from ledger such LedgerNotFoundException or LedgerMetadataDoesNotExists exception. For the cases that shutdown multiple bookies, the broker will encounter LedgerHandleNotAvaliableException which can't be handled by `autoSkipNonRecoverableData=true`.
   
   From the Pulsar perspective, the crashed bookies can not ensure that the data cannot be recovered but the specific LedgerNotFoundException or LedgerMetadataDoesNotExists can. So the  `autoSkipNonRecoverableData=true` only skips the cases that the data truly can't be recovered.
   
   For support to continue to consume messages that the cluster encounter multiple bookies not available scenario. We should introduce a new config `autoSkipNonAvailableBookies`. What do you think @merlimat @sijie @jiazhai 
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] rdhabalia commented on issue #9288: Consumer getting stuck when Qw bookies down though publisher continued to write to other available bookies

Posted by GitBox <gi...@apache.org>.
rdhabalia commented on issue #9288:
URL: https://github.com/apache/pulsar/issues/9288#issuecomment-772902471


   If all the replicas are not available and data is not recoverable then user can always skip the messages on that cursor using skip-message admin API. I think manual handling for such data loss (using admin-api) is better than skipping it automatically without user's concerns. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] codelipenghui closed issue #9288: Consumer getting stuck when Qw bookies down though publisher continued to write to other available bookies

Posted by GitBox <gi...@apache.org>.
codelipenghui closed issue #9288:
URL: https://github.com/apache/pulsar/issues/9288


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] codelipenghui commented on issue #9288: Consumer getting stuck when Qw bookies down though publisher continued to write to other available bookies

Posted by GitBox <gi...@apache.org>.
codelipenghui commented on issue #9288:
URL: https://github.com/apache/pulsar/issues/9288#issuecomment-788921288


   @rdhabalia Make sense. I will close this issue first.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org