You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pulsar.apache.org by GitBox <gi...@apache.org> on 2019/06/21 18:21:52 UTC

[GitHub] [pulsar] sschepens edited a comment on issue #981: Cannot connect to a topic because of ManagedLedgerException

sschepens edited a comment on issue #981: Cannot connect to a topic because of ManagedLedgerException
URL: https://github.com/apache/pulsar/issues/981#issuecomment-504523616
 
 
   @sijie we've been experiencing this, we're running the same pulsar version as the reported, but using bookkeeper 4.7.2.
   
   We're seeing these logs on the broker side:
   ```
   2019-06-20 18:49:15,644 - [level:ERROR] [class:ManagedLedgerImpl$1] [line:261] [thread:bookkeeper-ml-workers-33-1] - [TOPIC] Failed to open ledger 16956902: Error while recovering ledger
   2019-06-20 18:49:15,642 - [level:ERROR] [class:ReadLastConfirmedOp] [line:127] 2019-06-20 18:49:17,922 - [level:ERROR] [class:ReadLastConfirmedOp] [line:127] [thread:BookKeeperClientWorker-20-1] - While readLastConfirmed ledger: 16956902 did not hear success responses from all quorums
   ```
   
   On bookkeeper side, we don't see much, we only see these logs:
   ```
   18:49:12.240 [BookieHighPriorityThread-3181-OrderedExecutor-6-0] WARN  org.apache.bookkeeper.proto.ReadEntryProcessor - Ledger: 16956902  fenced by: /10.64.165.192:35262
   ```
   We have ensableSize: 3, writeQuorum: 3, ackQuorum: 3.
   When we inspect these ledgers, the appear as `length: 0, lastEntryId: -1, state: IN_RECOVERY`
   
   This seems to happen when our bookies are highly loaded or flapping, but we don't see it recovering by its own.
   I assume this is happening because something failed and something is failing when recovering the ledger
   
   A couple of questions:
   - Is this normal? Should this errors recover automatically?
   - Do we need the 3 bookies in the ensemble up and running for us to recover from these errors? If so, what happens if we loose one bookie forever?
   - From broker side, is there something to do to handle better these errors? we're left with a bunch of consumers or topics not being able to open.
   
   Edit: I should add that we sometimes see these ledgers with `length: 0, lastEntryId: -1` stuck as underreplicated and never recovering, when deleted, the replication process continues without issues.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services