You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@bookkeeper.apache.org by "Ivan Kelly (JIRA)" <ji...@apache.org> on 2012/12/13 17:32:13 UTC
[jira] [Comment Edited] (BOOKKEEPER-355) Ledger recovery will mark
ledger as closed with -1, in case of slow bookie is added to ensemble
during recovery add
[ https://issues.apache.org/jira/browse/BOOKKEEPER-355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13531155#comment-13531155 ]
Ivan Kelly edited comment on BOOKKEEPER-355 at 12/13/12 4:30 PM:
-----------------------------------------------------------------
BOOKKEEPER-356 fixed this problem in one way, but it's still possible to hit the root cause, and i've included a test case which does hit it (testLedgerRecoveryWithRollingRestart). Its a corner case, but a case that will kill your ledger.
The buggy sequence of events is.
You have a ledger with ensemble (B1, B2, B3)
# B1 brought down for maintenance
# Ledger recovery started
# B2 answers read last confirmed.
# B1 replaced in ensemble by B4
# Write to B4 fails for some reason
# B1 comes back up.
# B2 goes down for maintenance.
# Ledger recovery starts (ledger is now unavailable)
The core of the issue is that recovery updates the ensemble for a ledger before writing anything. Recovery only needs to update ensembles when closing the ledger, and the patch does exactly that.
was (Author: ikelly):
BOOKKEEPER-356 fixed this problem in one way, but it's still possible to hit the room cause, and i've included a test case which does hit it (testLedgerRecoveryWithRollingRestart). Its a corner case, but a case that will kill your ledger.
The buggy sequence of events is.
You have a ledger with ensemble (B1, B2, B3)
# B1 brought down for maintenance
# Ledger recovery started
# B2 answers read last confirmed.
# B1 replaced in ensemble by B4
# Write to B4 fails for some reason
# B1 comes back up.
# B2 goes down for maintenance.
# Ledger recovery starts (ledger is now unavailable)
The core of the issue is that recovery updates the ensemble for a ledger before writing anything. Recovery only needs to update ensembles when closing the ledger, and the patch does exactly that.
> Ledger recovery will mark ledger as closed with -1, in case of slow bookie is added to ensemble during recovery add
> --------------------------------------------------------------------------------------------------------------------
>
> Key: BOOKKEEPER-355
> URL: https://issues.apache.org/jira/browse/BOOKKEEPER-355
> Project: Bookkeeper
> Issue Type: Bug
> Components: bookkeeper-server
> Affects Versions: 4.1.0, 4.2.0
> Reporter: Vinay
> Assignee: Vinay
> Fix For: 4.2.0
>
> Attachments: 0001-BOOKKEEPER-355-Ledger-recovery-will-mark-ledger-as-c.patch, BOOKKEEPER-355.patch, BOOKKEEPER-355.patch
>
>
> Scenario:
> ------------
> 1. Ledger is created with ensemble and quorum size as 2, written with one entry
> 2. Now first bookie is in the ensemble is made down.
> 3. Another client fence and trying to recover the same ledger
> 4. During this time ensemble change will happen and new bookie will be added. But this bookie is not able to connect.
> 5. This recovery will fail.
> 7. Now previously added bookie came up.
> 8. Another client trying to recover the same ledger.
> 9. Since new bookie is first in the ensemble, doRecoveryRead() is reading from that bookie and getting NoSuchLedgerException and closing the ledger with -1
> i.e. Marking the ledger as empty, even though first client had successfully written one entry.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira