You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@bookkeeper.apache.org by "Ivan Kelly (JIRA)" <ji...@apache.org> on 2014/04/07 16:22:20 UTC

[jira] [Comment Edited] (BOOKKEEPER-742) Fix for empty ledgers losing quorum.

    [ https://issues.apache.org/jira/browse/BOOKKEEPER-742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13961871#comment-13961871 ] 

Ivan Kelly edited comment on BOOKKEEPER-742 at 4/7/14 2:22 PM:
---------------------------------------------------------------

{quote}
So in an ideal cluster its not expected to see multiple failures at sametime. If multiple failure happens, again it would enter into the loop as opening ledger wouldn't get succeeded.
{quote}
The problem before this patch is that, after the first failure occurs, no corrective action is taken. Then when the second does occur we get into the bad situation. The two failures may have many weeks or months between them. In fact, these weren't even failures, but boxes being taken out of rotation. The whole process got stalled, because it looked like a ledger had been lost.

{quote}
 2) I failed to see reason why the following sync block is removed in the patch ?
{quote}
Because it serves no purpose. What resource is the sync block protecting (rather, the two sync blocks; the method is also synchronized)?

I'll take a look at BOOKKEEPER-733


was (Author: ikelly):
{quote}
So in an ideal cluster its not expected to see multiple failures at sametime. If multiple failure happens, again it would enter into the loop as opening ledger wouldn't get succeeded.
{quote}
The problem before this patch is that, after the first failure occurs, no corrective action is taken. Then when the second does occur we get into the bad situation. The two failures may have many weeks or months between them. In fact, these weren't even failures, but boxes being taken out of rotation. The whole process got stalled, because it looked like a ledger had been lost.

{quote}
 2) I failed to see reason why the following sync block is removed in the patch ?
{quote}
Because it serves no purpose. What resource is the sync block protecting (rather, the two sync blocks the method is also synchronized).

I'll take a look at BOOKKEEPER-733

> Fix for empty ledgers losing quorum.
> ------------------------------------
>
>                 Key: BOOKKEEPER-742
>                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-742
>             Project: Bookkeeper
>          Issue Type: Bug
>          Components: bookkeeper-auto-recovery
>            Reporter: Ivan Kelly
>            Assignee: Ivan Kelly
>             Fix For: 4.3.0, 4.2.3
>
>         Attachments: 0001-Fix-for-empty-ledgers-using-quorum.trunk.patch, 0003-Fix-for-empty-ledgers-using-quorum.branch4.2.patch
>
>
> If a ledger is open and empty, when a bookie in the ensemble crashes no recovery will take place (because there's nothing to recover). This open empty unrepaired ledger can persist for a long time. If it loses another bookie, it can lose quorum. At this point it's impossible for the bookie to know that its an empty ledger, and the admin gets notified of missing data.



--
This message was sent by Atlassian JIRA
(v6.2#6252)