You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@bookkeeper.apache.org by "Sijie Guo (JIRA)" <ji...@apache.org> on 2013/03/12 07:07:13 UTC

[jira] [Created] (BOOKKEEPER-584) Data loss when ledger metadata is overwritten

Sijie Guo created BOOKKEEPER-584:
------------------------------------

             Summary: Data loss when ledger metadata is overwritten
                 Key: BOOKKEEPER-584
                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-584
             Project: Bookkeeper
          Issue Type: Bug
          Components: bookkeeper-client
    Affects Versions: 4.2.0
            Reporter: Sijie Guo
            Assignee: Sijie Guo
            Priority: Critical
             Fix For: 4.3.0


this is an issue introduced when fixing BOOKKEEPER-337. the original #resolveConflicts logic was removed by just checking state and current ensemble, which tends to fixing multiple bookies changed in same ensemble.

the issue could be reproduce by a test case in following steps:

1. Ledger L writing several entries to ensemble A, B, C.
2. C succeed, B failed with slow responses and A failed with unrecoverable issue.
3. L would fail all the pending add ops and close the ledger with lastEntryId = -1. (since no add operations succeed).
4. The ownership of this Ledger is released and transferred to other machines (it is the normal use case for Hedwig).
5. the new owner tried to open Ledger L and recover the ensemble, suppose A, B is back to normal at this case. so L is closed with lastEntryId is not -1.
6. the old owner although closed the ledger, but doesn't blocking the responses for already failed pending add ops. so failures for B would kick in some ensemble changes and since the ledger metadata is already changed by new owner, so it needs to resolve the conflicts and update the ledger metadata with lastEntryId = -1 again. so we get different lastEntryId at different time, which cause inconsistency and data loss.

for details of this sequence, a test case could describe it more clearly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira