You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@bookkeeper.apache.org by Jack Vanlightly <jv...@splunk.com.INVALID> on 2021/11/01 10:14:44 UTC

Use case for storage expansion

Hi all,

I thought I'd test the PR https://github.com/apache/bookkeeper/pull/2871 as
I hadn't used storage expansion at all. It seemed to work but I ran a
correctness test just in case and found that it "lost" 50% of my ledgers.

Looking at the code to my surprise it does not repartition the data across
the directories, which explained why 50% of the ledgers were "gone". I
expanded from one to two ledger dirs, so all the even ledger ids were fine,
but the odd ledger id read operations got routed to the new directory which
of course was empty. All the ledger data was still all in the original
ledger directory.

So either I am not understanding the use case for storage expansion (i.e.
you can only do it on an empty bookie) or this feature is majorly flawed.

Please confirm either way. I'll create an issue, if it is indeed flawed.

Jack

Re: Use case for storage expansion

Posted by Jack Vanlightly <jv...@splunk.com.INVALID>.

Hi Hang,

The thing is that the BookKeeper replication protocol doesn't tolerate
bookies losing entries that it says it has stored safely. Ledger recovery
can end-up truncating ledgers leading to unrecoverable data loss that not
even the auditor check can recover. So this shrink and expand is
fundamentally unsafe.

Ivan Kelly and I have worked on making BK run without the journal, which
can also lead to a bookie losing entries it said it had stored safely. This
required some changes to make it safe from ledger truncation during ledger
recovery and also allow bookies to self repair themselves. I will be
starting to submit PRs for this work this week. Once those changes are in
we could look at utilising it to make the expand/shrink operations safe.

The alternative is to do the ledger rewriting to ensure that existing
ledgers are placed in the correct directories before the bookie completes
its boot process.

Jack

On Tue, Nov 2, 2021 at 5:17 AM Hang Chen <ch...@apache.org> wrote:

> [ External sender. Exercise caution. ]
>
> Hi Jack,
>      Currently, if we use multi directories for journal or ledger in
> one bookie, it will store specific ledger into target directory by
> `ledgerId % numberOfLedgers`. If we expand or shrink the ledgers or
> journal directories, it will break hash result value, which will lead
> to some ledgers can't find the target storage directory instance and
> read ledger failed. The case can be addressed by auditor check.
>      In production BookKeeper cluster, if we use multi directories for
> journal or ledger in one bookie, and disk errors occur, it will lead
> to bookie shut down and can't startup unless we shrink the error disk
> for configuration. After the error disk came back, we should expand
> the disk to the bookie.
>
> Thanks,
> Hang
>
> Jack Vanlightly <jv...@splunk.com.invalid> 于2021年11月1日周一 下午6:15写道：
> >
> > Hi all,
> >
> > I thought I'd test the PR https://github.com/apache/bookkeeper/pull/2871
> as
> > I hadn't used storage expansion at all. It seemed to work but I ran a
> > correctness test just in case and found that it "lost" 50% of my ledgers.
> >
> > Looking at the code to my surprise it does not repartition the data
> across
> > the directories, which explained why 50% of the ledgers were "gone". I
> > expanded from one to two ledger dirs, so all the even ledger ids were
> fine,
> > but the odd ledger id read operations got routed to the new directory
> which
> > of course was empty. All the ledger data was still all in the original
> > ledger directory.
> >
> > So either I am not understanding the use case for storage expansion (i.e.
> > you can only do it on an empty bookie) or this feature is majorly flawed.
> >
> > Please confirm either way. I'll create an issue, if it is indeed flawed.
> >
> > Jack
>

Re: Use case for storage expansion

Posted by Hang Chen <ch...@apache.org>.

Hi Jack,
     Currently, if we use multi directories for journal or ledger in
one bookie, it will store specific ledger into target directory by
`ledgerId % numberOfLedgers`. If we expand or shrink the ledgers or
journal directories, it will break hash result value, which will lead
to some ledgers can't find the target storage directory instance and
read ledger failed. The case can be addressed by auditor check.
     In production BookKeeper cluster, if we use multi directories for
journal or ledger in one bookie, and disk errors occur, it will lead
to bookie shut down and can't startup unless we shrink the error disk
for configuration. After the error disk came back, we should expand
the disk to the bookie.

Thanks,
Hang

Jack Vanlightly <jv...@splunk.com.invalid> 于2021年11月1日周一 下午6:15写道：
>
> Hi all,
>
> I thought I'd test the PR https://github.com/apache/bookkeeper/pull/2871 as
> I hadn't used storage expansion at all. It seemed to work but I ran a
> correctness test just in case and found that it "lost" 50% of my ledgers.
>
> Looking at the code to my surprise it does not repartition the data across
> the directories, which explained why 50% of the ledgers were "gone". I
> expanded from one to two ledger dirs, so all the even ledger ids were fine,
> but the odd ledger id read operations got routed to the new directory which
> of course was empty. All the ledger data was still all in the original
> ledger directory.
>
> So either I am not understanding the use case for storage expansion (i.e.
> you can only do it on an empty bookie) or this feature is majorly flawed.
>
> Please confirm either way. I'll create an issue, if it is indeed flawed.
>
> Jack