You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@bookkeeper.apache.org by Hang Chen <ch...@apache.org> on 2023/03/27 04:26:27 UTC

Bookkeeper decommission may be blocked by ledgers that cannot be replicated

Hi guys, I found the BookKeeper decommission may be blocked by ledgers
that cannot be replicated.

Current bookie decommissions process.
  - Step 1: Use the command `bin/bookkeeper shell listunderreplicated`
to check whether there are some ledgers in the under-replicated state
  - Step 2: After all the ledgers are replicated complete, stop the
bookie and use the command `bin/bookkeeper shell decommissionbookie
-bookieid <bookieaddress>` to trigger decommission
  - Step 3: Wait for all the ledgers to be replicated and the bookie
decommission process will complete

However, there is a bug in the decommissioning process.

In Step 1, those under-replicated state ledgers are marked by the
following steps:
  - Auditor check lost bookie: it will be triggered by two cases: a)
One bookie lost after `lostBookieRecoveryDelay`, b) Check every
`auditorPeriodicBookieCheckInterval`.  The default is 24 hours.
  - Auditor checks all ledgers: triggered every
`auditorPeriodicCheckInterval`. The default is 7 days. It will check
every ledger's fragments with the following steps:
    - For every fragment, calculate pending read entries according to
`auditorLedgerVerificationPercentage`, default is `0`, which means
only checking the first and last entries of this fragment.
    - Read those entries from all the bookies in the ensemble list for
the pending read entries. If any entries read failed, mark the ledger
into an under-replicated state.


When we use the `bin/bookkeeper shell listunderreplicated` command to
check whether some are under-replicated, it only represents those
ledgers missing replicas before the last check. The lost bookie check
was 24 hours ago, and the all ledgers check was seven days ago. The
time range from the last check to the current timestamp won't mark any
missing replicas ledgers. Suppose we set EnsembleSize=3,
WriteQuorumSize=2, and AckQuorumSize=1, and decommission one bookie
with the current decommission process. In that case, it may result in
some ledgers that can't be replicated due to the only available
replica on the decommissioned bookie.

Moreover, the Auditor checks all ledgers and only checks the first and
last entries of each fragment of those ledgers. If the bookie disabled
writing journals and some entries are lost in one fragment, but the
first and last entries still exist, the checker won't find it.

### Options
There are two options to tune the decommissioning process.

1. Trigger-check all ledgers before Step 1. It has the following disadvantages.
   - It will cost a lot of resources
   - It only checks the first and last entries of each fragment of
those ledgers by default. It can't cover all the entries that check

 2. Turn the bookie into read-only mode instead of shutting it down
before using the `bin/bookkeeper shell decommissionbookie -bookieid
<bookieaddress>` command to trigger commission. When replicating
ledgers located on the decommission bookie, the ledgers can be
replicated successfully if one replica is available.

I suggest choosing the second option to tune the current bookie
decommission process. Do you have any suggestions?

Thanks,
Hang

Re: Bookkeeper decommission may be blocked by ledgers that cannot be replicated

Posted by Hang Chen <ch...@apache.org>.
Hi Andrey,
    Sorry for the late reply. I double-checked the code and found the
`recover` command can solve the problem, but it has a performance
issue.

When we use `bin/bookkeeper shell recover` command to decommission one
bookie, the ledger replication is executed on the node where we run
the recover command and not run on the auto-recovery pods. If one
bookie holds 4TB of ledger data to be replicated, the ledger
replication can't be parallelized by adding more auto-recovery
instances.

IMO, we need another way to decommission a bookie instead of the
`recover` command.

Thanks,
Hang

Andrey Yegorov <an...@datastax.com> 于2023年3月30日周四 05:54写道:
>
> Hi,
>
> You can use "recover" command instead.
>
> Switch bookie to read-only (via REST API)
> bin/bookkeeper shell recover ..
> recover command also has a flag to delete the cookie in ZK.
> As an additional benefit, this way you can decomm bookie with ledgers
> created with write quorum = 1.
>
> HTH.
>
> On Sun, Mar 26, 2023 at 9:27 PM Hang Chen <ch...@apache.org> wrote:
>
> > Hi guys, I found the BookKeeper decommission may be blocked by ledgers
> > that cannot be replicated.
> >
> > Current bookie decommissions process.
> >   - Step 1: Use the command `bin/bookkeeper shell listunderreplicated`
> > to check whether there are some ledgers in the under-replicated state
> >   - Step 2: After all the ledgers are replicated complete, stop the
> > bookie and use the command `bin/bookkeeper shell decommissionbookie
> > -bookieid <bookieaddress>` to trigger decommission
> >   - Step 3: Wait for all the ledgers to be replicated and the bookie
> > decommission process will complete
> >
> > However, there is a bug in the decommissioning process.
> >
> > In Step 1, those under-replicated state ledgers are marked by the
> > following steps:
> >   - Auditor check lost bookie: it will be triggered by two cases: a)
> > One bookie lost after `lostBookieRecoveryDelay`, b) Check every
> > `auditorPeriodicBookieCheckInterval`.  The default is 24 hours.
> >   - Auditor checks all ledgers: triggered every
> > `auditorPeriodicCheckInterval`. The default is 7 days. It will check
> > every ledger's fragments with the following steps:
> >     - For every fragment, calculate pending read entries according to
> > `auditorLedgerVerificationPercentage`, default is `0`, which means
> > only checking the first and last entries of this fragment.
> >     - Read those entries from all the bookies in the ensemble list for
> > the pending read entries. If any entries read failed, mark the ledger
> > into an under-replicated state.
> >
> >
> > When we use the `bin/bookkeeper shell listunderreplicated` command to
> > check whether some are under-replicated, it only represents those
> > ledgers missing replicas before the last check. The lost bookie check
> > was 24 hours ago, and the all ledgers check was seven days ago. The
> > time range from the last check to the current timestamp won't mark any
> > missing replicas ledgers. Suppose we set EnsembleSize=3,
> > WriteQuorumSize=2, and AckQuorumSize=1, and decommission one bookie
> > with the current decommission process. In that case, it may result in
> > some ledgers that can't be replicated due to the only available
> > replica on the decommissioned bookie.
> >
> > Moreover, the Auditor checks all ledgers and only checks the first and
> > last entries of each fragment of those ledgers. If the bookie disabled
> > writing journals and some entries are lost in one fragment, but the
> > first and last entries still exist, the checker won't find it.
> >
> > ### Options
> > There are two options to tune the decommissioning process.
> >
> > 1. Trigger-check all ledgers before Step 1. It has the following
> > disadvantages.
> >    - It will cost a lot of resources
> >    - It only checks the first and last entries of each fragment of
> > those ledgers by default. It can't cover all the entries that check
> >
> >  2. Turn the bookie into read-only mode instead of shutting it down
> > before using the `bin/bookkeeper shell decommissionbookie -bookieid
> > <bookieaddress>` command to trigger commission. When replicating
> > ledgers located on the decommission bookie, the ledgers can be
> > replicated successfully if one replica is available.
> >
> > I suggest choosing the second option to tune the current bookie
> > decommission process. Do you have any suggestions?
> >
> > Thanks,
> > Hang
> >
>
>
> --
> Andrey Yegorov

Re: Bookkeeper decommission may be blocked by ledgers that cannot be replicated

Posted by Andrey Yegorov <an...@datastax.com>.
Hi,

You can use "recover" command instead.

Switch bookie to read-only (via REST API)
bin/bookkeeper shell recover ..
recover command also has a flag to delete the cookie in ZK.
As an additional benefit, this way you can decomm bookie with ledgers
created with write quorum = 1.

HTH.

On Sun, Mar 26, 2023 at 9:27 PM Hang Chen <ch...@apache.org> wrote:

> Hi guys, I found the BookKeeper decommission may be blocked by ledgers
> that cannot be replicated.
>
> Current bookie decommissions process.
>   - Step 1: Use the command `bin/bookkeeper shell listunderreplicated`
> to check whether there are some ledgers in the under-replicated state
>   - Step 2: After all the ledgers are replicated complete, stop the
> bookie and use the command `bin/bookkeeper shell decommissionbookie
> -bookieid <bookieaddress>` to trigger decommission
>   - Step 3: Wait for all the ledgers to be replicated and the bookie
> decommission process will complete
>
> However, there is a bug in the decommissioning process.
>
> In Step 1, those under-replicated state ledgers are marked by the
> following steps:
>   - Auditor check lost bookie: it will be triggered by two cases: a)
> One bookie lost after `lostBookieRecoveryDelay`, b) Check every
> `auditorPeriodicBookieCheckInterval`.  The default is 24 hours.
>   - Auditor checks all ledgers: triggered every
> `auditorPeriodicCheckInterval`. The default is 7 days. It will check
> every ledger's fragments with the following steps:
>     - For every fragment, calculate pending read entries according to
> `auditorLedgerVerificationPercentage`, default is `0`, which means
> only checking the first and last entries of this fragment.
>     - Read those entries from all the bookies in the ensemble list for
> the pending read entries. If any entries read failed, mark the ledger
> into an under-replicated state.
>
>
> When we use the `bin/bookkeeper shell listunderreplicated` command to
> check whether some are under-replicated, it only represents those
> ledgers missing replicas before the last check. The lost bookie check
> was 24 hours ago, and the all ledgers check was seven days ago. The
> time range from the last check to the current timestamp won't mark any
> missing replicas ledgers. Suppose we set EnsembleSize=3,
> WriteQuorumSize=2, and AckQuorumSize=1, and decommission one bookie
> with the current decommission process. In that case, it may result in
> some ledgers that can't be replicated due to the only available
> replica on the decommissioned bookie.
>
> Moreover, the Auditor checks all ledgers and only checks the first and
> last entries of each fragment of those ledgers. If the bookie disabled
> writing journals and some entries are lost in one fragment, but the
> first and last entries still exist, the checker won't find it.
>
> ### Options
> There are two options to tune the decommissioning process.
>
> 1. Trigger-check all ledgers before Step 1. It has the following
> disadvantages.
>    - It will cost a lot of resources
>    - It only checks the first and last entries of each fragment of
> those ledgers by default. It can't cover all the entries that check
>
>  2. Turn the bookie into read-only mode instead of shutting it down
> before using the `bin/bookkeeper shell decommissionbookie -bookieid
> <bookieaddress>` command to trigger commission. When replicating
> ledgers located on the decommission bookie, the ledgers can be
> replicated successfully if one replica is available.
>
> I suggest choosing the second option to tune the current bookie
> decommission process. Do you have any suggestions?
>
> Thanks,
> Hang
>


-- 
Andrey Yegorov