You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@bookkeeper.apache.org by Ivan Kelly <iv...@apache.org> on 2018/08/04 08:49:42 UTC

Usefulness of ensemble change during recovery

Hi folks,

Recently I've been working to make the ledger metadata on the client
immutable, with the goal of making client metadata management more
understandable. The basic idea is that the metadata the client uses
should reflect what is in zookeeper. So if a client wants to modify
the metadata, if makes a copy, modifies, writes to zookeeper and then
starts using it. This gets rid of all the confictsWith and merge
operations.

There is only one case where this doesn't work. When we recover a
ledger, we read the LAC from all bookies, then read forward entry by
entry, rewriting the entry, until we reach the end. If a bookie fails
during the rewrite, we replace it in the ensemble, but we don't write
that back to zookeeper until the end.

I was banging my head off this yesterday, trying to find a nice way to
fit this in (there's loads of nasty ways), when I came to the
conclusion that failure recovery during recovery isn't actually
useful.

Recovery operates on a few seconds of data (from the last LAC written
to the end of the ledger, call this LLAC). Take a ledger with 3:2:2
configuration. If the writer crashes, and one bookie crashes, when we
recover we currently replace that crashed bookie, so that if another
bookie crashes the data is still available. But, and this is why I
don't think it's useful, if another bookie crashes, the recovered data
may be available, but everything before the LLAC in the ledger will
not be available.
IMO, this kind of thing should be handled by rereplication, not
ensemble change (as as aside, we should have a hint system to trigger
rereplication ASAP on this ledger).

Anyhow, I'd like to hear other opinions on this before proceeding.
Recovery with ensemble changes can work. Rather than modifying the
ledger, create a shadow ensemble list, and give entries from that to
the writers, but with the current entanglement in the client, this is
a bit nasty.

Cheers,
Ivan

Re: Usefulness of ensemble change during recovery

Posted by Sijie Guo <gu...@gmail.com>.

On Sun, Aug 5, 2018 at 11:46 PM Sijie Guo <gu...@gmail.com> wrote:

>
>
> On Sat, Aug 4, 2018 at 1:49 AM Ivan Kelly <iv...@apache.org> wrote:
>
>> Hi folks,
>>
>> Recently I've been working to make the ledger metadata on the client
>> immutable, with the goal of making client metadata management more
>> understandable. The basic idea is that the metadata the client uses
>> should reflect what is in zookeeper. So if a client wants to modify
>> the metadata, if makes a copy, modifies, writes to zookeeper and then
>> starts using it. This gets rid of all the confictsWith and merge
>> operations.
>>
>> There is only one case where this doesn't work. When we recover a
>> ledger, we read the LAC from all bookies, then read forward entry by
>> entry, rewriting the entry, until we reach the end. If a bookie fails
>> during the rewrite, we replace it in the ensemble, but we don't write
>> that back to zookeeper until the end.
>>
>> I was banging my head off this yesterday, trying to find a nice way to
>> fit this in (there's loads of nasty ways), when I came to the
>> conclusion that failure recovery during recovery isn't actually
>> useful.
>>
>
>
>> Recovery operates on a few seconds of data (from the last LAC written
>> to the end of the ledger, call this LLAC).
>
>
> the data during this duration can be very large if the traffic of the
> ledger is large. That has
> been observed at Twitter's production. so when we are talking about "a few
> seconds of data",
> we can't assume the amount of data is little. That says the recovery can
> be taking time than
> what we can expect, so if we don't handle failures during recovery how we
> are able to ensure
> we have enough data copy during recovery.
>
> I am not sure "make ledger metadata immutable" == "getting rid of merging
> ledger metadata".
> because I don't think these are same thing. making ledger metadata
> immutable will make code
> much clearer and simpler because the ledger metadata is immutable. how
> getting rid of merging
> ledger metadata is a different thing, when you make ledger metadata
> immutable, it will help make
> merging ledger metadata on conflicts clearer.
>
> In the ledger recovery case, it is actually okay to merge ledger metadata.
> let's assume LAC is L at the
> time of recovery, ledger metadata is M  is the copy before recovery. the
> client that attempts to recovery
> the ledger will first set the ledger to IN_RECOVERY first before
> recovering the ledger. so the conflicts will
> only coming from the clients (can be many) that attempt to recover and
> AutoRecovery daemon. the resolution
> of this conflict is simpler:
>
> when fail to write ledger metadata (version conflicts), read back the
> ledger metadata, if the state is changed
> back to CLOSED, it means it is updated by other client that also recovers
> the ledger, we discarded our ensemble;
> if the state has been changed, that means it is modified by AutoRecovery,
> AutoRecovery doesn't add ensembles,
>

sorry for typo => "if the state has not been changed"


> so can simply take the ensembles before L from zookeeper and our ensembles
> after L and merge them together.
>
>
>> Take a ledger with 3:2:2
>> configuration. If the writer crashes, and one bookie crashes, when we
>> recover we currently replace that crashed bookie, so that if another
>> bookie crashes the data is still available. But, and this is why I
>> don't think it's useful, if another bookie crashes, the recovered data
>> may be available, but everything before the LLAC in the ledger will
>> not be available.
>
> IMO, this kind of thing should be handled by rereplication, not
>> ensemble change (as as aside, we should have a hint system to trigger
>> rereplication ASAP on this ledger).
>
>
>> Anyhow, I'd like to hear other opinions on this before proceeding.
>> Recovery with ensemble changes can work. Rather than modifying the
>> ledger, create a shadow ensemble list, and give entries from that to
>> the writers, but with the current entanglement in the client, this is
>> a bit nasty.
>>
>> Cheers,
>> Ivan
>>
>

Re: Usefulness of ensemble change during recovery

Posted by Ivan Kelly <iv...@apache.org>.

Yup, we had already concluded we need the ensemble change for some
cases. Code didn't turn out as messy as I'd feared though (I don't
think I've pushed this yet).

-Ivan

On Mon, Aug 13, 2018 at 8:29 PM, Sam Just <sj...@salesforce.com> wrote:
> To flesh out JV's point a bit more, suppose we've got a 5/5/4 ledger which
> needs to be recovery opened.  In such a scenario, suppose the last entry on
> each of the 5 bookies (no holes) are 10,10,10,10,19.  Any entry in [10,19]
> is valid as the end of the ledger, but the safest answer for the end of the
> ledger is really 10 here -- 11-19 cannot have been ack'd to the client and
> we have 5 copies of 0-10, but only 1 of 11-19.  Currently, a client
> performing a recovery open on this ledger which is able to talk to all 5
> bookies will read and rewrite up to 19 ensuring that at least 4 bookies end
> up with 11-19.  I'd argue that rewriting the entries in that case is
> important if we want to let 19 be the end of the ledger because once we
> permit a client to read 19, losing that single copy would genuinely be data
> loss.  In that case, it happens that we have enough information to mark 10
> as the end of the ledger, but if the client performing recovery open has
> access only to bookies 3 and 4, it would be forced to conclude that 19
> could be the end of the ledger.  In that case, if we want to avoid exposing
> entries which have never been written to fewer than aQ bookies, we really
> do have to either
> 1) do an ensemble change and write out the tail entries of the ledger to a
> healthy ensemble
> 2) fail the recovery open
>
> I'd therefore argue that repairing the tail of the ledger -- with an
> ensemble change if necessary -- is actually required to allow readers to
> access the ledger.
> -Sam
>
> On Mon, Aug 6, 2018 at 9:27 AM Venkateswara Rao Jujjuri <ju...@gmail.com>
> wrote:
>
>> I don't think it a good idea to leave the tail to the replication.
>> This could lead to the perception of data loss, and it's more evident in
>> the case of larger WQ and disparity with AQ.
>> If we determine LLAC based on having 'a copy', which is never acknowledged
>> to the client, and if that bookie goes down(or crashes and burns)
>> before replication worker gets a chance, it gives the illusion of data
>> loss. Moreover, we have no way to determine the real data loss vs
>> this scenario where we have never acknowledged the client.
>>
>>
>> On Mon, Aug 6, 2018 at 12:32 AM, Sijie Guo <gu...@gmail.com> wrote:
>>
>> > On Mon, Aug 6, 2018 at 12:08 AM Ivan Kelly <iv...@apache.org> wrote:
>> >
>> > > >> Recovery operates on a few seconds of data (from the last LAC
>> written
>> > > >> to the end of the ledger, call this LLAC).
>> > > >
>> > > > the data during this duration can be very large if the traffic of the
>> > > > ledger is large. That has
>> > > > been observed at Twitter's production. so when we are talking about
>> "a
>> > > few
>> > > > seconds of data",
>> > > > we can't assume the amount of data is little. That says the recovery
>> > can
>> > > be
>> > > > taking time than
>> > >
>> > > Yes, it can be large, but still it is only a few seconds worth of
>> > > data. It is the amount of data that can be transmitted in the period
>> > > of one roundtrip, as the next roundtrip will update the LAC.
>> >
>> >
>> > > I didn't mean to imply the data was small. I was implying that the
>> > > data was small in comparison to the overall size of that ledger.
>> >
>> >
>> > > > what we can expect, so if we don't handle failures during recovery
>> how
>> > we
>> > > > are able to ensure
>> > > > we have enough data copy during recovery.
>> > >
>> > > Consider a e3w3a2 ledger, there's two cases where you can lose a
>> > > bookie during recover.
>> > >
>> > > Case one, one bookie is lost. You can still recover from as ack=2 is
>> > > available.
>> > > Case two, two bookies are lost. You can't recover, but ledger is
>> > > unavailable anyhow, since any entry in the ledger may only have been
>> > > replicated to 2.
>> > >
>> > > However, with e3w3a3 I guess you wouldn't be able to recover at all,
>> > > and we have to handle that case.
>> > >
>> > > > I am not sure "make ledger metadata immutable" == "getting rid of
>> > merging
>> > > > ledger metadata".
>> > > > because I don't think these are same thing. making ledger metadata
>> > > > immutable will make code
>> > > > much clearer and simpler because the ledger metadata is immutable.
>> how
>> > > > getting rid of merging
>> > > > ledger metadata is a different thing, when you make ledger metadata
>> > > > immutable, it will help make
>> > > > merging ledger metadata on conflicts clearer.
>> > >
>> > > I wouldn't call it merging in this case.
>> >
>> >
>> > That's fine.
>> >
>> >
>> > > Merging implies taking two
>> > > valid pieces of metadata and getting another usable, valid metadata
>> > > from it.
>> > > What happens with immutable metadata, is that you are taking one valid
>> > > metadata, and applying operations to it. So in the failure during
>> > > recovery place, we would have a list of AddEnsemble operations which
>> > > we add when we try to close.
>> > >
>> > > In theory this is perfectly valid and clean. It just can look messy in
>> > > the code, due to how the PendingAddOp reaches back into the ledger
>> > > handle to get the current ensemble.
>> > >
>> >
>> > That's okay since it is reality which we have to face anyway. But the
>> most
>> > important thing
>> > is that we can't get rid of ensemble changes during ledger recovery.
>> >
>> >
>> > >
>> > > So, in conclusion, I will keep the handling.
>> >
>> >
>> > Thank you.
>> >
>> >
>> > > In any case, these
>> > > changes are all still blocked on
>> > > https://github.com/apache/bookkeeper/pull/1577.
>> > >
>> > > -Ivan
>> > >
>> >
>>
>>
>>
>> --
>> Jvrao
>> ---
>> First they ignore you, then they laugh at you, then they fight you, then
>> you win. - Mahatma Gandhi
>>
>
>
> --
>
> <http://smart.salesforce.com/sig/sjust//us_mb/default/link.html>

Re: Usefulness of ensemble change during recovery

Posted by Sam Just <sj...@salesforce.com>.

To flesh out JV's point a bit more, suppose we've got a 5/5/4 ledger which
needs to be recovery opened.  In such a scenario, suppose the last entry on
each of the 5 bookies (no holes) are 10,10,10,10,19.  Any entry in [10,19]
is valid as the end of the ledger, but the safest answer for the end of the
ledger is really 10 here -- 11-19 cannot have been ack'd to the client and
we have 5 copies of 0-10, but only 1 of 11-19.  Currently, a client
performing a recovery open on this ledger which is able to talk to all 5
bookies will read and rewrite up to 19 ensuring that at least 4 bookies end
up with 11-19.  I'd argue that rewriting the entries in that case is
important if we want to let 19 be the end of the ledger because once we
permit a client to read 19, losing that single copy would genuinely be data
loss.  In that case, it happens that we have enough information to mark 10
as the end of the ledger, but if the client performing recovery open has
access only to bookies 3 and 4, it would be forced to conclude that 19
could be the end of the ledger.  In that case, if we want to avoid exposing
entries which have never been written to fewer than aQ bookies, we really
do have to either
1) do an ensemble change and write out the tail entries of the ledger to a
healthy ensemble
2) fail the recovery open

I'd therefore argue that repairing the tail of the ledger -- with an
ensemble change if necessary -- is actually required to allow readers to
access the ledger.
-Sam

On Mon, Aug 6, 2018 at 9:27 AM Venkateswara Rao Jujjuri <ju...@gmail.com>
wrote:

> I don't think it a good idea to leave the tail to the replication.
> This could lead to the perception of data loss, and it's more evident in
> the case of larger WQ and disparity with AQ.
> If we determine LLAC based on having 'a copy', which is never acknowledged
> to the client, and if that bookie goes down(or crashes and burns)
> before replication worker gets a chance, it gives the illusion of data
> loss. Moreover, we have no way to determine the real data loss vs
> this scenario where we have never acknowledged the client.
>
>
> On Mon, Aug 6, 2018 at 12:32 AM, Sijie Guo <gu...@gmail.com> wrote:
>
> > On Mon, Aug 6, 2018 at 12:08 AM Ivan Kelly <iv...@apache.org> wrote:
> >
> > > >> Recovery operates on a few seconds of data (from the last LAC
> written
> > > >> to the end of the ledger, call this LLAC).
> > > >
> > > > the data during this duration can be very large if the traffic of the
> > > > ledger is large. That has
> > > > been observed at Twitter's production. so when we are talking about
> "a
> > > few
> > > > seconds of data",
> > > > we can't assume the amount of data is little. That says the recovery
> > can
> > > be
> > > > taking time than
> > >
> > > Yes, it can be large, but still it is only a few seconds worth of
> > > data. It is the amount of data that can be transmitted in the period
> > > of one roundtrip, as the next roundtrip will update the LAC.
> >
> >
> > > I didn't mean to imply the data was small. I was implying that the
> > > data was small in comparison to the overall size of that ledger.
> >
> >
> > > > what we can expect, so if we don't handle failures during recovery
> how
> > we
> > > > are able to ensure
> > > > we have enough data copy during recovery.
> > >
> > > Consider a e3w3a2 ledger, there's two cases where you can lose a
> > > bookie during recover.
> > >
> > > Case one, one bookie is lost. You can still recover from as ack=2 is
> > > available.
> > > Case two, two bookies are lost. You can't recover, but ledger is
> > > unavailable anyhow, since any entry in the ledger may only have been
> > > replicated to 2.
> > >
> > > However, with e3w3a3 I guess you wouldn't be able to recover at all,
> > > and we have to handle that case.
> > >
> > > > I am not sure "make ledger metadata immutable" == "getting rid of
> > merging
> > > > ledger metadata".
> > > > because I don't think these are same thing. making ledger metadata
> > > > immutable will make code
> > > > much clearer and simpler because the ledger metadata is immutable.
> how
> > > > getting rid of merging
> > > > ledger metadata is a different thing, when you make ledger metadata
> > > > immutable, it will help make
> > > > merging ledger metadata on conflicts clearer.
> > >
> > > I wouldn't call it merging in this case.
> >
> >
> > That's fine.
> >
> >
> > > Merging implies taking two
> > > valid pieces of metadata and getting another usable, valid metadata
> > > from it.
> > > What happens with immutable metadata, is that you are taking one valid
> > > metadata, and applying operations to it. So in the failure during
> > > recovery place, we would have a list of AddEnsemble operations which
> > > we add when we try to close.
> > >
> > > In theory this is perfectly valid and clean. It just can look messy in
> > > the code, due to how the PendingAddOp reaches back into the ledger
> > > handle to get the current ensemble.
> > >
> >
> > That's okay since it is reality which we have to face anyway. But the
> most
> > important thing
> > is that we can't get rid of ensemble changes during ledger recovery.
> >
> >
> > >
> > > So, in conclusion, I will keep the handling.
> >
> >
> > Thank you.
> >
> >
> > > In any case, these
> > > changes are all still blocked on
> > > https://github.com/apache/bookkeeper/pull/1577.
> > >
> > > -Ivan
> > >
> >
>
>
>
> --
> Jvrao
> ---
> First they ignore you, then they laugh at you, then they fight you, then
> you win. - Mahatma Gandhi
>


-- 

<http://smart.salesforce.com/sig/sjust//us_mb/default/link.html>

Re: Usefulness of ensemble change during recovery

Posted by Venkateswara Rao Jujjuri <ju...@gmail.com>.

I don't think it a good idea to leave the tail to the replication.
This could lead to the perception of data loss, and it's more evident in
the case of larger WQ and disparity with AQ.
If we determine LLAC based on having 'a copy', which is never acknowledged
to the client, and if that bookie goes down(or crashes and burns)
before replication worker gets a chance, it gives the illusion of data
loss. Moreover, we have no way to determine the real data loss vs
this scenario where we have never acknowledged the client.


On Mon, Aug 6, 2018 at 12:32 AM, Sijie Guo <gu...@gmail.com> wrote:

> On Mon, Aug 6, 2018 at 12:08 AM Ivan Kelly <iv...@apache.org> wrote:
>
> > >> Recovery operates on a few seconds of data (from the last LAC written
> > >> to the end of the ledger, call this LLAC).
> > >
> > > the data during this duration can be very large if the traffic of the
> > > ledger is large. That has
> > > been observed at Twitter's production. so when we are talking about "a
> > few
> > > seconds of data",
> > > we can't assume the amount of data is little. That says the recovery
> can
> > be
> > > taking time than
> >
> > Yes, it can be large, but still it is only a few seconds worth of
> > data. It is the amount of data that can be transmitted in the period
> > of one roundtrip, as the next roundtrip will update the LAC.
>
>
> > I didn't mean to imply the data was small. I was implying that the
> > data was small in comparison to the overall size of that ledger.
>
>
> > > what we can expect, so if we don't handle failures during recovery how
> we
> > > are able to ensure
> > > we have enough data copy during recovery.
> >
> > Consider a e3w3a2 ledger, there's two cases where you can lose a
> > bookie during recover.
> >
> > Case one, one bookie is lost. You can still recover from as ack=2 is
> > available.
> > Case two, two bookies are lost. You can't recover, but ledger is
> > unavailable anyhow, since any entry in the ledger may only have been
> > replicated to 2.
> >
> > However, with e3w3a3 I guess you wouldn't be able to recover at all,
> > and we have to handle that case.
> >
> > > I am not sure "make ledger metadata immutable" == "getting rid of
> merging
> > > ledger metadata".
> > > because I don't think these are same thing. making ledger metadata
> > > immutable will make code
> > > much clearer and simpler because the ledger metadata is immutable. how
> > > getting rid of merging
> > > ledger metadata is a different thing, when you make ledger metadata
> > > immutable, it will help make
> > > merging ledger metadata on conflicts clearer.
> >
> > I wouldn't call it merging in this case.
>
>
> That's fine.
>
>
> > Merging implies taking two
> > valid pieces of metadata and getting another usable, valid metadata
> > from it.
> > What happens with immutable metadata, is that you are taking one valid
> > metadata, and applying operations to it. So in the failure during
> > recovery place, we would have a list of AddEnsemble operations which
> > we add when we try to close.
> >
> > In theory this is perfectly valid and clean. It just can look messy in
> > the code, due to how the PendingAddOp reaches back into the ledger
> > handle to get the current ensemble.
> >
>
> That's okay since it is reality which we have to face anyway. But the most
> important thing
> is that we can't get rid of ensemble changes during ledger recovery.
>
>
> >
> > So, in conclusion, I will keep the handling.
>
>
> Thank you.
>
>
> > In any case, these
> > changes are all still blocked on
> > https://github.com/apache/bookkeeper/pull/1577.
> >
> > -Ivan
> >
>



-- 
Jvrao
---
First they ignore you, then they laugh at you, then they fight you, then
you win. - Mahatma Gandhi

Re: Usefulness of ensemble change during recovery

Posted by Sijie Guo <gu...@gmail.com>.

On Mon, Aug 6, 2018 at 12:08 AM Ivan Kelly <iv...@apache.org> wrote:

> >> Recovery operates on a few seconds of data (from the last LAC written
> >> to the end of the ledger, call this LLAC).
> >
> > the data during this duration can be very large if the traffic of the
> > ledger is large. That has
> > been observed at Twitter's production. so when we are talking about "a
> few
> > seconds of data",
> > we can't assume the amount of data is little. That says the recovery can
> be
> > taking time than
>
> Yes, it can be large, but still it is only a few seconds worth of
> data. It is the amount of data that can be transmitted in the period
> of one roundtrip, as the next roundtrip will update the LAC.


> I didn't mean to imply the data was small. I was implying that the
> data was small in comparison to the overall size of that ledger.


> > what we can expect, so if we don't handle failures during recovery how we
> > are able to ensure
> > we have enough data copy during recovery.
>
> Consider a e3w3a2 ledger, there's two cases where you can lose a
> bookie during recover.
>
> Case one, one bookie is lost. You can still recover from as ack=2 is
> available.
> Case two, two bookies are lost. You can't recover, but ledger is
> unavailable anyhow, since any entry in the ledger may only have been
> replicated to 2.
>
> However, with e3w3a3 I guess you wouldn't be able to recover at all,
> and we have to handle that case.
>
> > I am not sure "make ledger metadata immutable" == "getting rid of merging
> > ledger metadata".
> > because I don't think these are same thing. making ledger metadata
> > immutable will make code
> > much clearer and simpler because the ledger metadata is immutable. how
> > getting rid of merging
> > ledger metadata is a different thing, when you make ledger metadata
> > immutable, it will help make
> > merging ledger metadata on conflicts clearer.
>
> I wouldn't call it merging in this case.


That's fine.


> Merging implies taking two
> valid pieces of metadata and getting another usable, valid metadata
> from it.
> What happens with immutable metadata, is that you are taking one valid
> metadata, and applying operations to it. So in the failure during
> recovery place, we would have a list of AddEnsemble operations which
> we add when we try to close.
>
> In theory this is perfectly valid and clean. It just can look messy in
> the code, due to how the PendingAddOp reaches back into the ledger
> handle to get the current ensemble.
>

That's okay since it is reality which we have to face anyway. But the most
important thing
is that we can't get rid of ensemble changes during ledger recovery.


>
> So, in conclusion, I will keep the handling.


Thank you.


> In any case, these
> changes are all still blocked on
> https://github.com/apache/bookkeeper/pull/1577.
>
> -Ivan
>

Re: Usefulness of ensemble change during recovery

Posted by Ivan Kelly <iv...@apache.org>.

>> Recovery operates on a few seconds of data (from the last LAC written
>> to the end of the ledger, call this LLAC).
>
> the data during this duration can be very large if the traffic of the
> ledger is large. That has
> been observed at Twitter's production. so when we are talking about "a few
> seconds of data",
> we can't assume the amount of data is little. That says the recovery can be
> taking time than

Yes, it can be large, but still it is only a few seconds worth of
data. It is the amount of data that can be transmitted in the period
of one roundtrip, as the next roundtrip will update the LAC.

I didn't mean to imply the data was small. I was implying that the
data was small in comparison to the overall size of that ledger.

> what we can expect, so if we don't handle failures during recovery how we
> are able to ensure
> we have enough data copy during recovery.

Consider a e3w3a2 ledger, there's two cases where you can lose a
bookie during recover.

Case one, one bookie is lost. You can still recover from as ack=2 is available.
Case two, two bookies are lost. You can't recover, but ledger is
unavailable anyhow, since any entry in the ledger may only have been
replicated to 2.

However, with e3w3a3 I guess you wouldn't be able to recover at all,
and we have to handle that case.

> I am not sure "make ledger metadata immutable" == "getting rid of merging
> ledger metadata".
> because I don't think these are same thing. making ledger metadata
> immutable will make code
> much clearer and simpler because the ledger metadata is immutable. how
> getting rid of merging
> ledger metadata is a different thing, when you make ledger metadata
> immutable, it will help make
> merging ledger metadata on conflicts clearer.

I wouldn't call it merging in this case. Merging implies taking two
valid pieces of metadata and getting another usable, valid metadata
from it.
What happens with immutable metadata, is that you are taking one valid
metadata, and applying operations to it. So in the failure during
recovery place, we would have a list of AddEnsemble operations which
we add when we try to close.

In theory this is perfectly valid and clean. It just can look messy in
the code, due to how the PendingAddOp reaches back into the ledger
handle to get the current ensemble.

So, in conclusion, I will keep the handling. In any case, these
changes are all still blocked on
https://github.com/apache/bookkeeper/pull/1577.

-Ivan

Re: Usefulness of ensemble change during recovery

Posted by Sijie Guo <gu...@gmail.com>.

On Sat, Aug 4, 2018 at 1:49 AM Ivan Kelly <iv...@apache.org> wrote:

> Hi folks,
>
> Recently I've been working to make the ledger metadata on the client
> immutable, with the goal of making client metadata management more
> understandable. The basic idea is that the metadata the client uses
> should reflect what is in zookeeper. So if a client wants to modify
> the metadata, if makes a copy, modifies, writes to zookeeper and then
> starts using it. This gets rid of all the confictsWith and merge
> operations.
>
> There is only one case where this doesn't work. When we recover a
> ledger, we read the LAC from all bookies, then read forward entry by
> entry, rewriting the entry, until we reach the end. If a bookie fails
> during the rewrite, we replace it in the ensemble, but we don't write
> that back to zookeeper until the end.
>
> I was banging my head off this yesterday, trying to find a nice way to
> fit this in (there's loads of nasty ways), when I came to the
> conclusion that failure recovery during recovery isn't actually
> useful.
>

> Recovery operates on a few seconds of data (from the last LAC written
> to the end of the ledger, call this LLAC).

the data during this duration can be very large if the traffic of the
ledger is large. That has
been observed at Twitter's production. so when we are talking about "a few
seconds of data",
we can't assume the amount of data is little. That says the recovery can be
taking time than
what we can expect, so if we don't handle failures during recovery how we
are able to ensure
we have enough data copy during recovery.

I am not sure "make ledger metadata immutable" == "getting rid of merging
ledger metadata".
because I don't think these are same thing. making ledger metadata
immutable will make code
much clearer and simpler because the ledger metadata is immutable. how
getting rid of merging
ledger metadata is a different thing, when you make ledger metadata
immutable, it will help make
merging ledger metadata on conflicts clearer.

In the ledger recovery case, it is actually okay to merge ledger metadata.
let's assume LAC is L at the
time of recovery, ledger metadata is M  is the copy before recovery. the
client that attempts to recovery
the ledger will first set the ledger to IN_RECOVERY first before recovering
the ledger. so the conflicts will
only coming from the clients (can be many) that attempt to recover and
AutoRecovery daemon. the resolution
of this conflict is simpler:

when fail to write ledger metadata (version conflicts), read back the
ledger metadata, if the state is changed
back to CLOSED, it means it is updated by other client that also recovers
the ledger, we discarded our ensemble;
if the state has been changed, that means it is modified by AutoRecovery,
AutoRecovery doesn't add ensembles,
so can simply take the ensembles before L from zookeeper and our ensembles
after L and merge them together.

> Take a ledger with 3:2:2
> configuration. If the writer crashes, and one bookie crashes, when we
> recover we currently replace that crashed bookie, so that if another
> bookie crashes the data is still available. But, and this is why I
> don't think it's useful, if another bookie crashes, the recovered data
> may be available, but everything before the LLAC in the ledger will
> not be available.

IMO, this kind of thing should be handled by rereplication, not
> ensemble change (as as aside, we should have a hint system to trigger
> rereplication ASAP on this ledger).

> Anyhow, I'd like to hear other opinions on this before proceeding.
> Recovery with ensemble changes can work. Rather than modifying the
> ledger, create a shadow ensemble list, and give entries from that to
> the writers, but with the current entanglement in the client, this is
> a bit nasty.
>
> Cheers,
> Ivan
>