You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@bookkeeper.apache.org by Ivan Kelly <iv...@apache.org> on 2014/03/05 14:14:53 UTC

Problem in rereplication algorithm

Hi folks,

We've come across a problem in autorecovery, which I've been banging
my head against for the last day so I decided to open it up to
everyone to see if a solution is any clearer.

The problem was observed in production, and while it doesn't cause
data loss, it does appear to the admin as if entries have been lost.

= Problem scenario =

You have a ledger L1. There is one segment in the ledger with quorum
2, ensemble 3 starting at entry 0. This segment is on the bookie B1,
B2 & B3. So metadata looks like

0: B1, B2, B3

No data has been written to the ledger.

B3 crashes. The auditor notes that L1 contains a segment with B3, so
scheduled the ledger to be checked. A recovery worker opens the ledger
without fencing. The recovery worker sees that the segment is still
open and that the lastAddConfirmed is less than the segment start id,
so it reads forward. Ultimately it gets a lastAddConfirmed which is
less than the segment start id, as all bookies in the quorum [B1,B2]
respond with NoSuchEntry for entry 0. So the recovery worker sees that
there are no underreplicated fragments, so there's nothing to
recovery. So far, so good.

But now consider if B2 crashes. L1 will be scheduled to be checked
again. A recovery worker will try to open with fencing. It won't be
able to reach all quorums; [B2, B3] is now unavailable. Open will
fail. 

As a result, the underreplicated node for L1 hangs around forever.

I have some ideas for a fix, but none is straightforward, so I'd like
to hear other opinions first.

-Ivan

RE: Problem in rereplication algorithm

Posted by Rakesh R <ra...@huawei.com>.

>>Yes, in the case that Ivan described, the ledger was "leaked", probably created by a process that got restarted before using the ledger. These ledgers will be left in OPEN state forever (or at least >>until some admin tool can decide that the ledger was leaked and will remove it) as the writer was already gone.
>>The only issue, since there's no data loss (and no data to be lost), is the infinite looping of auto replication workers over it.

Probably you might have observed one specific case where admin can safely remove the ledger. But in general, as we discussed earlier this would be a tough decision point(determine an empty ledger or ledger contains entry) for the admin as we lost quorum.


Adding one more point to this discussion. I had faced similar issue BOOKKEEPER-733 few days back, here also it's a kind of infinite hanging. 
My point is, most probably there could be couple of cases where RW can enter into an infinite loop. But today we have two such cases only. 


Hi All, Added initial draft proposal that comes in my mind to address these cases together, kindly look at it and would like to see the responses. Thanks


-Rakesh

-----Original Message-----
From: Matteo Merli [mailto:mmerli@yahoo-inc.com] 
Sent: 07 March 2014 04:17
To: bookkeeper-dev@zookeeper.apache.org
Subject: Re: Problem in rereplication algorithm

On Mar 6, 2014, at 10:49 AM, Sijie Guo <gu...@gmail.com> wrote:
> Just be curious, isn't it handled by the writer to change ensemble? 
> Unless that the ledger is idle and not being used anymore.

>>Yes, in the case that Ivan described, the ledger was "leaked", probably created by a process that got restarted before using the ledger. These ledgers will be left in OPEN state forever (or at least >>until some admin tool can decide that the ledger was leaked and will remove it) as the writer was already gone.
>>The only issue, since there's no data loss (and no data to be lost), is the infinite looping of auto replication workers over it.

>>Matteo


On Thu, Mar 6, 2014 at 8:29 AM, Ivan Kelly <iv...@apache.org> wrote:

> > OK, this comment is not entirely clear to me. I thought in your 
> > example you had ensemble 3, quorum 2, and you had lost both B2 and 
> > B3. In that case, you already lost quorum. Not for L1, but at that 
> > point there are cases in which you don't know if you've lost a 
> > record. In the specific scenario you describe, we know there is no 
> > record 1 because there is no record 0, fine. But, if you had a 
> > record 0, then we wouldn't know if we lost a record and consequently 
> > the ledger is broken. We may be able to fix this particular case by 
> > simply (not) replicating what we have and declaring success, but it 
> > is not a general solution, I'm afraid.
> After we lose the first bookie, B3, we are able to detect that the 
> ledger is empty and that a bookie is down. However, we don't do 
> anything at this point, because the bookie which is down isn't in the 
> quorum for the first entry of the ledger. The problem, is that we only 
> ever start to perceive the problem when the second bookie, B2 goes 
> down.
>
> My point is that we need to deal with the issue when the first bookie 
> goes down.
>

Just be curious, isn't it handled by the writer to change ensemble? Unless that the ledger is idle and not being used anymore.


>
> >
> > >>
> > >>
> > >>>> the postponing is already there, since the ledger couldn't be
> opened and fenced.
> > >>
> > >> Yeah Sijie you are right, it will postpone to next cycle.
> > >> AFAIK AutoRecovery feature will keep on trying to open it again 
> > >> and again, this cycle will never ends. It is a kind of hanging too.
> > > Actually, it's a little worse than that. The recovery worker will 
> > > acquire the lock on the unreplicated node, try to open, release 
> > > the lock, and repeat ad infinitum, without any pause between 
> > > loops. This will create a lot of write traffic on zookeeper for the locks.
> >
> >
> > Ok, thanks for the clarification. Having an unbounded number of 
> > attempts is definitely not good. Independent of how we solve this 
> > problem, I was thinking about keeping track of the number of 
> > attempts.
> Ya, adding a ratelimiter would probably be enough.
>
>
> -Ivan
>

Re: Problem in rereplication algorithm

Posted by Matteo Merli <mm...@yahoo-inc.com>.

On Mar 6, 2014, at 10:49 AM, Sijie Guo <gu...@gmail.com> wrote:
> Just be curious, isn't it handled by the writer to change ensemble? Unless
> that the ledger is idle and not being used anymore.

Yes, in the case that Ivan described, the ledger was "leaked", probably created by a process that got restarted before using the ledger. These ledgers will be left in OPEN state forever (or at least until some admin tool can decide that the ledger was leaked and will remove it) as the writer was already gone.
The only issue, since there's no data loss (and no data to be lost), is the infinite looping of auto replication workers over it.

Matteo

Re: Problem in rereplication algorithm

Posted by Sijie Guo <gu...@gmail.com>.

On Thu, Mar 6, 2014 at 8:29 AM, Ivan Kelly <iv...@apache.org> wrote:

> > OK, this comment is not entirely clear to me. I thought in your
> > example you had ensemble 3, quorum 2, and you had lost both B2 and
> > B3. In that case, you already lost quorum. Not for L1, but at that
> > point there are cases in which you don't know if you've lost a
> > record. In the specific scenario you describe, we know there is no
> > record 1 because there is no record 0, fine. But, if you had a
> > record 0, then we wouldn't know if we lost a record and consequently
> > the ledger is broken. We may be able to fix this particular case by
> > simply (not) replicating what we have and declaring success, but it
> > is not a general solution, I'm afraid.
> After we lose the first bookie, B3, we are able to detect that the
> ledger is empty and that a bookie is down. However, we don't do
> anything at this point, because the bookie which is down isn't in the
> quorum for the first entry of the ledger. The problem, is that we only
> ever start to perceive the problem when the second bookie, B2 goes
> down.
>
> My point is that we need to deal with the issue when the first bookie
> goes down.
>

Just be curious, isn't it handled by the writer to change ensemble? Unless
that the ledger is idle and not being used anymore.


>
> >
> > >>
> > >>
> > >>>> the postponing is already there, since the ledger couldn't be
> opened and fenced.
> > >>
> > >> Yeah Sijie you are right, it will postpone to next cycle.
> > >> AFAIK AutoRecovery feature will keep on trying to open it again and
> > >> again, this cycle will never ends. It is a kind of hanging too.
> > > Actually, it's a little worse than that. The recovery worker will
> > > acquire the lock on the unreplicated node, try to open, release the
> > > lock, and repeat ad infinitum, without any pause between loops. This
> > > will create a lot of write traffic on zookeeper for the locks.
> >
> >
> > Ok, thanks for the clarification. Having an unbounded number of
> > attempts is definitely not good. Independent of how we solve this
> > problem, I was thinking about keeping track of the number of
> > attempts.
> Ya, adding a ratelimiter would probably be enough.
>
>
> -Ivan
>

Re: Problem in rereplication algorithm

Posted by Ivan Kelly <iv...@apache.org>.

> OK, this comment is not entirely clear to me. I thought in your
> example you had ensemble 3, quorum 2, and you had lost both B2 and
> B3. In that case, you already lost quorum. Not for L1, but at that
> point there are cases in which you don't know if you've lost a
> record. In the specific scenario you describe, we know there is no
> record 1 because there is no record 0, fine. But, if you had a
> record 0, then we wouldn't know if we lost a record and consequently
> the ledger is broken. We may be able to fix this particular case by
> simply (not) replicating what we have and declaring success, but it
> is not a general solution, I'm afraid. 
After we lose the first bookie, B3, we are able to detect that the
ledger is empty and that a bookie is down. However, we don't do
anything at this point, because the bookie which is down isn't in the
quorum for the first entry of the ledger. The problem, is that we only
ever start to perceive the problem when the second bookie, B2 goes
down.

My point is that we need to deal with the issue when the first bookie
goes down.

> 
> >> 
> >> 
> >>>> the postponing is already there, since the ledger couldn't be opened and fenced.
> >> 
> >> Yeah Sijie you are right, it will postpone to next cycle. 
> >> AFAIK AutoRecovery feature will keep on trying to open it again and
> >> again, this cycle will never ends. It is a kind of hanging too.
> > Actually, it's a little worse than that. The recovery worker will
> > acquire the lock on the unreplicated node, try to open, release the
> > lock, and repeat ad infinitum, without any pause between loops. This
> > will create a lot of write traffic on zookeeper for the locks.
> 
> 
> Ok, thanks for the clarification. Having an unbounded number of
> attempts is definitely not good. Independent of how we solve this
> problem, I was thinking about keeping track of the number of
> attempts.
Ya, adding a ratelimiter would probably be enough.


-Ivan

Re: Problem in rereplication algorithm

Posted by Flavio Junqueira <fp...@yahoo.com>.

On 06 Mar 2014, at 01:36, Ivan Kelly <iv...@apache.org> wrote:

> On Thu, Mar 06, 2014 at 08:44:18AM +0000, Rakesh R wrote:
>>>>> I already pointed out. the admin should be aware of potential data loss. so no confidence.
>> 
>> In HDFS shared storage perspective, data loss is not acceptable.
> I agree. A manual tool don't really help (right now the admin just
> deletes the underreplicated node).
> 
> My thoughts on the case, it that, even though there's nothing to
> recover after the first bookie goes down, we should replace the bookie
> in the ensemble, so that if another bookie in the ensemble changes, we
> don't lose quorum. Once quorum is lost, all bets are off.
> 

OK, this comment is not entirely clear to me. I thought in your example you had ensemble 3, quorum 2, and you had lost both B2 and B3. In that case, you already lost quorum. Not for L1, but at that point there are cases in which you don't know if you've lost a record. In the specific scenario you describe, we know there is no record 1 because there is no record 0, fine. But, if you had a record 0, then we wouldn't know if we lost a record and consequently the ledger is broken. We may be able to fix this particular case by simply (not) replicating what we have and declaring success, but it is not a general solution, I'm afraid.

>> 
>> 
>>>> the postponing is already there, since the ledger couldn't be opened and fenced.
>> 
>> Yeah Sijie you are right, it will postpone to next cycle. 
>> AFAIK AutoRecovery feature will keep on trying to open it again and
>> again, this cycle will never ends. It is a kind of hanging too.
> Actually, it's a little worse than that. The recovery worker will
> acquire the lock on the unreplicated node, try to open, release the
> lock, and repeat ad infinitum, without any pause between loops. This
> will create a lot of write traffic on zookeeper for the locks.

Ok, thanks for the clarification. Having an unbounded number of attempts is definitely not good. Independent of how we solve this problem, I was thinking about keeping track of the number of attempts.

-Flavio

> 
> -Ivan

Re: Problem in rereplication algorithm

Posted by Ivan Kelly <iv...@apache.org>.

On Thu, Mar 06, 2014 at 08:44:18AM +0000, Rakesh R wrote:
> >>> I already pointed out. the admin should be aware of potential data loss. so no confidence.
> 
> In HDFS shared storage perspective, data loss is not acceptable.
I agree. A manual tool don't really help (right now the admin just
deletes the underreplicated node).

My thoughts on the case, it that, even though there's nothing to
recover after the first bookie goes down, we should replace the bookie
in the ensemble, so that if another bookie in the ensemble changes, we
don't lose quorum. Once quorum is lost, all bets are off.

> 
> 
> >> the postponing is already there, since the ledger couldn't be opened and fenced.
> 
> Yeah Sijie you are right, it will postpone to next cycle. 
> AFAIK AutoRecovery feature will keep on trying to open it again and
> again, this cycle will never ends. It is a kind of hanging too.
Actually, it's a little worse than that. The recovery worker will
acquire the lock on the unreplicated node, try to open, release the
lock, and repeat ad infinitum, without any pause between loops. This
will create a lot of write traffic on zookeeper for the locks.

-Ivan

RE: Problem in rereplication algorithm

Posted by Rakesh R <ra...@huawei.com>.

>>> I already pointed out. the admin should be aware of potential data loss. so no confidence.

In HDFS shared storage perspective, data loss is not acceptable.


>> the postponing is already there, since the ledger couldn't be opened and fenced.

Yeah Sijie you are right, it will postpone to next cycle. 
AFAIK AutoRecovery feature will keep on trying to open it again and again, this cycle will never ends. It is a kind of hanging too.

-Rakesh

-----Original Message-----
From: Sijie Guo [mailto:guosijie@gmail.com] 
Sent: 06 March 2014 13:50
To: bookkeeper-dev@zookeeper.apache.org
Subject: Re: Problem in rereplication algorithm

On Wed, Mar 5, 2014 at 9:16 PM, Rakesh R <ra...@huawei.com> wrote:

>
> If the failure is more than the tolerated failures, it would not be 
> safe to go ahead with any cleanup.
> For ex, quorum size is 2 and say failed 2 bookies out of 3, according 
> to me for this ledger allowed failure is only 1.
>
> Also, please someone tell me, how the admin will get the confidence to 
> safely do any cleanups.


I already pointed out. the admin should be aware of potential data loss. so no confidence.


> IMHO postponing the recovery would be safe.
>

the postponing is already there, since the ledger couldn't be opened and fenced.


>
> -Rakesh
>
> -----Original Message-----
> From: Uma Maheswara Rao G [mailto:hadoop.uma@gmail.com]
> Sent: 06 March 2014 10:05
> To: bookkeeper-dev@zookeeper.apache.org
> Subject: Re: Problem in rereplication algorithm
>
> >As Sijie pointed out, we lost quorum, so the ledger is not good any
> longer.
> Because we might not be able to detect such cases automatically, I was 
> wondering if we need to manually delete it.
>
> Yes. As Sijie and Flavio pointed out , how about providing a tool to 
> clean such ledgers.
> At the same time I agree, we have to think some automatic way to 
> detect it as we claim the feature as Auto.
>

at any time, if the quorum requirement is broken, we shouldn't do any auto things. leave it to human.

>
> or shall we delay such quorum failure ledgers replication cycle 
> incrementally by somehow tracking time in underreplication ledger 
> nodes? [ I am not very sure on this, we have to think more]
>
> Regards,
> Uma
>
>
>
> On Thu, Mar 6, 2014 at 7:35 AM, Flavio Junqueira 
> <fpjunqueira@yahoo.com
> >wrote:
>
> > I'm not sure what the desirable outcome is here. When you say that 
> > the underreplicated L1 node hangs around forever, does it mean that 
> > we keep trying to create new replicas?
>

The hang means that the ledger couldn't be opened and fenced.


> >
> > As Sijie pointed out, we lost quorum, so the ledger is not good any
> longer.
> > Because we might not be able to detect such cases automatically, I 
> > was wondering if we need to manually delete it.
> >
> > -Flavio
> >
> >
> > -----Original Message-----
> > From: Ivan Kelly [mailto:ivank@apache.org]
> > Sent: Wednesday, March 5, 2014 5:15 AM
> > To: bookkeeper-dev@zookeeper.apache.org
> > Subject: Problem in rereplication algorithm
> >
> > Hi folks,
> >
> > We've come across a problem in autorecovery, which I've been banging 
> > my head against for the last day so I decided to open it up to 
> > everyone to see if a solution is any clearer.
> >
> > The problem was observed in production, and while it doesn't cause 
> > data loss, it does appear to the admin as if entries have been lost.
> >
> > = Problem scenario =
> >
> > You have a ledger L1. There is one segment in the ledger with quorum 
> > 2, ensemble 3 starting at entry 0. This segment is on the bookie B1,
> > B2 & B3. So metadata looks like
> >
> > 0: B1, B2, B3
> >
> > No data has been written to the ledger.
> >
> > B3 crashes. The auditor notes that L1 contains a segment with B3, so 
> > scheduled the ledger to be checked. A recovery worker opens the 
> > ledger without fencing. The recovery worker sees that the segment is 
> > still open and that the lastAddConfirmed is less than the segment 
> > start id, so it reads forward. Ultimately it gets a lastAddConfirmed 
> > which is less than the segment start id, as all bookies in the 
> > quorum [B1,B2] respond with NoSuchEntry for entry 0. So the recovery 
> > worker sees that there are no underreplicated fragments, so there's 
> > nothing to recovery. So far, so good.
> >
> > But now consider if B2 crashes. L1 will be scheduled to be checked 
> > again. A recovery worker will try to open with fencing. It won't be 
> > able to reach all quorums; [B2, B3] is now unavailable. Open will 
> > fail.
> >
> > As a result, the underreplicated node for L1 hangs around forever.
> >
> > I have some ideas for a fix, but none is straightforward, so I'd 
> > like to hear other opinions first.
> >
> > -Ivan
> >
> >
>

Re: Problem in rereplication algorithm

Posted by Sijie Guo <gu...@gmail.com>.

On Wed, Mar 5, 2014 at 9:16 PM, Rakesh R <ra...@huawei.com> wrote:

>
> If the failure is more than the tolerated failures, it would not be safe
> to go ahead with any cleanup.
> For ex, quorum size is 2 and say failed 2 bookies out of 3, according to
> me for this ledger allowed failure is only 1.
>
> Also, please someone tell me, how the admin will get the confidence to
> safely do any cleanups.


I already pointed out. the admin should be aware of potential data loss. so
no confidence.


> IMHO postponing the recovery would be safe.
>

the postponing is already there, since the ledger couldn't be opened and
fenced.


>
> -Rakesh
>
> -----Original Message-----
> From: Uma Maheswara Rao G [mailto:hadoop.uma@gmail.com]
> Sent: 06 March 2014 10:05
> To: bookkeeper-dev@zookeeper.apache.org
> Subject: Re: Problem in rereplication algorithm
>
> >As Sijie pointed out, we lost quorum, so the ledger is not good any
> longer.
> Because we might not be able to detect such cases automatically, I was
> wondering if we need to manually delete it.
>
> Yes. As Sijie and Flavio pointed out , how about providing a tool to clean
> such ledgers.
> At the same time I agree, we have to think some automatic way to detect it
> as we claim the feature as Auto.
>

at any time, if the quorum requirement is broken, we shouldn't do any auto
things. leave it to human.

>
> or shall we delay such quorum failure ledgers replication cycle
> incrementally by somehow tracking time in underreplication ledger nodes? [
> I am not very sure on this, we have to think more]
>
> Regards,
> Uma
>
>
>
> On Thu, Mar 6, 2014 at 7:35 AM, Flavio Junqueira <fpjunqueira@yahoo.com
> >wrote:
>
> > I'm not sure what the desirable outcome is here. When you say that the
> > underreplicated L1 node hangs around forever, does it mean that we
> > keep trying to create new replicas?
>

The hang means that the ledger couldn't be opened and fenced.


> >
> > As Sijie pointed out, we lost quorum, so the ledger is not good any
> longer.
> > Because we might not be able to detect such cases automatically, I was
> > wondering if we need to manually delete it.
> >
> > -Flavio
> >
> >
> > -----Original Message-----
> > From: Ivan Kelly [mailto:ivank@apache.org]
> > Sent: Wednesday, March 5, 2014 5:15 AM
> > To: bookkeeper-dev@zookeeper.apache.org
> > Subject: Problem in rereplication algorithm
> >
> > Hi folks,
> >
> > We've come across a problem in autorecovery, which I've been banging
> > my head against for the last day so I decided to open it up to
> > everyone to see if a solution is any clearer.
> >
> > The problem was observed in production, and while it doesn't cause
> > data loss, it does appear to the admin as if entries have been lost.
> >
> > = Problem scenario =
> >
> > You have a ledger L1. There is one segment in the ledger with quorum
> > 2, ensemble 3 starting at entry 0. This segment is on the bookie B1,
> > B2 & B3. So metadata looks like
> >
> > 0: B1, B2, B3
> >
> > No data has been written to the ledger.
> >
> > B3 crashes. The auditor notes that L1 contains a segment with B3, so
> > scheduled the ledger to be checked. A recovery worker opens the ledger
> > without fencing. The recovery worker sees that the segment is still
> > open and that the lastAddConfirmed is less than the segment start id,
> > so it reads forward. Ultimately it gets a lastAddConfirmed which is
> > less than the segment start id, as all bookies in the quorum [B1,B2]
> > respond with NoSuchEntry for entry 0. So the recovery worker sees that
> > there are no underreplicated fragments, so there's nothing to
> > recovery. So far, so good.
> >
> > But now consider if B2 crashes. L1 will be scheduled to be checked
> > again. A recovery worker will try to open with fencing. It won't be
> > able to reach all quorums; [B2, B3] is now unavailable. Open will
> > fail.
> >
> > As a result, the underreplicated node for L1 hangs around forever.
> >
> > I have some ideas for a fix, but none is straightforward, so I'd like
> > to hear other opinions first.
> >
> > -Ivan
> >
> >
>

RE: Problem in rereplication algorithm

Posted by Rakesh R <ra...@huawei.com>.

If the failure is more than the tolerated failures, it would not be safe to go ahead with any cleanup.
For ex, quorum size is 2 and say failed 2 bookies out of 3, according to me for this ledger allowed failure is only 1.

Also, please someone tell me, how the admin will get the confidence to safely do any cleanups. IMHO postponing the recovery would be safe.

-Rakesh 

-----Original Message-----
From: Uma Maheswara Rao G [mailto:hadoop.uma@gmail.com] 
Sent: 06 March 2014 10:05
To: bookkeeper-dev@zookeeper.apache.org
Subject: Re: Problem in rereplication algorithm

>As Sijie pointed out, we lost quorum, so the ledger is not good any longer.
Because we might not be able to detect such cases automatically, I was wondering if we need to manually delete it.

Yes. As Sijie and Flavio pointed out , how about providing a tool to clean such ledgers.
At the same time I agree, we have to think some automatic way to detect it as we claim the feature as Auto.

or shall we delay such quorum failure ledgers replication cycle incrementally by somehow tracking time in underreplication ledger nodes? [ I am not very sure on this, we have to think more]

Regards,
Uma



On Thu, Mar 6, 2014 at 7:35 AM, Flavio Junqueira <fp...@yahoo.com>wrote:

> I'm not sure what the desirable outcome is here. When you say that the 
> underreplicated L1 node hangs around forever, does it mean that we 
> keep trying to create new replicas?
>
> As Sijie pointed out, we lost quorum, so the ledger is not good any longer.
> Because we might not be able to detect such cases automatically, I was 
> wondering if we need to manually delete it.
>
> -Flavio
>
>
> -----Original Message-----
> From: Ivan Kelly [mailto:ivank@apache.org]
> Sent: Wednesday, March 5, 2014 5:15 AM
> To: bookkeeper-dev@zookeeper.apache.org
> Subject: Problem in rereplication algorithm
>
> Hi folks,
>
> We've come across a problem in autorecovery, which I've been banging 
> my head against for the last day so I decided to open it up to 
> everyone to see if a solution is any clearer.
>
> The problem was observed in production, and while it doesn't cause 
> data loss, it does appear to the admin as if entries have been lost.
>
> = Problem scenario =
>
> You have a ledger L1. There is one segment in the ledger with quorum 
> 2, ensemble 3 starting at entry 0. This segment is on the bookie B1,
> B2 & B3. So metadata looks like
>
> 0: B1, B2, B3
>
> No data has been written to the ledger.
>
> B3 crashes. The auditor notes that L1 contains a segment with B3, so 
> scheduled the ledger to be checked. A recovery worker opens the ledger 
> without fencing. The recovery worker sees that the segment is still 
> open and that the lastAddConfirmed is less than the segment start id, 
> so it reads forward. Ultimately it gets a lastAddConfirmed which is 
> less than the segment start id, as all bookies in the quorum [B1,B2] 
> respond with NoSuchEntry for entry 0. So the recovery worker sees that 
> there are no underreplicated fragments, so there's nothing to 
> recovery. So far, so good.
>
> But now consider if B2 crashes. L1 will be scheduled to be checked 
> again. A recovery worker will try to open with fencing. It won't be 
> able to reach all quorums; [B2, B3] is now unavailable. Open will 
> fail.
>
> As a result, the underreplicated node for L1 hangs around forever.
>
> I have some ideas for a fix, but none is straightforward, so I'd like 
> to hear other opinions first.
>
> -Ivan
>
>

Re: Problem in rereplication algorithm

Posted by Uma Maheswara Rao G <ha...@gmail.com>.

>As Sijie pointed out, we lost quorum, so the ledger is not good any longer.
Because we might not be able to detect such cases automatically, I was
wondering if we need to manually delete it.

Yes. As Sijie and Flavio pointed out , how about providing a tool to clean
such ledgers.
At the same time I agree, we have to think some automatic way to detect it
as we claim the feature as Auto.

or shall we delay such quorum failure ledgers replication cycle
incrementally by somehow tracking time in underreplication ledger nodes? [
I am not very sure on this, we have to think more]

Regards,
Uma



On Thu, Mar 6, 2014 at 7:35 AM, Flavio Junqueira <fp...@yahoo.com>wrote:

> I'm not sure what the desirable outcome is here. When you say that the
> underreplicated L1 node hangs around forever, does it mean that we keep
> trying to create new replicas?
>
> As Sijie pointed out, we lost quorum, so the ledger is not good any longer.
> Because we might not be able to detect such cases automatically, I was
> wondering if we need to manually delete it.
>
> -Flavio
>
>
> -----Original Message-----
> From: Ivan Kelly [mailto:ivank@apache.org]
> Sent: Wednesday, March 5, 2014 5:15 AM
> To: bookkeeper-dev@zookeeper.apache.org
> Subject: Problem in rereplication algorithm
>
> Hi folks,
>
> We've come across a problem in autorecovery, which I've been banging my
> head
> against for the last day so I decided to open it up to everyone to see if a
> solution is any clearer.
>
> The problem was observed in production, and while it doesn't cause data
> loss, it does appear to the admin as if entries have been lost.
>
> = Problem scenario =
>
> You have a ledger L1. There is one segment in the ledger with quorum 2,
> ensemble 3 starting at entry 0. This segment is on the bookie B1,
> B2 & B3. So metadata looks like
>
> 0: B1, B2, B3
>
> No data has been written to the ledger.
>
> B3 crashes. The auditor notes that L1 contains a segment with B3, so
> scheduled the ledger to be checked. A recovery worker opens the ledger
> without fencing. The recovery worker sees that the segment is still open
> and
> that the lastAddConfirmed is less than the segment start id, so it reads
> forward. Ultimately it gets a lastAddConfirmed which is less than the
> segment start id, as all bookies in the quorum [B1,B2] respond with
> NoSuchEntry for entry 0. So the recovery worker sees that there are no
> underreplicated fragments, so there's nothing to recovery. So far, so good.
>
> But now consider if B2 crashes. L1 will be scheduled to be checked again. A
> recovery worker will try to open with fencing. It won't be able to reach
> all
> quorums; [B2, B3] is now unavailable. Open will fail.
>
> As a result, the underreplicated node for L1 hangs around forever.
>
> I have some ideas for a fix, but none is straightforward, so I'd like to
> hear other opinions first.
>
> -Ivan
>
>

RE: Problem in rereplication algorithm

Posted by Flavio Junqueira <fp...@yahoo.com>.

I'm not sure what the desirable outcome is here. When you say that the
underreplicated L1 node hangs around forever, does it mean that we keep
trying to create new replicas?

As Sijie pointed out, we lost quorum, so the ledger is not good any longer.
Because we might not be able to detect such cases automatically, I was
wondering if we need to manually delete it.

-Flavio


-----Original Message-----
From: Ivan Kelly [mailto:ivank@apache.org] 
Sent: Wednesday, March 5, 2014 5:15 AM
To: bookkeeper-dev@zookeeper.apache.org
Subject: Problem in rereplication algorithm

Hi folks,

We've come across a problem in autorecovery, which I've been banging my head
against for the last day so I decided to open it up to everyone to see if a
solution is any clearer.

The problem was observed in production, and while it doesn't cause data
loss, it does appear to the admin as if entries have been lost.

= Problem scenario =

You have a ledger L1. There is one segment in the ledger with quorum 2,
ensemble 3 starting at entry 0. This segment is on the bookie B1,
B2 & B3. So metadata looks like

0: B1, B2, B3

No data has been written to the ledger.

B3 crashes. The auditor notes that L1 contains a segment with B3, so
scheduled the ledger to be checked. A recovery worker opens the ledger
without fencing. The recovery worker sees that the segment is still open and
that the lastAddConfirmed is less than the segment start id, so it reads
forward. Ultimately it gets a lastAddConfirmed which is less than the
segment start id, as all bookies in the quorum [B1,B2] respond with
NoSuchEntry for entry 0. So the recovery worker sees that there are no
underreplicated fragments, so there's nothing to recovery. So far, so good.

But now consider if B2 crashes. L1 will be scheduled to be checked again. A
recovery worker will try to open with fencing. It won't be able to reach all
quorums; [B2, B3] is now unavailable. Open will fail. 

As a result, the underreplicated node for L1 hangs around forever.

I have some ideas for a fix, but none is straightforward, so I'd like to
hear other opinions first.

-Ivan

Re: Problem in rereplication algorithm

Posted by Sijie Guo <gu...@gmail.com>.

In your case, you already lost a quorum. Any actions here will cause
potential data loss. If you really want to address it, provide a tool to
ask admin force-close the ledger, in aware of potential data loss.

- Sijie


On Wed, Mar 5, 2014 at 10:01 AM, Ivan Kelly <iv...@apache.org> wrote:

> It was during the open that it failed, but it was at the
> readLastAddConfirmed part, not at recovery, as recovery didn't run
> because it was opening without fencing.
>
> -Ivan
>
> On Wed, Mar 05, 2014 at 02:50:26PM +0000, Rakesh R wrote:
> > Hi Ivan,
> >
> > I hope the following would have happened in your env.
> >
> > During fencing, ReplicationWorker(RW) is hitting the exception
> "org.apache.bookkeeper.client.BKException$BKLedgerRecoveryException"
> > as ledger did not hear success responses from all quorums. Now again and
> again RW will try to do fence and this cycle never ends, isn't it ?
> >
> >
> > If that is the case, I think graceful fencing will be difficult we may
> need to find some alternate way of handling this case.
> >
> >
> > -Rakesh
> >
> > -----Original Message-----
> > From: Ivan Kelly [mailto:ivank@apache.org]
> > Sent: 05 March 2014 18:45
> > To: bookkeeper-dev@zookeeper.apache.org
> > Subject: Problem in rereplication algorithm
> >
> > Hi folks,
> >
> > We've come across a problem in autorecovery, which I've been banging my
> head against for the last day so I decided to open it up to everyone to see
> if a solution is any clearer.
> >
> > The problem was observed in production, and while it doesn't cause data
> loss, it does appear to the admin as if entries have been lost.
> >
> > = Problem scenario =
> >
> > You have a ledger L1. There is one segment in the ledger with quorum 2,
> ensemble 3 starting at entry 0. This segment is on the bookie B1,
> > B2 & B3. So metadata looks like
> >
> > 0: B1, B2, B3
> >
> > No data has been written to the ledger.
> >
> > B3 crashes. The auditor notes that L1 contains a segment with B3, so
> scheduled the ledger to be checked. A recovery worker opens the ledger
> without fencing. The recovery worker sees that the segment is still open
> and that the lastAddConfirmed is less than the segment start id, so it
> reads forward. Ultimately it gets a lastAddConfirmed which is less than the
> segment start id, as all bookies in the quorum [B1,B2] respond with
> NoSuchEntry for entry 0. So the recovery worker sees that there are no
> underreplicated fragments, so there's nothing to recovery. So far, so good.
> >
> > But now consider if B2 crashes. L1 will be scheduled to be checked
> again. A recovery worker will try to open with fencing. It won't be able to
> reach all quorums; [B2, B3] is now unavailable. Open will fail.
> >
> > As a result, the underreplicated node for L1 hangs around forever.
> >
> > I have some ideas for a fix, but none is straightforward, so I'd like to
> hear other opinions first.
> >
> > -Ivan
>

Re: Problem in rereplication algorithm

Posted by Ivan Kelly <iv...@apache.org>.

It was during the open that it failed, but it was at the
readLastAddConfirmed part, not at recovery, as recovery didn't run
because it was opening without fencing.

-Ivan

On Wed, Mar 05, 2014 at 02:50:26PM +0000, Rakesh R wrote:
> Hi Ivan,
> 
> I hope the following would have happened in your env.
> 
> During fencing, ReplicationWorker(RW) is hitting the exception "org.apache.bookkeeper.client.BKException$BKLedgerRecoveryException" 
> as ledger did not hear success responses from all quorums. Now again and again RW will try to do fence and this cycle never ends, isn't it ?
> 
> 
> If that is the case, I think graceful fencing will be difficult we may need to find some alternate way of handling this case.
> 
> 
> -Rakesh
> 
> -----Original Message-----
> From: Ivan Kelly [mailto:ivank@apache.org] 
> Sent: 05 March 2014 18:45
> To: bookkeeper-dev@zookeeper.apache.org
> Subject: Problem in rereplication algorithm
> 
> Hi folks,
> 
> We've come across a problem in autorecovery, which I've been banging my head against for the last day so I decided to open it up to everyone to see if a solution is any clearer.
> 
> The problem was observed in production, and while it doesn't cause data loss, it does appear to the admin as if entries have been lost.
> 
> = Problem scenario =
> 
> You have a ledger L1. There is one segment in the ledger with quorum 2, ensemble 3 starting at entry 0. This segment is on the bookie B1,
> B2 & B3. So metadata looks like
> 
> 0: B1, B2, B3
> 
> No data has been written to the ledger.
> 
> B3 crashes. The auditor notes that L1 contains a segment with B3, so scheduled the ledger to be checked. A recovery worker opens the ledger without fencing. The recovery worker sees that the segment is still open and that the lastAddConfirmed is less than the segment start id, so it reads forward. Ultimately it gets a lastAddConfirmed which is less than the segment start id, as all bookies in the quorum [B1,B2] respond with NoSuchEntry for entry 0. So the recovery worker sees that there are no underreplicated fragments, so there's nothing to recovery. So far, so good.
> 
> But now consider if B2 crashes. L1 will be scheduled to be checked again. A recovery worker will try to open with fencing. It won't be able to reach all quorums; [B2, B3] is now unavailable. Open will fail. 
> 
> As a result, the underreplicated node for L1 hangs around forever.
> 
> I have some ideas for a fix, but none is straightforward, so I'd like to hear other opinions first.
> 
> -Ivan

RE: Problem in rereplication algorithm

Posted by Rakesh R <ra...@huawei.com>.

Hi Ivan,

I hope the following would have happened in your env.

During fencing, ReplicationWorker(RW) is hitting the exception "org.apache.bookkeeper.client.BKException$BKLedgerRecoveryException" 
as ledger did not hear success responses from all quorums. Now again and again RW will try to do fence and this cycle never ends, isn't it ?


If that is the case, I think graceful fencing will be difficult we may need to find some alternate way of handling this case.


-Rakesh

-----Original Message-----
From: Ivan Kelly [mailto:ivank@apache.org] 
Sent: 05 March 2014 18:45
To: bookkeeper-dev@zookeeper.apache.org
Subject: Problem in rereplication algorithm

Hi folks,

We've come across a problem in autorecovery, which I've been banging my head against for the last day so I decided to open it up to everyone to see if a solution is any clearer.

The problem was observed in production, and while it doesn't cause data loss, it does appear to the admin as if entries have been lost.

= Problem scenario =

You have a ledger L1. There is one segment in the ledger with quorum 2, ensemble 3 starting at entry 0. This segment is on the bookie B1,
B2 & B3. So metadata looks like

0: B1, B2, B3

No data has been written to the ledger.

B3 crashes. The auditor notes that L1 contains a segment with B3, so scheduled the ledger to be checked. A recovery worker opens the ledger without fencing. The recovery worker sees that the segment is still open and that the lastAddConfirmed is less than the segment start id, so it reads forward. Ultimately it gets a lastAddConfirmed which is less than the segment start id, as all bookies in the quorum [B1,B2] respond with NoSuchEntry for entry 0. So the recovery worker sees that there are no underreplicated fragments, so there's nothing to recovery. So far, so good.

But now consider if B2 crashes. L1 will be scheduled to be checked again. A recovery worker will try to open with fencing. It won't be able to reach all quorums; [B2, B3] is now unavailable. Open will fail. 

As a result, the underreplicated node for L1 hangs around forever.

I have some ideas for a fix, but none is straightforward, so I'd like to hear other opinions first.

-Ivan