You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@bookkeeper.apache.org by Ivan Kelly <iv...@apache.org> on 2012/07/03 10:50:03 UTC

Re: Race condition between LedgerChecker and Ensemble reformation from client

On Fri, Jun 29, 2012 at 07:01:45PM +0000, Uma Maheswara Rao G wrote:
> >>2. If the failed bookie is in the last ensemble of the ledger, we
> >>reopen the ledger using fencing. This stops the client from writing
> >>any further entries to the ledger. Then recovery can continue as if
> >>the ledger had already been closed.
> How failed BK present in last ensemble?  Only one case i can see is,
> when multiple BK failures and ensemble formation in inprogress ( 1/2
> times failed the bookies while writing the same entry). Within this
> window, RW may trigger and find fragment as underreplicated as I
> explained in my previous post. If my understanding is correct here,
> how about delaying the replication for this last fragment and retry
> after some time?  because client will have the scope to change the
> ensemble on next entry if it is alive. So, after that delay this
> fragment would not be last fragment more.  
> 
> Because, I am bit worrying about fencing at this situation, and it
> will cause unnecessary Namenode switch. 
The scenario in which a failed bookie is in the final ensemble is the
case that nothing has been written to the ledger since the bookie
failure. Indeed, the scenario you mentioned can happen, so having a
grace period is a good idea. Will you open a JIRA for adding this? I
think we should try to keep each JIRA as focussed as possible to make
review and integration as straightforward as possible.

-Ivan

RE: Race condition between LedgerChecker and Ensemble reformation from client

Posted by Uma Maheswara Rao G <ma...@huawei.com>.
Sure. I will file a separate JIRA for this.

Thanks a lot Ivan.

Regards,
Uma
________________________________________
From: Ivan Kelly [ivank@apache.org]
Sent: Tuesday, July 03, 2012 2:20 PM
To: bookkeeper-dev@zookeeper.apache.org
Subject: Re: Race condition between  LedgerChecker and Ensemble reformation from client

On Fri, Jun 29, 2012 at 07:01:45PM +0000, Uma Maheswara Rao G wrote:
> >>2. If the failed bookie is in the last ensemble of the ledger, we
> >>reopen the ledger using fencing. This stops the client from writing
> >>any further entries to the ledger. Then recovery can continue as if
> >>the ledger had already been closed.
> How failed BK present in last ensemble?  Only one case i can see is,
> when multiple BK failures and ensemble formation in inprogress ( 1/2
> times failed the bookies while writing the same entry). Within this
> window, RW may trigger and find fragment as underreplicated as I
> explained in my previous post. If my understanding is correct here,
> how about delaying the replication for this last fragment and retry
> after some time?  because client will have the scope to change the
> ensemble on next entry if it is alive. So, after that delay this
> fragment would not be last fragment more.
>
> Because, I am bit worrying about fencing at this situation, and it
> will cause unnecessary Namenode switch.
The scenario in which a failed bookie is in the final ensemble is the
case that nothing has been written to the ledger since the bookie
failure. Indeed, the scenario you mentioned can happen, so having a
grace period is a good idea. Will you open a JIRA for adding this? I
think we should try to keep each JIRA as focussed as possible to make
review and integration as straightforward as possible.

-Ivan