You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@bookkeeper.apache.org by Ivan Kelly <iv...@apache.org> on 2017/10/06 10:07:44 UTC

Cookies and empty disks

Hi folks,

Following up from the meeting yesterday, I said I would look into the
code to verify the behaviour because there could be a correctness
problem.

I think there could be an issue. The code is convoluted, but my
understanding of it is as follows.

We check all ledger, journal and index directories for a cookie. If it
doesn't exist, it gets added to a missingCookieDirs list. We then
iterate over this directory. If any directory in missingCookieDirs
isn't listed as a ledger directory in the journal dir cookies, or
isn't empty, we fail to start.

The issue is that a journal dir could be emptied and we wouldn't
detect it. It would be great if someone else could eyeball the code
and tell me I'm wrong. The code is in Bookie#checkEnvironment.

This breaks correctness. Imagine we have a ledger on b1, b2, b3.
Writer w1 is writing to the ledger.
The state of the ledger on the bookies is:

b1: e0     Fenced: false, LAC: -
b2: e0     Fenced: false, LAC: -
b3: e0     Fenced: false, LAC: -

w1 gets partitioned from network. w2 tries to recover the ledger, it
tries to fence on all bookies. The message to b3 gets lost. b1 and b2
acknowledge the fencing, so w2 continues to recover and close the
ledger with e0 as the last entry.

b1: e0     Fenced: true, LAC: e0
b2: e0     Fenced: true, LAC: e0
b3: e0     Fenced: false, LAC: -

If w1 became unpartitioned at this point, it wouldn't be able to add a
new entry to the ledger as any quorum would see fenced on b1 or b2.

However, imagine that the fenced message is only in the journal on b2,
b2 crashes, something wipes the journal directory and then b2 comes
back up. The new state of the ledger on the bookies will be.

b1: e0     Fenced: true, LAC: e0
b2: e0     Fenced: false, LAC: -
b3: e0     Fenced: false, LAC: -

Now w1 can write a new entry, e1, and b2 & b3 would both acknowledge
it, even though the end of the ledger is e0.

It requires many planets to be aligned for it to harm us, but we must fix this.

Regards,
Ivan

Re: Cookies and empty disks

Posted by Ivan Kelly <iv...@apache.org>.
> Just that may not be sufficient.
> 1. The UUID needs to be part of the ledger metadata so that the auditor
> knows it is looking at different bookie.
Agreed, we should include it in ensemble info.

> 2. Bookie need to know if the writes and reads from the client are intended
> for it or not. If not in your case C1 can come back to life and start to
> write without any problem.
Yes, it needs to be part of the addentry. For reads I'm not so sure
it's needed, at least from the correctness perspective.

-Ivan

Re: Cookies and empty disks

Posted by Venkateswara Rao Jujjuri <ju...@gmail.com>.
On Fri, Oct 6, 2017 at 10:01 AM, Ivan Kelly <iv...@apache.org> wrote:

> On Fri, Oct 6, 2017 at 6:35 PM, Venkateswara Rao Jujjuri
> <ju...@gmail.com> wrote:
> >> However, imagine that the fenced message is only in the journal on b2,
> >> b2 crashes, something wipes the journal directory and then b2 comes
> >> back up.
> >
> > In this case what happened?
> > 1. We have WQ = 1
> > 2. We had data loss (crash and comeup clean)
> Ah, maybe this was unclear. I meant, WQ = 2, the fencing is persisted
> fully on b1, but only onto the journal on b2.
>
> > But yeah, in addition to dataloss we have fencing violation too.
> > The problem is not just wiped journal dir, but how we recognize the
> bookie.
> > Bookie is just recognized by its ip address, not by its incarnation.
> > Bookie1 at T1  (b1t1) ; and same bookie1 at T2 after bookie format (b1t2)
> > should be two different bookies, isn;t it?
> This is something I want to change also for the zk stuff. There should
> be a uuid generated for the bookie when it is formatted, so that we
> can distinguish between instances. This uuid should be part of the
> cookie. Once we detect the uuid for a bookie has changed, all ledgers
> on that bookie should be checked somehow.
>

Just that may not be sufficient.
1. The UUID needs to be part of the ledger metadata so that the auditor
knows it is looking at different bookie.
2. Bookie need to know if the writes and reads from the client are intended
for it or not. If not in your case C1 can come back to life and start to
write without any problem.



> -Ivan
>



-- 
Jvrao
---
First they ignore you, then they laugh at you, then they fight you, then
you win. - Mahatma Gandhi

Re: Cookies and empty disks

Posted by Ivan Kelly <iv...@apache.org>.
On Fri, Oct 6, 2017 at 6:35 PM, Venkateswara Rao Jujjuri
<ju...@gmail.com> wrote:
>> However, imagine that the fenced message is only in the journal on b2,
>> b2 crashes, something wipes the journal directory and then b2 comes
>> back up.
>
> In this case what happened?
> 1. We have WQ = 1
> 2. We had data loss (crash and comeup clean)
Ah, maybe this was unclear. I meant, WQ = 2, the fencing is persisted
fully on b1, but only onto the journal on b2.

> But yeah, in addition to dataloss we have fencing violation too.
> The problem is not just wiped journal dir, but how we recognize the bookie.
> Bookie is just recognized by its ip address, not by its incarnation.
> Bookie1 at T1  (b1t1) ; and same bookie1 at T2 after bookie format (b1t2)
> should be two different bookies, isn;t it?
This is something I want to change also for the zk stuff. There should
be a uuid generated for the bookie when it is formatted, so that we
can distinguish between instances. This uuid should be part of the
cookie. Once we detect the uuid for a bookie has changed, all ledgers
on that bookie should be checked somehow.

-Ivan

Re: Cookies and empty disks

Posted by Ivan Kelly <iv...@apache.org>.
>> In any case, why not instead of refusing to start if any ledgers
>> reference the bookie, on boot the bookie checks which ledgers it is
>> supposed to have,
>
> How can you do this without querying the big oracle? You can use the local
> view as source of truth. Maybe I am missing one piece

Sorry, I was unclear on this. I meant to say, that if we do go down
the big oracle route, doing this may be a better option.

-Ivan

Re: Cookies and empty disks

Posted by Enrico Olivelli <eo...@gmail.com>.
Il lun 9 ott 2017, 10:54 Ivan Kelly <iv...@apache.org> ha scritto:

> Hi folks,
>
> I was travelling over the weekend, so I didn't have a chance to reply
> to anything on this thread. First off, as Enrico said, there's a lot
> of different topics being discussed at once. Perhaps each should be
> broken into a github issue, and then we can continue each conversation
> there, as it's getting a but unwieldy for email.
>
> I've created a cookie monster project, which we can throw all the issues
> into.
> https://github.com/apache/bookkeeper/projects/1
>
> There's a few individual opinions I'd like to give here though.
>
> > Needing the check the instance of the bookie when auditing
>
> The auditor, while it does check when bookies have disappeared, it
> also periodically checks all ledgers by reading the first and last
> entry of each segment. So even if a bookie has resurrected, the
> auditor will find that it is missing entries it is supposed to have.
>
> > UUID in ledger metadata
>
> At least for the write path, I'm not sure if this is needed, but
> consider the following.
>
> Only one writer can "vote" on the entries of the ledger. Other writers
> are fencing writers. A fencing writer has to hit a majority of bookies
> to proceed to closing the ledger. Unless a majority have been wiped,
> it will not proceed to close as an empty ledger. However, if a
> majority have been wiped, the correct behaviour would be for it not be
> possible to close the ledger, as we cannot know what the end of the
> ledger is.
>
> That said, not boot if any ledger refers to a bookie could solve this.
>
> > No ledgers referencing bookie? (Sijie's suggestion)
>
> I'm resistant this idea, because it assumes a central oracle where all
> ledgers can be queried. I know we currently have this, but I don't
> think it scales for each bookie to read the metadata of the whole
> system.
>
Makes sense

>
> In any case, why not instead of refusing to start if any ledgers
> reference the bookie, on boot the bookie checks which ledgers it is
> supposed to have,

How can you do this without querying the big oracle? You can use the local
view as source of truth. Maybe I am missing one piece

and if it doesn't have them, start pulling the data
> for them itself. While doing this replication it should avoid all new
> writes.
>
> > Storing the list of files in the cookie? (Enrico's suggestion)
>
> I don't think this is needed. The purpose of the cookie is to protect
> against stuff like a mount not coming up, or a machine being
> completely wiped. We assume that on a journalled filesystem, files
> don't just disappear arbitrarily. There may be corruption in
> individual files, but see my first point.
>

I am fine with this assumption. I never saw such type if corruption ideed.
I just wanted to enumerate as many cases of error as possible.

>
> Anyhow, as I said earlier, we should decide the broad topics here and
> move into issues. I've made a first pass.
>
> Regards,
> Ivan
>
-- 


-- Enrico Olivelli

Re: Cookies and empty disks

Posted by Sijie Guo <gu...@gmail.com>.
Hi all,

I've sent out a pull request on fixing the cookie issue. please take a look.

https://github.com/apache/bookkeeper/pull/712

On Mon, Oct 9, 2017 at 1:50 PM, Ivan Kelly <iv...@apache.org> wrote:

> On Mon, Oct 9, 2017 at 6:32 PM, Venkateswara Rao Jujjuri
> <ju...@gmail.com> wrote:
> > Can we have a doc to put all these things? Thread has grown enough to
> cause
> > confusion.
> I created a github project earlier today. We can manage the different
> streams of work from there. Each stream should have a doc though.
>
> https://github.com/apache/bookkeeper/projects/1
>
> > Immediate things.
> > 1. Don't assume new bookie if journal dir is empty.
> I've already created an issue for this.
>
> > 2. Put cookies through bookie format, and bookie never boots on an empty
> > cookie or mismatched cookie.
> The reason it was like this in the first place was for backward
> compatibility I think. If a bookie was upgraded from the software that
> didn't have cookies, the cookies would be created automatically.
> Perhaps this would have been better if it was a "format" like command
> that would create the cookies on old bookies, but we didn't think of
> it at the time. If we change the cookies, we'd probably need to add an
> upgrade command also.
>
> Let's discuss more in a doc.
>
> > 3. We can live with operations procedure to deal with incarnation issue.
> > Infact we run an automated bookie decomm script which runs through the
> > entire metadata and makes sure that the bookie is not part of any ledger.
> >
> > For next step:
> > 1. Establish incarnation support.
> > 2. Deal with bitrot.
> >
> > Makes sense?
> lgtm.
>
> -Ivan
>

Re: Cookies and empty disks

Posted by Ivan Kelly <iv...@apache.org>.
On Mon, Oct 9, 2017 at 6:32 PM, Venkateswara Rao Jujjuri
<ju...@gmail.com> wrote:
> Can we have a doc to put all these things? Thread has grown enough to cause
> confusion.
I created a github project earlier today. We can manage the different
streams of work from there. Each stream should have a doc though.

https://github.com/apache/bookkeeper/projects/1

> Immediate things.
> 1. Don't assume new bookie if journal dir is empty.
I've already created an issue for this.

> 2. Put cookies through bookie format, and bookie never boots on an empty
> cookie or mismatched cookie.
The reason it was like this in the first place was for backward
compatibility I think. If a bookie was upgraded from the software that
didn't have cookies, the cookies would be created automatically.
Perhaps this would have been better if it was a "format" like command
that would create the cookies on old bookies, but we didn't think of
it at the time. If we change the cookies, we'd probably need to add an
upgrade command also.

Let's discuss more in a doc.

> 3. We can live with operations procedure to deal with incarnation issue.
> Infact we run an automated bookie decomm script which runs through the
> entire metadata and makes sure that the bookie is not part of any ledger.
>
> For next step:
> 1. Establish incarnation support.
> 2. Deal with bitrot.
>
> Makes sense?
lgtm.

-Ivan

Re: Cookies and empty disks

Posted by Enrico Olivelli <eo...@gmail.com>.
I like this too.
I have no time immediately for working on this sorry.
Maybe the only blocker isse is about the boot with empty dirs which Sijie
pointed

Enrico

Il lun 9 ott 2017, 19:08 Sijie Guo <gu...@gmail.com> ha scritto:

> +1. I liked this summary.
>
> JV, is this related to what you were writing? or anyone else want to drive
> this?
>
> - Sijie
>
> On Mon, Oct 9, 2017 at 9:32 AM, Venkateswara Rao Jujjuri <
> jujjuri@gmail.com>
> wrote:
>
> > Can we have a doc to put all these things? Thread has grown enough to
> cause
> > confusion.
> >
> > Immediate things.
> > 1. Don't assume new bookie if journal dir is empty.
> > 2. Put cookies through bookie format, and bookie never boots on an empty
> > cookie or mismatched cookie.
> > 3. We can live with operations procedure to deal with incarnation issue.
> > Infact we run an automated bookie decomm script which runs through the
> > entire metadata and makes sure that the bookie is not part of any ledger.
> >
> > For next step:
> > 1. Establish incarnation support.
> > 2. Deal with bitrot.
> >
> > Makes sense?
> >
> > JV
> >
> > On Mon, Oct 9, 2017 at 8:55 AM, Sijie Guo <gu...@gmail.com> wrote:
> >
> > > On Oct 9, 2017 1:54 AM, "Ivan Kelly" <iv...@apache.org> wrote:
> > >
> > > Hi folks,
> > >
> > > I was travelling over the weekend, so I didn't have a chance to reply
> > > to anything on this thread. First off, as Enrico said, there's a lot
> > > of different topics being discussed at once. Perhaps each should be
> > > broken into a github issue, and then we can continue each conversation
> > > there, as it's getting a but unwieldy for email.
> > >
> > > I've created a cookie monster project, which we can throw all the
> issues
> > > into.
> > > https://github.com/apache/bookkeeper/projects/1
> > >
> > > There's a few individual opinions I'd like to give here though.
> > >
> > > > Needing the check the instance of the bookie when auditing
> > >
> > > The auditor, while it does check when bookies have disappeared, it
> > > also periodically checks all ledgers by reading the first and last
> > > entry of each segment. So even if a bookie has resurrected, the
> > > auditor will find that it is missing entries it is supposed to have.
> > >
> > > > UUID in ledger metadata
> > >
> > > At least for the write path, I'm not sure if this is needed, but
> > > consider the following.
> > >
> > > Only one writer can "vote" on the entries of the ledger. Other writers
> > > are fencing writers. A fencing writer has to hit a majority of bookies
> > > to proceed to closing the ledger. Unless a majority have been wiped,
> > > it will not proceed to close as an empty ledger. However, if a
> > > majority have been wiped, the correct behaviour would be for it not be
> > > possible to close the ledger, as we cannot know what the end of the
> > > ledger is.
> > >
> > > That said, not boot if any ledger refers to a bookie could solve this.
> > >
> > > > No ledgers referencing bookie? (Sijie's suggestion)
> > >
> > > I'm resistant this idea, because it assumes a central oracle where all
> > > ledgers can be queried. I know we currently have this, but I don't
> > > think it scales for each bookie to read the metadata of the whole
> > > system.
> > >
> > > In any case, why not instead of refusing to start if any ledgers
> > > reference the bookie, on boot the bookie checks which ledgers it is
> > > supposed to have, and if it doesn't have them, start pulling the data
> > > for them itself. While doing this replication it should avoid all new
> > > writes.
> > >
> > >
> > > Yes, that's another thing we need to improve for auto recovery. It is
> not
> > > only on boot, you need to do it periodically, in the garbage collection
> > > thread. The bookie need to scan what ledgers are missing and what
> entries
> > > are missing and replicate them.
> > >
> > >
> > >
> > > > Storing the list of files in the cookie? (Enrico's suggestion)
> > >
> > > I don't think this is needed. The purpose of the cookie is to protect
> > > against stuff like a mount not coming up, or a machine being
> > > completely wiped. We assume that on a journalled filesystem, files
> > > don't just disappear arbitrarily. There may be corruption in
> > > individual files, but see my first point.
> > >
> > > Anyhow, as I said earlier, we should decide the broad topics here and
> > > move into issues. I've made a first pass.
> > >
> > > Regards,
> > > Ivan
> > >
> >
> >
> >
> > --
> > Jvrao
> > ---
> > First they ignore you, then they laugh at you, then they fight you, then
> > you win. - Mahatma Gandhi
> >
>
-- 


-- Enrico Olivelli

Re: Cookies and empty disks

Posted by Sijie Guo <gu...@gmail.com>.
+1. I liked this summary.

JV, is this related to what you were writing? or anyone else want to drive
this?

- Sijie

On Mon, Oct 9, 2017 at 9:32 AM, Venkateswara Rao Jujjuri <ju...@gmail.com>
wrote:

> Can we have a doc to put all these things? Thread has grown enough to cause
> confusion.
>
> Immediate things.
> 1. Don't assume new bookie if journal dir is empty.
> 2. Put cookies through bookie format, and bookie never boots on an empty
> cookie or mismatched cookie.
> 3. We can live with operations procedure to deal with incarnation issue.
> Infact we run an automated bookie decomm script which runs through the
> entire metadata and makes sure that the bookie is not part of any ledger.
>
> For next step:
> 1. Establish incarnation support.
> 2. Deal with bitrot.
>
> Makes sense?
>
> JV
>
> On Mon, Oct 9, 2017 at 8:55 AM, Sijie Guo <gu...@gmail.com> wrote:
>
> > On Oct 9, 2017 1:54 AM, "Ivan Kelly" <iv...@apache.org> wrote:
> >
> > Hi folks,
> >
> > I was travelling over the weekend, so I didn't have a chance to reply
> > to anything on this thread. First off, as Enrico said, there's a lot
> > of different topics being discussed at once. Perhaps each should be
> > broken into a github issue, and then we can continue each conversation
> > there, as it's getting a but unwieldy for email.
> >
> > I've created a cookie monster project, which we can throw all the issues
> > into.
> > https://github.com/apache/bookkeeper/projects/1
> >
> > There's a few individual opinions I'd like to give here though.
> >
> > > Needing the check the instance of the bookie when auditing
> >
> > The auditor, while it does check when bookies have disappeared, it
> > also periodically checks all ledgers by reading the first and last
> > entry of each segment. So even if a bookie has resurrected, the
> > auditor will find that it is missing entries it is supposed to have.
> >
> > > UUID in ledger metadata
> >
> > At least for the write path, I'm not sure if this is needed, but
> > consider the following.
> >
> > Only one writer can "vote" on the entries of the ledger. Other writers
> > are fencing writers. A fencing writer has to hit a majority of bookies
> > to proceed to closing the ledger. Unless a majority have been wiped,
> > it will not proceed to close as an empty ledger. However, if a
> > majority have been wiped, the correct behaviour would be for it not be
> > possible to close the ledger, as we cannot know what the end of the
> > ledger is.
> >
> > That said, not boot if any ledger refers to a bookie could solve this.
> >
> > > No ledgers referencing bookie? (Sijie's suggestion)
> >
> > I'm resistant this idea, because it assumes a central oracle where all
> > ledgers can be queried. I know we currently have this, but I don't
> > think it scales for each bookie to read the metadata of the whole
> > system.
> >
> > In any case, why not instead of refusing to start if any ledgers
> > reference the bookie, on boot the bookie checks which ledgers it is
> > supposed to have, and if it doesn't have them, start pulling the data
> > for them itself. While doing this replication it should avoid all new
> > writes.
> >
> >
> > Yes, that's another thing we need to improve for auto recovery. It is not
> > only on boot, you need to do it periodically, in the garbage collection
> > thread. The bookie need to scan what ledgers are missing and what entries
> > are missing and replicate them.
> >
> >
> >
> > > Storing the list of files in the cookie? (Enrico's suggestion)
> >
> > I don't think this is needed. The purpose of the cookie is to protect
> > against stuff like a mount not coming up, or a machine being
> > completely wiped. We assume that on a journalled filesystem, files
> > don't just disappear arbitrarily. There may be corruption in
> > individual files, but see my first point.
> >
> > Anyhow, as I said earlier, we should decide the broad topics here and
> > move into issues. I've made a first pass.
> >
> > Regards,
> > Ivan
> >
>
>
>
> --
> Jvrao
> ---
> First they ignore you, then they laugh at you, then they fight you, then
> you win. - Mahatma Gandhi
>

Re: Cookies and empty disks

Posted by Venkateswara Rao Jujjuri <ju...@gmail.com>.
Can we have a doc to put all these things? Thread has grown enough to cause
confusion.

Immediate things.
1. Don't assume new bookie if journal dir is empty.
2. Put cookies through bookie format, and bookie never boots on an empty
cookie or mismatched cookie.
3. We can live with operations procedure to deal with incarnation issue.
Infact we run an automated bookie decomm script which runs through the
entire metadata and makes sure that the bookie is not part of any ledger.

For next step:
1. Establish incarnation support.
2. Deal with bitrot.

Makes sense?

JV

On Mon, Oct 9, 2017 at 8:55 AM, Sijie Guo <gu...@gmail.com> wrote:

> On Oct 9, 2017 1:54 AM, "Ivan Kelly" <iv...@apache.org> wrote:
>
> Hi folks,
>
> I was travelling over the weekend, so I didn't have a chance to reply
> to anything on this thread. First off, as Enrico said, there's a lot
> of different topics being discussed at once. Perhaps each should be
> broken into a github issue, and then we can continue each conversation
> there, as it's getting a but unwieldy for email.
>
> I've created a cookie monster project, which we can throw all the issues
> into.
> https://github.com/apache/bookkeeper/projects/1
>
> There's a few individual opinions I'd like to give here though.
>
> > Needing the check the instance of the bookie when auditing
>
> The auditor, while it does check when bookies have disappeared, it
> also periodically checks all ledgers by reading the first and last
> entry of each segment. So even if a bookie has resurrected, the
> auditor will find that it is missing entries it is supposed to have.
>
> > UUID in ledger metadata
>
> At least for the write path, I'm not sure if this is needed, but
> consider the following.
>
> Only one writer can "vote" on the entries of the ledger. Other writers
> are fencing writers. A fencing writer has to hit a majority of bookies
> to proceed to closing the ledger. Unless a majority have been wiped,
> it will not proceed to close as an empty ledger. However, if a
> majority have been wiped, the correct behaviour would be for it not be
> possible to close the ledger, as we cannot know what the end of the
> ledger is.
>
> That said, not boot if any ledger refers to a bookie could solve this.
>
> > No ledgers referencing bookie? (Sijie's suggestion)
>
> I'm resistant this idea, because it assumes a central oracle where all
> ledgers can be queried. I know we currently have this, but I don't
> think it scales for each bookie to read the metadata of the whole
> system.
>
> In any case, why not instead of refusing to start if any ledgers
> reference the bookie, on boot the bookie checks which ledgers it is
> supposed to have, and if it doesn't have them, start pulling the data
> for them itself. While doing this replication it should avoid all new
> writes.
>
>
> Yes, that's another thing we need to improve for auto recovery. It is not
> only on boot, you need to do it periodically, in the garbage collection
> thread. The bookie need to scan what ledgers are missing and what entries
> are missing and replicate them.
>
>
>
> > Storing the list of files in the cookie? (Enrico's suggestion)
>
> I don't think this is needed. The purpose of the cookie is to protect
> against stuff like a mount not coming up, or a machine being
> completely wiped. We assume that on a journalled filesystem, files
> don't just disappear arbitrarily. There may be corruption in
> individual files, but see my first point.
>
> Anyhow, as I said earlier, we should decide the broad topics here and
> move into issues. I've made a first pass.
>
> Regards,
> Ivan
>



-- 
Jvrao
---
First they ignore you, then they laugh at you, then they fight you, then
you win. - Mahatma Gandhi

Re: Cookies and empty disks

Posted by Sijie Guo <gu...@gmail.com>.
On Oct 9, 2017 1:54 AM, "Ivan Kelly" <iv...@apache.org> wrote:

Hi folks,

I was travelling over the weekend, so I didn't have a chance to reply
to anything on this thread. First off, as Enrico said, there's a lot
of different topics being discussed at once. Perhaps each should be
broken into a github issue, and then we can continue each conversation
there, as it's getting a but unwieldy for email.

I've created a cookie monster project, which we can throw all the issues
into.
https://github.com/apache/bookkeeper/projects/1

There's a few individual opinions I'd like to give here though.

> Needing the check the instance of the bookie when auditing

The auditor, while it does check when bookies have disappeared, it
also periodically checks all ledgers by reading the first and last
entry of each segment. So even if a bookie has resurrected, the
auditor will find that it is missing entries it is supposed to have.

> UUID in ledger metadata

At least for the write path, I'm not sure if this is needed, but
consider the following.

Only one writer can "vote" on the entries of the ledger. Other writers
are fencing writers. A fencing writer has to hit a majority of bookies
to proceed to closing the ledger. Unless a majority have been wiped,
it will not proceed to close as an empty ledger. However, if a
majority have been wiped, the correct behaviour would be for it not be
possible to close the ledger, as we cannot know what the end of the
ledger is.

That said, not boot if any ledger refers to a bookie could solve this.

> No ledgers referencing bookie? (Sijie's suggestion)

I'm resistant this idea, because it assumes a central oracle where all
ledgers can be queried. I know we currently have this, but I don't
think it scales for each bookie to read the metadata of the whole
system.

In any case, why not instead of refusing to start if any ledgers
reference the bookie, on boot the bookie checks which ledgers it is
supposed to have, and if it doesn't have them, start pulling the data
for them itself. While doing this replication it should avoid all new
writes.


Yes, that's another thing we need to improve for auto recovery. It is not
only on boot, you need to do it periodically, in the garbage collection
thread. The bookie need to scan what ledgers are missing and what entries
are missing and replicate them.



> Storing the list of files in the cookie? (Enrico's suggestion)

I don't think this is needed. The purpose of the cookie is to protect
against stuff like a mount not coming up, or a machine being
completely wiped. We assume that on a journalled filesystem, files
don't just disappear arbitrarily. There may be corruption in
individual files, but see my first point.

Anyhow, as I said earlier, we should decide the broad topics here and
move into issues. I've made a first pass.

Regards,
Ivan

Re: Cookies and empty disks

Posted by Ivan Kelly <iv...@apache.org>.
Hi folks,

I was travelling over the weekend, so I didn't have a chance to reply
to anything on this thread. First off, as Enrico said, there's a lot
of different topics being discussed at once. Perhaps each should be
broken into a github issue, and then we can continue each conversation
there, as it's getting a but unwieldy for email.

I've created a cookie monster project, which we can throw all the issues into.
https://github.com/apache/bookkeeper/projects/1

There's a few individual opinions I'd like to give here though.

> Needing the check the instance of the bookie when auditing

The auditor, while it does check when bookies have disappeared, it
also periodically checks all ledgers by reading the first and last
entry of each segment. So even if a bookie has resurrected, the
auditor will find that it is missing entries it is supposed to have.

> UUID in ledger metadata

At least for the write path, I'm not sure if this is needed, but
consider the following.

Only one writer can "vote" on the entries of the ledger. Other writers
are fencing writers. A fencing writer has to hit a majority of bookies
to proceed to closing the ledger. Unless a majority have been wiped,
it will not proceed to close as an empty ledger. However, if a
majority have been wiped, the correct behaviour would be for it not be
possible to close the ledger, as we cannot know what the end of the
ledger is.

That said, not boot if any ledger refers to a bookie could solve this.

> No ledgers referencing bookie? (Sijie's suggestion)

I'm resistant this idea, because it assumes a central oracle where all
ledgers can be queried. I know we currently have this, but I don't
think it scales for each bookie to read the metadata of the whole
system.

In any case, why not instead of refusing to start if any ledgers
reference the bookie, on boot the bookie checks which ledgers it is
supposed to have, and if it doesn't have them, start pulling the data
for them itself. While doing this replication it should avoid all new
writes.

> Storing the list of files in the cookie? (Enrico's suggestion)

I don't think this is needed. The purpose of the cookie is to protect
against stuff like a mount not coming up, or a machine being
completely wiped. We assume that on a journalled filesystem, files
don't just disappear arbitrarily. There may be corruption in
individual files, but see my first point.

Anyhow, as I said earlier, we should decide the broad topics here and
move into issues. I've made a first pass.

Regards,
Ivan

Re: Cookies and empty disks

Posted by Sijie Guo <gu...@gmail.com>.
On Oct 9, 2017 1:13 AM, "Enrico Olivelli" <eo...@gmail.com> wrote:

2017-10-09 9:21 GMT+02:00 Sijie Guo <gu...@gmail.com>:

> okay, but why do you want to track the list of files? I don't get your
idea
> here.
>


If you allow a bookie to start with a journal directory which contains the
cookie file but without the other files the bookie thinks that have been
persisted durably you will fall into the correctness issue we are talking
about, you will lose fence bits for instance.
So having a directory which contains the cookie flie is not enough to say
that the bookie is in good status.


Sure. The case you described can happen. Bit can also be corrupted. In the
case of missing other files, it is same as bit corruption. This has to be
covered by auto recovery or disk scrubber rather than reusing cookie.


-- Enrico





>
> - Sijie
>
> On Sun, Oct 8, 2017 at 11:45 PM, Enrico Olivelli <eo...@gmail.com>
> wrote:
>
> > 2017-10-09 7:52 GMT+02:00 Sijie Guo <gu...@gmail.com>:
> >
> > > On Sat, Oct 7, 2017 at 9:53 AM, Enrico Olivelli <eo...@gmail.com>
> > > wrote:
> > >
> > > > Il sab 7 ott 2017, 00:27 Sijie Guo <gu...@gmail.com> ha scritto:
> > > >
> > > > > Enrico,
> > > > >
> > > > > Let's try to come to a conclusion or an agreement what we should
> fix
> > > and
> > > > > improve, before talking who is going to drive this.
> > > > >
> > > >
> > > > Sure.
> > > >
> > > > This is my point of view:
> > > > View have separate issues:
> > > > - missing checksums, to protect fence bits
> > > > - have a bug in bookie boot, we should not allow empty directories
> > > > - have a clear lifecycle for the bookie, add/remove
> > > > - deal with reincarnation of bookies
> > > > - ensuring the correctness of the contents of the directories of the
> > > bookie
> > > >
> > > > I would like to add a new point, we have rhe cookie inside every
> > > configured
> > > > directory managed by the bookie.
> > > > No cookie -> no boot
> > > > This will not be enough, we have to write in that file not only the
> > > > identity of the bookie but the list of files expected to be in the
> > > > directory.
> > > > This way you will not boot with a corrupted directory.
> > > > Config ->  list of dirs -> list of files
> > > >
> > >
> > > I am not sure why this is a new point. This is exactly what cookie is
> > > doing, no?
> > >
> >
> > Sorry, I can't find such behavior in code on master brach
> > https://github.com/apache/bookkeeper/blob/master/
> > bookkeeper-server/src/main/java/org/apache/bookkeeper/bookie/Cookie.java
> >
> > I we have a copy of the cookie inside each directory (index + data +
> > journal) I mean that each file should carry the exact list of files
> > expected to be present in the directory at boot.
> > So for instance when you add a new file to the set of files on a journal
> > directory you must update the file in that directory, same for index,
> > data.....
> >
> > Maybe I am missing something.
> > It seems to me that cookie contains only a list a of directories not of
> > "files"
> >
> > Enrico
> >
> >
> >
> >
> > >
> > >
> > > >
> > > > I agree on the fact that the bookie should be added (bookie format)
> > only
> > > if
> > > > there is no reference to it in zk.
> > > > The bookie format operation should write the cookie in any
configured
> > > > directory so that a bookie with empty directories won't ever start.
> > > >
> > > > I have to think more about this, but I wanted to share my first
> > thoughts
> > > >
> > > > Enrico
> > > >
> > > >
> > > > > - Sijie
> > > > >
> > > > > On Fri, Oct 6, 2017 at 1:14 PM, Enrico Olivelli <
> eolivelli@gmail.com
> > >
> > > > > wrote:
> > > > >
> > > > > > +1 for fixing the problem of missing cookie in 4.6
> > > > > >
> > > > > > Who drives the issue?
> > > > > >
> > > > > > Thank you all for the interesting points
> > > > > > Enrico
> > > > > >
> > > > > > Il ven 6 ott 2017, 21:27 Venkateswara Rao Jujjuri <
> > jujjuri@gmail.com
> > > >
> > > > ha
> > > > > > scritto:
> > > > > >
> > > > > > > Thanks for the writeup Sijie, comments below.
> > > > > > >
> > > > > > > On Fri, Oct 6, 2017 at 12:14 PM, Sijie Guo <guosijie@gmail.com
> >
> > > > wrote:
> > > > > > >
> > > > > > > > I think the question is mainly around "how do we recognize
> the
> > > > > bookie"
> > > > > > or
> > > > > > > > "incarnations". And the purpose of a cookie is designed for
> > > > > addressing
> > > > > > > > "incarnations".
> > > > > > > >
> > > > > > > > I will try to cover following aspects, and will try to
answer
> > > > > questions
> > > > > > > > that Ivan and JV raised.
> > > > > > > >
> > > > > > > > - what is cookie?
> > > > > > > > - how the behavior became bad?
> > > > > > > > - how do we fix current bad behavior?
> > > > > > > > - is the cookie enough?
> > > > > > > >
> > > > > > > >
> > > > > > > > *What is Cookie?*
> > > > > > > >
> > > > > > > > Cookie is originally introduced in this commit -
> > > > > > > >
> > > > > > > https://github.com/apache/bookkeeper/commit/
> > > > > > c6cc7cca3a85603c8e935ba6d06fbf
> > > > > > > > 3d8d7a7eb5
> > > > > > > > .
> > > > > > > >
> > > > > > > > A cookie is a identifier of a bookie. A cookie is created on
> > > > > zookeeper
> > > > > > > when
> > > > > > > > a brand new bookie joint the cluster, the cookie is
> > representing
> > > > the
> > > > > > > bookie
> > > > > > > > instance
> > > > > > > > during its lifecycle. The cookie is stored on all the disks
> for
> > > > > > > > verification purpose. so if any of the disks misses the
> cookie
> > > > (e.g.
> > > > > > > disks
> > > > > > > > were reformat or wiped out,
> > > > > > > > disks are not mounted correctly), a bookie will reject to
> > start.
> > > > > > > >
> > > > > > > >
> > > > > > > > *How the behavior became bad?*
> > > > > > > >
> > > > > > > > The original behavior worked as expected to use the cookie
in
> > > > > zookeeper
> > > > > > > as
> > > > > > > > the source of truth. See
> > > > > > > >
> > > > > > > https://github.com/apache/bookkeeper/commit/
> > > > > > c6cc7cca3a85603c8e935ba6d06fbf
> > > > > > > > 3d8d7a7eb5
> > > > > > > >
> > > > > > > >
> > > > > > > > The behavior was changed at
> > > > > > > >
> > > > > > > https://github.com/apache/bookkeeper/commit/
> > > > > > 19b821c63b91293960041bca7b0316
> > > > > > > > 14a109a7b8
> > > > > > > > when trying to support both ip and hostname . It used
journal
> > > > > directory
> > > > > > > as
> > > > > > > > the source-of-truth for verifying cookies.
> > > > > > > >
> > > > > > > > At the community meeting, I was saying a bookie should
reject
> > > start
> > > > > > when
> > > > > > > a
> > > > > > > > cookie file is missing locally and that was my operational
> > > > > experience.
> > > > > > It
> > > > > > > > turns out twitter's branch didn't include the change at
> > > > > > > > 19b821c63b91293960041bca7b031614a109a7b8,
> > > > > > > > so it was still the original behavior at
> > > > > > > > c6cc7cca3a85603c8e935ba6d06fbf3d8d7a7eb5 .
> > > > > > > >
> > > > > > > > *How do we fix current bad behavior?*
> > > > > > > >
> > > > > > > > We basically need to revert the current behaviour to the
> > original
> > > > > > > designed
> > > > > > > > behavior. The cookie in zookeeper should be the
> source-of-truth
> > > for
> > > > > > > > validation.
> > > > > > > >
> > > > > > > > If the cookie works as expected (change the behavior to the
> > > > original
> > > > > > > > behavior), then it is the operational or lifecycle
management
> > > > issue I
> > > > > > > > explained above.
> > > > > > > >
> > > > > > > > If a bookie failed with missing cookie, it should be:
> > > > > > > >
> > > > > > > > 1. taken out of the cluster
> > > > > > > > 2. run re-replication (autorecovery or manual recovery)
> > > > > > > > 3. ensure no ledgers using this bookie any more
> > > > > > > > 4. reformat the bookie
> > > > > > > > 5. add it back
> > > > > > > >
> > > > > > > > This can be automated by hooking into a scheduler (like k8s
> or
> > > > > mesos).
> > > > > > > But
> > > > > > > > it requires some sort of lifecycle management in order to
> > > automate
> > > > > such
> > > > > > > > operations. There is a BP-4:
> > > > > > > > https://cwiki.apache.org/confluence/display/BOOKKEEPER/
> > > > > > > > BP-4+-+BookKeeper+Lifecycle+Management
> > > > > > > > proposed for this purpose.
> > > > > > > >
> > > > > > > >
> > > > > > > > *Is the cookie enough?*
> > > > > > > >
> > > > > > > > Cookie (if we revert the current behavior to the original
> > > > behavior),
> > > > > > > should
> > > > > > > > be able to address most of the issues related to
> > "incarnations".
> > > > > > > >
> > > > > > > > There are still some corner cases will violate correctness
> > > issues.
> > > > > They
> > > > > > > are
> > > > > > > > related to "dangling writers" described in Ivan's first
> > comment.
> > > > > > > >
> > > > > > > > How can a writer tell whether bookies changed or ledger
> changed
> > > > when
> > > > > it
> > > > > > > > gets network partitioned?
> > > > > > > >
> > > > > > > > 1) Bookie Changed.
> > > > > > > >
> > > > > > > > Bookie can be reformatted and re-added to the cluster. Ivan
> and
> > > JV
> > > > > > > already
> > > > > > > > touch this on adding UUID.
> > > > > > > >
> > > > > > > > I think the UUID doesn't have to be part of ledger metadata.
> > > > because
> > > > > > > > auditor and replication worker would use the lifecycle
> > management
> > > > for
> > > > > > > > managing the lifecycle of bookies.
> > > > > > > >
> > > > > > >
> > > > > > > You are suggesting that the 'manual/scripted' lifecycle tool
is
> > to
> > > > the
> > > > > > > rescue.
> > > > > > > a side cart solution.
> > > > > > >
> > > > > > > But what are we saving by not keeping this info in the
> metadata?
> > > > > > > metadata size? sure it is a huge win in ZK environment.
> > > > > > >
> > > > > > > >
> > > > > > > > But the connection should have the UUID informations.
> > > > > > > >
> > > > > > >
> > > > > > > By this you are suggesting  service discovery portion need to
> > have
> > > > UUID
> > > > > > > info
> > > > > > > but not metadata portion. Won't it be confusing to handle a
> case
> > > > where
> > > > > > > write fails
> > > > > > > on bookie because of UUID mismatch, and you may need to handle
> > that
> > > > > case
> > > > > > > and if you go back to the same bookie then no ensmeble
changes.
> > > > > > >
> > > > > > > On the other hand if we introduce UUID into metadata, then we
> > don't
> > > > > need
> > > > > > to
> > > > > > > be
> > > > > > > explicitly depend on the side-cart solution.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > > Basically, any bookie client connects to a bookie, it needs
> to
> > > > carry
> > > > > > the
> > > > > > > > namespace uuid and the bookie uuid to ensure bookie is
> > connecting
> > > > to
> > > > > a
> > > > > > > > right bookie. This would prevent "dangling writers" connect
> to
> > > > > bookies
> > > > > > > that
> > > > > > > > are reformatted and added back.
> > > > > > > >
> > > > > > > >  While this is an issue, the problem can only get exposed in
> > > > > > pathological
> > > > > > > scenario
> > > > > > > where AQ bookies have went through this scenario, which is ~ 3
> > > > > > >
> > > > > > >
> > > > > > > 2) Ledger Changed.
> > > > > > > >
> > > > > > > > It is similar as what the case that Ivan' described. If a
> > writer
> > > > > > becomes
> > > > > > > > 'network partitioned', and the ledger is deleted during this
> > > > period,
> > > > > > > after
> > > > > > > > the writer comes back, the writer can still successfully
> write
> > > > > entries
> > > > > > to
> > > > > > > > the bookies, because the ledgers are already deleted and all
> > the
> > > > > > fencing
> > > > > > > > bits are gone.
> > > > > > > >
> > > > > > > > This violates the expectation of "fencing". but I am not
sure
> > we
> > > > need
> > > > > > to
> > > > > > > > spend time on fixing this, because the ledger is already
> > > explicitly
> > > > > > > deleted
> > > > > > > > by the application. so I think the behavior should be
> > categorized
> > > > as
> > > > > > > > "undefined", just like "deleting a ledger when a writer is
> > still
> > > > > > writing
> > > > > > > > entries" is a undefined behavior.
> > > > > > > >
> > > > > > > >
> > > > > > > > To summarize my thought on this:
> > > > > > > >
> > > > > > > > 1. we need to revert the cookie behaviour to the original
> > > behavior.
> > > > > > make
> > > > > > > > sure the cookie works as expected.
> > > > > > > > 2. introduce UUID or epoch in the cookie. client connection
> > > should
> > > > > > carry
> > > > > > > > namespace uuid and bookie uuid when establishing the
> > connection.
> > > > > > > > 3. work on BP-4 to have a complete lifecycle management to
> take
> > > > > bookie
> > > > > > > out
> > > > > > > > and add bookie out.
> > > > > > > >
> > > > > > > > 1 is the immediate fix, so correct operations can still
> > guarantee
> > > > the
> > > > > > > > correctness.
> > > > > > > >
> > > > > > >
> > > > > > > I agree we need to take care of #1 ASAP and have a Issues
> opened
> > > and
> > > > > > > designs for #2 and #3.
> > > > > > >
> > > > > > > Thanks,
> > > > > > > JV
> > > > > > >
> > > > > > > >
> > > > > > > > - Sijie
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > On Fri, Oct 6, 2017 at 9:35 AM, Venkateswara Rao Jujjuri <
> > > > > > > > jujjuri@gmail.com>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > > However, imagine that the fenced message is only in the
> > > journal
> > > > > on
> > > > > > > b2,
> > > > > > > > > > b2 crashes, something wipes the journal directory and
> then
> > b2
> > > > > comes
> > > > > > > > > > back up.
> > > > > > > > >
> > > > > > > > > In this case what happened?
> > > > > > > > > 1. We have WQ = 1
> > > > > > > > > 2. We had data loss (crash and comeup clean)
> > > > > > > > >
> > > > > > > > > But yeah, in addition to dataloss we have fencing
violation
> > > too.
> > > > > > > > > The problem is not just wiped journal dir, but how we
> > recognize
> > > > the
> > > > > > > > bookie.
> > > > > > > > > Bookie is just recognized by its ip address, not by its
> > > > > incarnation.
> > > > > > > > > Bookie1 at T1  (b1t1) ; and same bookie1 at T2 after
bookie
> > > > format
> > > > > > > (b1t2)
> > > > > > > > > should be two different bookies, isn;t it?
> > > > > > > > > this is needed for the replication worker and the auditor
> > too.
> > > > > > > > >
> > > > > > > > > Also, bookie needs to know if the writer/reader is
intended
> > to
> > > > read
> > > > > > > from
> > > > > > > > > b1t2 not from b1t1.
> > > > > > > > > Looks like we have a hole here? Or I may not be fully
> > > > understanding
> > > > > > > > cookie
> > > > > > > > > verification mechanism.
> > > > > > > > >
> > > > > > > > > Also as Ivan pointed out, we appear to think the lack of
> > > journal
> > > > is
> > > > > > > > > implicitly a new bookie, but overall cluster doesn't
> > > > differentiate
> > > > > > > > between
> > > > > > > > > incarnations.
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > > JV
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Fri, Oct 6, 2017 at 8:46 AM, Ivan Kelly <
> ivank@apache.org
> > >
> > > > > wrote:
> > > > > > > > >
> > > > > > > > > > > The case you described here is "almost correct". But
> > there
> > > is
> > > > > an
> > > > > > > key
> > > > > > > > > > here:
> > > > > > > > > > > B2 can't startup itself if journal disk is wiped out,
> > > because
> > > > > the
> > > > > > > > > cookie
> > > > > > > > > > is
> > > > > > > > > > > missed.
> > > > > > > > > > This is what I expected to see, but isn't the case.
> > > > > > > > > > <snip>
> > > > > > > > > >       List<Cookie> journalCookies =
Lists.newArrayList();
> > > > > > > > > >             // try to read cookie from journal
directory.
> > > > > > > > > >             for (File journalDirectory :
> > journalDirectories)
> > > {
> > > > > > > > > >                 try {
> > > > > > > > > >                     Cookie journalCookie =
> > > > > > > > > > Cookie.readFromDirectory(journalDirectory);
> > > > > > > > > >                     journalCookies.add(journalCookie);
> > > > > > > > > >                     if
> > > > > (journalCookie.isBookieHostCreatedFromIp())
> > > > > > {
> > > > > > > > > >                         conf.setUseHostNameAsBookieID(
> > > false);
> > > > > > > > > >                     } else {
> > > > > > > > > >                         conf.setUseHostNameAsBookieID(
> > true);
> > > > > > > > > >                     }
> > > > > > > > > >                 } catch (FileNotFoundException fnf) {
> > > > > > > > > >                     newEnv = true;
> > > > > > > > > >                     missedCookieDirs.add(
> > journalDirectory);
> > > > > > > > > >                 }
> > > > > > > > > >             }
> > > > > > > > > > </snip>
> > > > > > > > > >
> > > > > > > > > > So if a journal is missing the cookie, newEnv is set to
> > true.
> > > > > This
> > > > > > > > > > disabled the later checks.
> > > > > > > > > >
> > > > > > > > > > > Hower it can still happen in a different case: bit
> flap.
> > In
> > > > > your
> > > > > > > > case,
> > > > > > > > > if
> > > > > > > > > > > fence bit in b2 is already persisted on disk, but it
> got
> > > > > > corrupted.
> > > > > > > > > Then
> > > > > > > > > > it
> > > > > > > > > > > will cause the issue you described. One problem is we
> > don't
> > > > > have
> > > > > > > > > checksum
> > > > > > > > > > > on the index file header when it stores those fence
> bits.
> > > > > > > > > > Yes, this is also an issue.
> > > > > > > > > >
> > > > > > > > > > -Ivan
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > Jvrao
> > > > > > > > > ---
> > > > > > > > > First they ignore you, then they laugh at you, then they
> > fight
> > > > you,
> > > > > > > then
> > > > > > > > > you win. - Mahatma Gandhi
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Jvrao
> > > > > > > ---
> > > > > > > First they ignore you, then they laugh at you, then they fight
> > you,
> > > > > then
> > > > > > > you win. - Mahatma Gandhi
> > > > > > >
> > > > > > --
> > > > > >
> > > > > >
> > > > > > -- Enrico Olivelli
> > > > > >
> > > > >
> > > > --
> > > >
> > > >
> > > > -- Enrico Olivelli
> > > >
> > >
> >
>

Re: Cookies and empty disks

Posted by Enrico Olivelli <eo...@gmail.com>.
2017-10-09 9:21 GMT+02:00 Sijie Guo <gu...@gmail.com>:

> okay, but why do you want to track the list of files? I don't get your idea
> here.
>


If you allow a bookie to start with a journal directory which contains the
cookie file but without the other files the bookie thinks that have been
persisted durably you will fall into the correctness issue we are talking
about, you will lose fence bits for instance.
So having a directory which contains the cookie flie is not enough to say
that the bookie is in good status.

-- Enrico





>
> - Sijie
>
> On Sun, Oct 8, 2017 at 11:45 PM, Enrico Olivelli <eo...@gmail.com>
> wrote:
>
> > 2017-10-09 7:52 GMT+02:00 Sijie Guo <gu...@gmail.com>:
> >
> > > On Sat, Oct 7, 2017 at 9:53 AM, Enrico Olivelli <eo...@gmail.com>
> > > wrote:
> > >
> > > > Il sab 7 ott 2017, 00:27 Sijie Guo <gu...@gmail.com> ha scritto:
> > > >
> > > > > Enrico,
> > > > >
> > > > > Let's try to come to a conclusion or an agreement what we should
> fix
> > > and
> > > > > improve, before talking who is going to drive this.
> > > > >
> > > >
> > > > Sure.
> > > >
> > > > This is my point of view:
> > > > View have separate issues:
> > > > - missing checksums, to protect fence bits
> > > > - have a bug in bookie boot, we should not allow empty directories
> > > > - have a clear lifecycle for the bookie, add/remove
> > > > - deal with reincarnation of bookies
> > > > - ensuring the correctness of the contents of the directories of the
> > > bookie
> > > >
> > > > I would like to add a new point, we have rhe cookie inside every
> > > configured
> > > > directory managed by the bookie.
> > > > No cookie -> no boot
> > > > This will not be enough, we have to write in that file not only the
> > > > identity of the bookie but the list of files expected to be in the
> > > > directory.
> > > > This way you will not boot with a corrupted directory.
> > > > Config ->  list of dirs -> list of files
> > > >
> > >
> > > I am not sure why this is a new point. This is exactly what cookie is
> > > doing, no?
> > >
> >
> > Sorry, I can't find such behavior in code on master brach
> > https://github.com/apache/bookkeeper/blob/master/
> > bookkeeper-server/src/main/java/org/apache/bookkeeper/bookie/Cookie.java
> >
> > I we have a copy of the cookie inside each directory (index + data +
> > journal) I mean that each file should carry the exact list of files
> > expected to be present in the directory at boot.
> > So for instance when you add a new file to the set of files on a journal
> > directory you must update the file in that directory, same for index,
> > data.....
> >
> > Maybe I am missing something.
> > It seems to me that cookie contains only a list a of directories not of
> > "files"
> >
> > Enrico
> >
> >
> >
> >
> > >
> > >
> > > >
> > > > I agree on the fact that the bookie should be added (bookie format)
> > only
> > > if
> > > > there is no reference to it in zk.
> > > > The bookie format operation should write the cookie in any configured
> > > > directory so that a bookie with empty directories won't ever start.
> > > >
> > > > I have to think more about this, but I wanted to share my first
> > thoughts
> > > >
> > > > Enrico
> > > >
> > > >
> > > > > - Sijie
> > > > >
> > > > > On Fri, Oct 6, 2017 at 1:14 PM, Enrico Olivelli <
> eolivelli@gmail.com
> > >
> > > > > wrote:
> > > > >
> > > > > > +1 for fixing the problem of missing cookie in 4.6
> > > > > >
> > > > > > Who drives the issue?
> > > > > >
> > > > > > Thank you all for the interesting points
> > > > > > Enrico
> > > > > >
> > > > > > Il ven 6 ott 2017, 21:27 Venkateswara Rao Jujjuri <
> > jujjuri@gmail.com
> > > >
> > > > ha
> > > > > > scritto:
> > > > > >
> > > > > > > Thanks for the writeup Sijie, comments below.
> > > > > > >
> > > > > > > On Fri, Oct 6, 2017 at 12:14 PM, Sijie Guo <guosijie@gmail.com
> >
> > > > wrote:
> > > > > > >
> > > > > > > > I think the question is mainly around "how do we recognize
> the
> > > > > bookie"
> > > > > > or
> > > > > > > > "incarnations". And the purpose of a cookie is designed for
> > > > > addressing
> > > > > > > > "incarnations".
> > > > > > > >
> > > > > > > > I will try to cover following aspects, and will try to answer
> > > > > questions
> > > > > > > > that Ivan and JV raised.
> > > > > > > >
> > > > > > > > - what is cookie?
> > > > > > > > - how the behavior became bad?
> > > > > > > > - how do we fix current bad behavior?
> > > > > > > > - is the cookie enough?
> > > > > > > >
> > > > > > > >
> > > > > > > > *What is Cookie?*
> > > > > > > >
> > > > > > > > Cookie is originally introduced in this commit -
> > > > > > > >
> > > > > > > https://github.com/apache/bookkeeper/commit/
> > > > > > c6cc7cca3a85603c8e935ba6d06fbf
> > > > > > > > 3d8d7a7eb5
> > > > > > > > .
> > > > > > > >
> > > > > > > > A cookie is a identifier of a bookie. A cookie is created on
> > > > > zookeeper
> > > > > > > when
> > > > > > > > a brand new bookie joint the cluster, the cookie is
> > representing
> > > > the
> > > > > > > bookie
> > > > > > > > instance
> > > > > > > > during its lifecycle. The cookie is stored on all the disks
> for
> > > > > > > > verification purpose. so if any of the disks misses the
> cookie
> > > > (e.g.
> > > > > > > disks
> > > > > > > > were reformat or wiped out,
> > > > > > > > disks are not mounted correctly), a bookie will reject to
> > start.
> > > > > > > >
> > > > > > > >
> > > > > > > > *How the behavior became bad?*
> > > > > > > >
> > > > > > > > The original behavior worked as expected to use the cookie in
> > > > > zookeeper
> > > > > > > as
> > > > > > > > the source of truth. See
> > > > > > > >
> > > > > > > https://github.com/apache/bookkeeper/commit/
> > > > > > c6cc7cca3a85603c8e935ba6d06fbf
> > > > > > > > 3d8d7a7eb5
> > > > > > > >
> > > > > > > >
> > > > > > > > The behavior was changed at
> > > > > > > >
> > > > > > > https://github.com/apache/bookkeeper/commit/
> > > > > > 19b821c63b91293960041bca7b0316
> > > > > > > > 14a109a7b8
> > > > > > > > when trying to support both ip and hostname . It used journal
> > > > > directory
> > > > > > > as
> > > > > > > > the source-of-truth for verifying cookies.
> > > > > > > >
> > > > > > > > At the community meeting, I was saying a bookie should reject
> > > start
> > > > > > when
> > > > > > > a
> > > > > > > > cookie file is missing locally and that was my operational
> > > > > experience.
> > > > > > It
> > > > > > > > turns out twitter's branch didn't include the change at
> > > > > > > > 19b821c63b91293960041bca7b031614a109a7b8,
> > > > > > > > so it was still the original behavior at
> > > > > > > > c6cc7cca3a85603c8e935ba6d06fbf3d8d7a7eb5 .
> > > > > > > >
> > > > > > > > *How do we fix current bad behavior?*
> > > > > > > >
> > > > > > > > We basically need to revert the current behaviour to the
> > original
> > > > > > > designed
> > > > > > > > behavior. The cookie in zookeeper should be the
> source-of-truth
> > > for
> > > > > > > > validation.
> > > > > > > >
> > > > > > > > If the cookie works as expected (change the behavior to the
> > > > original
> > > > > > > > behavior), then it is the operational or lifecycle management
> > > > issue I
> > > > > > > > explained above.
> > > > > > > >
> > > > > > > > If a bookie failed with missing cookie, it should be:
> > > > > > > >
> > > > > > > > 1. taken out of the cluster
> > > > > > > > 2. run re-replication (autorecovery or manual recovery)
> > > > > > > > 3. ensure no ledgers using this bookie any more
> > > > > > > > 4. reformat the bookie
> > > > > > > > 5. add it back
> > > > > > > >
> > > > > > > > This can be automated by hooking into a scheduler (like k8s
> or
> > > > > mesos).
> > > > > > > But
> > > > > > > > it requires some sort of lifecycle management in order to
> > > automate
> > > > > such
> > > > > > > > operations. There is a BP-4:
> > > > > > > > https://cwiki.apache.org/confluence/display/BOOKKEEPER/
> > > > > > > > BP-4+-+BookKeeper+Lifecycle+Management
> > > > > > > > proposed for this purpose.
> > > > > > > >
> > > > > > > >
> > > > > > > > *Is the cookie enough?*
> > > > > > > >
> > > > > > > > Cookie (if we revert the current behavior to the original
> > > > behavior),
> > > > > > > should
> > > > > > > > be able to address most of the issues related to
> > "incarnations".
> > > > > > > >
> > > > > > > > There are still some corner cases will violate correctness
> > > issues.
> > > > > They
> > > > > > > are
> > > > > > > > related to "dangling writers" described in Ivan's first
> > comment.
> > > > > > > >
> > > > > > > > How can a writer tell whether bookies changed or ledger
> changed
> > > > when
> > > > > it
> > > > > > > > gets network partitioned?
> > > > > > > >
> > > > > > > > 1) Bookie Changed.
> > > > > > > >
> > > > > > > > Bookie can be reformatted and re-added to the cluster. Ivan
> and
> > > JV
> > > > > > > already
> > > > > > > > touch this on adding UUID.
> > > > > > > >
> > > > > > > > I think the UUID doesn't have to be part of ledger metadata.
> > > > because
> > > > > > > > auditor and replication worker would use the lifecycle
> > management
> > > > for
> > > > > > > > managing the lifecycle of bookies.
> > > > > > > >
> > > > > > >
> > > > > > > You are suggesting that the 'manual/scripted' lifecycle tool is
> > to
> > > > the
> > > > > > > rescue.
> > > > > > > a side cart solution.
> > > > > > >
> > > > > > > But what are we saving by not keeping this info in the
> metadata?
> > > > > > > metadata size? sure it is a huge win in ZK environment.
> > > > > > >
> > > > > > > >
> > > > > > > > But the connection should have the UUID informations.
> > > > > > > >
> > > > > > >
> > > > > > > By this you are suggesting  service discovery portion need to
> > have
> > > > UUID
> > > > > > > info
> > > > > > > but not metadata portion. Won't it be confusing to handle a
> case
> > > > where
> > > > > > > write fails
> > > > > > > on bookie because of UUID mismatch, and you may need to handle
> > that
> > > > > case
> > > > > > > and if you go back to the same bookie then no ensmeble changes.
> > > > > > >
> > > > > > > On the other hand if we introduce UUID into metadata, then we
> > don't
> > > > > need
> > > > > > to
> > > > > > > be
> > > > > > > explicitly depend on the side-cart solution.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > > Basically, any bookie client connects to a bookie, it needs
> to
> > > > carry
> > > > > > the
> > > > > > > > namespace uuid and the bookie uuid to ensure bookie is
> > connecting
> > > > to
> > > > > a
> > > > > > > > right bookie. This would prevent "dangling writers" connect
> to
> > > > > bookies
> > > > > > > that
> > > > > > > > are reformatted and added back.
> > > > > > > >
> > > > > > > >  While this is an issue, the problem can only get exposed in
> > > > > > pathological
> > > > > > > scenario
> > > > > > > where AQ bookies have went through this scenario, which is ~ 3
> > > > > > >
> > > > > > >
> > > > > > > 2) Ledger Changed.
> > > > > > > >
> > > > > > > > It is similar as what the case that Ivan' described. If a
> > writer
> > > > > > becomes
> > > > > > > > 'network partitioned', and the ledger is deleted during this
> > > > period,
> > > > > > > after
> > > > > > > > the writer comes back, the writer can still successfully
> write
> > > > > entries
> > > > > > to
> > > > > > > > the bookies, because the ledgers are already deleted and all
> > the
> > > > > > fencing
> > > > > > > > bits are gone.
> > > > > > > >
> > > > > > > > This violates the expectation of "fencing". but I am not sure
> > we
> > > > need
> > > > > > to
> > > > > > > > spend time on fixing this, because the ledger is already
> > > explicitly
> > > > > > > deleted
> > > > > > > > by the application. so I think the behavior should be
> > categorized
> > > > as
> > > > > > > > "undefined", just like "deleting a ledger when a writer is
> > still
> > > > > > writing
> > > > > > > > entries" is a undefined behavior.
> > > > > > > >
> > > > > > > >
> > > > > > > > To summarize my thought on this:
> > > > > > > >
> > > > > > > > 1. we need to revert the cookie behaviour to the original
> > > behavior.
> > > > > > make
> > > > > > > > sure the cookie works as expected.
> > > > > > > > 2. introduce UUID or epoch in the cookie. client connection
> > > should
> > > > > > carry
> > > > > > > > namespace uuid and bookie uuid when establishing the
> > connection.
> > > > > > > > 3. work on BP-4 to have a complete lifecycle management to
> take
> > > > > bookie
> > > > > > > out
> > > > > > > > and add bookie out.
> > > > > > > >
> > > > > > > > 1 is the immediate fix, so correct operations can still
> > guarantee
> > > > the
> > > > > > > > correctness.
> > > > > > > >
> > > > > > >
> > > > > > > I agree we need to take care of #1 ASAP and have a Issues
> opened
> > > and
> > > > > > > designs for #2 and #3.
> > > > > > >
> > > > > > > Thanks,
> > > > > > > JV
> > > > > > >
> > > > > > > >
> > > > > > > > - Sijie
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > On Fri, Oct 6, 2017 at 9:35 AM, Venkateswara Rao Jujjuri <
> > > > > > > > jujjuri@gmail.com>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > > However, imagine that the fenced message is only in the
> > > journal
> > > > > on
> > > > > > > b2,
> > > > > > > > > > b2 crashes, something wipes the journal directory and
> then
> > b2
> > > > > comes
> > > > > > > > > > back up.
> > > > > > > > >
> > > > > > > > > In this case what happened?
> > > > > > > > > 1. We have WQ = 1
> > > > > > > > > 2. We had data loss (crash and comeup clean)
> > > > > > > > >
> > > > > > > > > But yeah, in addition to dataloss we have fencing violation
> > > too.
> > > > > > > > > The problem is not just wiped journal dir, but how we
> > recognize
> > > > the
> > > > > > > > bookie.
> > > > > > > > > Bookie is just recognized by its ip address, not by its
> > > > > incarnation.
> > > > > > > > > Bookie1 at T1  (b1t1) ; and same bookie1 at T2 after bookie
> > > > format
> > > > > > > (b1t2)
> > > > > > > > > should be two different bookies, isn;t it?
> > > > > > > > > this is needed for the replication worker and the auditor
> > too.
> > > > > > > > >
> > > > > > > > > Also, bookie needs to know if the writer/reader is intended
> > to
> > > > read
> > > > > > > from
> > > > > > > > > b1t2 not from b1t1.
> > > > > > > > > Looks like we have a hole here? Or I may not be fully
> > > > understanding
> > > > > > > > cookie
> > > > > > > > > verification mechanism.
> > > > > > > > >
> > > > > > > > > Also as Ivan pointed out, we appear to think the lack of
> > > journal
> > > > is
> > > > > > > > > implicitly a new bookie, but overall cluster doesn't
> > > > differentiate
> > > > > > > > between
> > > > > > > > > incarnations.
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > > JV
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Fri, Oct 6, 2017 at 8:46 AM, Ivan Kelly <
> ivank@apache.org
> > >
> > > > > wrote:
> > > > > > > > >
> > > > > > > > > > > The case you described here is "almost correct". But
> > there
> > > is
> > > > > an
> > > > > > > key
> > > > > > > > > > here:
> > > > > > > > > > > B2 can't startup itself if journal disk is wiped out,
> > > because
> > > > > the
> > > > > > > > > cookie
> > > > > > > > > > is
> > > > > > > > > > > missed.
> > > > > > > > > > This is what I expected to see, but isn't the case.
> > > > > > > > > > <snip>
> > > > > > > > > >       List<Cookie> journalCookies = Lists.newArrayList();
> > > > > > > > > >             // try to read cookie from journal directory.
> > > > > > > > > >             for (File journalDirectory :
> > journalDirectories)
> > > {
> > > > > > > > > >                 try {
> > > > > > > > > >                     Cookie journalCookie =
> > > > > > > > > > Cookie.readFromDirectory(journalDirectory);
> > > > > > > > > >                     journalCookies.add(journalCookie);
> > > > > > > > > >                     if
> > > > > (journalCookie.isBookieHostCreatedFromIp())
> > > > > > {
> > > > > > > > > >                         conf.setUseHostNameAsBookieID(
> > > false);
> > > > > > > > > >                     } else {
> > > > > > > > > >                         conf.setUseHostNameAsBookieID(
> > true);
> > > > > > > > > >                     }
> > > > > > > > > >                 } catch (FileNotFoundException fnf) {
> > > > > > > > > >                     newEnv = true;
> > > > > > > > > >                     missedCookieDirs.add(
> > journalDirectory);
> > > > > > > > > >                 }
> > > > > > > > > >             }
> > > > > > > > > > </snip>
> > > > > > > > > >
> > > > > > > > > > So if a journal is missing the cookie, newEnv is set to
> > true.
> > > > > This
> > > > > > > > > > disabled the later checks.
> > > > > > > > > >
> > > > > > > > > > > Hower it can still happen in a different case: bit
> flap.
> > In
> > > > > your
> > > > > > > > case,
> > > > > > > > > if
> > > > > > > > > > > fence bit in b2 is already persisted on disk, but it
> got
> > > > > > corrupted.
> > > > > > > > > Then
> > > > > > > > > > it
> > > > > > > > > > > will cause the issue you described. One problem is we
> > don't
> > > > > have
> > > > > > > > > checksum
> > > > > > > > > > > on the index file header when it stores those fence
> bits.
> > > > > > > > > > Yes, this is also an issue.
> > > > > > > > > >
> > > > > > > > > > -Ivan
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > Jvrao
> > > > > > > > > ---
> > > > > > > > > First they ignore you, then they laugh at you, then they
> > fight
> > > > you,
> > > > > > > then
> > > > > > > > > you win. - Mahatma Gandhi
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Jvrao
> > > > > > > ---
> > > > > > > First they ignore you, then they laugh at you, then they fight
> > you,
> > > > > then
> > > > > > > you win. - Mahatma Gandhi
> > > > > > >
> > > > > > --
> > > > > >
> > > > > >
> > > > > > -- Enrico Olivelli
> > > > > >
> > > > >
> > > > --
> > > >
> > > >
> > > > -- Enrico Olivelli
> > > >
> > >
> >
>

Re: Cookies and empty disks

Posted by Sijie Guo <gu...@gmail.com>.
okay, but why do you want to track the list of files? I don't get your idea
here.

- Sijie

On Sun, Oct 8, 2017 at 11:45 PM, Enrico Olivelli <eo...@gmail.com>
wrote:

> 2017-10-09 7:52 GMT+02:00 Sijie Guo <gu...@gmail.com>:
>
> > On Sat, Oct 7, 2017 at 9:53 AM, Enrico Olivelli <eo...@gmail.com>
> > wrote:
> >
> > > Il sab 7 ott 2017, 00:27 Sijie Guo <gu...@gmail.com> ha scritto:
> > >
> > > > Enrico,
> > > >
> > > > Let's try to come to a conclusion or an agreement what we should fix
> > and
> > > > improve, before talking who is going to drive this.
> > > >
> > >
> > > Sure.
> > >
> > > This is my point of view:
> > > View have separate issues:
> > > - missing checksums, to protect fence bits
> > > - have a bug in bookie boot, we should not allow empty directories
> > > - have a clear lifecycle for the bookie, add/remove
> > > - deal with reincarnation of bookies
> > > - ensuring the correctness of the contents of the directories of the
> > bookie
> > >
> > > I would like to add a new point, we have rhe cookie inside every
> > configured
> > > directory managed by the bookie.
> > > No cookie -> no boot
> > > This will not be enough, we have to write in that file not only the
> > > identity of the bookie but the list of files expected to be in the
> > > directory.
> > > This way you will not boot with a corrupted directory.
> > > Config ->  list of dirs -> list of files
> > >
> >
> > I am not sure why this is a new point. This is exactly what cookie is
> > doing, no?
> >
>
> Sorry, I can't find such behavior in code on master brach
> https://github.com/apache/bookkeeper/blob/master/
> bookkeeper-server/src/main/java/org/apache/bookkeeper/bookie/Cookie.java
>
> I we have a copy of the cookie inside each directory (index + data +
> journal) I mean that each file should carry the exact list of files
> expected to be present in the directory at boot.
> So for instance when you add a new file to the set of files on a journal
> directory you must update the file in that directory, same for index,
> data.....
>
> Maybe I am missing something.
> It seems to me that cookie contains only a list a of directories not of
> "files"
>
> Enrico
>
>
>
>
> >
> >
> > >
> > > I agree on the fact that the bookie should be added (bookie format)
> only
> > if
> > > there is no reference to it in zk.
> > > The bookie format operation should write the cookie in any configured
> > > directory so that a bookie with empty directories won't ever start.
> > >
> > > I have to think more about this, but I wanted to share my first
> thoughts
> > >
> > > Enrico
> > >
> > >
> > > > - Sijie
> > > >
> > > > On Fri, Oct 6, 2017 at 1:14 PM, Enrico Olivelli <eolivelli@gmail.com
> >
> > > > wrote:
> > > >
> > > > > +1 for fixing the problem of missing cookie in 4.6
> > > > >
> > > > > Who drives the issue?
> > > > >
> > > > > Thank you all for the interesting points
> > > > > Enrico
> > > > >
> > > > > Il ven 6 ott 2017, 21:27 Venkateswara Rao Jujjuri <
> jujjuri@gmail.com
> > >
> > > ha
> > > > > scritto:
> > > > >
> > > > > > Thanks for the writeup Sijie, comments below.
> > > > > >
> > > > > > On Fri, Oct 6, 2017 at 12:14 PM, Sijie Guo <gu...@gmail.com>
> > > wrote:
> > > > > >
> > > > > > > I think the question is mainly around "how do we recognize the
> > > > bookie"
> > > > > or
> > > > > > > "incarnations". And the purpose of a cookie is designed for
> > > > addressing
> > > > > > > "incarnations".
> > > > > > >
> > > > > > > I will try to cover following aspects, and will try to answer
> > > > questions
> > > > > > > that Ivan and JV raised.
> > > > > > >
> > > > > > > - what is cookie?
> > > > > > > - how the behavior became bad?
> > > > > > > - how do we fix current bad behavior?
> > > > > > > - is the cookie enough?
> > > > > > >
> > > > > > >
> > > > > > > *What is Cookie?*
> > > > > > >
> > > > > > > Cookie is originally introduced in this commit -
> > > > > > >
> > > > > > https://github.com/apache/bookkeeper/commit/
> > > > > c6cc7cca3a85603c8e935ba6d06fbf
> > > > > > > 3d8d7a7eb5
> > > > > > > .
> > > > > > >
> > > > > > > A cookie is a identifier of a bookie. A cookie is created on
> > > > zookeeper
> > > > > > when
> > > > > > > a brand new bookie joint the cluster, the cookie is
> representing
> > > the
> > > > > > bookie
> > > > > > > instance
> > > > > > > during its lifecycle. The cookie is stored on all the disks for
> > > > > > > verification purpose. so if any of the disks misses the cookie
> > > (e.g.
> > > > > > disks
> > > > > > > were reformat or wiped out,
> > > > > > > disks are not mounted correctly), a bookie will reject to
> start.
> > > > > > >
> > > > > > >
> > > > > > > *How the behavior became bad?*
> > > > > > >
> > > > > > > The original behavior worked as expected to use the cookie in
> > > > zookeeper
> > > > > > as
> > > > > > > the source of truth. See
> > > > > > >
> > > > > > https://github.com/apache/bookkeeper/commit/
> > > > > c6cc7cca3a85603c8e935ba6d06fbf
> > > > > > > 3d8d7a7eb5
> > > > > > >
> > > > > > >
> > > > > > > The behavior was changed at
> > > > > > >
> > > > > > https://github.com/apache/bookkeeper/commit/
> > > > > 19b821c63b91293960041bca7b0316
> > > > > > > 14a109a7b8
> > > > > > > when trying to support both ip and hostname . It used journal
> > > > directory
> > > > > > as
> > > > > > > the source-of-truth for verifying cookies.
> > > > > > >
> > > > > > > At the community meeting, I was saying a bookie should reject
> > start
> > > > > when
> > > > > > a
> > > > > > > cookie file is missing locally and that was my operational
> > > > experience.
> > > > > It
> > > > > > > turns out twitter's branch didn't include the change at
> > > > > > > 19b821c63b91293960041bca7b031614a109a7b8,
> > > > > > > so it was still the original behavior at
> > > > > > > c6cc7cca3a85603c8e935ba6d06fbf3d8d7a7eb5 .
> > > > > > >
> > > > > > > *How do we fix current bad behavior?*
> > > > > > >
> > > > > > > We basically need to revert the current behaviour to the
> original
> > > > > > designed
> > > > > > > behavior. The cookie in zookeeper should be the source-of-truth
> > for
> > > > > > > validation.
> > > > > > >
> > > > > > > If the cookie works as expected (change the behavior to the
> > > original
> > > > > > > behavior), then it is the operational or lifecycle management
> > > issue I
> > > > > > > explained above.
> > > > > > >
> > > > > > > If a bookie failed with missing cookie, it should be:
> > > > > > >
> > > > > > > 1. taken out of the cluster
> > > > > > > 2. run re-replication (autorecovery or manual recovery)
> > > > > > > 3. ensure no ledgers using this bookie any more
> > > > > > > 4. reformat the bookie
> > > > > > > 5. add it back
> > > > > > >
> > > > > > > This can be automated by hooking into a scheduler (like k8s or
> > > > mesos).
> > > > > > But
> > > > > > > it requires some sort of lifecycle management in order to
> > automate
> > > > such
> > > > > > > operations. There is a BP-4:
> > > > > > > https://cwiki.apache.org/confluence/display/BOOKKEEPER/
> > > > > > > BP-4+-+BookKeeper+Lifecycle+Management
> > > > > > > proposed for this purpose.
> > > > > > >
> > > > > > >
> > > > > > > *Is the cookie enough?*
> > > > > > >
> > > > > > > Cookie (if we revert the current behavior to the original
> > > behavior),
> > > > > > should
> > > > > > > be able to address most of the issues related to
> "incarnations".
> > > > > > >
> > > > > > > There are still some corner cases will violate correctness
> > issues.
> > > > They
> > > > > > are
> > > > > > > related to "dangling writers" described in Ivan's first
> comment.
> > > > > > >
> > > > > > > How can a writer tell whether bookies changed or ledger changed
> > > when
> > > > it
> > > > > > > gets network partitioned?
> > > > > > >
> > > > > > > 1) Bookie Changed.
> > > > > > >
> > > > > > > Bookie can be reformatted and re-added to the cluster. Ivan and
> > JV
> > > > > > already
> > > > > > > touch this on adding UUID.
> > > > > > >
> > > > > > > I think the UUID doesn't have to be part of ledger metadata.
> > > because
> > > > > > > auditor and replication worker would use the lifecycle
> management
> > > for
> > > > > > > managing the lifecycle of bookies.
> > > > > > >
> > > > > >
> > > > > > You are suggesting that the 'manual/scripted' lifecycle tool is
> to
> > > the
> > > > > > rescue.
> > > > > > a side cart solution.
> > > > > >
> > > > > > But what are we saving by not keeping this info in the metadata?
> > > > > > metadata size? sure it is a huge win in ZK environment.
> > > > > >
> > > > > > >
> > > > > > > But the connection should have the UUID informations.
> > > > > > >
> > > > > >
> > > > > > By this you are suggesting  service discovery portion need to
> have
> > > UUID
> > > > > > info
> > > > > > but not metadata portion. Won't it be confusing to handle a case
> > > where
> > > > > > write fails
> > > > > > on bookie because of UUID mismatch, and you may need to handle
> that
> > > > case
> > > > > > and if you go back to the same bookie then no ensmeble changes.
> > > > > >
> > > > > > On the other hand if we introduce UUID into metadata, then we
> don't
> > > > need
> > > > > to
> > > > > > be
> > > > > > explicitly depend on the side-cart solution.
> > > > > >
> > > > > >
> > > > > >
> > > > > > > Basically, any bookie client connects to a bookie, it needs to
> > > carry
> > > > > the
> > > > > > > namespace uuid and the bookie uuid to ensure bookie is
> connecting
> > > to
> > > > a
> > > > > > > right bookie. This would prevent "dangling writers" connect to
> > > > bookies
> > > > > > that
> > > > > > > are reformatted and added back.
> > > > > > >
> > > > > > >  While this is an issue, the problem can only get exposed in
> > > > > pathological
> > > > > > scenario
> > > > > > where AQ bookies have went through this scenario, which is ~ 3
> > > > > >
> > > > > >
> > > > > > 2) Ledger Changed.
> > > > > > >
> > > > > > > It is similar as what the case that Ivan' described. If a
> writer
> > > > > becomes
> > > > > > > 'network partitioned', and the ledger is deleted during this
> > > period,
> > > > > > after
> > > > > > > the writer comes back, the writer can still successfully write
> > > > entries
> > > > > to
> > > > > > > the bookies, because the ledgers are already deleted and all
> the
> > > > > fencing
> > > > > > > bits are gone.
> > > > > > >
> > > > > > > This violates the expectation of "fencing". but I am not sure
> we
> > > need
> > > > > to
> > > > > > > spend time on fixing this, because the ledger is already
> > explicitly
> > > > > > deleted
> > > > > > > by the application. so I think the behavior should be
> categorized
> > > as
> > > > > > > "undefined", just like "deleting a ledger when a writer is
> still
> > > > > writing
> > > > > > > entries" is a undefined behavior.
> > > > > > >
> > > > > > >
> > > > > > > To summarize my thought on this:
> > > > > > >
> > > > > > > 1. we need to revert the cookie behaviour to the original
> > behavior.
> > > > > make
> > > > > > > sure the cookie works as expected.
> > > > > > > 2. introduce UUID or epoch in the cookie. client connection
> > should
> > > > > carry
> > > > > > > namespace uuid and bookie uuid when establishing the
> connection.
> > > > > > > 3. work on BP-4 to have a complete lifecycle management to take
> > > > bookie
> > > > > > out
> > > > > > > and add bookie out.
> > > > > > >
> > > > > > > 1 is the immediate fix, so correct operations can still
> guarantee
> > > the
> > > > > > > correctness.
> > > > > > >
> > > > > >
> > > > > > I agree we need to take care of #1 ASAP and have a Issues opened
> > and
> > > > > > designs for #2 and #3.
> > > > > >
> > > > > > Thanks,
> > > > > > JV
> > > > > >
> > > > > > >
> > > > > > > - Sijie
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Fri, Oct 6, 2017 at 9:35 AM, Venkateswara Rao Jujjuri <
> > > > > > > jujjuri@gmail.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > > However, imagine that the fenced message is only in the
> > journal
> > > > on
> > > > > > b2,
> > > > > > > > > b2 crashes, something wipes the journal directory and then
> b2
> > > > comes
> > > > > > > > > back up.
> > > > > > > >
> > > > > > > > In this case what happened?
> > > > > > > > 1. We have WQ = 1
> > > > > > > > 2. We had data loss (crash and comeup clean)
> > > > > > > >
> > > > > > > > But yeah, in addition to dataloss we have fencing violation
> > too.
> > > > > > > > The problem is not just wiped journal dir, but how we
> recognize
> > > the
> > > > > > > bookie.
> > > > > > > > Bookie is just recognized by its ip address, not by its
> > > > incarnation.
> > > > > > > > Bookie1 at T1  (b1t1) ; and same bookie1 at T2 after bookie
> > > format
> > > > > > (b1t2)
> > > > > > > > should be two different bookies, isn;t it?
> > > > > > > > this is needed for the replication worker and the auditor
> too.
> > > > > > > >
> > > > > > > > Also, bookie needs to know if the writer/reader is intended
> to
> > > read
> > > > > > from
> > > > > > > > b1t2 not from b1t1.
> > > > > > > > Looks like we have a hole here? Or I may not be fully
> > > understanding
> > > > > > > cookie
> > > > > > > > verification mechanism.
> > > > > > > >
> > > > > > > > Also as Ivan pointed out, we appear to think the lack of
> > journal
> > > is
> > > > > > > > implicitly a new bookie, but overall cluster doesn't
> > > differentiate
> > > > > > > between
> > > > > > > > incarnations.
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > JV
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > On Fri, Oct 6, 2017 at 8:46 AM, Ivan Kelly <ivank@apache.org
> >
> > > > wrote:
> > > > > > > >
> > > > > > > > > > The case you described here is "almost correct". But
> there
> > is
> > > > an
> > > > > > key
> > > > > > > > > here:
> > > > > > > > > > B2 can't startup itself if journal disk is wiped out,
> > because
> > > > the
> > > > > > > > cookie
> > > > > > > > > is
> > > > > > > > > > missed.
> > > > > > > > > This is what I expected to see, but isn't the case.
> > > > > > > > > <snip>
> > > > > > > > >       List<Cookie> journalCookies = Lists.newArrayList();
> > > > > > > > >             // try to read cookie from journal directory.
> > > > > > > > >             for (File journalDirectory :
> journalDirectories)
> > {
> > > > > > > > >                 try {
> > > > > > > > >                     Cookie journalCookie =
> > > > > > > > > Cookie.readFromDirectory(journalDirectory);
> > > > > > > > >                     journalCookies.add(journalCookie);
> > > > > > > > >                     if
> > > > (journalCookie.isBookieHostCreatedFromIp())
> > > > > {
> > > > > > > > >                         conf.setUseHostNameAsBookieID(
> > false);
> > > > > > > > >                     } else {
> > > > > > > > >                         conf.setUseHostNameAsBookieID(
> true);
> > > > > > > > >                     }
> > > > > > > > >                 } catch (FileNotFoundException fnf) {
> > > > > > > > >                     newEnv = true;
> > > > > > > > >                     missedCookieDirs.add(
> journalDirectory);
> > > > > > > > >                 }
> > > > > > > > >             }
> > > > > > > > > </snip>
> > > > > > > > >
> > > > > > > > > So if a journal is missing the cookie, newEnv is set to
> true.
> > > > This
> > > > > > > > > disabled the later checks.
> > > > > > > > >
> > > > > > > > > > Hower it can still happen in a different case: bit flap.
> In
> > > > your
> > > > > > > case,
> > > > > > > > if
> > > > > > > > > > fence bit in b2 is already persisted on disk, but it got
> > > > > corrupted.
> > > > > > > > Then
> > > > > > > > > it
> > > > > > > > > > will cause the issue you described. One problem is we
> don't
> > > > have
> > > > > > > > checksum
> > > > > > > > > > on the index file header when it stores those fence bits.
> > > > > > > > > Yes, this is also an issue.
> > > > > > > > >
> > > > > > > > > -Ivan
> > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > Jvrao
> > > > > > > > ---
> > > > > > > > First they ignore you, then they laugh at you, then they
> fight
> > > you,
> > > > > > then
> > > > > > > > you win. - Mahatma Gandhi
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Jvrao
> > > > > > ---
> > > > > > First they ignore you, then they laugh at you, then they fight
> you,
> > > > then
> > > > > > you win. - Mahatma Gandhi
> > > > > >
> > > > > --
> > > > >
> > > > >
> > > > > -- Enrico Olivelli
> > > > >
> > > >
> > > --
> > >
> > >
> > > -- Enrico Olivelli
> > >
> >
>

Re: Cookies and empty disks

Posted by Enrico Olivelli <eo...@gmail.com>.
2017-10-09 7:52 GMT+02:00 Sijie Guo <gu...@gmail.com>:

> On Sat, Oct 7, 2017 at 9:53 AM, Enrico Olivelli <eo...@gmail.com>
> wrote:
>
> > Il sab 7 ott 2017, 00:27 Sijie Guo <gu...@gmail.com> ha scritto:
> >
> > > Enrico,
> > >
> > > Let's try to come to a conclusion or an agreement what we should fix
> and
> > > improve, before talking who is going to drive this.
> > >
> >
> > Sure.
> >
> > This is my point of view:
> > View have separate issues:
> > - missing checksums, to protect fence bits
> > - have a bug in bookie boot, we should not allow empty directories
> > - have a clear lifecycle for the bookie, add/remove
> > - deal with reincarnation of bookies
> > - ensuring the correctness of the contents of the directories of the
> bookie
> >
> > I would like to add a new point, we have rhe cookie inside every
> configured
> > directory managed by the bookie.
> > No cookie -> no boot
> > This will not be enough, we have to write in that file not only the
> > identity of the bookie but the list of files expected to be in the
> > directory.
> > This way you will not boot with a corrupted directory.
> > Config ->  list of dirs -> list of files
> >
>
> I am not sure why this is a new point. This is exactly what cookie is
> doing, no?
>

Sorry, I can't find such behavior in code on master brach
https://github.com/apache/bookkeeper/blob/master/bookkeeper-server/src/main/java/org/apache/bookkeeper/bookie/Cookie.java

I we have a copy of the cookie inside each directory (index + data +
journal) I mean that each file should carry the exact list of files
expected to be present in the directory at boot.
So for instance when you add a new file to the set of files on a journal
directory you must update the file in that directory, same for index,
data.....

Maybe I am missing something.
It seems to me that cookie contains only a list a of directories not of
"files"

Enrico




>
>
> >
> > I agree on the fact that the bookie should be added (bookie format) only
> if
> > there is no reference to it in zk.
> > The bookie format operation should write the cookie in any configured
> > directory so that a bookie with empty directories won't ever start.
> >
> > I have to think more about this, but I wanted to share my first thoughts
> >
> > Enrico
> >
> >
> > > - Sijie
> > >
> > > On Fri, Oct 6, 2017 at 1:14 PM, Enrico Olivelli <eo...@gmail.com>
> > > wrote:
> > >
> > > > +1 for fixing the problem of missing cookie in 4.6
> > > >
> > > > Who drives the issue?
> > > >
> > > > Thank you all for the interesting points
> > > > Enrico
> > > >
> > > > Il ven 6 ott 2017, 21:27 Venkateswara Rao Jujjuri <jujjuri@gmail.com
> >
> > ha
> > > > scritto:
> > > >
> > > > > Thanks for the writeup Sijie, comments below.
> > > > >
> > > > > On Fri, Oct 6, 2017 at 12:14 PM, Sijie Guo <gu...@gmail.com>
> > wrote:
> > > > >
> > > > > > I think the question is mainly around "how do we recognize the
> > > bookie"
> > > > or
> > > > > > "incarnations". And the purpose of a cookie is designed for
> > > addressing
> > > > > > "incarnations".
> > > > > >
> > > > > > I will try to cover following aspects, and will try to answer
> > > questions
> > > > > > that Ivan and JV raised.
> > > > > >
> > > > > > - what is cookie?
> > > > > > - how the behavior became bad?
> > > > > > - how do we fix current bad behavior?
> > > > > > - is the cookie enough?
> > > > > >
> > > > > >
> > > > > > *What is Cookie?*
> > > > > >
> > > > > > Cookie is originally introduced in this commit -
> > > > > >
> > > > > https://github.com/apache/bookkeeper/commit/
> > > > c6cc7cca3a85603c8e935ba6d06fbf
> > > > > > 3d8d7a7eb5
> > > > > > .
> > > > > >
> > > > > > A cookie is a identifier of a bookie. A cookie is created on
> > > zookeeper
> > > > > when
> > > > > > a brand new bookie joint the cluster, the cookie is representing
> > the
> > > > > bookie
> > > > > > instance
> > > > > > during its lifecycle. The cookie is stored on all the disks for
> > > > > > verification purpose. so if any of the disks misses the cookie
> > (e.g.
> > > > > disks
> > > > > > were reformat or wiped out,
> > > > > > disks are not mounted correctly), a bookie will reject to start.
> > > > > >
> > > > > >
> > > > > > *How the behavior became bad?*
> > > > > >
> > > > > > The original behavior worked as expected to use the cookie in
> > > zookeeper
> > > > > as
> > > > > > the source of truth. See
> > > > > >
> > > > > https://github.com/apache/bookkeeper/commit/
> > > > c6cc7cca3a85603c8e935ba6d06fbf
> > > > > > 3d8d7a7eb5
> > > > > >
> > > > > >
> > > > > > The behavior was changed at
> > > > > >
> > > > > https://github.com/apache/bookkeeper/commit/
> > > > 19b821c63b91293960041bca7b0316
> > > > > > 14a109a7b8
> > > > > > when trying to support both ip and hostname . It used journal
> > > directory
> > > > > as
> > > > > > the source-of-truth for verifying cookies.
> > > > > >
> > > > > > At the community meeting, I was saying a bookie should reject
> start
> > > > when
> > > > > a
> > > > > > cookie file is missing locally and that was my operational
> > > experience.
> > > > It
> > > > > > turns out twitter's branch didn't include the change at
> > > > > > 19b821c63b91293960041bca7b031614a109a7b8,
> > > > > > so it was still the original behavior at
> > > > > > c6cc7cca3a85603c8e935ba6d06fbf3d8d7a7eb5 .
> > > > > >
> > > > > > *How do we fix current bad behavior?*
> > > > > >
> > > > > > We basically need to revert the current behaviour to the original
> > > > > designed
> > > > > > behavior. The cookie in zookeeper should be the source-of-truth
> for
> > > > > > validation.
> > > > > >
> > > > > > If the cookie works as expected (change the behavior to the
> > original
> > > > > > behavior), then it is the operational or lifecycle management
> > issue I
> > > > > > explained above.
> > > > > >
> > > > > > If a bookie failed with missing cookie, it should be:
> > > > > >
> > > > > > 1. taken out of the cluster
> > > > > > 2. run re-replication (autorecovery or manual recovery)
> > > > > > 3. ensure no ledgers using this bookie any more
> > > > > > 4. reformat the bookie
> > > > > > 5. add it back
> > > > > >
> > > > > > This can be automated by hooking into a scheduler (like k8s or
> > > mesos).
> > > > > But
> > > > > > it requires some sort of lifecycle management in order to
> automate
> > > such
> > > > > > operations. There is a BP-4:
> > > > > > https://cwiki.apache.org/confluence/display/BOOKKEEPER/
> > > > > > BP-4+-+BookKeeper+Lifecycle+Management
> > > > > > proposed for this purpose.
> > > > > >
> > > > > >
> > > > > > *Is the cookie enough?*
> > > > > >
> > > > > > Cookie (if we revert the current behavior to the original
> > behavior),
> > > > > should
> > > > > > be able to address most of the issues related to "incarnations".
> > > > > >
> > > > > > There are still some corner cases will violate correctness
> issues.
> > > They
> > > > > are
> > > > > > related to "dangling writers" described in Ivan's first comment.
> > > > > >
> > > > > > How can a writer tell whether bookies changed or ledger changed
> > when
> > > it
> > > > > > gets network partitioned?
> > > > > >
> > > > > > 1) Bookie Changed.
> > > > > >
> > > > > > Bookie can be reformatted and re-added to the cluster. Ivan and
> JV
> > > > > already
> > > > > > touch this on adding UUID.
> > > > > >
> > > > > > I think the UUID doesn't have to be part of ledger metadata.
> > because
> > > > > > auditor and replication worker would use the lifecycle management
> > for
> > > > > > managing the lifecycle of bookies.
> > > > > >
> > > > >
> > > > > You are suggesting that the 'manual/scripted' lifecycle tool is to
> > the
> > > > > rescue.
> > > > > a side cart solution.
> > > > >
> > > > > But what are we saving by not keeping this info in the metadata?
> > > > > metadata size? sure it is a huge win in ZK environment.
> > > > >
> > > > > >
> > > > > > But the connection should have the UUID informations.
> > > > > >
> > > > >
> > > > > By this you are suggesting  service discovery portion need to have
> > UUID
> > > > > info
> > > > > but not metadata portion. Won't it be confusing to handle a case
> > where
> > > > > write fails
> > > > > on bookie because of UUID mismatch, and you may need to handle that
> > > case
> > > > > and if you go back to the same bookie then no ensmeble changes.
> > > > >
> > > > > On the other hand if we introduce UUID into metadata, then we don't
> > > need
> > > > to
> > > > > be
> > > > > explicitly depend on the side-cart solution.
> > > > >
> > > > >
> > > > >
> > > > > > Basically, any bookie client connects to a bookie, it needs to
> > carry
> > > > the
> > > > > > namespace uuid and the bookie uuid to ensure bookie is connecting
> > to
> > > a
> > > > > > right bookie. This would prevent "dangling writers" connect to
> > > bookies
> > > > > that
> > > > > > are reformatted and added back.
> > > > > >
> > > > > >  While this is an issue, the problem can only get exposed in
> > > > pathological
> > > > > scenario
> > > > > where AQ bookies have went through this scenario, which is ~ 3
> > > > >
> > > > >
> > > > > 2) Ledger Changed.
> > > > > >
> > > > > > It is similar as what the case that Ivan' described. If a writer
> > > > becomes
> > > > > > 'network partitioned', and the ledger is deleted during this
> > period,
> > > > > after
> > > > > > the writer comes back, the writer can still successfully write
> > > entries
> > > > to
> > > > > > the bookies, because the ledgers are already deleted and all the
> > > > fencing
> > > > > > bits are gone.
> > > > > >
> > > > > > This violates the expectation of "fencing". but I am not sure we
> > need
> > > > to
> > > > > > spend time on fixing this, because the ledger is already
> explicitly
> > > > > deleted
> > > > > > by the application. so I think the behavior should be categorized
> > as
> > > > > > "undefined", just like "deleting a ledger when a writer is still
> > > > writing
> > > > > > entries" is a undefined behavior.
> > > > > >
> > > > > >
> > > > > > To summarize my thought on this:
> > > > > >
> > > > > > 1. we need to revert the cookie behaviour to the original
> behavior.
> > > > make
> > > > > > sure the cookie works as expected.
> > > > > > 2. introduce UUID or epoch in the cookie. client connection
> should
> > > > carry
> > > > > > namespace uuid and bookie uuid when establishing the connection.
> > > > > > 3. work on BP-4 to have a complete lifecycle management to take
> > > bookie
> > > > > out
> > > > > > and add bookie out.
> > > > > >
> > > > > > 1 is the immediate fix, so correct operations can still guarantee
> > the
> > > > > > correctness.
> > > > > >
> > > > >
> > > > > I agree we need to take care of #1 ASAP and have a Issues opened
> and
> > > > > designs for #2 and #3.
> > > > >
> > > > > Thanks,
> > > > > JV
> > > > >
> > > > > >
> > > > > > - Sijie
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Fri, Oct 6, 2017 at 9:35 AM, Venkateswara Rao Jujjuri <
> > > > > > jujjuri@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > > However, imagine that the fenced message is only in the
> journal
> > > on
> > > > > b2,
> > > > > > > > b2 crashes, something wipes the journal directory and then b2
> > > comes
> > > > > > > > back up.
> > > > > > >
> > > > > > > In this case what happened?
> > > > > > > 1. We have WQ = 1
> > > > > > > 2. We had data loss (crash and comeup clean)
> > > > > > >
> > > > > > > But yeah, in addition to dataloss we have fencing violation
> too.
> > > > > > > The problem is not just wiped journal dir, but how we recognize
> > the
> > > > > > bookie.
> > > > > > > Bookie is just recognized by its ip address, not by its
> > > incarnation.
> > > > > > > Bookie1 at T1  (b1t1) ; and same bookie1 at T2 after bookie
> > format
> > > > > (b1t2)
> > > > > > > should be two different bookies, isn;t it?
> > > > > > > this is needed for the replication worker and the auditor too.
> > > > > > >
> > > > > > > Also, bookie needs to know if the writer/reader is intended to
> > read
> > > > > from
> > > > > > > b1t2 not from b1t1.
> > > > > > > Looks like we have a hole here? Or I may not be fully
> > understanding
> > > > > > cookie
> > > > > > > verification mechanism.
> > > > > > >
> > > > > > > Also as Ivan pointed out, we appear to think the lack of
> journal
> > is
> > > > > > > implicitly a new bookie, but overall cluster doesn't
> > differentiate
> > > > > > between
> > > > > > > incarnations.
> > > > > > >
> > > > > > > Thanks,
> > > > > > > JV
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Fri, Oct 6, 2017 at 8:46 AM, Ivan Kelly <iv...@apache.org>
> > > wrote:
> > > > > > >
> > > > > > > > > The case you described here is "almost correct". But there
> is
> > > an
> > > > > key
> > > > > > > > here:
> > > > > > > > > B2 can't startup itself if journal disk is wiped out,
> because
> > > the
> > > > > > > cookie
> > > > > > > > is
> > > > > > > > > missed.
> > > > > > > > This is what I expected to see, but isn't the case.
> > > > > > > > <snip>
> > > > > > > >       List<Cookie> journalCookies = Lists.newArrayList();
> > > > > > > >             // try to read cookie from journal directory.
> > > > > > > >             for (File journalDirectory : journalDirectories)
> {
> > > > > > > >                 try {
> > > > > > > >                     Cookie journalCookie =
> > > > > > > > Cookie.readFromDirectory(journalDirectory);
> > > > > > > >                     journalCookies.add(journalCookie);
> > > > > > > >                     if
> > > (journalCookie.isBookieHostCreatedFromIp())
> > > > {
> > > > > > > >                         conf.setUseHostNameAsBookieID(
> false);
> > > > > > > >                     } else {
> > > > > > > >                         conf.setUseHostNameAsBookieID(true);
> > > > > > > >                     }
> > > > > > > >                 } catch (FileNotFoundException fnf) {
> > > > > > > >                     newEnv = true;
> > > > > > > >                     missedCookieDirs.add(journalDirectory);
> > > > > > > >                 }
> > > > > > > >             }
> > > > > > > > </snip>
> > > > > > > >
> > > > > > > > So if a journal is missing the cookie, newEnv is set to true.
> > > This
> > > > > > > > disabled the later checks.
> > > > > > > >
> > > > > > > > > Hower it can still happen in a different case: bit flap. In
> > > your
> > > > > > case,
> > > > > > > if
> > > > > > > > > fence bit in b2 is already persisted on disk, but it got
> > > > corrupted.
> > > > > > > Then
> > > > > > > > it
> > > > > > > > > will cause the issue you described. One problem is we don't
> > > have
> > > > > > > checksum
> > > > > > > > > on the index file header when it stores those fence bits.
> > > > > > > > Yes, this is also an issue.
> > > > > > > >
> > > > > > > > -Ivan
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Jvrao
> > > > > > > ---
> > > > > > > First they ignore you, then they laugh at you, then they fight
> > you,
> > > > > then
> > > > > > > you win. - Mahatma Gandhi
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Jvrao
> > > > > ---
> > > > > First they ignore you, then they laugh at you, then they fight you,
> > > then
> > > > > you win. - Mahatma Gandhi
> > > > >
> > > > --
> > > >
> > > >
> > > > -- Enrico Olivelli
> > > >
> > >
> > --
> >
> >
> > -- Enrico Olivelli
> >
>

Re: Cookies and empty disks

Posted by Sijie Guo <gu...@gmail.com>.
On Sat, Oct 7, 2017 at 9:53 AM, Enrico Olivelli <eo...@gmail.com> wrote:

> Il sab 7 ott 2017, 00:27 Sijie Guo <gu...@gmail.com> ha scritto:
>
> > Enrico,
> >
> > Let's try to come to a conclusion or an agreement what we should fix and
> > improve, before talking who is going to drive this.
> >
>
> Sure.
>
> This is my point of view:
> View have separate issues:
> - missing checksums, to protect fence bits
> - have a bug in bookie boot, we should not allow empty directories
> - have a clear lifecycle for the bookie, add/remove
> - deal with reincarnation of bookies
> - ensuring the correctness of the contents of the directories of the bookie
>
> I would like to add a new point, we have rhe cookie inside every configured
> directory managed by the bookie.
> No cookie -> no boot
> This will not be enough, we have to write in that file not only the
> identity of the bookie but the list of files expected to be in the
> directory.
> This way you will not boot with a corrupted directory.
> Config ->  list of dirs -> list of files
>

I am not sure why this is a new point. This is exactly what cookie is
doing, no?


>
> I agree on the fact that the bookie should be added (bookie format) only if
> there is no reference to it in zk.
> The bookie format operation should write the cookie in any configured
> directory so that a bookie with empty directories won't ever start.
>
> I have to think more about this, but I wanted to share my first thoughts
>
> Enrico
>
>
> > - Sijie
> >
> > On Fri, Oct 6, 2017 at 1:14 PM, Enrico Olivelli <eo...@gmail.com>
> > wrote:
> >
> > > +1 for fixing the problem of missing cookie in 4.6
> > >
> > > Who drives the issue?
> > >
> > > Thank you all for the interesting points
> > > Enrico
> > >
> > > Il ven 6 ott 2017, 21:27 Venkateswara Rao Jujjuri <ju...@gmail.com>
> ha
> > > scritto:
> > >
> > > > Thanks for the writeup Sijie, comments below.
> > > >
> > > > On Fri, Oct 6, 2017 at 12:14 PM, Sijie Guo <gu...@gmail.com>
> wrote:
> > > >
> > > > > I think the question is mainly around "how do we recognize the
> > bookie"
> > > or
> > > > > "incarnations". And the purpose of a cookie is designed for
> > addressing
> > > > > "incarnations".
> > > > >
> > > > > I will try to cover following aspects, and will try to answer
> > questions
> > > > > that Ivan and JV raised.
> > > > >
> > > > > - what is cookie?
> > > > > - how the behavior became bad?
> > > > > - how do we fix current bad behavior?
> > > > > - is the cookie enough?
> > > > >
> > > > >
> > > > > *What is Cookie?*
> > > > >
> > > > > Cookie is originally introduced in this commit -
> > > > >
> > > > https://github.com/apache/bookkeeper/commit/
> > > c6cc7cca3a85603c8e935ba6d06fbf
> > > > > 3d8d7a7eb5
> > > > > .
> > > > >
> > > > > A cookie is a identifier of a bookie. A cookie is created on
> > zookeeper
> > > > when
> > > > > a brand new bookie joint the cluster, the cookie is representing
> the
> > > > bookie
> > > > > instance
> > > > > during its lifecycle. The cookie is stored on all the disks for
> > > > > verification purpose. so if any of the disks misses the cookie
> (e.g.
> > > > disks
> > > > > were reformat or wiped out,
> > > > > disks are not mounted correctly), a bookie will reject to start.
> > > > >
> > > > >
> > > > > *How the behavior became bad?*
> > > > >
> > > > > The original behavior worked as expected to use the cookie in
> > zookeeper
> > > > as
> > > > > the source of truth. See
> > > > >
> > > > https://github.com/apache/bookkeeper/commit/
> > > c6cc7cca3a85603c8e935ba6d06fbf
> > > > > 3d8d7a7eb5
> > > > >
> > > > >
> > > > > The behavior was changed at
> > > > >
> > > > https://github.com/apache/bookkeeper/commit/
> > > 19b821c63b91293960041bca7b0316
> > > > > 14a109a7b8
> > > > > when trying to support both ip and hostname . It used journal
> > directory
> > > > as
> > > > > the source-of-truth for verifying cookies.
> > > > >
> > > > > At the community meeting, I was saying a bookie should reject start
> > > when
> > > > a
> > > > > cookie file is missing locally and that was my operational
> > experience.
> > > It
> > > > > turns out twitter's branch didn't include the change at
> > > > > 19b821c63b91293960041bca7b031614a109a7b8,
> > > > > so it was still the original behavior at
> > > > > c6cc7cca3a85603c8e935ba6d06fbf3d8d7a7eb5 .
> > > > >
> > > > > *How do we fix current bad behavior?*
> > > > >
> > > > > We basically need to revert the current behaviour to the original
> > > > designed
> > > > > behavior. The cookie in zookeeper should be the source-of-truth for
> > > > > validation.
> > > > >
> > > > > If the cookie works as expected (change the behavior to the
> original
> > > > > behavior), then it is the operational or lifecycle management
> issue I
> > > > > explained above.
> > > > >
> > > > > If a bookie failed with missing cookie, it should be:
> > > > >
> > > > > 1. taken out of the cluster
> > > > > 2. run re-replication (autorecovery or manual recovery)
> > > > > 3. ensure no ledgers using this bookie any more
> > > > > 4. reformat the bookie
> > > > > 5. add it back
> > > > >
> > > > > This can be automated by hooking into a scheduler (like k8s or
> > mesos).
> > > > But
> > > > > it requires some sort of lifecycle management in order to automate
> > such
> > > > > operations. There is a BP-4:
> > > > > https://cwiki.apache.org/confluence/display/BOOKKEEPER/
> > > > > BP-4+-+BookKeeper+Lifecycle+Management
> > > > > proposed for this purpose.
> > > > >
> > > > >
> > > > > *Is the cookie enough?*
> > > > >
> > > > > Cookie (if we revert the current behavior to the original
> behavior),
> > > > should
> > > > > be able to address most of the issues related to "incarnations".
> > > > >
> > > > > There are still some corner cases will violate correctness issues.
> > They
> > > > are
> > > > > related to "dangling writers" described in Ivan's first comment.
> > > > >
> > > > > How can a writer tell whether bookies changed or ledger changed
> when
> > it
> > > > > gets network partitioned?
> > > > >
> > > > > 1) Bookie Changed.
> > > > >
> > > > > Bookie can be reformatted and re-added to the cluster. Ivan and JV
> > > > already
> > > > > touch this on adding UUID.
> > > > >
> > > > > I think the UUID doesn't have to be part of ledger metadata.
> because
> > > > > auditor and replication worker would use the lifecycle management
> for
> > > > > managing the lifecycle of bookies.
> > > > >
> > > >
> > > > You are suggesting that the 'manual/scripted' lifecycle tool is to
> the
> > > > rescue.
> > > > a side cart solution.
> > > >
> > > > But what are we saving by not keeping this info in the metadata?
> > > > metadata size? sure it is a huge win in ZK environment.
> > > >
> > > > >
> > > > > But the connection should have the UUID informations.
> > > > >
> > > >
> > > > By this you are suggesting  service discovery portion need to have
> UUID
> > > > info
> > > > but not metadata portion. Won't it be confusing to handle a case
> where
> > > > write fails
> > > > on bookie because of UUID mismatch, and you may need to handle that
> > case
> > > > and if you go back to the same bookie then no ensmeble changes.
> > > >
> > > > On the other hand if we introduce UUID into metadata, then we don't
> > need
> > > to
> > > > be
> > > > explicitly depend on the side-cart solution.
> > > >
> > > >
> > > >
> > > > > Basically, any bookie client connects to a bookie, it needs to
> carry
> > > the
> > > > > namespace uuid and the bookie uuid to ensure bookie is connecting
> to
> > a
> > > > > right bookie. This would prevent "dangling writers" connect to
> > bookies
> > > > that
> > > > > are reformatted and added back.
> > > > >
> > > > >  While this is an issue, the problem can only get exposed in
> > > pathological
> > > > scenario
> > > > where AQ bookies have went through this scenario, which is ~ 3
> > > >
> > > >
> > > > 2) Ledger Changed.
> > > > >
> > > > > It is similar as what the case that Ivan' described. If a writer
> > > becomes
> > > > > 'network partitioned', and the ledger is deleted during this
> period,
> > > > after
> > > > > the writer comes back, the writer can still successfully write
> > entries
> > > to
> > > > > the bookies, because the ledgers are already deleted and all the
> > > fencing
> > > > > bits are gone.
> > > > >
> > > > > This violates the expectation of "fencing". but I am not sure we
> need
> > > to
> > > > > spend time on fixing this, because the ledger is already explicitly
> > > > deleted
> > > > > by the application. so I think the behavior should be categorized
> as
> > > > > "undefined", just like "deleting a ledger when a writer is still
> > > writing
> > > > > entries" is a undefined behavior.
> > > > >
> > > > >
> > > > > To summarize my thought on this:
> > > > >
> > > > > 1. we need to revert the cookie behaviour to the original behavior.
> > > make
> > > > > sure the cookie works as expected.
> > > > > 2. introduce UUID or epoch in the cookie. client connection should
> > > carry
> > > > > namespace uuid and bookie uuid when establishing the connection.
> > > > > 3. work on BP-4 to have a complete lifecycle management to take
> > bookie
> > > > out
> > > > > and add bookie out.
> > > > >
> > > > > 1 is the immediate fix, so correct operations can still guarantee
> the
> > > > > correctness.
> > > > >
> > > >
> > > > I agree we need to take care of #1 ASAP and have a Issues opened and
> > > > designs for #2 and #3.
> > > >
> > > > Thanks,
> > > > JV
> > > >
> > > > >
> > > > > - Sijie
> > > > >
> > > > >
> > > > >
> > > > > On Fri, Oct 6, 2017 at 9:35 AM, Venkateswara Rao Jujjuri <
> > > > > jujjuri@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > > However, imagine that the fenced message is only in the journal
> > on
> > > > b2,
> > > > > > > b2 crashes, something wipes the journal directory and then b2
> > comes
> > > > > > > back up.
> > > > > >
> > > > > > In this case what happened?
> > > > > > 1. We have WQ = 1
> > > > > > 2. We had data loss (crash and comeup clean)
> > > > > >
> > > > > > But yeah, in addition to dataloss we have fencing violation too.
> > > > > > The problem is not just wiped journal dir, but how we recognize
> the
> > > > > bookie.
> > > > > > Bookie is just recognized by its ip address, not by its
> > incarnation.
> > > > > > Bookie1 at T1  (b1t1) ; and same bookie1 at T2 after bookie
> format
> > > > (b1t2)
> > > > > > should be two different bookies, isn;t it?
> > > > > > this is needed for the replication worker and the auditor too.
> > > > > >
> > > > > > Also, bookie needs to know if the writer/reader is intended to
> read
> > > > from
> > > > > > b1t2 not from b1t1.
> > > > > > Looks like we have a hole here? Or I may not be fully
> understanding
> > > > > cookie
> > > > > > verification mechanism.
> > > > > >
> > > > > > Also as Ivan pointed out, we appear to think the lack of journal
> is
> > > > > > implicitly a new bookie, but overall cluster doesn't
> differentiate
> > > > > between
> > > > > > incarnations.
> > > > > >
> > > > > > Thanks,
> > > > > > JV
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Fri, Oct 6, 2017 at 8:46 AM, Ivan Kelly <iv...@apache.org>
> > wrote:
> > > > > >
> > > > > > > > The case you described here is "almost correct". But there is
> > an
> > > > key
> > > > > > > here:
> > > > > > > > B2 can't startup itself if journal disk is wiped out, because
> > the
> > > > > > cookie
> > > > > > > is
> > > > > > > > missed.
> > > > > > > This is what I expected to see, but isn't the case.
> > > > > > > <snip>
> > > > > > >       List<Cookie> journalCookies = Lists.newArrayList();
> > > > > > >             // try to read cookie from journal directory.
> > > > > > >             for (File journalDirectory : journalDirectories) {
> > > > > > >                 try {
> > > > > > >                     Cookie journalCookie =
> > > > > > > Cookie.readFromDirectory(journalDirectory);
> > > > > > >                     journalCookies.add(journalCookie);
> > > > > > >                     if
> > (journalCookie.isBookieHostCreatedFromIp())
> > > {
> > > > > > >                         conf.setUseHostNameAsBookieID(false);
> > > > > > >                     } else {
> > > > > > >                         conf.setUseHostNameAsBookieID(true);
> > > > > > >                     }
> > > > > > >                 } catch (FileNotFoundException fnf) {
> > > > > > >                     newEnv = true;
> > > > > > >                     missedCookieDirs.add(journalDirectory);
> > > > > > >                 }
> > > > > > >             }
> > > > > > > </snip>
> > > > > > >
> > > > > > > So if a journal is missing the cookie, newEnv is set to true.
> > This
> > > > > > > disabled the later checks.
> > > > > > >
> > > > > > > > Hower it can still happen in a different case: bit flap. In
> > your
> > > > > case,
> > > > > > if
> > > > > > > > fence bit in b2 is already persisted on disk, but it got
> > > corrupted.
> > > > > > Then
> > > > > > > it
> > > > > > > > will cause the issue you described. One problem is we don't
> > have
> > > > > > checksum
> > > > > > > > on the index file header when it stores those fence bits.
> > > > > > > Yes, this is also an issue.
> > > > > > >
> > > > > > > -Ivan
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Jvrao
> > > > > > ---
> > > > > > First they ignore you, then they laugh at you, then they fight
> you,
> > > > then
> > > > > > you win. - Mahatma Gandhi
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Jvrao
> > > > ---
> > > > First they ignore you, then they laugh at you, then they fight you,
> > then
> > > > you win. - Mahatma Gandhi
> > > >
> > > --
> > >
> > >
> > > -- Enrico Olivelli
> > >
> >
> --
>
>
> -- Enrico Olivelli
>

Re: Cookies and empty disks

Posted by Enrico Olivelli <eo...@gmail.com>.
Il sab 7 ott 2017, 00:27 Sijie Guo <gu...@gmail.com> ha scritto:

> Enrico,
>
> Let's try to come to a conclusion or an agreement what we should fix and
> improve, before talking who is going to drive this.
>

Sure.

This is my point of view:
View have separate issues:
- missing checksums, to protect fence bits
- have a bug in bookie boot, we should not allow empty directories
- have a clear lifecycle for the bookie, add/remove
- deal with reincarnation of bookies
- ensuring the correctness of the contents of the directories of the bookie

I would like to add a new point, we have rhe cookie inside every configured
directory managed by the bookie.
No cookie -> no boot
This will not be enough, we have to write in that file not only the
identity of the bookie but the list of files expected to be in the
directory.
This way you will not boot with a corrupted directory.
Config ->  list of dirs -> list of files

I agree on the fact that the bookie should be added (bookie format) only if
there is no reference to it in zk.
The bookie format operation should write the cookie in any configured
directory so that a bookie with empty directories won't ever start.

I have to think more about this, but I wanted to share my first thoughts

Enrico


> - Sijie
>
> On Fri, Oct 6, 2017 at 1:14 PM, Enrico Olivelli <eo...@gmail.com>
> wrote:
>
> > +1 for fixing the problem of missing cookie in 4.6
> >
> > Who drives the issue?
> >
> > Thank you all for the interesting points
> > Enrico
> >
> > Il ven 6 ott 2017, 21:27 Venkateswara Rao Jujjuri <ju...@gmail.com> ha
> > scritto:
> >
> > > Thanks for the writeup Sijie, comments below.
> > >
> > > On Fri, Oct 6, 2017 at 12:14 PM, Sijie Guo <gu...@gmail.com> wrote:
> > >
> > > > I think the question is mainly around "how do we recognize the
> bookie"
> > or
> > > > "incarnations". And the purpose of a cookie is designed for
> addressing
> > > > "incarnations".
> > > >
> > > > I will try to cover following aspects, and will try to answer
> questions
> > > > that Ivan and JV raised.
> > > >
> > > > - what is cookie?
> > > > - how the behavior became bad?
> > > > - how do we fix current bad behavior?
> > > > - is the cookie enough?
> > > >
> > > >
> > > > *What is Cookie?*
> > > >
> > > > Cookie is originally introduced in this commit -
> > > >
> > > https://github.com/apache/bookkeeper/commit/
> > c6cc7cca3a85603c8e935ba6d06fbf
> > > > 3d8d7a7eb5
> > > > .
> > > >
> > > > A cookie is a identifier of a bookie. A cookie is created on
> zookeeper
> > > when
> > > > a brand new bookie joint the cluster, the cookie is representing the
> > > bookie
> > > > instance
> > > > during its lifecycle. The cookie is stored on all the disks for
> > > > verification purpose. so if any of the disks misses the cookie (e.g.
> > > disks
> > > > were reformat or wiped out,
> > > > disks are not mounted correctly), a bookie will reject to start.
> > > >
> > > >
> > > > *How the behavior became bad?*
> > > >
> > > > The original behavior worked as expected to use the cookie in
> zookeeper
> > > as
> > > > the source of truth. See
> > > >
> > > https://github.com/apache/bookkeeper/commit/
> > c6cc7cca3a85603c8e935ba6d06fbf
> > > > 3d8d7a7eb5
> > > >
> > > >
> > > > The behavior was changed at
> > > >
> > > https://github.com/apache/bookkeeper/commit/
> > 19b821c63b91293960041bca7b0316
> > > > 14a109a7b8
> > > > when trying to support both ip and hostname . It used journal
> directory
> > > as
> > > > the source-of-truth for verifying cookies.
> > > >
> > > > At the community meeting, I was saying a bookie should reject start
> > when
> > > a
> > > > cookie file is missing locally and that was my operational
> experience.
> > It
> > > > turns out twitter's branch didn't include the change at
> > > > 19b821c63b91293960041bca7b031614a109a7b8,
> > > > so it was still the original behavior at
> > > > c6cc7cca3a85603c8e935ba6d06fbf3d8d7a7eb5 .
> > > >
> > > > *How do we fix current bad behavior?*
> > > >
> > > > We basically need to revert the current behaviour to the original
> > > designed
> > > > behavior. The cookie in zookeeper should be the source-of-truth for
> > > > validation.
> > > >
> > > > If the cookie works as expected (change the behavior to the original
> > > > behavior), then it is the operational or lifecycle management issue I
> > > > explained above.
> > > >
> > > > If a bookie failed with missing cookie, it should be:
> > > >
> > > > 1. taken out of the cluster
> > > > 2. run re-replication (autorecovery or manual recovery)
> > > > 3. ensure no ledgers using this bookie any more
> > > > 4. reformat the bookie
> > > > 5. add it back
> > > >
> > > > This can be automated by hooking into a scheduler (like k8s or
> mesos).
> > > But
> > > > it requires some sort of lifecycle management in order to automate
> such
> > > > operations. There is a BP-4:
> > > > https://cwiki.apache.org/confluence/display/BOOKKEEPER/
> > > > BP-4+-+BookKeeper+Lifecycle+Management
> > > > proposed for this purpose.
> > > >
> > > >
> > > > *Is the cookie enough?*
> > > >
> > > > Cookie (if we revert the current behavior to the original behavior),
> > > should
> > > > be able to address most of the issues related to "incarnations".
> > > >
> > > > There are still some corner cases will violate correctness issues.
> They
> > > are
> > > > related to "dangling writers" described in Ivan's first comment.
> > > >
> > > > How can a writer tell whether bookies changed or ledger changed when
> it
> > > > gets network partitioned?
> > > >
> > > > 1) Bookie Changed.
> > > >
> > > > Bookie can be reformatted and re-added to the cluster. Ivan and JV
> > > already
> > > > touch this on adding UUID.
> > > >
> > > > I think the UUID doesn't have to be part of ledger metadata. because
> > > > auditor and replication worker would use the lifecycle management for
> > > > managing the lifecycle of bookies.
> > > >
> > >
> > > You are suggesting that the 'manual/scripted' lifecycle tool is to the
> > > rescue.
> > > a side cart solution.
> > >
> > > But what are we saving by not keeping this info in the metadata?
> > > metadata size? sure it is a huge win in ZK environment.
> > >
> > > >
> > > > But the connection should have the UUID informations.
> > > >
> > >
> > > By this you are suggesting  service discovery portion need to have UUID
> > > info
> > > but not metadata portion. Won't it be confusing to handle a case where
> > > write fails
> > > on bookie because of UUID mismatch, and you may need to handle that
> case
> > > and if you go back to the same bookie then no ensmeble changes.
> > >
> > > On the other hand if we introduce UUID into metadata, then we don't
> need
> > to
> > > be
> > > explicitly depend on the side-cart solution.
> > >
> > >
> > >
> > > > Basically, any bookie client connects to a bookie, it needs to carry
> > the
> > > > namespace uuid and the bookie uuid to ensure bookie is connecting to
> a
> > > > right bookie. This would prevent "dangling writers" connect to
> bookies
> > > that
> > > > are reformatted and added back.
> > > >
> > > >  While this is an issue, the problem can only get exposed in
> > pathological
> > > scenario
> > > where AQ bookies have went through this scenario, which is ~ 3
> > >
> > >
> > > 2) Ledger Changed.
> > > >
> > > > It is similar as what the case that Ivan' described. If a writer
> > becomes
> > > > 'network partitioned', and the ledger is deleted during this period,
> > > after
> > > > the writer comes back, the writer can still successfully write
> entries
> > to
> > > > the bookies, because the ledgers are already deleted and all the
> > fencing
> > > > bits are gone.
> > > >
> > > > This violates the expectation of "fencing". but I am not sure we need
> > to
> > > > spend time on fixing this, because the ledger is already explicitly
> > > deleted
> > > > by the application. so I think the behavior should be categorized as
> > > > "undefined", just like "deleting a ledger when a writer is still
> > writing
> > > > entries" is a undefined behavior.
> > > >
> > > >
> > > > To summarize my thought on this:
> > > >
> > > > 1. we need to revert the cookie behaviour to the original behavior.
> > make
> > > > sure the cookie works as expected.
> > > > 2. introduce UUID or epoch in the cookie. client connection should
> > carry
> > > > namespace uuid and bookie uuid when establishing the connection.
> > > > 3. work on BP-4 to have a complete lifecycle management to take
> bookie
> > > out
> > > > and add bookie out.
> > > >
> > > > 1 is the immediate fix, so correct operations can still guarantee the
> > > > correctness.
> > > >
> > >
> > > I agree we need to take care of #1 ASAP and have a Issues opened and
> > > designs for #2 and #3.
> > >
> > > Thanks,
> > > JV
> > >
> > > >
> > > > - Sijie
> > > >
> > > >
> > > >
> > > > On Fri, Oct 6, 2017 at 9:35 AM, Venkateswara Rao Jujjuri <
> > > > jujjuri@gmail.com>
> > > > wrote:
> > > >
> > > > > > However, imagine that the fenced message is only in the journal
> on
> > > b2,
> > > > > > b2 crashes, something wipes the journal directory and then b2
> comes
> > > > > > back up.
> > > > >
> > > > > In this case what happened?
> > > > > 1. We have WQ = 1
> > > > > 2. We had data loss (crash and comeup clean)
> > > > >
> > > > > But yeah, in addition to dataloss we have fencing violation too.
> > > > > The problem is not just wiped journal dir, but how we recognize the
> > > > bookie.
> > > > > Bookie is just recognized by its ip address, not by its
> incarnation.
> > > > > Bookie1 at T1  (b1t1) ; and same bookie1 at T2 after bookie format
> > > (b1t2)
> > > > > should be two different bookies, isn;t it?
> > > > > this is needed for the replication worker and the auditor too.
> > > > >
> > > > > Also, bookie needs to know if the writer/reader is intended to read
> > > from
> > > > > b1t2 not from b1t1.
> > > > > Looks like we have a hole here? Or I may not be fully understanding
> > > > cookie
> > > > > verification mechanism.
> > > > >
> > > > > Also as Ivan pointed out, we appear to think the lack of journal is
> > > > > implicitly a new bookie, but overall cluster doesn't differentiate
> > > > between
> > > > > incarnations.
> > > > >
> > > > > Thanks,
> > > > > JV
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On Fri, Oct 6, 2017 at 8:46 AM, Ivan Kelly <iv...@apache.org>
> wrote:
> > > > >
> > > > > > > The case you described here is "almost correct". But there is
> an
> > > key
> > > > > > here:
> > > > > > > B2 can't startup itself if journal disk is wiped out, because
> the
> > > > > cookie
> > > > > > is
> > > > > > > missed.
> > > > > > This is what I expected to see, but isn't the case.
> > > > > > <snip>
> > > > > >       List<Cookie> journalCookies = Lists.newArrayList();
> > > > > >             // try to read cookie from journal directory.
> > > > > >             for (File journalDirectory : journalDirectories) {
> > > > > >                 try {
> > > > > >                     Cookie journalCookie =
> > > > > > Cookie.readFromDirectory(journalDirectory);
> > > > > >                     journalCookies.add(journalCookie);
> > > > > >                     if
> (journalCookie.isBookieHostCreatedFromIp())
> > {
> > > > > >                         conf.setUseHostNameAsBookieID(false);
> > > > > >                     } else {
> > > > > >                         conf.setUseHostNameAsBookieID(true);
> > > > > >                     }
> > > > > >                 } catch (FileNotFoundException fnf) {
> > > > > >                     newEnv = true;
> > > > > >                     missedCookieDirs.add(journalDirectory);
> > > > > >                 }
> > > > > >             }
> > > > > > </snip>
> > > > > >
> > > > > > So if a journal is missing the cookie, newEnv is set to true.
> This
> > > > > > disabled the later checks.
> > > > > >
> > > > > > > Hower it can still happen in a different case: bit flap. In
> your
> > > > case,
> > > > > if
> > > > > > > fence bit in b2 is already persisted on disk, but it got
> > corrupted.
> > > > > Then
> > > > > > it
> > > > > > > will cause the issue you described. One problem is we don't
> have
> > > > > checksum
> > > > > > > on the index file header when it stores those fence bits.
> > > > > > Yes, this is also an issue.
> > > > > >
> > > > > > -Ivan
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Jvrao
> > > > > ---
> > > > > First they ignore you, then they laugh at you, then they fight you,
> > > then
> > > > > you win. - Mahatma Gandhi
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Jvrao
> > > ---
> > > First they ignore you, then they laugh at you, then they fight you,
> then
> > > you win. - Mahatma Gandhi
> > >
> > --
> >
> >
> > -- Enrico Olivelli
> >
>
-- 


-- Enrico Olivelli

Re: Cookies and empty disks

Posted by Sijie Guo <gu...@gmail.com>.
Enrico,

Let's try to come to a conclusion or an agreement what we should fix and
improve, before talking who is going to drive this.

- Sijie

On Fri, Oct 6, 2017 at 1:14 PM, Enrico Olivelli <eo...@gmail.com> wrote:

> +1 for fixing the problem of missing cookie in 4.6
>
> Who drives the issue?
>
> Thank you all for the interesting points
> Enrico
>
> Il ven 6 ott 2017, 21:27 Venkateswara Rao Jujjuri <ju...@gmail.com> ha
> scritto:
>
> > Thanks for the writeup Sijie, comments below.
> >
> > On Fri, Oct 6, 2017 at 12:14 PM, Sijie Guo <gu...@gmail.com> wrote:
> >
> > > I think the question is mainly around "how do we recognize the bookie"
> or
> > > "incarnations". And the purpose of a cookie is designed for addressing
> > > "incarnations".
> > >
> > > I will try to cover following aspects, and will try to answer questions
> > > that Ivan and JV raised.
> > >
> > > - what is cookie?
> > > - how the behavior became bad?
> > > - how do we fix current bad behavior?
> > > - is the cookie enough?
> > >
> > >
> > > *What is Cookie?*
> > >
> > > Cookie is originally introduced in this commit -
> > >
> > https://github.com/apache/bookkeeper/commit/
> c6cc7cca3a85603c8e935ba6d06fbf
> > > 3d8d7a7eb5
> > > .
> > >
> > > A cookie is a identifier of a bookie. A cookie is created on zookeeper
> > when
> > > a brand new bookie joint the cluster, the cookie is representing the
> > bookie
> > > instance
> > > during its lifecycle. The cookie is stored on all the disks for
> > > verification purpose. so if any of the disks misses the cookie (e.g.
> > disks
> > > were reformat or wiped out,
> > > disks are not mounted correctly), a bookie will reject to start.
> > >
> > >
> > > *How the behavior became bad?*
> > >
> > > The original behavior worked as expected to use the cookie in zookeeper
> > as
> > > the source of truth. See
> > >
> > https://github.com/apache/bookkeeper/commit/
> c6cc7cca3a85603c8e935ba6d06fbf
> > > 3d8d7a7eb5
> > >
> > >
> > > The behavior was changed at
> > >
> > https://github.com/apache/bookkeeper/commit/
> 19b821c63b91293960041bca7b0316
> > > 14a109a7b8
> > > when trying to support both ip and hostname . It used journal directory
> > as
> > > the source-of-truth for verifying cookies.
> > >
> > > At the community meeting, I was saying a bookie should reject start
> when
> > a
> > > cookie file is missing locally and that was my operational experience.
> It
> > > turns out twitter's branch didn't include the change at
> > > 19b821c63b91293960041bca7b031614a109a7b8,
> > > so it was still the original behavior at
> > > c6cc7cca3a85603c8e935ba6d06fbf3d8d7a7eb5 .
> > >
> > > *How do we fix current bad behavior?*
> > >
> > > We basically need to revert the current behaviour to the original
> > designed
> > > behavior. The cookie in zookeeper should be the source-of-truth for
> > > validation.
> > >
> > > If the cookie works as expected (change the behavior to the original
> > > behavior), then it is the operational or lifecycle management issue I
> > > explained above.
> > >
> > > If a bookie failed with missing cookie, it should be:
> > >
> > > 1. taken out of the cluster
> > > 2. run re-replication (autorecovery or manual recovery)
> > > 3. ensure no ledgers using this bookie any more
> > > 4. reformat the bookie
> > > 5. add it back
> > >
> > > This can be automated by hooking into a scheduler (like k8s or mesos).
> > But
> > > it requires some sort of lifecycle management in order to automate such
> > > operations. There is a BP-4:
> > > https://cwiki.apache.org/confluence/display/BOOKKEEPER/
> > > BP-4+-+BookKeeper+Lifecycle+Management
> > > proposed for this purpose.
> > >
> > >
> > > *Is the cookie enough?*
> > >
> > > Cookie (if we revert the current behavior to the original behavior),
> > should
> > > be able to address most of the issues related to "incarnations".
> > >
> > > There are still some corner cases will violate correctness issues. They
> > are
> > > related to "dangling writers" described in Ivan's first comment.
> > >
> > > How can a writer tell whether bookies changed or ledger changed when it
> > > gets network partitioned?
> > >
> > > 1) Bookie Changed.
> > >
> > > Bookie can be reformatted and re-added to the cluster. Ivan and JV
> > already
> > > touch this on adding UUID.
> > >
> > > I think the UUID doesn't have to be part of ledger metadata. because
> > > auditor and replication worker would use the lifecycle management for
> > > managing the lifecycle of bookies.
> > >
> >
> > You are suggesting that the 'manual/scripted' lifecycle tool is to the
> > rescue.
> > a side cart solution.
> >
> > But what are we saving by not keeping this info in the metadata?
> > metadata size? sure it is a huge win in ZK environment.
> >
> > >
> > > But the connection should have the UUID informations.
> > >
> >
> > By this you are suggesting  service discovery portion need to have UUID
> > info
> > but not metadata portion. Won't it be confusing to handle a case where
> > write fails
> > on bookie because of UUID mismatch, and you may need to handle that case
> > and if you go back to the same bookie then no ensmeble changes.
> >
> > On the other hand if we introduce UUID into metadata, then we don't need
> to
> > be
> > explicitly depend on the side-cart solution.
> >
> >
> >
> > > Basically, any bookie client connects to a bookie, it needs to carry
> the
> > > namespace uuid and the bookie uuid to ensure bookie is connecting to a
> > > right bookie. This would prevent "dangling writers" connect to bookies
> > that
> > > are reformatted and added back.
> > >
> > >  While this is an issue, the problem can only get exposed in
> pathological
> > scenario
> > where AQ bookies have went through this scenario, which is ~ 3
> >
> >
> > 2) Ledger Changed.
> > >
> > > It is similar as what the case that Ivan' described. If a writer
> becomes
> > > 'network partitioned', and the ledger is deleted during this period,
> > after
> > > the writer comes back, the writer can still successfully write entries
> to
> > > the bookies, because the ledgers are already deleted and all the
> fencing
> > > bits are gone.
> > >
> > > This violates the expectation of "fencing". but I am not sure we need
> to
> > > spend time on fixing this, because the ledger is already explicitly
> > deleted
> > > by the application. so I think the behavior should be categorized as
> > > "undefined", just like "deleting a ledger when a writer is still
> writing
> > > entries" is a undefined behavior.
> > >
> > >
> > > To summarize my thought on this:
> > >
> > > 1. we need to revert the cookie behaviour to the original behavior.
> make
> > > sure the cookie works as expected.
> > > 2. introduce UUID or epoch in the cookie. client connection should
> carry
> > > namespace uuid and bookie uuid when establishing the connection.
> > > 3. work on BP-4 to have a complete lifecycle management to take bookie
> > out
> > > and add bookie out.
> > >
> > > 1 is the immediate fix, so correct operations can still guarantee the
> > > correctness.
> > >
> >
> > I agree we need to take care of #1 ASAP and have a Issues opened and
> > designs for #2 and #3.
> >
> > Thanks,
> > JV
> >
> > >
> > > - Sijie
> > >
> > >
> > >
> > > On Fri, Oct 6, 2017 at 9:35 AM, Venkateswara Rao Jujjuri <
> > > jujjuri@gmail.com>
> > > wrote:
> > >
> > > > > However, imagine that the fenced message is only in the journal on
> > b2,
> > > > > b2 crashes, something wipes the journal directory and then b2 comes
> > > > > back up.
> > > >
> > > > In this case what happened?
> > > > 1. We have WQ = 1
> > > > 2. We had data loss (crash and comeup clean)
> > > >
> > > > But yeah, in addition to dataloss we have fencing violation too.
> > > > The problem is not just wiped journal dir, but how we recognize the
> > > bookie.
> > > > Bookie is just recognized by its ip address, not by its incarnation.
> > > > Bookie1 at T1  (b1t1) ; and same bookie1 at T2 after bookie format
> > (b1t2)
> > > > should be two different bookies, isn;t it?
> > > > this is needed for the replication worker and the auditor too.
> > > >
> > > > Also, bookie needs to know if the writer/reader is intended to read
> > from
> > > > b1t2 not from b1t1.
> > > > Looks like we have a hole here? Or I may not be fully understanding
> > > cookie
> > > > verification mechanism.
> > > >
> > > > Also as Ivan pointed out, we appear to think the lack of journal is
> > > > implicitly a new bookie, but overall cluster doesn't differentiate
> > > between
> > > > incarnations.
> > > >
> > > > Thanks,
> > > > JV
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > On Fri, Oct 6, 2017 at 8:46 AM, Ivan Kelly <iv...@apache.org> wrote:
> > > >
> > > > > > The case you described here is "almost correct". But there is an
> > key
> > > > > here:
> > > > > > B2 can't startup itself if journal disk is wiped out, because the
> > > > cookie
> > > > > is
> > > > > > missed.
> > > > > This is what I expected to see, but isn't the case.
> > > > > <snip>
> > > > >       List<Cookie> journalCookies = Lists.newArrayList();
> > > > >             // try to read cookie from journal directory.
> > > > >             for (File journalDirectory : journalDirectories) {
> > > > >                 try {
> > > > >                     Cookie journalCookie =
> > > > > Cookie.readFromDirectory(journalDirectory);
> > > > >                     journalCookies.add(journalCookie);
> > > > >                     if (journalCookie.isBookieHostCreatedFromIp())
> {
> > > > >                         conf.setUseHostNameAsBookieID(false);
> > > > >                     } else {
> > > > >                         conf.setUseHostNameAsBookieID(true);
> > > > >                     }
> > > > >                 } catch (FileNotFoundException fnf) {
> > > > >                     newEnv = true;
> > > > >                     missedCookieDirs.add(journalDirectory);
> > > > >                 }
> > > > >             }
> > > > > </snip>
> > > > >
> > > > > So if a journal is missing the cookie, newEnv is set to true. This
> > > > > disabled the later checks.
> > > > >
> > > > > > Hower it can still happen in a different case: bit flap. In your
> > > case,
> > > > if
> > > > > > fence bit in b2 is already persisted on disk, but it got
> corrupted.
> > > > Then
> > > > > it
> > > > > > will cause the issue you described. One problem is we don't have
> > > > checksum
> > > > > > on the index file header when it stores those fence bits.
> > > > > Yes, this is also an issue.
> > > > >
> > > > > -Ivan
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Jvrao
> > > > ---
> > > > First they ignore you, then they laugh at you, then they fight you,
> > then
> > > > you win. - Mahatma Gandhi
> > > >
> > >
> >
> >
> >
> > --
> > Jvrao
> > ---
> > First they ignore you, then they laugh at you, then they fight you, then
> > you win. - Mahatma Gandhi
> >
> --
>
>
> -- Enrico Olivelli
>

Re: Cookies and empty disks

Posted by Enrico Olivelli <eo...@gmail.com>.
+1 for fixing the problem of missing cookie in 4.6

Who drives the issue?

Thank you all for the interesting points
Enrico

Il ven 6 ott 2017, 21:27 Venkateswara Rao Jujjuri <ju...@gmail.com> ha
scritto:

> Thanks for the writeup Sijie, comments below.
>
> On Fri, Oct 6, 2017 at 12:14 PM, Sijie Guo <gu...@gmail.com> wrote:
>
> > I think the question is mainly around "how do we recognize the bookie" or
> > "incarnations". And the purpose of a cookie is designed for addressing
> > "incarnations".
> >
> > I will try to cover following aspects, and will try to answer questions
> > that Ivan and JV raised.
> >
> > - what is cookie?
> > - how the behavior became bad?
> > - how do we fix current bad behavior?
> > - is the cookie enough?
> >
> >
> > *What is Cookie?*
> >
> > Cookie is originally introduced in this commit -
> >
> https://github.com/apache/bookkeeper/commit/c6cc7cca3a85603c8e935ba6d06fbf
> > 3d8d7a7eb5
> > .
> >
> > A cookie is a identifier of a bookie. A cookie is created on zookeeper
> when
> > a brand new bookie joint the cluster, the cookie is representing the
> bookie
> > instance
> > during its lifecycle. The cookie is stored on all the disks for
> > verification purpose. so if any of the disks misses the cookie (e.g.
> disks
> > were reformat or wiped out,
> > disks are not mounted correctly), a bookie will reject to start.
> >
> >
> > *How the behavior became bad?*
> >
> > The original behavior worked as expected to use the cookie in zookeeper
> as
> > the source of truth. See
> >
> https://github.com/apache/bookkeeper/commit/c6cc7cca3a85603c8e935ba6d06fbf
> > 3d8d7a7eb5
> >
> >
> > The behavior was changed at
> >
> https://github.com/apache/bookkeeper/commit/19b821c63b91293960041bca7b0316
> > 14a109a7b8
> > when trying to support both ip and hostname . It used journal directory
> as
> > the source-of-truth for verifying cookies.
> >
> > At the community meeting, I was saying a bookie should reject start when
> a
> > cookie file is missing locally and that was my operational experience. It
> > turns out twitter's branch didn't include the change at
> > 19b821c63b91293960041bca7b031614a109a7b8,
> > so it was still the original behavior at
> > c6cc7cca3a85603c8e935ba6d06fbf3d8d7a7eb5 .
> >
> > *How do we fix current bad behavior?*
> >
> > We basically need to revert the current behaviour to the original
> designed
> > behavior. The cookie in zookeeper should be the source-of-truth for
> > validation.
> >
> > If the cookie works as expected (change the behavior to the original
> > behavior), then it is the operational or lifecycle management issue I
> > explained above.
> >
> > If a bookie failed with missing cookie, it should be:
> >
> > 1. taken out of the cluster
> > 2. run re-replication (autorecovery or manual recovery)
> > 3. ensure no ledgers using this bookie any more
> > 4. reformat the bookie
> > 5. add it back
> >
> > This can be automated by hooking into a scheduler (like k8s or mesos).
> But
> > it requires some sort of lifecycle management in order to automate such
> > operations. There is a BP-4:
> > https://cwiki.apache.org/confluence/display/BOOKKEEPER/
> > BP-4+-+BookKeeper+Lifecycle+Management
> > proposed for this purpose.
> >
> >
> > *Is the cookie enough?*
> >
> > Cookie (if we revert the current behavior to the original behavior),
> should
> > be able to address most of the issues related to "incarnations".
> >
> > There are still some corner cases will violate correctness issues. They
> are
> > related to "dangling writers" described in Ivan's first comment.
> >
> > How can a writer tell whether bookies changed or ledger changed when it
> > gets network partitioned?
> >
> > 1) Bookie Changed.
> >
> > Bookie can be reformatted and re-added to the cluster. Ivan and JV
> already
> > touch this on adding UUID.
> >
> > I think the UUID doesn't have to be part of ledger metadata. because
> > auditor and replication worker would use the lifecycle management for
> > managing the lifecycle of bookies.
> >
>
> You are suggesting that the 'manual/scripted' lifecycle tool is to the
> rescue.
> a side cart solution.
>
> But what are we saving by not keeping this info in the metadata?
> metadata size? sure it is a huge win in ZK environment.
>
> >
> > But the connection should have the UUID informations.
> >
>
> By this you are suggesting  service discovery portion need to have UUID
> info
> but not metadata portion. Won't it be confusing to handle a case where
> write fails
> on bookie because of UUID mismatch, and you may need to handle that case
> and if you go back to the same bookie then no ensmeble changes.
>
> On the other hand if we introduce UUID into metadata, then we don't need to
> be
> explicitly depend on the side-cart solution.
>
>
>
> > Basically, any bookie client connects to a bookie, it needs to carry the
> > namespace uuid and the bookie uuid to ensure bookie is connecting to a
> > right bookie. This would prevent "dangling writers" connect to bookies
> that
> > are reformatted and added back.
> >
> >  While this is an issue, the problem can only get exposed in pathological
> scenario
> where AQ bookies have went through this scenario, which is ~ 3
>
>
> 2) Ledger Changed.
> >
> > It is similar as what the case that Ivan' described. If a writer becomes
> > 'network partitioned', and the ledger is deleted during this period,
> after
> > the writer comes back, the writer can still successfully write entries to
> > the bookies, because the ledgers are already deleted and all the fencing
> > bits are gone.
> >
> > This violates the expectation of "fencing". but I am not sure we need to
> > spend time on fixing this, because the ledger is already explicitly
> deleted
> > by the application. so I think the behavior should be categorized as
> > "undefined", just like "deleting a ledger when a writer is still writing
> > entries" is a undefined behavior.
> >
> >
> > To summarize my thought on this:
> >
> > 1. we need to revert the cookie behaviour to the original behavior. make
> > sure the cookie works as expected.
> > 2. introduce UUID or epoch in the cookie. client connection should carry
> > namespace uuid and bookie uuid when establishing the connection.
> > 3. work on BP-4 to have a complete lifecycle management to take bookie
> out
> > and add bookie out.
> >
> > 1 is the immediate fix, so correct operations can still guarantee the
> > correctness.
> >
>
> I agree we need to take care of #1 ASAP and have a Issues opened and
> designs for #2 and #3.
>
> Thanks,
> JV
>
> >
> > - Sijie
> >
> >
> >
> > On Fri, Oct 6, 2017 at 9:35 AM, Venkateswara Rao Jujjuri <
> > jujjuri@gmail.com>
> > wrote:
> >
> > > > However, imagine that the fenced message is only in the journal on
> b2,
> > > > b2 crashes, something wipes the journal directory and then b2 comes
> > > > back up.
> > >
> > > In this case what happened?
> > > 1. We have WQ = 1
> > > 2. We had data loss (crash and comeup clean)
> > >
> > > But yeah, in addition to dataloss we have fencing violation too.
> > > The problem is not just wiped journal dir, but how we recognize the
> > bookie.
> > > Bookie is just recognized by its ip address, not by its incarnation.
> > > Bookie1 at T1  (b1t1) ; and same bookie1 at T2 after bookie format
> (b1t2)
> > > should be two different bookies, isn;t it?
> > > this is needed for the replication worker and the auditor too.
> > >
> > > Also, bookie needs to know if the writer/reader is intended to read
> from
> > > b1t2 not from b1t1.
> > > Looks like we have a hole here? Or I may not be fully understanding
> > cookie
> > > verification mechanism.
> > >
> > > Also as Ivan pointed out, we appear to think the lack of journal is
> > > implicitly a new bookie, but overall cluster doesn't differentiate
> > between
> > > incarnations.
> > >
> > > Thanks,
> > > JV
> > >
> > >
> > >
> > >
> > >
> > > On Fri, Oct 6, 2017 at 8:46 AM, Ivan Kelly <iv...@apache.org> wrote:
> > >
> > > > > The case you described here is "almost correct". But there is an
> key
> > > > here:
> > > > > B2 can't startup itself if journal disk is wiped out, because the
> > > cookie
> > > > is
> > > > > missed.
> > > > This is what I expected to see, but isn't the case.
> > > > <snip>
> > > >       List<Cookie> journalCookies = Lists.newArrayList();
> > > >             // try to read cookie from journal directory.
> > > >             for (File journalDirectory : journalDirectories) {
> > > >                 try {
> > > >                     Cookie journalCookie =
> > > > Cookie.readFromDirectory(journalDirectory);
> > > >                     journalCookies.add(journalCookie);
> > > >                     if (journalCookie.isBookieHostCreatedFromIp()) {
> > > >                         conf.setUseHostNameAsBookieID(false);
> > > >                     } else {
> > > >                         conf.setUseHostNameAsBookieID(true);
> > > >                     }
> > > >                 } catch (FileNotFoundException fnf) {
> > > >                     newEnv = true;
> > > >                     missedCookieDirs.add(journalDirectory);
> > > >                 }
> > > >             }
> > > > </snip>
> > > >
> > > > So if a journal is missing the cookie, newEnv is set to true. This
> > > > disabled the later checks.
> > > >
> > > > > Hower it can still happen in a different case: bit flap. In your
> > case,
> > > if
> > > > > fence bit in b2 is already persisted on disk, but it got corrupted.
> > > Then
> > > > it
> > > > > will cause the issue you described. One problem is we don't have
> > > checksum
> > > > > on the index file header when it stores those fence bits.
> > > > Yes, this is also an issue.
> > > >
> > > > -Ivan
> > > >
> > >
> > >
> > >
> > > --
> > > Jvrao
> > > ---
> > > First they ignore you, then they laugh at you, then they fight you,
> then
> > > you win. - Mahatma Gandhi
> > >
> >
>
>
>
> --
> Jvrao
> ---
> First they ignore you, then they laugh at you, then they fight you, then
> you win. - Mahatma Gandhi
>
-- 


-- Enrico Olivelli

Re: Cookies and empty disks

Posted by Sijie Guo <gu...@gmail.com>.
On Fri, Oct 6, 2017 at 12:27 PM, Venkateswara Rao Jujjuri <jujjuri@gmail.com
> wrote:

> Thanks for the writeup Sijie, comments below.
>
> On Fri, Oct 6, 2017 at 12:14 PM, Sijie Guo <gu...@gmail.com> wrote:
>
> > I think the question is mainly around "how do we recognize the bookie" or
> > "incarnations". And the purpose of a cookie is designed for addressing
> > "incarnations".
> >
> > I will try to cover following aspects, and will try to answer questions
> > that Ivan and JV raised.
> >
> > - what is cookie?
> > - how the behavior became bad?
> > - how do we fix current bad behavior?
> > - is the cookie enough?
> >
> >
> > *What is Cookie?*
> >
> > Cookie is originally introduced in this commit -
> > https://github.com/apache/bookkeeper/commit/
> c6cc7cca3a85603c8e935ba6d06fbf
> > 3d8d7a7eb5
> > .
> >
> > A cookie is a identifier of a bookie. A cookie is created on zookeeper
> when
> > a brand new bookie joint the cluster, the cookie is representing the
> bookie
> > instance
> > during its lifecycle. The cookie is stored on all the disks for
> > verification purpose. so if any of the disks misses the cookie (e.g.
> disks
> > were reformat or wiped out,
> > disks are not mounted correctly), a bookie will reject to start.
> >
> >
> > *How the behavior became bad?*
> >
> > The original behavior worked as expected to use the cookie in zookeeper
> as
> > the source of truth. See
> > https://github.com/apache/bookkeeper/commit/
> c6cc7cca3a85603c8e935ba6d06fbf
> > 3d8d7a7eb5
> >
> >
> > The behavior was changed at
> > https://github.com/apache/bookkeeper/commit/
> 19b821c63b91293960041bca7b0316
> > 14a109a7b8
> > when trying to support both ip and hostname . It used journal directory
> as
> > the source-of-truth for verifying cookies.
> >
> > At the community meeting, I was saying a bookie should reject start when
> a
> > cookie file is missing locally and that was my operational experience. It
> > turns out twitter's branch didn't include the change at
> > 19b821c63b91293960041bca7b031614a109a7b8,
> > so it was still the original behavior at
> > c6cc7cca3a85603c8e935ba6d06fbf3d8d7a7eb5 .
> >
> > *How do we fix current bad behavior?*
> >
> > We basically need to revert the current behaviour to the original
> designed
> > behavior. The cookie in zookeeper should be the source-of-truth for
> > validation.
> >
> > If the cookie works as expected (change the behavior to the original
> > behavior), then it is the operational or lifecycle management issue I
> > explained above.
> >
> > If a bookie failed with missing cookie, it should be:
> >
> > 1. taken out of the cluster
> > 2. run re-replication (autorecovery or manual recovery)
> > 3. ensure no ledgers using this bookie any more
> > 4. reformat the bookie
> > 5. add it back
> >
> > This can be automated by hooking into a scheduler (like k8s or mesos).
> But
> > it requires some sort of lifecycle management in order to automate such
> > operations. There is a BP-4:
> > https://cwiki.apache.org/confluence/display/BOOKKEEPER/
> > BP-4+-+BookKeeper+Lifecycle+Management
> > proposed for this purpose.
> >
> >
> > *Is the cookie enough?*
> >
> > Cookie (if we revert the current behavior to the original behavior),
> should
> > be able to address most of the issues related to "incarnations".
> >
> > There are still some corner cases will violate correctness issues. They
> are
> > related to "dangling writers" described in Ivan's first comment.
> >
> > How can a writer tell whether bookies changed or ledger changed when it
> > gets network partitioned?
> >
> > 1) Bookie Changed.
> >
> > Bookie can be reformatted and re-added to the cluster. Ivan and JV
> already
> > touch this on adding UUID.
> >
> > I think the UUID doesn't have to be part of ledger metadata. because
> > auditor and replication worker would use the lifecycle management for
> > managing the lifecycle of bookies.
> >
>
> You are suggesting that the 'manual/scripted' lifecycle tool is to the
> rescue.
> a side cart solution.
>
> But what are we saving by not keeping this info in the metadata?
> metadata size? sure it is a huge win in ZK environment.
>


If you never add a malformed bookie back (removing cookie from zookeeper
and reformat), then Cookie is already there to guarantee the correctness.

So the real correctness issues are raised only when we are adding bookie
back (removing the cookie from zookeeper). Allowing a bookie being added
back, you need a clean lifecycle management in bookkeeper to ensure the
correctness. It will address the problem "when we are okay
to add a bookie back".

The approach I am suggesting (ignoring what the implementation it is), is
enforcing the rule - "*a bookie can only be added to a cluster only after
no ledgers are referencing it*". It can be done by auditor and replication
worker or whatever mechanisms that can achieve this.

The approach you are suggesting -- adding UUID in the ledger metadata --
basically doesn't enforce any kind of this rule. It basically says you are
okay to add a bookie back at any time. Because the UUID in the ledger
metadata will handle it. In theory, it is correct. However, I have concerns
around
simplicity, operations and debuggability in this approach. That says,

- it requires quite a bit changes on metadata format on clients and also
replication. Backward compatibility is also a big concern.
- when a problem occurs at midnight, it is going to become hard to
investigate, because a ledger might potentially contains a bookie at its
different lifecycles.


From operational perspective, I would prefer any simple solution. That says
- i like the simplicity of enforcing "a bookie can only be added to a
cluster only after no ledgers are referencing it", but I am fine with any
approaches to take us there.




>
> >
> > But the connection should have the UUID informations.
> >
>
> By this you are suggesting  service discovery portion need to have UUID
> info
> but not metadata portion. Won't it be confusing to handle a case where
> write fails
> on bookie because of UUID mismatch, and you may need to handle that case
> and if you go back to the same bookie then no ensmeble changes.
>
> On the other hand if we introduce UUID into metadata, then we don't need to
> be
> explicitly depend on the side-cart solution.
>


If you keep "a bookie can only be added to a cluster only after no ledgers
are referencing it" in the mind, then you will make sense of the approach
putting uuid when establishing the connections.
Because at any given time, there is only one bookie running at a given
host:ip with its namespace uuid and instance uuid. The combination of
namespace uuid and bookie uuid represents a bookie at a given cluster. The
client uses these information to establish the connection.
If you are carrying wrong uuids, that means you are connecting to a wrong
cluster or some bad bookies, then no operations should succeed. It is just
like 'authentication'. It is very simple, straightforward, apply for any
type of requests in the bookie and easy to scale if we add new type of
requests in future.




>
>
> > Basically, any bookie client connects to a bookie, it needs to carry the
> > namespace uuid and the bookie uuid to ensure bookie is connecting to a
> > right bookie. This would prevent "dangling writers" connect to bookies
> that
> > are reformatted and added back.
> >
> >  While this is an issue, the problem can only get exposed in pathological
> scenario
> where AQ bookies have went through this scenario, which is ~ 3
>


I am not sure I understand the comment here.


>
>
> 2) Ledger Changed.
> >
> > It is similar as what the case that Ivan' described. If a writer becomes
> > 'network partitioned', and the ledger is deleted during this period,
> after
> > the writer comes back, the writer can still successfully write entries to
> > the bookies, because the ledgers are already deleted and all the fencing
> > bits are gone.
> >
> > This violates the expectation of "fencing". but I am not sure we need to
> > spend time on fixing this, because the ledger is already explicitly
> deleted
> > by the application. so I think the behavior should be categorized as
> > "undefined", just like "deleting a ledger when a writer is still writing
> > entries" is a undefined behavior.
> >
> >
> > To summarize my thought on this:
> >
> > 1. we need to revert the cookie behaviour to the original behavior. make
> > sure the cookie works as expected.
> > 2. introduce UUID or epoch in the cookie. client connection should carry
> > namespace uuid and bookie uuid when establishing the connection.
> > 3. work on BP-4 to have a complete lifecycle management to take bookie
> out
> > and add bookie out.
> >
> > 1 is the immediate fix, so correct operations can still guarantee the
> > correctness.
> >
>
> I agree we need to take care of #1 ASAP and have a Issues opened and
> designs for #2 and #3.
>
> Thanks,
> JV
>
> >
> > - Sijie
> >
> >
> >
> > On Fri, Oct 6, 2017 at 9:35 AM, Venkateswara Rao Jujjuri <
> > jujjuri@gmail.com>
> > wrote:
> >
> > > > However, imagine that the fenced message is only in the journal on
> b2,
> > > > b2 crashes, something wipes the journal directory and then b2 comes
> > > > back up.
> > >
> > > In this case what happened?
> > > 1. We have WQ = 1
> > > 2. We had data loss (crash and comeup clean)
> > >
> > > But yeah, in addition to dataloss we have fencing violation too.
> > > The problem is not just wiped journal dir, but how we recognize the
> > bookie.
> > > Bookie is just recognized by its ip address, not by its incarnation.
> > > Bookie1 at T1  (b1t1) ; and same bookie1 at T2 after bookie format
> (b1t2)
> > > should be two different bookies, isn;t it?
> > > this is needed for the replication worker and the auditor too.
> > >
> > > Also, bookie needs to know if the writer/reader is intended to read
> from
> > > b1t2 not from b1t1.
> > > Looks like we have a hole here? Or I may not be fully understanding
> > cookie
> > > verification mechanism.
> > >
> > > Also as Ivan pointed out, we appear to think the lack of journal is
> > > implicitly a new bookie, but overall cluster doesn't differentiate
> > between
> > > incarnations.
> > >
> > > Thanks,
> > > JV
> > >
> > >
> > >
> > >
> > >
> > > On Fri, Oct 6, 2017 at 8:46 AM, Ivan Kelly <iv...@apache.org> wrote:
> > >
> > > > > The case you described here is "almost correct". But there is an
> key
> > > > here:
> > > > > B2 can't startup itself if journal disk is wiped out, because the
> > > cookie
> > > > is
> > > > > missed.
> > > > This is what I expected to see, but isn't the case.
> > > > <snip>
> > > >       List<Cookie> journalCookies = Lists.newArrayList();
> > > >             // try to read cookie from journal directory.
> > > >             for (File journalDirectory : journalDirectories) {
> > > >                 try {
> > > >                     Cookie journalCookie =
> > > > Cookie.readFromDirectory(journalDirectory);
> > > >                     journalCookies.add(journalCookie);
> > > >                     if (journalCookie.isBookieHostCreatedFromIp()) {
> > > >                         conf.setUseHostNameAsBookieID(false);
> > > >                     } else {
> > > >                         conf.setUseHostNameAsBookieID(true);
> > > >                     }
> > > >                 } catch (FileNotFoundException fnf) {
> > > >                     newEnv = true;
> > > >                     missedCookieDirs.add(journalDirectory);
> > > >                 }
> > > >             }
> > > > </snip>
> > > >
> > > > So if a journal is missing the cookie, newEnv is set to true. This
> > > > disabled the later checks.
> > > >
> > > > > Hower it can still happen in a different case: bit flap. In your
> > case,
> > > if
> > > > > fence bit in b2 is already persisted on disk, but it got corrupted.
> > > Then
> > > > it
> > > > > will cause the issue you described. One problem is we don't have
> > > checksum
> > > > > on the index file header when it stores those fence bits.
> > > > Yes, this is also an issue.
> > > >
> > > > -Ivan
> > > >
> > >
> > >
> > >
> > > --
> > > Jvrao
> > > ---
> > > First they ignore you, then they laugh at you, then they fight you,
> then
> > > you win. - Mahatma Gandhi
> > >
> >
>
>
>
> --
> Jvrao
> ---
> First they ignore you, then they laugh at you, then they fight you, then
> you win. - Mahatma Gandhi
>

Re: Cookies and empty disks

Posted by Venkateswara Rao Jujjuri <ju...@gmail.com>.
Thanks for the writeup Sijie, comments below.

On Fri, Oct 6, 2017 at 12:14 PM, Sijie Guo <gu...@gmail.com> wrote:

> I think the question is mainly around "how do we recognize the bookie" or
> "incarnations". And the purpose of a cookie is designed for addressing
> "incarnations".
>
> I will try to cover following aspects, and will try to answer questions
> that Ivan and JV raised.
>
> - what is cookie?
> - how the behavior became bad?
> - how do we fix current bad behavior?
> - is the cookie enough?
>
>
> *What is Cookie?*
>
> Cookie is originally introduced in this commit -
> https://github.com/apache/bookkeeper/commit/c6cc7cca3a85603c8e935ba6d06fbf
> 3d8d7a7eb5
> .
>
> A cookie is a identifier of a bookie. A cookie is created on zookeeper when
> a brand new bookie joint the cluster, the cookie is representing the bookie
> instance
> during its lifecycle. The cookie is stored on all the disks for
> verification purpose. so if any of the disks misses the cookie (e.g. disks
> were reformat or wiped out,
> disks are not mounted correctly), a bookie will reject to start.
>
>
> *How the behavior became bad?*
>
> The original behavior worked as expected to use the cookie in zookeeper as
> the source of truth. See
> https://github.com/apache/bookkeeper/commit/c6cc7cca3a85603c8e935ba6d06fbf
> 3d8d7a7eb5
>
>
> The behavior was changed at
> https://github.com/apache/bookkeeper/commit/19b821c63b91293960041bca7b0316
> 14a109a7b8
> when trying to support both ip and hostname . It used journal directory as
> the source-of-truth for verifying cookies.
>
> At the community meeting, I was saying a bookie should reject start when a
> cookie file is missing locally and that was my operational experience. It
> turns out twitter's branch didn't include the change at
> 19b821c63b91293960041bca7b031614a109a7b8,
> so it was still the original behavior at
> c6cc7cca3a85603c8e935ba6d06fbf3d8d7a7eb5 .
>
> *How do we fix current bad behavior?*
>
> We basically need to revert the current behaviour to the original designed
> behavior. The cookie in zookeeper should be the source-of-truth for
> validation.
>
> If the cookie works as expected (change the behavior to the original
> behavior), then it is the operational or lifecycle management issue I
> explained above.
>
> If a bookie failed with missing cookie, it should be:
>
> 1. taken out of the cluster
> 2. run re-replication (autorecovery or manual recovery)
> 3. ensure no ledgers using this bookie any more
> 4. reformat the bookie
> 5. add it back
>
> This can be automated by hooking into a scheduler (like k8s or mesos). But
> it requires some sort of lifecycle management in order to automate such
> operations. There is a BP-4:
> https://cwiki.apache.org/confluence/display/BOOKKEEPER/
> BP-4+-+BookKeeper+Lifecycle+Management
> proposed for this purpose.
>
>
> *Is the cookie enough?*
>
> Cookie (if we revert the current behavior to the original behavior), should
> be able to address most of the issues related to "incarnations".
>
> There are still some corner cases will violate correctness issues. They are
> related to "dangling writers" described in Ivan's first comment.
>
> How can a writer tell whether bookies changed or ledger changed when it
> gets network partitioned?
>
> 1) Bookie Changed.
>
> Bookie can be reformatted and re-added to the cluster. Ivan and JV already
> touch this on adding UUID.
>
> I think the UUID doesn't have to be part of ledger metadata. because
> auditor and replication worker would use the lifecycle management for
> managing the lifecycle of bookies.
>

You are suggesting that the 'manual/scripted' lifecycle tool is to the
rescue.
a side cart solution.

But what are we saving by not keeping this info in the metadata?
metadata size? sure it is a huge win in ZK environment.

>
> But the connection should have the UUID informations.
>

By this you are suggesting  service discovery portion need to have UUID info
but not metadata portion. Won't it be confusing to handle a case where
write fails
on bookie because of UUID mismatch, and you may need to handle that case
and if you go back to the same bookie then no ensmeble changes.

On the other hand if we introduce UUID into metadata, then we don't need to
be
explicitly depend on the side-cart solution.



> Basically, any bookie client connects to a bookie, it needs to carry the
> namespace uuid and the bookie uuid to ensure bookie is connecting to a
> right bookie. This would prevent "dangling writers" connect to bookies that
> are reformatted and added back.
>
>  While this is an issue, the problem can only get exposed in pathological
scenario
where AQ bookies have went through this scenario, which is ~ 3


2) Ledger Changed.
>
> It is similar as what the case that Ivan' described. If a writer becomes
> 'network partitioned', and the ledger is deleted during this period, after
> the writer comes back, the writer can still successfully write entries to
> the bookies, because the ledgers are already deleted and all the fencing
> bits are gone.
>
> This violates the expectation of "fencing". but I am not sure we need to
> spend time on fixing this, because the ledger is already explicitly deleted
> by the application. so I think the behavior should be categorized as
> "undefined", just like "deleting a ledger when a writer is still writing
> entries" is a undefined behavior.
>
>
> To summarize my thought on this:
>
> 1. we need to revert the cookie behaviour to the original behavior. make
> sure the cookie works as expected.
> 2. introduce UUID or epoch in the cookie. client connection should carry
> namespace uuid and bookie uuid when establishing the connection.
> 3. work on BP-4 to have a complete lifecycle management to take bookie out
> and add bookie out.
>
> 1 is the immediate fix, so correct operations can still guarantee the
> correctness.
>

I agree we need to take care of #1 ASAP and have a Issues opened and
designs for #2 and #3.

Thanks,
JV

>
> - Sijie
>
>
>
> On Fri, Oct 6, 2017 at 9:35 AM, Venkateswara Rao Jujjuri <
> jujjuri@gmail.com>
> wrote:
>
> > > However, imagine that the fenced message is only in the journal on b2,
> > > b2 crashes, something wipes the journal directory and then b2 comes
> > > back up.
> >
> > In this case what happened?
> > 1. We have WQ = 1
> > 2. We had data loss (crash and comeup clean)
> >
> > But yeah, in addition to dataloss we have fencing violation too.
> > The problem is not just wiped journal dir, but how we recognize the
> bookie.
> > Bookie is just recognized by its ip address, not by its incarnation.
> > Bookie1 at T1  (b1t1) ; and same bookie1 at T2 after bookie format (b1t2)
> > should be two different bookies, isn;t it?
> > this is needed for the replication worker and the auditor too.
> >
> > Also, bookie needs to know if the writer/reader is intended to read from
> > b1t2 not from b1t1.
> > Looks like we have a hole here? Or I may not be fully understanding
> cookie
> > verification mechanism.
> >
> > Also as Ivan pointed out, we appear to think the lack of journal is
> > implicitly a new bookie, but overall cluster doesn't differentiate
> between
> > incarnations.
> >
> > Thanks,
> > JV
> >
> >
> >
> >
> >
> > On Fri, Oct 6, 2017 at 8:46 AM, Ivan Kelly <iv...@apache.org> wrote:
> >
> > > > The case you described here is "almost correct". But there is an key
> > > here:
> > > > B2 can't startup itself if journal disk is wiped out, because the
> > cookie
> > > is
> > > > missed.
> > > This is what I expected to see, but isn't the case.
> > > <snip>
> > >       List<Cookie> journalCookies = Lists.newArrayList();
> > >             // try to read cookie from journal directory.
> > >             for (File journalDirectory : journalDirectories) {
> > >                 try {
> > >                     Cookie journalCookie =
> > > Cookie.readFromDirectory(journalDirectory);
> > >                     journalCookies.add(journalCookie);
> > >                     if (journalCookie.isBookieHostCreatedFromIp()) {
> > >                         conf.setUseHostNameAsBookieID(false);
> > >                     } else {
> > >                         conf.setUseHostNameAsBookieID(true);
> > >                     }
> > >                 } catch (FileNotFoundException fnf) {
> > >                     newEnv = true;
> > >                     missedCookieDirs.add(journalDirectory);
> > >                 }
> > >             }
> > > </snip>
> > >
> > > So if a journal is missing the cookie, newEnv is set to true. This
> > > disabled the later checks.
> > >
> > > > Hower it can still happen in a different case: bit flap. In your
> case,
> > if
> > > > fence bit in b2 is already persisted on disk, but it got corrupted.
> > Then
> > > it
> > > > will cause the issue you described. One problem is we don't have
> > checksum
> > > > on the index file header when it stores those fence bits.
> > > Yes, this is also an issue.
> > >
> > > -Ivan
> > >
> >
> >
> >
> > --
> > Jvrao
> > ---
> > First they ignore you, then they laugh at you, then they fight you, then
> > you win. - Mahatma Gandhi
> >
>



-- 
Jvrao
---
First they ignore you, then they laugh at you, then they fight you, then
you win. - Mahatma Gandhi

Re: Cookies and empty disks

Posted by Sijie Guo <gu...@gmail.com>.
I think the question is mainly around "how do we recognize the bookie" or
"incarnations". And the purpose of a cookie is designed for addressing
"incarnations".

I will try to cover following aspects, and will try to answer questions
that Ivan and JV raised.

- what is cookie?
- how the behavior became bad?
- how do we fix current bad behavior?
- is the cookie enough?


*What is Cookie?*

Cookie is originally introduced in this commit -
https://github.com/apache/bookkeeper/commit/c6cc7cca3a85603c8e935ba6d06fbf3d8d7a7eb5
.

A cookie is a identifier of a bookie. A cookie is created on zookeeper when
a brand new bookie joint the cluster, the cookie is representing the bookie
instance
during its lifecycle. The cookie is stored on all the disks for
verification purpose. so if any of the disks misses the cookie (e.g. disks
were reformat or wiped out,
disks are not mounted correctly), a bookie will reject to start.


*How the behavior became bad?*

The original behavior worked as expected to use the cookie in zookeeper as
the source of truth. See
https://github.com/apache/bookkeeper/commit/c6cc7cca3a85603c8e935ba6d06fbf3d8d7a7eb5


The behavior was changed at
https://github.com/apache/bookkeeper/commit/19b821c63b91293960041bca7b031614a109a7b8
when trying to support both ip and hostname . It used journal directory as
the source-of-truth for verifying cookies.

At the community meeting, I was saying a bookie should reject start when a
cookie file is missing locally and that was my operational experience. It
turns out twitter's branch didn't include the change at
19b821c63b91293960041bca7b031614a109a7b8,
so it was still the original behavior at
c6cc7cca3a85603c8e935ba6d06fbf3d8d7a7eb5 .

*How do we fix current bad behavior?*

We basically need to revert the current behaviour to the original designed
behavior. The cookie in zookeeper should be the source-of-truth for
validation.

If the cookie works as expected (change the behavior to the original
behavior), then it is the operational or lifecycle management issue I
explained above.

If a bookie failed with missing cookie, it should be:

1. taken out of the cluster
2. run re-replication (autorecovery or manual recovery)
3. ensure no ledgers using this bookie any more
4. reformat the bookie
5. add it back

This can be automated by hooking into a scheduler (like k8s or mesos). But
it requires some sort of lifecycle management in order to automate such
operations. There is a BP-4:
https://cwiki.apache.org/confluence/display/BOOKKEEPER/BP-4+-+BookKeeper+Lifecycle+Management
proposed for this purpose.


*Is the cookie enough?*

Cookie (if we revert the current behavior to the original behavior), should
be able to address most of the issues related to "incarnations".

There are still some corner cases will violate correctness issues. They are
related to "dangling writers" described in Ivan's first comment.

How can a writer tell whether bookies changed or ledger changed when it
gets network partitioned?

1) Bookie Changed.

Bookie can be reformatted and re-added to the cluster. Ivan and JV already
touch this on adding UUID.

I think the UUID doesn't have to be part of ledger metadata. because
auditor and replication worker would use the lifecycle management for
managing the lifecycle of bookies.

But the connection should have the UUID informations.

Basically, any bookie client connects to a bookie, it needs to carry the
namespace uuid and the bookie uuid to ensure bookie is connecting to a
right bookie. This would prevent "dangling writers" connect to bookies that
are reformatted and added back.

2) Ledger Changed.

It is similar as what the case that Ivan' described. If a writer becomes
'network partitioned', and the ledger is deleted during this period, after
the writer comes back, the writer can still successfully write entries to
the bookies, because the ledgers are already deleted and all the fencing
bits are gone.

This violates the expectation of "fencing". but I am not sure we need to
spend time on fixing this, because the ledger is already explicitly deleted
by the application. so I think the behavior should be categorized as
"undefined", just like "deleting a ledger when a writer is still writing
entries" is a undefined behavior.


To summarize my thought on this:

1. we need to revert the cookie behaviour to the original behavior. make
sure the cookie works as expected.
2. introduce UUID or epoch in the cookie. client connection should carry
namespace uuid and bookie uuid when establishing the connection.
3. work on BP-4 to have a complete lifecycle management to take bookie out
and add bookie out.

1 is the immediate fix, so correct operations can still guarantee the
correctness.

- Sijie



On Fri, Oct 6, 2017 at 9:35 AM, Venkateswara Rao Jujjuri <ju...@gmail.com>
wrote:

> > However, imagine that the fenced message is only in the journal on b2,
> > b2 crashes, something wipes the journal directory and then b2 comes
> > back up.
>
> In this case what happened?
> 1. We have WQ = 1
> 2. We had data loss (crash and comeup clean)
>
> But yeah, in addition to dataloss we have fencing violation too.
> The problem is not just wiped journal dir, but how we recognize the bookie.
> Bookie is just recognized by its ip address, not by its incarnation.
> Bookie1 at T1  (b1t1) ; and same bookie1 at T2 after bookie format (b1t2)
> should be two different bookies, isn;t it?
> this is needed for the replication worker and the auditor too.
>
> Also, bookie needs to know if the writer/reader is intended to read from
> b1t2 not from b1t1.
> Looks like we have a hole here? Or I may not be fully understanding cookie
> verification mechanism.
>
> Also as Ivan pointed out, we appear to think the lack of journal is
> implicitly a new bookie, but overall cluster doesn't differentiate between
> incarnations.
>
> Thanks,
> JV
>
>
>
>
>
> On Fri, Oct 6, 2017 at 8:46 AM, Ivan Kelly <iv...@apache.org> wrote:
>
> > > The case you described here is "almost correct". But there is an key
> > here:
> > > B2 can't startup itself if journal disk is wiped out, because the
> cookie
> > is
> > > missed.
> > This is what I expected to see, but isn't the case.
> > <snip>
> >       List<Cookie> journalCookies = Lists.newArrayList();
> >             // try to read cookie from journal directory.
> >             for (File journalDirectory : journalDirectories) {
> >                 try {
> >                     Cookie journalCookie =
> > Cookie.readFromDirectory(journalDirectory);
> >                     journalCookies.add(journalCookie);
> >                     if (journalCookie.isBookieHostCreatedFromIp()) {
> >                         conf.setUseHostNameAsBookieID(false);
> >                     } else {
> >                         conf.setUseHostNameAsBookieID(true);
> >                     }
> >                 } catch (FileNotFoundException fnf) {
> >                     newEnv = true;
> >                     missedCookieDirs.add(journalDirectory);
> >                 }
> >             }
> > </snip>
> >
> > So if a journal is missing the cookie, newEnv is set to true. This
> > disabled the later checks.
> >
> > > Hower it can still happen in a different case: bit flap. In your case,
> if
> > > fence bit in b2 is already persisted on disk, but it got corrupted.
> Then
> > it
> > > will cause the issue you described. One problem is we don't have
> checksum
> > > on the index file header when it stores those fence bits.
> > Yes, this is also an issue.
> >
> > -Ivan
> >
>
>
>
> --
> Jvrao
> ---
> First they ignore you, then they laugh at you, then they fight you, then
> you win. - Mahatma Gandhi
>

Re: Cookies and empty disks

Posted by Venkateswara Rao Jujjuri <ju...@gmail.com>.
> However, imagine that the fenced message is only in the journal on b2,
> b2 crashes, something wipes the journal directory and then b2 comes
> back up.

In this case what happened?
1. We have WQ = 1
2. We had data loss (crash and comeup clean)

But yeah, in addition to dataloss we have fencing violation too.
The problem is not just wiped journal dir, but how we recognize the bookie.
Bookie is just recognized by its ip address, not by its incarnation.
Bookie1 at T1  (b1t1) ; and same bookie1 at T2 after bookie format (b1t2)
should be two different bookies, isn;t it?
this is needed for the replication worker and the auditor too.

Also, bookie needs to know if the writer/reader is intended to read from
b1t2 not from b1t1.
Looks like we have a hole here? Or I may not be fully understanding cookie
verification mechanism.

Also as Ivan pointed out, we appear to think the lack of journal is
implicitly a new bookie, but overall cluster doesn't differentiate between
incarnations.

Thanks,
JV





On Fri, Oct 6, 2017 at 8:46 AM, Ivan Kelly <iv...@apache.org> wrote:

> > The case you described here is "almost correct". But there is an key
> here:
> > B2 can't startup itself if journal disk is wiped out, because the cookie
> is
> > missed.
> This is what I expected to see, but isn't the case.
> <snip>
>       List<Cookie> journalCookies = Lists.newArrayList();
>             // try to read cookie from journal directory.
>             for (File journalDirectory : journalDirectories) {
>                 try {
>                     Cookie journalCookie =
> Cookie.readFromDirectory(journalDirectory);
>                     journalCookies.add(journalCookie);
>                     if (journalCookie.isBookieHostCreatedFromIp()) {
>                         conf.setUseHostNameAsBookieID(false);
>                     } else {
>                         conf.setUseHostNameAsBookieID(true);
>                     }
>                 } catch (FileNotFoundException fnf) {
>                     newEnv = true;
>                     missedCookieDirs.add(journalDirectory);
>                 }
>             }
> </snip>
>
> So if a journal is missing the cookie, newEnv is set to true. This
> disabled the later checks.
>
> > Hower it can still happen in a different case: bit flap. In your case, if
> > fence bit in b2 is already persisted on disk, but it got corrupted. Then
> it
> > will cause the issue you described. One problem is we don't have checksum
> > on the index file header when it stores those fence bits.
> Yes, this is also an issue.
>
> -Ivan
>



-- 
Jvrao
---
First they ignore you, then they laugh at you, then they fight you, then
you win. - Mahatma Gandhi

Re: Cookies and empty disks

Posted by Ivan Kelly <iv...@apache.org>.
> The case you described here is "almost correct". But there is an key here:
> B2 can't startup itself if journal disk is wiped out, because the cookie is
> missed.
This is what I expected to see, but isn't the case.
<snip>
      List<Cookie> journalCookies = Lists.newArrayList();
            // try to read cookie from journal directory.
            for (File journalDirectory : journalDirectories) {
                try {
                    Cookie journalCookie =
Cookie.readFromDirectory(journalDirectory);
                    journalCookies.add(journalCookie);
                    if (journalCookie.isBookieHostCreatedFromIp()) {
                        conf.setUseHostNameAsBookieID(false);
                    } else {
                        conf.setUseHostNameAsBookieID(true);
                    }
                } catch (FileNotFoundException fnf) {
                    newEnv = true;
                    missedCookieDirs.add(journalDirectory);
                }
            }
</snip>

So if a journal is missing the cookie, newEnv is set to true. This
disabled the later checks.

> Hower it can still happen in a different case: bit flap. In your case, if
> fence bit in b2 is already persisted on disk, but it got corrupted. Then it
> will cause the issue you described. One problem is we don't have checksum
> on the index file header when it stores those fence bits.
Yes, this is also an issue.

-Ivan

Re: Cookies and empty disks

Posted by Sijie Guo <gu...@gmail.com>.
On Oct 6, 2017 3:07 AM, "Ivan Kelly" <iv...@apache.org> wrote:

Hi folks,

Following up from the meeting yesterday, I said I would look into the
code to verify the behaviour because there could be a correctness
problem.

I think there could be an issue. The code is convoluted, but my
understanding of it is as follows.

We check all ledger, journal and index directories for a cookie. If it
doesn't exist, it gets added to a missingCookieDirs list. We then
iterate over this directory. If any directory in missingCookieDirs
isn't listed as a ledger directory in the journal dir cookies, or
isn't empty, we fail to start.

The issue is that a journal dir could be emptied and we wouldn't
detect it. It would be great if someone else could eyeball the code
and tell me I'm wrong. The code is in Bookie#checkEnvironment.

This breaks correctness. Imagine we have a ledger on b1, b2, b3.
Writer w1 is writing to the ledger.
The state of the ledger on the bookies is:

b1: e0     Fenced: false, LAC: -
b2: e0     Fenced: false, LAC: -
b3: e0     Fenced: false, LAC: -

w1 gets partitioned from network. w2 tries to recover the ledger, it
tries to fence on all bookies. The message to b3 gets lost. b1 and b2
acknowledge the fencing, so w2 continues to recover and close the
ledger with e0 as the last entry.

b1: e0     Fenced: true, LAC: e0
b2: e0     Fenced: true, LAC: e0
b3: e0     Fenced: false, LAC: -

If w1 became unpartitioned at this point, it wouldn't be able to add a
new entry to the ledger as any quorum would see fenced on b1 or b2.

However, imagine that the fenced message is only in the journal on b2,
b2 crashes, something wipes the journal directory and then b2 comes
back up.


The case you described here is "almost correct". But there is an key here:
B2 can't startup itself if journal disk is wiped out, because the cookie is
missed. So this is an operation issue or lifecycle management issue:

1) at twitter, when we took a bookie out for repair and before adding it
back, we typically make sure there are no ledgers referencing this bookie.
It is done by either auto or manual recovery.

2) we are lacking an life cycle management of taking bookie out and adding
bookie back, to automate this. It has to guarantee a bookie when it is
taken out for repair, there are no ledgers referencing it before adding it
back.


The good thing of this case is it only happens if you add a bookie back by
simply removing cookie. Otherwise cookie should do it's job.

Hower it can still happen in a different case: bit flap. In your case, if
fence bit in b2 is already persisted on disk, but it got corrupted. Then it
will cause the issue you described. One problem is we don't have checksum
on the index file header when it stores those fence bits.

So I think two issues we can look for:

- enforce life cycle management for bookie.
- add checksum for index file headers.


The new state of the ledger on the bookies will be.

b1: e0     Fenced: true, LAC: e0
b2: e0     Fenced: false, LAC: -
b3: e0     Fenced: false, LAC: -

Now w1 can write a new entry, e1, and b2 & b3 would both acknowledge
it, even though the end of the ledger is e0.

It requires many planets to be aligned for it to harm us, but we must fix
this.

Regards,
Ivan