You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@kudu.apache.org by Mike Percy <mp...@apache.org> on 2016/04/05 09:14:36 UTC

Durability of appendable metadata for Log Block Manager

I'm working on solving an issue in the Log Block Manager (LBM) filed as
KUDU-1377. In this case, writing a partial record to the end of a container
metadata file can cause a failure to restart Kudu. One possible wversay for
this to happen is to run out of disk space when appending a block id to
this metadata file. I wanted to discuss a potential fix for this issue.

The LBM uses the protobuf container (PBC) file format documented in
pb_util.h for the container metadata file. The current version of this
format (V1) looks like this:


<magic number>
<container version>
repeated <record>


with each <record> looking like the following:

<uint32 data length>
<data bytes>
<data checksum>


In the LBM, each time we create a new block we append the block id to the
metadata file. On startup, we verify that all records in the file are
valid. If not, we print to the log and exit.

In the case of a full disk, we will have written a partial record to the
metadata file and at startup we will fail validation, however we should be
able to detect this case and ignore the partial record on startup. Because
we still need to support deleting blocks, we need to be able to continue
appending to this metadata file after startup, so we also need to truncate
the file to the last good record when this occurs.

Here is what I am thinking about to fix this issue:

1. When we are reading a container metadata file at startup, if we detect
that there is a trailing record that is too short to fit a valid record
(relative to the length of the file) then we truncate the last partial
record from the file and continue as normal.

2. To avoid truncating "good" records in the case that there is data
corruption in one of the length fields, we also need to extend the PBC
format to add a checksum for the record length. So a record would now look
like the following:

<uint32 data length>

<length checksum>

<data bytes>
<data checksum>


Does anyone see any drawbacks to this approach?

If you made it this far, thanks for reading.

Mike

Re: Durability of appendable metadata for Log Block Manager

Posted by Mike Percy <mp...@apache.org>.

On Tue, Apr 12, 2016 at 3:37 PM, Dan Burkert <da...@cloudera.com> wrote:

> Sounds good to me.  It should also be possible to begin preallocating in
> the future if we see a need, since it's the same on disk format, right?
>

Yeah. We'd likely just want to change the validation rules.

> The only other concern about preallocation is making sure that the length
> seed for a new record is never 0, since you could read a string of null
> bytes as repeated 0 length records.  E.G., assuming CRC(null bytes, 0) == 0

This assumption doesn't appear to hold. I wasn't sure, so I checked. This
is the test I wrote (which fails):

TEST_F(CrcTest, TestCRC32CZeroSeed) {
  uint64_t zero = 0;
  ASSERT_EQ(0, Crc32c(&zero, sizeof(zero)));
}

The output, on my machine, is:

[ RUN      ] CrcTest.TestCRC32CZeroSeed
../../src/kudu/util/crc-test.cc:92: Failure
Value of: Crc32c(&zero, sizeof(zero))
  Actual: 2351477386
Expected: 0

So I think we are safe not chaining the CRCs, and not changing the CRC
seed, unless we think that it's very important to enforce ordering.

Mike

Re: Durability of appendable metadata for Log Block Manager

Posted by Dan Burkert <da...@cloudera.com>.

Sounds good to me.  It should also be possible to begin preallocating in
the future if we see a need, since it's the same on disk format, right?
The only other concern about preallocation is making sure that the length
seed for a new record is never 0, since you could read a string of null
bytes as repeated 0 length records.  E.G., assuming CRC(null bytes, 0) ==
0, and a preallocated file with a single record:

[header]
[record1 length] = 10
[CRC(record 1 length, <initial seed>)] = 0x01234567
[record1 data] = 0x01ab...
[CRC(record1 data, 0x01234567)] = 0x00000000
0x00 // beginning of preallocated trailing 0 bytes
0x00
0x00
...

My understanding is that if the server wrote this file (in which the CRC of
the record1 data just happens to be 0), when it comes back up it will read
0 length records from the preallocated remainder of the file.  I haven't
actually checked the CRC(null bytes, 0) == 0 assumption, though.

- Dan

On Tue, Apr 12, 2016 at 3:09 PM, Mike Percy <mp...@apache.org> wrote:

> Sorry for the delay in responding and thanks Dan for enumerating the error
> cases. Responses below:
>
> > 1) Header magic number does not match
> Unrecoverable (corruption).
>
> > 2) Container version is not recognized
> Unrecoverable (corruption).
>
> > 3) Record length field is 0 (perhaps this isn't an error).
> Yeah, I just checked, and this is not an error. The case is writing a
> message with an optional field that is not present. The resulting
> serialized message is 0-length.
>
> > 4) Record length field is greater than the remaining bytes in the file.
> Recoverable, assuming the CRC checks out. We can truncate.
>
> > 5) Record length CRC does not match (if we go ahead and add a CRC for the
> length).
> Unrecoverable (corruption) -- assuming the file has enough length to store
> the full CRC.
>
> > 6) Record data CRC does not match.
> Unrecoverable (corruption).
>
> > 7) After the header or final record, there are more than 0 bytes
> remaining in the file, but not enough for the length and CRC(length)
> fields.
> Recoverable, we can truncate. Actually, assuming we're doing <len,
> crc(len), data, crc(data)> and len and crc fields are each 4 bytes long, we
> can say if len(remaining bytes) > 0 && < 12 then we can safely truncate.
>
> Additionally, I'll add some additional scenarios as well, which only occur
> when preallocating space:
>
> 8) Record data CRC does not match, but it is the last record in the file.
>
> We may choose to treat this as not corruption, but a partial write that
> could be truncated. This is only a valid assumption if we are preallocating
> the size of the record before writing the data (or if we assume file
> metadata is updated before the data is fsynced). If we don't preallocate,
> for example with ext4 data mode=ordered this should never occur normally
> and should be treated as an error (same as #6).
>
> 9) Same as #8, but followed by a long run of 0s at the end of the file.
>
> Another case that only occurs with preallocation, but this time we assume
> we're preallocating more than a single record at a time. We could
> preallocate 64MB at a time, for example. If we get a partial write, we'll
> end up with some bytes followed by zeroes. Let's assume a data corruption
> issue where several bytes got corrupted to 0 on-disk, and further that
> these are trailing bytes on the record. Unfortunately, in this case we
> don't have any way to detect whether we fsynced the write to the file and
> then that record was corrupted or whether we wrote only a partial record to
> disk and the fsync never completed. The same problem holds in the case of a
> series of trailing records if we use a "scan forward" approach (see
> KUDU-1414).
>
> If we want to preallocate for performance reasons (we may call fdatasync()
> for durability instead of fsync(), thus avoiding syncing file metadata), I
> think we have to accept that corruption of trailing zeros will be
> undetectable and that data loss may occur in this (probably very rare) case
> if we treat this situation as a partially written record that may be
> truncated. To minimize the problem surface area of this heuristic we can
> enforce that only the last record may be corrupt, and must be followed by
> zeroes. Otherwise, we should consider the file corrupt, treating it the
> same as case #6.
>
> *However, if we do not preallocate, then scenarios 1-7 are exhaustive and
> we shouldn't have any heuristic cases.* Personally, I think that's
> preferable for the log block metadata, which isn't getting written to as
> much as, say, the consensus WAL. Without profiling data suggesting this is
> a bottleneck, I'd rather err on the side of durability. So for now, I'm
> assuming we're not going to be preallocating space in the PBC files (this
> is not a change from what we do today).
>
> Mike
>
> On Tue, Apr 5, 2016 at 10:42 AM, Dan Burkert <da...@cloudera.com> wrote:
>
> > 7) After the header or final record, there are more than 0 bytes
> remaining
> > in the file, but not enough for the length and CRC(length) fields.
> >
> > On Tue, Apr 5, 2016 at 10:34 AM, Dan Burkert <da...@cloudera.com> wrote:
> >
> > > Perhaps not related to the issue exactly, but there is a pretty good
> > slide
> > > deck on CRCs here:
> > > https://users.ece.cmu.edu/~koopman/pubs/KoopmanCRCWebinar9May2012.pdf.
> > > Starting on slide 38 it has some do's and dont's, including:
> > >
> > > 1) Use a CRC over the length field, and then the data field, otherwise
> a
> > 1
> > > bit error in the length field may go undetected.  Adar also brought
> this
> > up.
> > > 2) Use a seed other than 0. This prevents zeroed runs in the file from
> > > passing the CRC check.
> > > 3) "Be careful with bit ordering" - I'm not sure exactly how this
> applies
> > > to our use case.
> > > 4) The slide deck doesn't say this - but perhaps we should be chaining
> > the
> > > CRCs by using the previous CRC as the seed for the next. I'm still
> trying
> > > to find some evidence that this is a beneficial thing to do.
> > >
> > > It's probably worth thinking about these things if we are going to
> revise
> > > the format.
> > >
> > > Back to the issue at hand, I think it may help to enumerate the
> possible
> > > failure scenarios.  Here are the failures I can think of:
> > >
> > > 1) Header magic number does not match
> > > 2) Container version is not recognized
> > > 3) Record length field is 0 (perhaps this isn't an error).
> > > 4) Record length field is greater than the remaining bytes in the file.
> > > 5) Record length CRC does not match (if we go ahead and add a CRC for
> the
> > > length).
> > > 6) Record data CRC does not match.
> > >
> > > From what I understand, it's #4 that is the motivator behind KUDU-1377,
> > > right?
> > >
> > > - Dan
> > >
> > > On Tue, Apr 5, 2016 at 1:07 AM, Mike Percy <mp...@apache.org> wrote:
> > >
> > >> It looks to me like the WAL suffers from an issue very similar to
> > >> KUDU-1377, where we could be more forgiving when a partial record is
> > >> written. The WAL will rebuild a segment footer for you, but it won't
> > >> ignore
> > >> a partial record. However it's less likely because the WAL uses
> > >> preallocation.
> > >>
> > >> Mike
> > >>
> > >> On Tue, Apr 5, 2016 at 12:38 AM, Adar Dembo <ad...@cloudera.com>
> wrote:
> > >>
> > >> > Quick note: the existing checksum is actually of the length AND the
> > >> bytes,
> > >> > not just the bytes. But that doesn't help solve your problem,
> because
> > >> the
> > >> > checksum follows the data, which means if the length is corrupted,
> you
> > >> > don't really know where the checksum lies.
> > >> >
> > >> > Isn't the solution you're describing effectively the same as what's
> > >> > described in KUDU-668? That closely mirrors what the WAL does, which
> > I'd
> > >> > argue is a good thing (since it's tried and true).
> > >> >
> > >> > On Tue, Apr 5, 2016 at 12:14 AM, Mike Percy <mp...@apache.org>
> > wrote:
> > >> >
> > >> > > I'm working on solving an issue in the Log Block Manager (LBM)
> filed
> > >> as
> > >> > > KUDU-1377. In this case, writing a partial record to the end of a
> > >> > container
> > >> > > metadata file can cause a failure to restart Kudu. One possible
> > >> wversay
> > >> > for
> > >> > > this to happen is to run out of disk space when appending a block
> id
> > >> to
> > >> > > this metadata file. I wanted to discuss a potential fix for this
> > >> issue.
> > >> > >
> > >> > > The LBM uses the protobuf container (PBC) file format documented
> in
> > >> > > pb_util.h for the container metadata file. The current version of
> > this
> > >> > > format (V1) looks like this:
> > >> > >
> > >> > >
> > >> > > <magic number>
> > >> > > <container version>
> > >> > > repeated <record>
> > >> > >
> > >> > >
> > >> > > with each <record> looking like the following:
> > >> > >
> > >> > > <uint32 data length>
> > >> > > <data bytes>
> > >> > > <data checksum>
> > >> > >
> > >> > >
> > >> > > In the LBM, each time we create a new block we append the block id
> > to
> > >> the
> > >> > > metadata file. On startup, we verify that all records in the file
> > are
> > >> > > valid. If not, we print to the log and exit.
> > >> > >
> > >> > > In the case of a full disk, we will have written a partial record
> to
> > >> the
> > >> > > metadata file and at startup we will fail validation, however we
> > >> should
> > >> > be
> > >> > > able to detect this case and ignore the partial record on startup.
> > >> > Because
> > >> > > we still need to support deleting blocks, we need to be able to
> > >> continue
> > >> > > appending to this metadata file after startup, so we also need to
> > >> > truncate
> > >> > > the file to the last good record when this occurs.
> > >> > >
> > >> > > Here is what I am thinking about to fix this issue:
> > >> > >
> > >> > > 1. When we are reading a container metadata file at startup, if we
> > >> detect
> > >> > > that there is a trailing record that is too short to fit a valid
> > >> record
> > >> > > (relative to the length of the file) then we truncate the last
> > partial
> > >> > > record from the file and continue as normal.
> > >> > >
> > >> > > 2. To avoid truncating "good" records in the case that there is
> data
> > >> > > corruption in one of the length fields, we also need to extend the
> > PBC
> > >> > > format to add a checksum for the record length. So a record would
> > now
> > >> > look
> > >> > > like the following:
> > >> > >
> > >> > > <uint32 data length>
> > >> > >
> > >> > > <length checksum>
> > >> > >
> > >> > > <data bytes>
> > >> > > <data checksum>
> > >> > >
> > >> > >
> > >> > > Does anyone see any drawbacks to this approach?
> > >> > >
> > >> > > If you made it this far, thanks for reading.
> > >> > >
> > >> > > Mike
> > >> > >
> > >> >
> > >>
> > >
> > >
> >
>

Re: Durability of appendable metadata for Log Block Manager

Posted by Mike Percy <mp...@apache.org>.

Sorry for the delay in responding and thanks Dan for enumerating the error
cases. Responses below:

> 1) Header magic number does not match
Unrecoverable (corruption).

> 2) Container version is not recognized
Unrecoverable (corruption).

> 3) Record length field is 0 (perhaps this isn't an error).
Yeah, I just checked, and this is not an error. The case is writing a
message with an optional field that is not present. The resulting
serialized message is 0-length.

> 4) Record length field is greater than the remaining bytes in the file.
Recoverable, assuming the CRC checks out. We can truncate.

> 5) Record length CRC does not match (if we go ahead and add a CRC for the
length).
Unrecoverable (corruption) -- assuming the file has enough length to store
the full CRC.

> 6) Record data CRC does not match.
Unrecoverable (corruption).

> 7) After the header or final record, there are more than 0 bytes
remaining in the file, but not enough for the length and CRC(length) fields.
Recoverable, we can truncate. Actually, assuming we're doing <len,
crc(len), data, crc(data)> and len and crc fields are each 4 bytes long, we
can say if len(remaining bytes) > 0 && < 12 then we can safely truncate.

Additionally, I'll add some additional scenarios as well, which only occur
when preallocating space:

8) Record data CRC does not match, but it is the last record in the file.

We may choose to treat this as not corruption, but a partial write that
could be truncated. This is only a valid assumption if we are preallocating
the size of the record before writing the data (or if we assume file
metadata is updated before the data is fsynced). If we don't preallocate,
for example with ext4 data mode=ordered this should never occur normally
and should be treated as an error (same as #6).

9) Same as #8, but followed by a long run of 0s at the end of the file.

Another case that only occurs with preallocation, but this time we assume
we're preallocating more than a single record at a time. We could
preallocate 64MB at a time, for example. If we get a partial write, we'll
end up with some bytes followed by zeroes. Let's assume a data corruption
issue where several bytes got corrupted to 0 on-disk, and further that
these are trailing bytes on the record. Unfortunately, in this case we
don't have any way to detect whether we fsynced the write to the file and
then that record was corrupted or whether we wrote only a partial record to
disk and the fsync never completed. The same problem holds in the case of a
series of trailing records if we use a "scan forward" approach (see
KUDU-1414).

If we want to preallocate for performance reasons (we may call fdatasync()
for durability instead of fsync(), thus avoiding syncing file metadata), I
think we have to accept that corruption of trailing zeros will be
undetectable and that data loss may occur in this (probably very rare) case
if we treat this situation as a partially written record that may be
truncated. To minimize the problem surface area of this heuristic we can
enforce that only the last record may be corrupt, and must be followed by
zeroes. Otherwise, we should consider the file corrupt, treating it the
same as case #6.

*However, if we do not preallocate, then scenarios 1-7 are exhaustive and
we shouldn't have any heuristic cases.* Personally, I think that's
preferable for the log block metadata, which isn't getting written to as
much as, say, the consensus WAL. Without profiling data suggesting this is
a bottleneck, I'd rather err on the side of durability. So for now, I'm
assuming we're not going to be preallocating space in the PBC files (this
is not a change from what we do today).

Mike

On Tue, Apr 5, 2016 at 10:42 AM, Dan Burkert <da...@cloudera.com> wrote:

> 7) After the header or final record, there are more than 0 bytes remaining
> in the file, but not enough for the length and CRC(length) fields.
>
> On Tue, Apr 5, 2016 at 10:34 AM, Dan Burkert <da...@cloudera.com> wrote:
>
> > Perhaps not related to the issue exactly, but there is a pretty good
> slide
> > deck on CRCs here:
> > https://users.ece.cmu.edu/~koopman/pubs/KoopmanCRCWebinar9May2012.pdf.
> > Starting on slide 38 it has some do's and dont's, including:
> >
> > 1) Use a CRC over the length field, and then the data field, otherwise a
> 1
> > bit error in the length field may go undetected.  Adar also brought this
> up.
> > 2) Use a seed other than 0. This prevents zeroed runs in the file from
> > passing the CRC check.
> > 3) "Be careful with bit ordering" - I'm not sure exactly how this applies
> > to our use case.
> > 4) The slide deck doesn't say this - but perhaps we should be chaining
> the
> > CRCs by using the previous CRC as the seed for the next. I'm still trying
> > to find some evidence that this is a beneficial thing to do.
> >
> > It's probably worth thinking about these things if we are going to revise
> > the format.
> >
> > Back to the issue at hand, I think it may help to enumerate the possible
> > failure scenarios.  Here are the failures I can think of:
> >
> > 1) Header magic number does not match
> > 2) Container version is not recognized
> > 3) Record length field is 0 (perhaps this isn't an error).
> > 4) Record length field is greater than the remaining bytes in the file.
> > 5) Record length CRC does not match (if we go ahead and add a CRC for the
> > length).
> > 6) Record data CRC does not match.
> >
> > From what I understand, it's #4 that is the motivator behind KUDU-1377,
> > right?
> >
> > - Dan
> >
> > On Tue, Apr 5, 2016 at 1:07 AM, Mike Percy <mp...@apache.org> wrote:
> >
> >> It looks to me like the WAL suffers from an issue very similar to
> >> KUDU-1377, where we could be more forgiving when a partial record is
> >> written. The WAL will rebuild a segment footer for you, but it won't
> >> ignore
> >> a partial record. However it's less likely because the WAL uses
> >> preallocation.
> >>
> >> Mike
> >>
> >> On Tue, Apr 5, 2016 at 12:38 AM, Adar Dembo <ad...@cloudera.com> wrote:
> >>
> >> > Quick note: the existing checksum is actually of the length AND the
> >> bytes,
> >> > not just the bytes. But that doesn't help solve your problem, because
> >> the
> >> > checksum follows the data, which means if the length is corrupted, you
> >> > don't really know where the checksum lies.
> >> >
> >> > Isn't the solution you're describing effectively the same as what's
> >> > described in KUDU-668? That closely mirrors what the WAL does, which
> I'd
> >> > argue is a good thing (since it's tried and true).
> >> >
> >> > On Tue, Apr 5, 2016 at 12:14 AM, Mike Percy <mp...@apache.org>
> wrote:
> >> >
> >> > > I'm working on solving an issue in the Log Block Manager (LBM) filed
> >> as
> >> > > KUDU-1377. In this case, writing a partial record to the end of a
> >> > container
> >> > > metadata file can cause a failure to restart Kudu. One possible
> >> wversay
> >> > for
> >> > > this to happen is to run out of disk space when appending a block id
> >> to
> >> > > this metadata file. I wanted to discuss a potential fix for this
> >> issue.
> >> > >
> >> > > The LBM uses the protobuf container (PBC) file format documented in
> >> > > pb_util.h for the container metadata file. The current version of
> this
> >> > > format (V1) looks like this:
> >> > >
> >> > >
> >> > > <magic number>
> >> > > <container version>
> >> > > repeated <record>
> >> > >
> >> > >
> >> > > with each <record> looking like the following:
> >> > >
> >> > > <uint32 data length>
> >> > > <data bytes>
> >> > > <data checksum>
> >> > >
> >> > >
> >> > > In the LBM, each time we create a new block we append the block id
> to
> >> the
> >> > > metadata file. On startup, we verify that all records in the file
> are
> >> > > valid. If not, we print to the log and exit.
> >> > >
> >> > > In the case of a full disk, we will have written a partial record to
> >> the
> >> > > metadata file and at startup we will fail validation, however we
> >> should
> >> > be
> >> > > able to detect this case and ignore the partial record on startup.
> >> > Because
> >> > > we still need to support deleting blocks, we need to be able to
> >> continue
> >> > > appending to this metadata file after startup, so we also need to
> >> > truncate
> >> > > the file to the last good record when this occurs.
> >> > >
> >> > > Here is what I am thinking about to fix this issue:
> >> > >
> >> > > 1. When we are reading a container metadata file at startup, if we
> >> detect
> >> > > that there is a trailing record that is too short to fit a valid
> >> record
> >> > > (relative to the length of the file) then we truncate the last
> partial
> >> > > record from the file and continue as normal.
> >> > >
> >> > > 2. To avoid truncating "good" records in the case that there is data
> >> > > corruption in one of the length fields, we also need to extend the
> PBC
> >> > > format to add a checksum for the record length. So a record would
> now
> >> > look
> >> > > like the following:
> >> > >
> >> > > <uint32 data length>
> >> > >
> >> > > <length checksum>
> >> > >
> >> > > <data bytes>
> >> > > <data checksum>
> >> > >
> >> > >
> >> > > Does anyone see any drawbacks to this approach?
> >> > >
> >> > > If you made it this far, thanks for reading.
> >> > >
> >> > > Mike
> >> > >
> >> >
> >>
> >
> >
>

Re: Durability of appendable metadata for Log Block Manager

Posted by Dan Burkert <da...@cloudera.com>.

7) After the header or final record, there are more than 0 bytes remaining
in the file, but not enough for the length and CRC(length) fields.

On Tue, Apr 5, 2016 at 10:34 AM, Dan Burkert <da...@cloudera.com> wrote:

> Perhaps not related to the issue exactly, but there is a pretty good slide
> deck on CRCs here:
> https://users.ece.cmu.edu/~koopman/pubs/KoopmanCRCWebinar9May2012.pdf.
> Starting on slide 38 it has some do's and dont's, including:
>
> 1) Use a CRC over the length field, and then the data field, otherwise a 1
> bit error in the length field may go undetected.  Adar also brought this up.
> 2) Use a seed other than 0. This prevents zeroed runs in the file from
> passing the CRC check.
> 3) "Be careful with bit ordering" - I'm not sure exactly how this applies
> to our use case.
> 4) The slide deck doesn't say this - but perhaps we should be chaining the
> CRCs by using the previous CRC as the seed for the next. I'm still trying
> to find some evidence that this is a beneficial thing to do.
>
> It's probably worth thinking about these things if we are going to revise
> the format.
>
> Back to the issue at hand, I think it may help to enumerate the possible
> failure scenarios.  Here are the failures I can think of:
>
> 1) Header magic number does not match
> 2) Container version is not recognized
> 3) Record length field is 0 (perhaps this isn't an error).
> 4) Record length field is greater than the remaining bytes in the file.
> 5) Record length CRC does not match (if we go ahead and add a CRC for the
> length).
> 6) Record data CRC does not match.
>
> From what I understand, it's #4 that is the motivator behind KUDU-1377,
> right?
>
> - Dan
>
> On Tue, Apr 5, 2016 at 1:07 AM, Mike Percy <mp...@apache.org> wrote:
>
>> It looks to me like the WAL suffers from an issue very similar to
>> KUDU-1377, where we could be more forgiving when a partial record is
>> written. The WAL will rebuild a segment footer for you, but it won't
>> ignore
>> a partial record. However it's less likely because the WAL uses
>> preallocation.
>>
>> Mike
>>
>> On Tue, Apr 5, 2016 at 12:38 AM, Adar Dembo <ad...@cloudera.com> wrote:
>>
>> > Quick note: the existing checksum is actually of the length AND the
>> bytes,
>> > not just the bytes. But that doesn't help solve your problem, because
>> the
>> > checksum follows the data, which means if the length is corrupted, you
>> > don't really know where the checksum lies.
>> >
>> > Isn't the solution you're describing effectively the same as what's
>> > described in KUDU-668? That closely mirrors what the WAL does, which I'd
>> > argue is a good thing (since it's tried and true).
>> >
>> > On Tue, Apr 5, 2016 at 12:14 AM, Mike Percy <mp...@apache.org> wrote:
>> >
>> > > I'm working on solving an issue in the Log Block Manager (LBM) filed
>> as
>> > > KUDU-1377. In this case, writing a partial record to the end of a
>> > container
>> > > metadata file can cause a failure to restart Kudu. One possible
>> wversay
>> > for
>> > > this to happen is to run out of disk space when appending a block id
>> to
>> > > this metadata file. I wanted to discuss a potential fix for this
>> issue.
>> > >
>> > > The LBM uses the protobuf container (PBC) file format documented in
>> > > pb_util.h for the container metadata file. The current version of this
>> > > format (V1) looks like this:
>> > >
>> > >
>> > > <magic number>
>> > > <container version>
>> > > repeated <record>
>> > >
>> > >
>> > > with each <record> looking like the following:
>> > >
>> > > <uint32 data length>
>> > > <data bytes>
>> > > <data checksum>
>> > >
>> > >
>> > > In the LBM, each time we create a new block we append the block id to
>> the
>> > > metadata file. On startup, we verify that all records in the file are
>> > > valid. If not, we print to the log and exit.
>> > >
>> > > In the case of a full disk, we will have written a partial record to
>> the
>> > > metadata file and at startup we will fail validation, however we
>> should
>> > be
>> > > able to detect this case and ignore the partial record on startup.
>> > Because
>> > > we still need to support deleting blocks, we need to be able to
>> continue
>> > > appending to this metadata file after startup, so we also need to
>> > truncate
>> > > the file to the last good record when this occurs.
>> > >
>> > > Here is what I am thinking about to fix this issue:
>> > >
>> > > 1. When we are reading a container metadata file at startup, if we
>> detect
>> > > that there is a trailing record that is too short to fit a valid
>> record
>> > > (relative to the length of the file) then we truncate the last partial
>> > > record from the file and continue as normal.
>> > >
>> > > 2. To avoid truncating "good" records in the case that there is data
>> > > corruption in one of the length fields, we also need to extend the PBC
>> > > format to add a checksum for the record length. So a record would now
>> > look
>> > > like the following:
>> > >
>> > > <uint32 data length>
>> > >
>> > > <length checksum>
>> > >
>> > > <data bytes>
>> > > <data checksum>
>> > >
>> > >
>> > > Does anyone see any drawbacks to this approach?
>> > >
>> > > If you made it this far, thanks for reading.
>> > >
>> > > Mike
>> > >
>> >
>>
>
>

Re: Durability of appendable metadata for Log Block Manager

Posted by Dan Burkert <da...@cloudera.com>.

Perhaps not related to the issue exactly, but there is a pretty good slide
deck on CRCs here:
https://users.ece.cmu.edu/~koopman/pubs/KoopmanCRCWebinar9May2012.pdf.
Starting on slide 38 it has some do's and dont's, including:

1) Use a CRC over the length field, and then the data field, otherwise a 1
bit error in the length field may go undetected.  Adar also brought this up.
2) Use a seed other than 0. This prevents zeroed runs in the file from
passing the CRC check.
3) "Be careful with bit ordering" - I'm not sure exactly how this applies
to our use case.
4) The slide deck doesn't say this - but perhaps we should be chaining the
CRCs by using the previous CRC as the seed for the next. I'm still trying
to find some evidence that this is a beneficial thing to do.

It's probably worth thinking about these things if we are going to revise
the format.

Back to the issue at hand, I think it may help to enumerate the possible
failure scenarios.  Here are the failures I can think of:

1) Header magic number does not match
2) Container version is not recognized
3) Record length field is 0 (perhaps this isn't an error).
4) Record length field is greater than the remaining bytes in the file.
5) Record length CRC does not match (if we go ahead and add a CRC for the
length).
6) Record data CRC does not match.

>From what I understand, it's #4 that is the motivator behind KUDU-1377,
right?

- Dan

On Tue, Apr 5, 2016 at 1:07 AM, Mike Percy <mp...@apache.org> wrote:

> It looks to me like the WAL suffers from an issue very similar to
> KUDU-1377, where we could be more forgiving when a partial record is
> written. The WAL will rebuild a segment footer for you, but it won't ignore
> a partial record. However it's less likely because the WAL uses
> preallocation.
>
> Mike
>
> On Tue, Apr 5, 2016 at 12:38 AM, Adar Dembo <ad...@cloudera.com> wrote:
>
> > Quick note: the existing checksum is actually of the length AND the
> bytes,
> > not just the bytes. But that doesn't help solve your problem, because the
> > checksum follows the data, which means if the length is corrupted, you
> > don't really know where the checksum lies.
> >
> > Isn't the solution you're describing effectively the same as what's
> > described in KUDU-668? That closely mirrors what the WAL does, which I'd
> > argue is a good thing (since it's tried and true).
> >
> > On Tue, Apr 5, 2016 at 12:14 AM, Mike Percy <mp...@apache.org> wrote:
> >
> > > I'm working on solving an issue in the Log Block Manager (LBM) filed as
> > > KUDU-1377. In this case, writing a partial record to the end of a
> > container
> > > metadata file can cause a failure to restart Kudu. One possible wversay
> > for
> > > this to happen is to run out of disk space when appending a block id to
> > > this metadata file. I wanted to discuss a potential fix for this issue.
> > >
> > > The LBM uses the protobuf container (PBC) file format documented in
> > > pb_util.h for the container metadata file. The current version of this
> > > format (V1) looks like this:
> > >
> > >
> > > <magic number>
> > > <container version>
> > > repeated <record>
> > >
> > >
> > > with each <record> looking like the following:
> > >
> > > <uint32 data length>
> > > <data bytes>
> > > <data checksum>
> > >
> > >
> > > In the LBM, each time we create a new block we append the block id to
> the
> > > metadata file. On startup, we verify that all records in the file are
> > > valid. If not, we print to the log and exit.
> > >
> > > In the case of a full disk, we will have written a partial record to
> the
> > > metadata file and at startup we will fail validation, however we should
> > be
> > > able to detect this case and ignore the partial record on startup.
> > Because
> > > we still need to support deleting blocks, we need to be able to
> continue
> > > appending to this metadata file after startup, so we also need to
> > truncate
> > > the file to the last good record when this occurs.
> > >
> > > Here is what I am thinking about to fix this issue:
> > >
> > > 1. When we are reading a container metadata file at startup, if we
> detect
> > > that there is a trailing record that is too short to fit a valid record
> > > (relative to the length of the file) then we truncate the last partial
> > > record from the file and continue as normal.
> > >
> > > 2. To avoid truncating "good" records in the case that there is data
> > > corruption in one of the length fields, we also need to extend the PBC
> > > format to add a checksum for the record length. So a record would now
> > look
> > > like the following:
> > >
> > > <uint32 data length>
> > >
> > > <length checksum>
> > >
> > > <data bytes>
> > > <data checksum>
> > >
> > >
> > > Does anyone see any drawbacks to this approach?
> > >
> > > If you made it this far, thanks for reading.
> > >
> > > Mike
> > >
> >
>

Re: Durability of appendable metadata for Log Block Manager

Posted by Mike Percy <mp...@apache.org>.

It looks to me like the WAL suffers from an issue very similar to
KUDU-1377, where we could be more forgiving when a partial record is
written. The WAL will rebuild a segment footer for you, but it won't ignore
a partial record. However it's less likely because the WAL uses
preallocation.

Mike

On Tue, Apr 5, 2016 at 12:38 AM, Adar Dembo <ad...@cloudera.com> wrote:

> Quick note: the existing checksum is actually of the length AND the bytes,
> not just the bytes. But that doesn't help solve your problem, because the
> checksum follows the data, which means if the length is corrupted, you
> don't really know where the checksum lies.
>
> Isn't the solution you're describing effectively the same as what's
> described in KUDU-668? That closely mirrors what the WAL does, which I'd
> argue is a good thing (since it's tried and true).
>
> On Tue, Apr 5, 2016 at 12:14 AM, Mike Percy <mp...@apache.org> wrote:
>
> > I'm working on solving an issue in the Log Block Manager (LBM) filed as
> > KUDU-1377. In this case, writing a partial record to the end of a
> container
> > metadata file can cause a failure to restart Kudu. One possible wversay
> for
> > this to happen is to run out of disk space when appending a block id to
> > this metadata file. I wanted to discuss a potential fix for this issue.
> >
> > The LBM uses the protobuf container (PBC) file format documented in
> > pb_util.h for the container metadata file. The current version of this
> > format (V1) looks like this:
> >
> >
> > <magic number>
> > <container version>
> > repeated <record>
> >
> >
> > with each <record> looking like the following:
> >
> > <uint32 data length>
> > <data bytes>
> > <data checksum>
> >
> >
> > In the LBM, each time we create a new block we append the block id to the
> > metadata file. On startup, we verify that all records in the file are
> > valid. If not, we print to the log and exit.
> >
> > In the case of a full disk, we will have written a partial record to the
> > metadata file and at startup we will fail validation, however we should
> be
> > able to detect this case and ignore the partial record on startup.
> Because
> > we still need to support deleting blocks, we need to be able to continue
> > appending to this metadata file after startup, so we also need to
> truncate
> > the file to the last good record when this occurs.
> >
> > Here is what I am thinking about to fix this issue:
> >
> > 1. When we are reading a container metadata file at startup, if we detect
> > that there is a trailing record that is too short to fit a valid record
> > (relative to the length of the file) then we truncate the last partial
> > record from the file and continue as normal.
> >
> > 2. To avoid truncating "good" records in the case that there is data
> > corruption in one of the length fields, we also need to extend the PBC
> > format to add a checksum for the record length. So a record would now
> look
> > like the following:
> >
> > <uint32 data length>
> >
> > <length checksum>
> >
> > <data bytes>
> > <data checksum>
> >
> >
> > Does anyone see any drawbacks to this approach?
> >
> > If you made it this far, thanks for reading.
> >
> > Mike
> >
>

Re: Durability of appendable metadata for Log Block Manager

Posted by Adar Dembo <ad...@cloudera.com>.

Quick note: the existing checksum is actually of the length AND the bytes,
not just the bytes. But that doesn't help solve your problem, because the
checksum follows the data, which means if the length is corrupted, you
don't really know where the checksum lies.

Isn't the solution you're describing effectively the same as what's
described in KUDU-668? That closely mirrors what the WAL does, which I'd
argue is a good thing (since it's tried and true).

On Tue, Apr 5, 2016 at 12:14 AM, Mike Percy <mp...@apache.org> wrote:

> I'm working on solving an issue in the Log Block Manager (LBM) filed as
> KUDU-1377. In this case, writing a partial record to the end of a container
> metadata file can cause a failure to restart Kudu. One possible wversay for
> this to happen is to run out of disk space when appending a block id to
> this metadata file. I wanted to discuss a potential fix for this issue.
>
> The LBM uses the protobuf container (PBC) file format documented in
> pb_util.h for the container metadata file. The current version of this
> format (V1) looks like this:
>
>
> <magic number>
> <container version>
> repeated <record>
>
>
> with each <record> looking like the following:
>
> <uint32 data length>
> <data bytes>
> <data checksum>
>
>
> In the LBM, each time we create a new block we append the block id to the
> metadata file. On startup, we verify that all records in the file are
> valid. If not, we print to the log and exit.
>
> In the case of a full disk, we will have written a partial record to the
> metadata file and at startup we will fail validation, however we should be
> able to detect this case and ignore the partial record on startup. Because
> we still need to support deleting blocks, we need to be able to continue
> appending to this metadata file after startup, so we also need to truncate
> the file to the last good record when this occurs.
>
> Here is what I am thinking about to fix this issue:
>
> 1. When we are reading a container metadata file at startup, if we detect
> that there is a trailing record that is too short to fit a valid record
> (relative to the length of the file) then we truncate the last partial
> record from the file and continue as normal.
>
> 2. To avoid truncating "good" records in the case that there is data
> corruption in one of the length fields, we also need to extend the PBC
> format to add a checksum for the record length. So a record would now look
> like the following:
>
> <uint32 data length>
>
> <length checksum>
>
> <data bytes>
> <data checksum>
>
>
> Does anyone see any drawbacks to this approach?
>
> If you made it this far, thanks for reading.
>
> Mike
>

Re: Durability of appendable metadata for Log Block Manager

Posted by Mike Percy <mp...@apache.org>.

Oh, one thing I forgot to mention with this approach. Extending the PBC
format to a V2 would require some level of backwards compatibility. The way
I was thinking about dealing w/ this is that new files would be created in
V2 format, including the ability to truncate log block manager container
metadata files and therefore tolerate a trailing partial record in metadata
files. The current PBC V1 format would continue to be supported, just not
created by default anymore, and V1 files would continue to be affected by
KUDU-1377.

For people who wanted to migrate their existing metadata to the V2 format,
we could provide a migration tool that could convert V1 metadata files to
V2, or possibly even automatically upgrade to the new format on startup.
However, if someone ever wanted to roll back to an old version of Kudu for
some reason, they would not be able to do so after migrating to the new PBC
version.

Mike

On Tue, Apr 5, 2016 at 12:14 AM, Mike Percy <mp...@apache.org> wrote:

> I'm working on solving an issue in the Log Block Manager (LBM) filed as
> KUDU-1377. In this case, writing a partial record to the end of a container
> metadata file can cause a failure to restart Kudu. One possible wversay for
> this to happen is to run out of disk space when appending a block id to
> this metadata file. I wanted to discuss a potential fix for this issue.
>
> The LBM uses the protobuf container (PBC) file format documented in
> pb_util.h for the container metadata file. The current version of this
> format (V1) looks like this:
>
>
> <magic number>
> <container version>
> repeated <record>
>
>
> with each <record> looking like the following:
>
> <uint32 data length>
> <data bytes>
> <data checksum>
>
>
> In the LBM, each time we create a new block we append the block id to the
> metadata file. On startup, we verify that all records in the file are
> valid. If not, we print to the log and exit.
>
> In the case of a full disk, we will have written a partial record to the
> metadata file and at startup we will fail validation, however we should be
> able to detect this case and ignore the partial record on startup. Because
> we still need to support deleting blocks, we need to be able to continue
> appending to this metadata file after startup, so we also need to truncate
> the file to the last good record when this occurs.
>
> Here is what I am thinking about to fix this issue:
>
> 1. When we are reading a container metadata file at startup, if we detect
> that there is a trailing record that is too short to fit a valid record
> (relative to the length of the file) then we truncate the last partial
> record from the file and continue as normal.
>
> 2. To avoid truncating "good" records in the case that there is data
> corruption in one of the length fields, we also need to extend the PBC
> format to add a checksum for the record length. So a record would now look
> like the following:
>
> <uint32 data length>
>
> <length checksum>
>
> <data bytes>
> <data checksum>
>
>
> Does anyone see any drawbacks to this approach?
>
> If you made it this far, thanks for reading.
>
> Mike
>
>