You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by Micah Kornfield <em...@gmail.com> on 2019/07/12 07:56:48 UTC

[DISCUSS][FORMAT] Data Integrity

Per Antoine's recommendation. I'm splitting off the discussion about data
integrity from the previous e-mail thread about the format additions [1].
To re-cap I made a proposal including data integrity [2] by adding a new
message type to the

From the previous thread the main question was at what level to apply
digests to Arrow data (Message level, array, buffer or potentially some
hybrid).

Some trade-offs I've thought of for each approach:
* Message level
+ Simplest implementation and can be applied across all messages with the
pretty much the same code.
+ Smallest amount of additional data (each digest will likely be 8-64 bytes)
- It lacks granularity to recover partial data from a record batch if there
is corruption.

Array level:
+ Allows for reading non-corrupted columns
+ Allows for potentially more complicated use-cases like have different
compute engines "collaborate" and sign each array they computed to
establish a "chain-of-trust"
- Adds some implementation complexity. Will need different schemes for
message types other than RecordBatch and for message metadata. We also
need to determine digest boundaries (would a complex column be consumed
entirely or would child arrays be separate).

Buffer level:
More or less same issues as array but with the following other factors:
- The most amount of additional data
- Its not clear if there is a benefit of detecting if a single buffer is
corrupted if it means we can't accurately decode the array.

Other implementation options:
* Use message level metadata (this can be a little awkward if we want to
have safety against metadata corruption).

[1]
https://lists.apache.org/thread.html/a99124e57c14c3c9ef9d98f3c80cfe1dd25496bf3ff7046778add937@%3Cdev.arrow.apache.org%3E
[2] https://github.com/apache/arrow/pull/4815

Re: [DISCUSS][FORMAT] Data Integrity

Posted by Antoine Pitrou <an...@python.org>.

Le 15/07/2019 à 16:15, Wes McKinney a écrit :
> If we adopt the position (as we already are in practice, I think) that
> the encapsulated IPC message format is the main way that we expose
> data from one process to another, then having digests at the message
> level seems like the simplest and most useful thing.
> 
> FWIW, the Parquet format technically provides for CRC checksums but
> has never been widely implemented, so there is a certain YAGNI feeling
> to doing anything complex on this.

You may be right.  Also, if the transport uses TLS, there's some data
integrity built-in already.

I suspect checksumming may be desirable mostly for archival purposes,
which Arrow is not aimed at.

Regards

Antoine.



> 
> On Fri, Jul 12, 2019 at 4:30 AM Antoine Pitrou <an...@python.org> wrote:
>>
>>
>>
>> Le 12/07/2019 à 09:56, Micah Kornfield a écrit :
>>> Per Antoine's recommendation.  I'm splitting off the discussion about data
>>> integrity from the previous e-mail thread about the format additions [1].
>>> To re-cap I made a proposal including data integrity [2] by adding a new
>>> message type to the
>>>
>>> From the previous thread the main question was at what level to apply
>>> digests to Arrow data (Message level, array, buffer or potentially some
>>> hybrid).
>>>
>>> Some trade-offs I've thought of for each approach:
>>> * Message level
>>> + Simplest implementation and can be applied across all messages with the
>>> pretty much the same code.
>>> + Smallest amount of additional data (each digest will likely be 8-64 bytes)
>>> - It lacks granularity to recover partial data from a record batch if there
>>> is corruption.
>>
>> Also:
>> - Will only apply to transmission errors using the IPC mechanism, not
>> other kinds of errors that may occur
>>
>>> Array level:
>>> + Allows for reading non-corrupted columns
>>> + Allows for potentially more complicated use-cases like have different
>>> compute engines "collaborate" and sign each array they computed to
>>> establish a "chain-of-trust"
>>> - Adds some implementation complexity. Will need different schemes for
>>> message types other than RecordBatch and for message metadata.  We also
>>> need to determine digest boundaries (would a complex column be consumed
>>> entirely or would child arrays be separate).
>>
>> Also:
>> - Need to compute a new checksum when slicing an array?
>>
>>> Buffer level:
>>> More or less same issues as array but with the following other factors:
>>> - The most amount of additional data
>>
>> It's not clear that's much of a problem (currently?), especially if
>> checksumming is optional.  Arrow isn't well-suited for use cases with
>> many tiny buffers...
>>
>>> - Its not clear if there is a benefit of detecting if a single buffer is
>>> corrupted if it means we can't accurately decode the array.
>>
>> Also:
>> + decorrelated from logical interpretation of buffer, e.g. slicing
>>
>> I think the possibility of a hybrid scheme should be discussed as well.
>>  For example, compute physical checksums at the buffer level, then
>> devise a lightweight formula for the checkum of an array based on those
>> physical checksums.  And a formula for an IPC message's checksum based
>> on its type (schema, record batch, dictionary...).
>>
>> Regards
>>
>> Antoine.

Re: [DISCUSS][FORMAT] Data Integrity

Posted by Wes McKinney <we...@gmail.com>.

If we adopt the position (as we already are in practice, I think) that
the encapsulated IPC message format is the main way that we expose
data from one process to another, then having digests at the message
level seems like the simplest and most useful thing.

FWIW, the Parquet format technically provides for CRC checksums but
has never been widely implemented, so there is a certain YAGNI feeling
to doing anything complex on this.

On Fri, Jul 12, 2019 at 4:30 AM Antoine Pitrou <an...@python.org> wrote:
>
>
>
> Le 12/07/2019 à 09:56, Micah Kornfield a écrit :
> > Per Antoine's recommendation.  I'm splitting off the discussion about data
> > integrity from the previous e-mail thread about the format additions [1].
> > To re-cap I made a proposal including data integrity [2] by adding a new
> > message type to the
> >
> > From the previous thread the main question was at what level to apply
> > digests to Arrow data (Message level, array, buffer or potentially some
> > hybrid).
> >
> > Some trade-offs I've thought of for each approach:
> > * Message level
> > + Simplest implementation and can be applied across all messages with the
> > pretty much the same code.
> > + Smallest amount of additional data (each digest will likely be 8-64 bytes)
> > - It lacks granularity to recover partial data from a record batch if there
> > is corruption.
>
> Also:
> - Will only apply to transmission errors using the IPC mechanism, not
> other kinds of errors that may occur
>
> > Array level:
> > + Allows for reading non-corrupted columns
> > + Allows for potentially more complicated use-cases like have different
> > compute engines "collaborate" and sign each array they computed to
> > establish a "chain-of-trust"
> > - Adds some implementation complexity. Will need different schemes for
> > message types other than RecordBatch and for message metadata.  We also
> > need to determine digest boundaries (would a complex column be consumed
> > entirely or would child arrays be separate).
>
> Also:
> - Need to compute a new checksum when slicing an array?
>
> > Buffer level:
> > More or less same issues as array but with the following other factors:
> > - The most amount of additional data
>
> It's not clear that's much of a problem (currently?), especially if
> checksumming is optional.  Arrow isn't well-suited for use cases with
> many tiny buffers...
>
> > - Its not clear if there is a benefit of detecting if a single buffer is
> > corrupted if it means we can't accurately decode the array.
>
> Also:
> + decorrelated from logical interpretation of buffer, e.g. slicing
>
> I think the possibility of a hybrid scheme should be discussed as well.
>  For example, compute physical checksums at the buffer level, then
> devise a lightweight formula for the checkum of an array based on those
> physical checksums.  And a formula for an IPC message's checksum based
> on its type (schema, record batch, dictionary...).
>
> Regards
>
> Antoine.

Re: [DISCUSS][FORMAT] Data Integrity

Posted by Antoine Pitrou <an...@python.org>.


Le 12/07/2019 à 09:56, Micah Kornfield a écrit :
> Per Antoine's recommendation.  I'm splitting off the discussion about data
> integrity from the previous e-mail thread about the format additions [1].
> To re-cap I made a proposal including data integrity [2] by adding a new
> message type to the
> 
> From the previous thread the main question was at what level to apply
> digests to Arrow data (Message level, array, buffer or potentially some
> hybrid).
> 
> Some trade-offs I've thought of for each approach:
> * Message level
> + Simplest implementation and can be applied across all messages with the
> pretty much the same code.
> + Smallest amount of additional data (each digest will likely be 8-64 bytes)
> - It lacks granularity to recover partial data from a record batch if there
> is corruption.

Also:
- Will only apply to transmission errors using the IPC mechanism, not
other kinds of errors that may occur

> Array level:
> + Allows for reading non-corrupted columns
> + Allows for potentially more complicated use-cases like have different
> compute engines "collaborate" and sign each array they computed to
> establish a "chain-of-trust"
> - Adds some implementation complexity. Will need different schemes for
> message types other than RecordBatch and for message metadata.  We also
> need to determine digest boundaries (would a complex column be consumed
> entirely or would child arrays be separate).

Also:
- Need to compute a new checksum when slicing an array?

> Buffer level:
> More or less same issues as array but with the following other factors:
> - The most amount of additional data

It's not clear that's much of a problem (currently?), especially if
checksumming is optional.  Arrow isn't well-suited for use cases with
many tiny buffers...

> - Its not clear if there is a benefit of detecting if a single buffer is
> corrupted if it means we can't accurately decode the array.

Also:
+ decorrelated from logical interpretation of buffer, e.g. slicing

I think the possibility of a hybrid scheme should be discussed as well.
 For example, compute physical checksums at the buffer level, then
devise a lightweight formula for the checkum of an array based on those
physical checksums.  And a formula for an IPC message's checksum based
on its type (schema, record batch, dictionary...).

Regards

Antoine.