You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by Micah Kornfield <em...@gmail.com> on 2019/07/05 18:53:42 UTC

[Discuss] Format additions to Arrow for sparse data and data integrity

Hi Arrow-dev,

I’d like to make a straw-man proposal to cover some features that I think
would be useful to Arrow, and that I would like to make a proof-of-concept
implementation for in Java and C++. In particular, the proposal covers
allowing for smaller data sizes via compression and encoding [1][2][8],
data integrity [3] and avoiding unnecessary data transfer [4][5].

I’ve put together a PR [6] that has proposed changes to the flatbuffer
metadata to support the new features. The PR introduces:

A new “SparseRecordBatch” that can support one of multiple possible
encodings (both dense and sparse), compression and column elision.
-

A “Digest” message type to support optional data integrity.

Going into more details on the specific features in the PR:

Sparse encodings for arrays and buffers. The guiding principles behind
the suggested encodings are to support encodings that can be exploited by
compute engines for more efficient computation (I don’t think parquet style
bit-packing belongs in Arrow). While the encodings don’t maintain O(1)
data element access, they support sublinear, O(log(N)), element access. The
suggested encodings are:
1.

Array encodings:
1.

Add a run-length encoding scheme to efficiently represent repeated
values (the actual scheme encodes run ends instead of length
to preserve
sub-linear random access).
2.

Add a “packed” sparse representation (null values don’t take up
space in value buffers)
2.

Buffer encodings:
1.

Add frame of reference integer encoding [7] (this allows for lower
bit-width encoding of integer types by subtracting a
“reference” value from
all values in the buffer).
2.

Add a sparse integer set encoding. This encoding allows more
efficient encoding of validity bit-masks for cases when all values are
either null or not null.
2.

Data compression. Similar to encodings but compression is solely for
reduction of data at rest/on the wire. The proposal is to allow
compression of individual buffers. Right now zstd is proposed, but I don’t
feel strongly on the specific technologies here.
3.

Column Elision. For some use-cases, like structured logging, the
overhead of including array metadata for columns with no data present
represents non-negligible overhead. The proposal provides a mechanism for
omitting meta-data for such arrays.
4.

Data Integrity. While the arrow file format isn’t meant for archiving
data, I think it is important to allow for optional native data integrity
checks in the format. To this end, I proposed a new “Digest” message type
that can be added after other messages to record a digest/hash of the
preceding data. I suggested xxhash, but I don’t have a strong opinion here,
as long as there is some minimal support that can potentially be expanded
later.

In the proposal I chose to use Tables and Unions everywhere for flexibility
but in all likelihood some could be replaced by enums.

My initial plan would be to solely focus on an IPC mechanism that can send
a SparseRecordBatch and immediately translate it to a normal RecordBatch in
both Java and C++.

As a practical matter the proposal represents a lot of work to get an MVP
working in time for 1.0.0 release (provided they are accepted by the
community), so I'd greatly appreciate if anyone wants to collaborate on
this.

If it is easier I’m happy to start a separate thread for feature if people
feel like it would make the conversation easier. I can also create a
Google Doc for direct comments if that is preferred.

Thanks,

Micah

P.S. In the interest of full disclosure, these ideas evolved in
collaboration with Brian Hulette and other colleagues at Google who are
interested in making use of Arrow in both internal and external projects.

[1] https://issues.apache.org/jira/browse/ARROW-300

[2] https://issues.apache.org/jira/browse/ARROW-5224

[3]
https://lists.apache.org/thread.html/36ab9c2b8b5d9f04493b3f9ea3b63c3ca3bc0f90743aa726b7a3199b@%3Cdev.arrow.apache.org%3E

[4]
https://lists.apache.org/thread.html/5e09557274f9018efee770ad3712122d874447331f52d27169f99fe0@%3Cdev.arrow.apache.org%3E

[5]
https://issues.apache.org/jira/browse/ARROW-1693?page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel&focusedCommentId=16244812#comment-16244812

[6] https://github.com/apache/arrow/pull/4815

[7]
https://lemire.me/blog/2012/02/08/effective-compression-using-frame-of-reference-and-delta-coding/

[8] https://issues.apache.org/jira/browse/ARROW-5821

Re: [DISCUSS] Format additions for encoding/compression

Posted by Micah Kornfield <em...@gmail.com>.

Great John, I'd be interesting to hear about progress.

Also, IMO I think we should be only focusing on encoding that have the
potential to be exploited for computational benefits (not just
compressibility).  I think this is what distinguishes Arrow from other
formats like Parquet. I think this echos some others sentiments on this
thread.

Cheers,
Micah

On Fri, Jan 24, 2020 at 8:28 AM John Muehlhausen <jg...@jgm.org> wrote:

> Thanks Micah, I will see if I can find some time to explore this further.
>
> On Thu, Jan 23, 2020 at 10:56 PM Micah Kornfield <em...@gmail.com>
> wrote:
>
>> Hi John,
>> Not Wes, but my thoughts on this are as follows:
>>
>> 1. Alternate bit/byte arrangements can also be useful for processing [1]
>> in
>> addition to compression.
>> 2. I think they are quite a bit more complicated then the existing schemes
>> proposed in [2], so I think it would be more expedient to get the
>> integration hooks necessary to work with simpler encodings before going
>> with something more complex.  I believe the proposal is generic enough to
>> support this type of encoding.
>> 3. For prototyping, this seems like a potential use of the ExtensionType
>> [3] type mechanism already in the specification.
>> 4. I don't think these should be new types or part of the basic Array data
>> structure.  I think having a different container format in the form of
>> "SparseRecordBatch" (or perhaps it should be renamed to
>> EncodedRecordBatch)
>> and keeping the existing types with alternate encodings is a better
>> option.
>>
>> That being said if you have bandwidth to get this working for C++ and Java
>> we can potentially setup a separate development branch to see how it
>> evolves.  Personally, I've not brought my proposal up for discussion
>> again,
>> because I haven't had bandwidth to work on it, but I still think
>> introducing some level of alternate encodings is a good idea.
>>
>> Cheers,
>> Micah
>>
>>
>> [1]
>>
>> https://15721.courses.cs.cmu.edu/spring2018/papers/22-vectorization2/p31-feng.pdf
>> [2] https://github.com/apache/arrow/pull/4815
>> [3]
>>
>> https://github.com/apache/arrow/blob/master/docs/source/format/Columnar.rst#extension-types
>>
>> On Thu, Jan 23, 2020 at 11:36 AM John Muehlhausen <jg...@jgm.org> wrote:
>>
>> > Wes, what do you think about Arrow supporting a new suite of
>> fixed-length
>> > data types that unshuffle on column->Value(i) calls?  This would allow
>> > memory/swap compressors and memory maps backed by compressing
>> > filesystems (ZFS) or block devices (VDO) to operate more efficiently.
>> >
>> > By doing it with new datatypes there is no separate flag to check?
>> >
>> > On Thu, Jan 23, 2020 at 1:09 PM Wes McKinney <we...@gmail.com>
>> wrote:
>> >
>> > > On Thu, Jan 23, 2020 at 12:42 PM John Muehlhausen <jg...@jgm.org>
>> wrote:
>> > > >
>> > > > Again, I know very little about Parquet, so your patience is
>> > appreciated.
>> > > >
>> > > > At the moment I can Arrow/mmap a file without having anywhere
>> nearly as
>> > > > much available memory as the file size.  I can visit random place in
>> > the
>> > > > file (such as a binary search if it is ordered) and only the
>> locations
>> > > > visited by column->Value(i) are paged in.  Paging them out happens
>> > > without
>> > > > my awareness, if necessary.
>> > > >
>> > > > Does Parquet cover this use-case with the same elegance and at least
>> > > equal
>> > > > efficiency, or are there more copies/conversions?  Perhaps it
>> requires
>> > > the
>> > > > entire file to be transformed into Arrow memory at the beginning? Or
>> > on a
>> > > > batch/block basis? Or to get this I need to use a non-Arrow API for
>> > data
>> > > > element access?  Etc.
>> > >
>> > > Data has to be materialized / deserialized from the Parquet file on a
>> > > batch-wise per-column basis. The APIs we provide allow batches of
>> > > values to be read for a given subset of columns
>> > >
>> > > >
>> > > > IFF it covers the above use-case, which does not mention
>> compression or
>> > > > encoding, then I could consider whether it is interesting on those
>> > > points.
>> > >
>> > > My point really has to do with Parquet's design which is about
>> > > reducing file size. In the following blog post
>> > >
>> > > https://ursalabs.org/blog/2019-10-columnar-perf/
>> > >
>> > > I examined a dataset which is about 4GB as raw Arrow stream/file but
>> > > only 114 MB as a Parquet file. A 30+X compression ratio is a huge deal
>> > > if you are working with filesystems that yield < 500MB/s (which
>> > > includes pretty much all cloud filesystems AFAIK). In clickstream
>> > > analytics this kind of compression ratio is not unusual.
>> > >
>> > > >
>> > > > -John
>> > > >
>> > > > On Thu, Jan 23, 2020 at 12:06 PM Francois Saint-Jacques <
>> > > > fsaintjacques@gmail.com> wrote:
>> > > >
>> > > > > What's the point of having zero copy if the OS is doing the
>> > > > > decompression in kernel (which trumps the zero-copy argument)? You
>> > > > > might as well just use parquet without filesystem compression. I
>> > > > > prefer to have compression algorithm where the columnar engine can
>> > > > > benefit from it [1] than marginally improving a file-system-os
>> > > > > specific feature.
>> > > > >
>> > > > > François
>> > > > >
>> > > > > [1] Section 4.3
>> http://db.csail.mit.edu/pubs/abadi-column-stores.pdf
>> > > > >
>> > > > >
>> > > > >
>> > > > >
>> > > > > On Thu, Jan 23, 2020 at 12:43 PM John Muehlhausen <jg...@jgm.org>
>> > wrote:
>> > > > > >
>> > > > > > This could also have utility in memory via things like
>> zram/zswap,
>> > > right?
>> > > > > > Mac also has a memory compressor?
>> > > > > >
>> > > > > > I don't think Parquet is an option for me unless the integration
>> > with
>> > > > > Arrow
>> > > > > > is tighter than I imagine (i.e. zero-copy).  That said, I
>> confess I
>> > > know
>> > > > > > next to nothing about Parquet.
>> > > > > >
>> > > > > > On Thu, Jan 23, 2020 at 11:23 AM Antoine Pitrou <
>> > antoine@python.org>
>> > > > > wrote:
>> > > > > > >
>> > > > > > >
>> > > > > > > Le 23/01/2020 à 18:16, John Muehlhausen a écrit :
>> > > > > > > > Perhaps related to this thread, are there any current or
>> > proposed
>> > > > > tools
>> > > > > > to
>> > > > > > > > transform columns for fixed-length data types according to a
>> > > > > "shuffle?"
>> > > > > > > >  For precedent see the implementation of the shuffle filter
>> in
>> > > hdf5.
>> > > > > > > >
>> > > > > >
>> > > > >
>> > >
>> >
>> https://support.hdfgroup.org/ftp/HDF5//documentation/doc1.6/TechNotes/shuffling-algorithm-report.pdf
>> > > > > > > >
>> > > > > > > > For example, the column (length 3) would store bytes 00 00
>> 00
>> > 00
>> > > 00
>> > > > > 00
>> > > > > > 00
>> > > > > > > > 00 00 01 02 03 to represent the three 32-bit numbers 00 00
>> 00
>> > 01
>> > > 00
>> > > > > 00
>> > > > > > 00
>> > > > > > > > 02 00 00 00 03  (I'm writing big-endian even if that is not
>> > > actually
>> > > > > the
>> > > > > > > > case).
>> > > > > > > >
>> > > > > > > > Value(1) would return 00 00 00 02 by referring to some
>> metadata
>> > > flag
>> > > > > > that
>> > > > > > > > the column is shuffled, stitching the bytes back together at
>> > call
>> > > > > time.
>> > > > > > > >
>> > > > > > > > Thus if the column pages were backed by a memory map to
>> > something
>> > > > > like
>> > > > > > > > zfs/gzip-9 (my actual use-case), one would expect approx 30%
>> > > savings
>> > > > > in
>> > > > > > > > underlying disk usage due to better run lengths.
>> > > > > > > >
>> > > > > > > > It would enable a space/time tradeoff that could be useful?
>> > The
>> > > > > > filesystem
>> > > > > > > > itself cannot easily do this particular compression
>> transform
>> > > since
>> > > > > it
>> > > > > > > > benefits from knowing the shape of the data.
>> > > > > > >
>> > > > > > > For the record, there's a pull request adding this encoding to
>> > the
>> > > > > > > Parquet C++ specification.
>> > > > > > >
>> > > > > > > Regards
>> > > > > > >
>> > > > > > > Antoine.
>> > > > >
>> > >
>> >
>>
>

Re: [DISCUSS] Format additions for encoding/compression

Posted by John Muehlhausen <jg...@jgm.org>.

Thanks Micah, I will see if I can find some time to explore this further.

On Thu, Jan 23, 2020 at 10:56 PM Micah Kornfield <em...@gmail.com>
wrote:

> Hi John,
> Not Wes, but my thoughts on this are as follows:
>
> 1. Alternate bit/byte arrangements can also be useful for processing [1] in
> addition to compression.
> 2. I think they are quite a bit more complicated then the existing schemes
> proposed in [2], so I think it would be more expedient to get the
> integration hooks necessary to work with simpler encodings before going
> with something more complex.  I believe the proposal is generic enough to
> support this type of encoding.
> 3. For prototyping, this seems like a potential use of the ExtensionType
> [3] type mechanism already in the specification.
> 4. I don't think these should be new types or part of the basic Array data
> structure.  I think having a different container format in the form of
> "SparseRecordBatch" (or perhaps it should be renamed to EncodedRecordBatch)
> and keeping the existing types with alternate encodings is a better option.
>
> That being said if you have bandwidth to get this working for C++ and Java
> we can potentially setup a separate development branch to see how it
> evolves.  Personally, I've not brought my proposal up for discussion again,
> because I haven't had bandwidth to work on it, but I still think
> introducing some level of alternate encodings is a good idea.
>
> Cheers,
> Micah
>
>
> [1]
>
> https://15721.courses.cs.cmu.edu/spring2018/papers/22-vectorization2/p31-feng.pdf
> [2] https://github.com/apache/arrow/pull/4815
> [3]
>
> https://github.com/apache/arrow/blob/master/docs/source/format/Columnar.rst#extension-types
>
> On Thu, Jan 23, 2020 at 11:36 AM John Muehlhausen <jg...@jgm.org> wrote:
>
> > Wes, what do you think about Arrow supporting a new suite of fixed-length
> > data types that unshuffle on column->Value(i) calls?  This would allow
> > memory/swap compressors and memory maps backed by compressing
> > filesystems (ZFS) or block devices (VDO) to operate more efficiently.
> >
> > By doing it with new datatypes there is no separate flag to check?
> >
> > On Thu, Jan 23, 2020 at 1:09 PM Wes McKinney <we...@gmail.com>
> wrote:
> >
> > > On Thu, Jan 23, 2020 at 12:42 PM John Muehlhausen <jg...@jgm.org> wrote:
> > > >
> > > > Again, I know very little about Parquet, so your patience is
> > appreciated.
> > > >
> > > > At the moment I can Arrow/mmap a file without having anywhere nearly
> as
> > > > much available memory as the file size.  I can visit random place in
> > the
> > > > file (such as a binary search if it is ordered) and only the
> locations
> > > > visited by column->Value(i) are paged in.  Paging them out happens
> > > without
> > > > my awareness, if necessary.
> > > >
> > > > Does Parquet cover this use-case with the same elegance and at least
> > > equal
> > > > efficiency, or are there more copies/conversions?  Perhaps it
> requires
> > > the
> > > > entire file to be transformed into Arrow memory at the beginning? Or
> > on a
> > > > batch/block basis? Or to get this I need to use a non-Arrow API for
> > data
> > > > element access?  Etc.
> > >
> > > Data has to be materialized / deserialized from the Parquet file on a
> > > batch-wise per-column basis. The APIs we provide allow batches of
> > > values to be read for a given subset of columns
> > >
> > > >
> > > > IFF it covers the above use-case, which does not mention compression
> or
> > > > encoding, then I could consider whether it is interesting on those
> > > points.
> > >
> > > My point really has to do with Parquet's design which is about
> > > reducing file size. In the following blog post
> > >
> > > https://ursalabs.org/blog/2019-10-columnar-perf/
> > >
> > > I examined a dataset which is about 4GB as raw Arrow stream/file but
> > > only 114 MB as a Parquet file. A 30+X compression ratio is a huge deal
> > > if you are working with filesystems that yield < 500MB/s (which
> > > includes pretty much all cloud filesystems AFAIK). In clickstream
> > > analytics this kind of compression ratio is not unusual.
> > >
> > > >
> > > > -John
> > > >
> > > > On Thu, Jan 23, 2020 at 12:06 PM Francois Saint-Jacques <
> > > > fsaintjacques@gmail.com> wrote:
> > > >
> > > > > What's the point of having zero copy if the OS is doing the
> > > > > decompression in kernel (which trumps the zero-copy argument)? You
> > > > > might as well just use parquet without filesystem compression. I
> > > > > prefer to have compression algorithm where the columnar engine can
> > > > > benefit from it [1] than marginally improving a file-system-os
> > > > > specific feature.
> > > > >
> > > > > François
> > > > >
> > > > > [1] Section 4.3
> http://db.csail.mit.edu/pubs/abadi-column-stores.pdf
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On Thu, Jan 23, 2020 at 12:43 PM John Muehlhausen <jg...@jgm.org>
> > wrote:
> > > > > >
> > > > > > This could also have utility in memory via things like
> zram/zswap,
> > > right?
> > > > > > Mac also has a memory compressor?
> > > > > >
> > > > > > I don't think Parquet is an option for me unless the integration
> > with
> > > > > Arrow
> > > > > > is tighter than I imagine (i.e. zero-copy).  That said, I
> confess I
> > > know
> > > > > > next to nothing about Parquet.
> > > > > >
> > > > > > On Thu, Jan 23, 2020 at 11:23 AM Antoine Pitrou <
> > antoine@python.org>
> > > > > wrote:
> > > > > > >
> > > > > > >
> > > > > > > Le 23/01/2020 à 18:16, John Muehlhausen a écrit :
> > > > > > > > Perhaps related to this thread, are there any current or
> > proposed
> > > > > tools
> > > > > > to
> > > > > > > > transform columns for fixed-length data types according to a
> > > > > "shuffle?"
> > > > > > > >  For precedent see the implementation of the shuffle filter
> in
> > > hdf5.
> > > > > > > >
> > > > > >
> > > > >
> > >
> >
> https://support.hdfgroup.org/ftp/HDF5//documentation/doc1.6/TechNotes/shuffling-algorithm-report.pdf
> > > > > > > >
> > > > > > > > For example, the column (length 3) would store bytes 00 00 00
> > 00
> > > 00
> > > > > 00
> > > > > > 00
> > > > > > > > 00 00 01 02 03 to represent the three 32-bit numbers 00 00 00
> > 01
> > > 00
> > > > > 00
> > > > > > 00
> > > > > > > > 02 00 00 00 03  (I'm writing big-endian even if that is not
> > > actually
> > > > > the
> > > > > > > > case).
> > > > > > > >
> > > > > > > > Value(1) would return 00 00 00 02 by referring to some
> metadata
> > > flag
> > > > > > that
> > > > > > > > the column is shuffled, stitching the bytes back together at
> > call
> > > > > time.
> > > > > > > >
> > > > > > > > Thus if the column pages were backed by a memory map to
> > something
> > > > > like
> > > > > > > > zfs/gzip-9 (my actual use-case), one would expect approx 30%
> > > savings
> > > > > in
> > > > > > > > underlying disk usage due to better run lengths.
> > > > > > > >
> > > > > > > > It would enable a space/time tradeoff that could be useful?
> > The
> > > > > > filesystem
> > > > > > > > itself cannot easily do this particular compression transform
> > > since
> > > > > it
> > > > > > > > benefits from knowing the shape of the data.
> > > > > > >
> > > > > > > For the record, there's a pull request adding this encoding to
> > the
> > > > > > > Parquet C++ specification.
> > > > > > >
> > > > > > > Regards
> > > > > > >
> > > > > > > Antoine.
> > > > >
> > >
> >
>

Re: [DISCUSS] Format additions for encoding/compression

Posted by Micah Kornfield <em...@gmail.com>.

Hi John,
Not Wes, but my thoughts on this are as follows:

1. Alternate bit/byte arrangements can also be useful for processing [1] in
addition to compression.
2. I think they are quite a bit more complicated then the existing schemes
proposed in [2], so I think it would be more expedient to get the
integration hooks necessary to work with simpler encodings before going
with something more complex.  I believe the proposal is generic enough to
support this type of encoding.
3. For prototyping, this seems like a potential use of the ExtensionType
[3] type mechanism already in the specification.
4. I don't think these should be new types or part of the basic Array data
structure.  I think having a different container format in the form of
"SparseRecordBatch" (or perhaps it should be renamed to EncodedRecordBatch)
and keeping the existing types with alternate encodings is a better option.

That being said if you have bandwidth to get this working for C++ and Java
we can potentially setup a separate development branch to see how it
evolves.  Personally, I've not brought my proposal up for discussion again,
because I haven't had bandwidth to work on it, but I still think
introducing some level of alternate encodings is a good idea.

Cheers,
Micah


[1]
https://15721.courses.cs.cmu.edu/spring2018/papers/22-vectorization2/p31-feng.pdf
[2] https://github.com/apache/arrow/pull/4815
[3]
https://github.com/apache/arrow/blob/master/docs/source/format/Columnar.rst#extension-types

On Thu, Jan 23, 2020 at 11:36 AM John Muehlhausen <jg...@jgm.org> wrote:

> Wes, what do you think about Arrow supporting a new suite of fixed-length
> data types that unshuffle on column->Value(i) calls?  This would allow
> memory/swap compressors and memory maps backed by compressing
> filesystems (ZFS) or block devices (VDO) to operate more efficiently.
>
> By doing it with new datatypes there is no separate flag to check?
>
> On Thu, Jan 23, 2020 at 1:09 PM Wes McKinney <we...@gmail.com> wrote:
>
> > On Thu, Jan 23, 2020 at 12:42 PM John Muehlhausen <jg...@jgm.org> wrote:
> > >
> > > Again, I know very little about Parquet, so your patience is
> appreciated.
> > >
> > > At the moment I can Arrow/mmap a file without having anywhere nearly as
> > > much available memory as the file size.  I can visit random place in
> the
> > > file (such as a binary search if it is ordered) and only the locations
> > > visited by column->Value(i) are paged in.  Paging them out happens
> > without
> > > my awareness, if necessary.
> > >
> > > Does Parquet cover this use-case with the same elegance and at least
> > equal
> > > efficiency, or are there more copies/conversions?  Perhaps it requires
> > the
> > > entire file to be transformed into Arrow memory at the beginning? Or
> on a
> > > batch/block basis? Or to get this I need to use a non-Arrow API for
> data
> > > element access?  Etc.
> >
> > Data has to be materialized / deserialized from the Parquet file on a
> > batch-wise per-column basis. The APIs we provide allow batches of
> > values to be read for a given subset of columns
> >
> > >
> > > IFF it covers the above use-case, which does not mention compression or
> > > encoding, then I could consider whether it is interesting on those
> > points.
> >
> > My point really has to do with Parquet's design which is about
> > reducing file size. In the following blog post
> >
> > https://ursalabs.org/blog/2019-10-columnar-perf/
> >
> > I examined a dataset which is about 4GB as raw Arrow stream/file but
> > only 114 MB as a Parquet file. A 30+X compression ratio is a huge deal
> > if you are working with filesystems that yield < 500MB/s (which
> > includes pretty much all cloud filesystems AFAIK). In clickstream
> > analytics this kind of compression ratio is not unusual.
> >
> > >
> > > -John
> > >
> > > On Thu, Jan 23, 2020 at 12:06 PM Francois Saint-Jacques <
> > > fsaintjacques@gmail.com> wrote:
> > >
> > > > What's the point of having zero copy if the OS is doing the
> > > > decompression in kernel (which trumps the zero-copy argument)? You
> > > > might as well just use parquet without filesystem compression. I
> > > > prefer to have compression algorithm where the columnar engine can
> > > > benefit from it [1] than marginally improving a file-system-os
> > > > specific feature.
> > > >
> > > > François
> > > >
> > > > [1] Section 4.3 http://db.csail.mit.edu/pubs/abadi-column-stores.pdf
> > > >
> > > >
> > > >
> > > >
> > > > On Thu, Jan 23, 2020 at 12:43 PM John Muehlhausen <jg...@jgm.org>
> wrote:
> > > > >
> > > > > This could also have utility in memory via things like zram/zswap,
> > right?
> > > > > Mac also has a memory compressor?
> > > > >
> > > > > I don't think Parquet is an option for me unless the integration
> with
> > > > Arrow
> > > > > is tighter than I imagine (i.e. zero-copy).  That said, I confess I
> > know
> > > > > next to nothing about Parquet.
> > > > >
> > > > > On Thu, Jan 23, 2020 at 11:23 AM Antoine Pitrou <
> antoine@python.org>
> > > > wrote:
> > > > > >
> > > > > >
> > > > > > Le 23/01/2020 à 18:16, John Muehlhausen a écrit :
> > > > > > > Perhaps related to this thread, are there any current or
> proposed
> > > > tools
> > > > > to
> > > > > > > transform columns for fixed-length data types according to a
> > > > "shuffle?"
> > > > > > >  For precedent see the implementation of the shuffle filter in
> > hdf5.
> > > > > > >
> > > > >
> > > >
> >
> https://support.hdfgroup.org/ftp/HDF5//documentation/doc1.6/TechNotes/shuffling-algorithm-report.pdf
> > > > > > >
> > > > > > > For example, the column (length 3) would store bytes 00 00 00
> 00
> > 00
> > > > 00
> > > > > 00
> > > > > > > 00 00 01 02 03 to represent the three 32-bit numbers 00 00 00
> 01
> > 00
> > > > 00
> > > > > 00
> > > > > > > 02 00 00 00 03  (I'm writing big-endian even if that is not
> > actually
> > > > the
> > > > > > > case).
> > > > > > >
> > > > > > > Value(1) would return 00 00 00 02 by referring to some metadata
> > flag
> > > > > that
> > > > > > > the column is shuffled, stitching the bytes back together at
> call
> > > > time.
> > > > > > >
> > > > > > > Thus if the column pages were backed by a memory map to
> something
> > > > like
> > > > > > > zfs/gzip-9 (my actual use-case), one would expect approx 30%
> > savings
> > > > in
> > > > > > > underlying disk usage due to better run lengths.
> > > > > > >
> > > > > > > It would enable a space/time tradeoff that could be useful?
> The
> > > > > filesystem
> > > > > > > itself cannot easily do this particular compression transform
> > since
> > > > it
> > > > > > > benefits from knowing the shape of the data.
> > > > > >
> > > > > > For the record, there's a pull request adding this encoding to
> the
> > > > > > Parquet C++ specification.
> > > > > >
> > > > > > Regards
> > > > > >
> > > > > > Antoine.
> > > >
> >
>

Re: [DISCUSS] Format additions for encoding/compression

Posted by John Muehlhausen <jg...@jgm.org>.

Wes, what do you think about Arrow supporting a new suite of fixed-length
data types that unshuffle on column->Value(i) calls?  This would allow
memory/swap compressors and memory maps backed by compressing
filesystems (ZFS) or block devices (VDO) to operate more efficiently.

By doing it with new datatypes there is no separate flag to check?

On Thu, Jan 23, 2020 at 1:09 PM Wes McKinney <we...@gmail.com> wrote:

> On Thu, Jan 23, 2020 at 12:42 PM John Muehlhausen <jg...@jgm.org> wrote:
> >
> > Again, I know very little about Parquet, so your patience is appreciated.
> >
> > At the moment I can Arrow/mmap a file without having anywhere nearly as
> > much available memory as the file size.  I can visit random place in the
> > file (such as a binary search if it is ordered) and only the locations
> > visited by column->Value(i) are paged in.  Paging them out happens
> without
> > my awareness, if necessary.
> >
> > Does Parquet cover this use-case with the same elegance and at least
> equal
> > efficiency, or are there more copies/conversions?  Perhaps it requires
> the
> > entire file to be transformed into Arrow memory at the beginning? Or on a
> > batch/block basis? Or to get this I need to use a non-Arrow API for data
> > element access?  Etc.
>
> Data has to be materialized / deserialized from the Parquet file on a
> batch-wise per-column basis. The APIs we provide allow batches of
> values to be read for a given subset of columns
>
> >
> > IFF it covers the above use-case, which does not mention compression or
> > encoding, then I could consider whether it is interesting on those
> points.
>
> My point really has to do with Parquet's design which is about
> reducing file size. In the following blog post
>
> https://ursalabs.org/blog/2019-10-columnar-perf/
>
> I examined a dataset which is about 4GB as raw Arrow stream/file but
> only 114 MB as a Parquet file. A 30+X compression ratio is a huge deal
> if you are working with filesystems that yield < 500MB/s (which
> includes pretty much all cloud filesystems AFAIK). In clickstream
> analytics this kind of compression ratio is not unusual.
>
> >
> > -John
> >
> > On Thu, Jan 23, 2020 at 12:06 PM Francois Saint-Jacques <
> > fsaintjacques@gmail.com> wrote:
> >
> > > What's the point of having zero copy if the OS is doing the
> > > decompression in kernel (which trumps the zero-copy argument)? You
> > > might as well just use parquet without filesystem compression. I
> > > prefer to have compression algorithm where the columnar engine can
> > > benefit from it [1] than marginally improving a file-system-os
> > > specific feature.
> > >
> > > François
> > >
> > > [1] Section 4.3 http://db.csail.mit.edu/pubs/abadi-column-stores.pdf
> > >
> > >
> > >
> > >
> > > On Thu, Jan 23, 2020 at 12:43 PM John Muehlhausen <jg...@jgm.org> wrote:
> > > >
> > > > This could also have utility in memory via things like zram/zswap,
> right?
> > > > Mac also has a memory compressor?
> > > >
> > > > I don't think Parquet is an option for me unless the integration with
> > > Arrow
> > > > is tighter than I imagine (i.e. zero-copy).  That said, I confess I
> know
> > > > next to nothing about Parquet.
> > > >
> > > > On Thu, Jan 23, 2020 at 11:23 AM Antoine Pitrou <an...@python.org>
> > > wrote:
> > > > >
> > > > >
> > > > > Le 23/01/2020 à 18:16, John Muehlhausen a écrit :
> > > > > > Perhaps related to this thread, are there any current or proposed
> > > tools
> > > > to
> > > > > > transform columns for fixed-length data types according to a
> > > "shuffle?"
> > > > > >  For precedent see the implementation of the shuffle filter in
> hdf5.
> > > > > >
> > > >
> > >
> https://support.hdfgroup.org/ftp/HDF5//documentation/doc1.6/TechNotes/shuffling-algorithm-report.pdf
> > > > > >
> > > > > > For example, the column (length 3) would store bytes 00 00 00 00
> 00
> > > 00
> > > > 00
> > > > > > 00 00 01 02 03 to represent the three 32-bit numbers 00 00 00 01
> 00
> > > 00
> > > > 00
> > > > > > 02 00 00 00 03  (I'm writing big-endian even if that is not
> actually
> > > the
> > > > > > case).
> > > > > >
> > > > > > Value(1) would return 00 00 00 02 by referring to some metadata
> flag
> > > > that
> > > > > > the column is shuffled, stitching the bytes back together at call
> > > time.
> > > > > >
> > > > > > Thus if the column pages were backed by a memory map to something
> > > like
> > > > > > zfs/gzip-9 (my actual use-case), one would expect approx 30%
> savings
> > > in
> > > > > > underlying disk usage due to better run lengths.
> > > > > >
> > > > > > It would enable a space/time tradeoff that could be useful?  The
> > > > filesystem
> > > > > > itself cannot easily do this particular compression transform
> since
> > > it
> > > > > > benefits from knowing the shape of the data.
> > > > >
> > > > > For the record, there's a pull request adding this encoding to the
> > > > > Parquet C++ specification.
> > > > >
> > > > > Regards
> > > > >
> > > > > Antoine.
> > >
>

Re: [DISCUSS] Format additions for encoding/compression

Posted by Wes McKinney <we...@gmail.com>.

On Thu, Jan 23, 2020 at 12:42 PM John Muehlhausen <jg...@jgm.org> wrote:
>
> Again, I know very little about Parquet, so your patience is appreciated.
>
> At the moment I can Arrow/mmap a file without having anywhere nearly as
> much available memory as the file size.  I can visit random place in the
> file (such as a binary search if it is ordered) and only the locations
> visited by column->Value(i) are paged in.  Paging them out happens without
> my awareness, if necessary.
>
> Does Parquet cover this use-case with the same elegance and at least equal
> efficiency, or are there more copies/conversions?  Perhaps it requires the
> entire file to be transformed into Arrow memory at the beginning? Or on a
> batch/block basis? Or to get this I need to use a non-Arrow API for data
> element access?  Etc.

Data has to be materialized / deserialized from the Parquet file on a
batch-wise per-column basis. The APIs we provide allow batches of
values to be read for a given subset of columns

>
> IFF it covers the above use-case, which does not mention compression or
> encoding, then I could consider whether it is interesting on those points.

My point really has to do with Parquet's design which is about
reducing file size. In the following blog post

https://ursalabs.org/blog/2019-10-columnar-perf/

I examined a dataset which is about 4GB as raw Arrow stream/file but
only 114 MB as a Parquet file. A 30+X compression ratio is a huge deal
if you are working with filesystems that yield < 500MB/s (which
includes pretty much all cloud filesystems AFAIK). In clickstream
analytics this kind of compression ratio is not unusual.

>
> -John
>
> On Thu, Jan 23, 2020 at 12:06 PM Francois Saint-Jacques <
> fsaintjacques@gmail.com> wrote:
>
> > What's the point of having zero copy if the OS is doing the
> > decompression in kernel (which trumps the zero-copy argument)? You
> > might as well just use parquet without filesystem compression. I
> > prefer to have compression algorithm where the columnar engine can
> > benefit from it [1] than marginally improving a file-system-os
> > specific feature.
> >
> > François
> >
> > [1] Section 4.3 http://db.csail.mit.edu/pubs/abadi-column-stores.pdf
> >
> >
> >
> >
> > On Thu, Jan 23, 2020 at 12:43 PM John Muehlhausen <jg...@jgm.org> wrote:
> > >
> > > This could also have utility in memory via things like zram/zswap, right?
> > > Mac also has a memory compressor?
> > >
> > > I don't think Parquet is an option for me unless the integration with
> > Arrow
> > > is tighter than I imagine (i.e. zero-copy).  That said, I confess I know
> > > next to nothing about Parquet.
> > >
> > > On Thu, Jan 23, 2020 at 11:23 AM Antoine Pitrou <an...@python.org>
> > wrote:
> > > >
> > > >
> > > > Le 23/01/2020 à 18:16, John Muehlhausen a écrit :
> > > > > Perhaps related to this thread, are there any current or proposed
> > tools
> > > to
> > > > > transform columns for fixed-length data types according to a
> > "shuffle?"
> > > > >  For precedent see the implementation of the shuffle filter in hdf5.
> > > > >
> > >
> > https://support.hdfgroup.org/ftp/HDF5//documentation/doc1.6/TechNotes/shuffling-algorithm-report.pdf
> > > > >
> > > > > For example, the column (length 3) would store bytes 00 00 00 00 00
> > 00
> > > 00
> > > > > 00 00 01 02 03 to represent the three 32-bit numbers 00 00 00 01 00
> > 00
> > > 00
> > > > > 02 00 00 00 03  (I'm writing big-endian even if that is not actually
> > the
> > > > > case).
> > > > >
> > > > > Value(1) would return 00 00 00 02 by referring to some metadata flag
> > > that
> > > > > the column is shuffled, stitching the bytes back together at call
> > time.
> > > > >
> > > > > Thus if the column pages were backed by a memory map to something
> > like
> > > > > zfs/gzip-9 (my actual use-case), one would expect approx 30% savings
> > in
> > > > > underlying disk usage due to better run lengths.
> > > > >
> > > > > It would enable a space/time tradeoff that could be useful?  The
> > > filesystem
> > > > > itself cannot easily do this particular compression transform since
> > it
> > > > > benefits from knowing the shape of the data.
> > > >
> > > > For the record, there's a pull request adding this encoding to the
> > > > Parquet C++ specification.
> > > >
> > > > Regards
> > > >
> > > > Antoine.
> >

Re: [DISCUSS] Format additions for encoding/compression

Posted by John Muehlhausen <jg...@jgm.org>.

Again, I know very little about Parquet, so your patience is appreciated.

At the moment I can Arrow/mmap a file without having anywhere nearly as
much available memory as the file size.  I can visit random place in the
file (such as a binary search if it is ordered) and only the locations
visited by column->Value(i) are paged in.  Paging them out happens without
my awareness, if necessary.

Does Parquet cover this use-case with the same elegance and at least equal
efficiency, or are there more copies/conversions?  Perhaps it requires the
entire file to be transformed into Arrow memory at the beginning? Or on a
batch/block basis? Or to get this I need to use a non-Arrow API for data
element access?  Etc.

IFF it covers the above use-case, which does not mention compression or
encoding, then I could consider whether it is interesting on those points.

-John

On Thu, Jan 23, 2020 at 12:06 PM Francois Saint-Jacques <
fsaintjacques@gmail.com> wrote:

> What's the point of having zero copy if the OS is doing the
> decompression in kernel (which trumps the zero-copy argument)? You
> might as well just use parquet without filesystem compression. I
> prefer to have compression algorithm where the columnar engine can
> benefit from it [1] than marginally improving a file-system-os
> specific feature.
>
> François
>
> [1] Section 4.3 http://db.csail.mit.edu/pubs/abadi-column-stores.pdf
>
>
>
>
> On Thu, Jan 23, 2020 at 12:43 PM John Muehlhausen <jg...@jgm.org> wrote:
> >
> > This could also have utility in memory via things like zram/zswap, right?
> > Mac also has a memory compressor?
> >
> > I don't think Parquet is an option for me unless the integration with
> Arrow
> > is tighter than I imagine (i.e. zero-copy).  That said, I confess I know
> > next to nothing about Parquet.
> >
> > On Thu, Jan 23, 2020 at 11:23 AM Antoine Pitrou <an...@python.org>
> wrote:
> > >
> > >
> > > Le 23/01/2020 à 18:16, John Muehlhausen a écrit :
> > > > Perhaps related to this thread, are there any current or proposed
> tools
> > to
> > > > transform columns for fixed-length data types according to a
> "shuffle?"
> > > >  For precedent see the implementation of the shuffle filter in hdf5.
> > > >
> >
> https://support.hdfgroup.org/ftp/HDF5//documentation/doc1.6/TechNotes/shuffling-algorithm-report.pdf
> > > >
> > > > For example, the column (length 3) would store bytes 00 00 00 00 00
> 00
> > 00
> > > > 00 00 01 02 03 to represent the three 32-bit numbers 00 00 00 01 00
> 00
> > 00
> > > > 02 00 00 00 03  (I'm writing big-endian even if that is not actually
> the
> > > > case).
> > > >
> > > > Value(1) would return 00 00 00 02 by referring to some metadata flag
> > that
> > > > the column is shuffled, stitching the bytes back together at call
> time.
> > > >
> > > > Thus if the column pages were backed by a memory map to something
> like
> > > > zfs/gzip-9 (my actual use-case), one would expect approx 30% savings
> in
> > > > underlying disk usage due to better run lengths.
> > > >
> > > > It would enable a space/time tradeoff that could be useful?  The
> > filesystem
> > > > itself cannot easily do this particular compression transform since
> it
> > > > benefits from knowing the shape of the data.
> > >
> > > For the record, there's a pull request adding this encoding to the
> > > Parquet C++ specification.
> > >
> > > Regards
> > >
> > > Antoine.
>

Re: [DISCUSS] Format additions for encoding/compression

Posted by Wes McKinney <we...@gmail.com>.

Parquet is most relevant in scenarios filesystem IO is constrained
(spinning rust HDD, network FS, cloud storage / S3 / GCS). For those
use cases memory-mapped Arrow is not viable.

Against local NVMe (> 2000 MB/s read throughput) your mileage may vary.

On Thu, Jan 23, 2020 at 12:06 PM Francois Saint-Jacques
<fs...@gmail.com> wrote:
>
> What's the point of having zero copy if the OS is doing the
> decompression in kernel (which trumps the zero-copy argument)? You
> might as well just use parquet without filesystem compression. I
> prefer to have compression algorithm where the columnar engine can
> benefit from it [1] than marginally improving a file-system-os
> specific feature.
>
> François
>
> [1] Section 4.3 http://db.csail.mit.edu/pubs/abadi-column-stores.pdf
>
>
>
>
> On Thu, Jan 23, 2020 at 12:43 PM John Muehlhausen <jg...@jgm.org> wrote:
> >
> > This could also have utility in memory via things like zram/zswap, right?
> > Mac also has a memory compressor?
> >
> > I don't think Parquet is an option for me unless the integration with Arrow
> > is tighter than I imagine (i.e. zero-copy).  That said, I confess I know
> > next to nothing about Parquet.
> >
> > On Thu, Jan 23, 2020 at 11:23 AM Antoine Pitrou <an...@python.org> wrote:
> > >
> > >
> > > Le 23/01/2020 à 18:16, John Muehlhausen a écrit :
> > > > Perhaps related to this thread, are there any current or proposed tools
> > to
> > > > transform columns for fixed-length data types according to a "shuffle?"
> > > >  For precedent see the implementation of the shuffle filter in hdf5.
> > > >
> > https://support.hdfgroup.org/ftp/HDF5//documentation/doc1.6/TechNotes/shuffling-algorithm-report.pdf
> > > >
> > > > For example, the column (length 3) would store bytes 00 00 00 00 00 00
> > 00
> > > > 00 00 01 02 03 to represent the three 32-bit numbers 00 00 00 01 00 00
> > 00
> > > > 02 00 00 00 03  (I'm writing big-endian even if that is not actually the
> > > > case).
> > > >
> > > > Value(1) would return 00 00 00 02 by referring to some metadata flag
> > that
> > > > the column is shuffled, stitching the bytes back together at call time.
> > > >
> > > > Thus if the column pages were backed by a memory map to something like
> > > > zfs/gzip-9 (my actual use-case), one would expect approx 30% savings in
> > > > underlying disk usage due to better run lengths.
> > > >
> > > > It would enable a space/time tradeoff that could be useful?  The
> > filesystem
> > > > itself cannot easily do this particular compression transform since it
> > > > benefits from knowing the shape of the data.
> > >
> > > For the record, there's a pull request adding this encoding to the
> > > Parquet C++ specification.
> > >
> > > Regards
> > >
> > > Antoine.

Re: [DISCUSS] Format additions for encoding/compression

Posted by Francois Saint-Jacques <fs...@gmail.com>.

What's the point of having zero copy if the OS is doing the
decompression in kernel (which trumps the zero-copy argument)? You
might as well just use parquet without filesystem compression. I
prefer to have compression algorithm where the columnar engine can
benefit from it [1] than marginally improving a file-system-os
specific feature.

François

[1] Section 4.3 http://db.csail.mit.edu/pubs/abadi-column-stores.pdf




On Thu, Jan 23, 2020 at 12:43 PM John Muehlhausen <jg...@jgm.org> wrote:
>
> This could also have utility in memory via things like zram/zswap, right?
> Mac also has a memory compressor?
>
> I don't think Parquet is an option for me unless the integration with Arrow
> is tighter than I imagine (i.e. zero-copy).  That said, I confess I know
> next to nothing about Parquet.
>
> On Thu, Jan 23, 2020 at 11:23 AM Antoine Pitrou <an...@python.org> wrote:
> >
> >
> > Le 23/01/2020 à 18:16, John Muehlhausen a écrit :
> > > Perhaps related to this thread, are there any current or proposed tools
> to
> > > transform columns for fixed-length data types according to a "shuffle?"
> > >  For precedent see the implementation of the shuffle filter in hdf5.
> > >
> https://support.hdfgroup.org/ftp/HDF5//documentation/doc1.6/TechNotes/shuffling-algorithm-report.pdf
> > >
> > > For example, the column (length 3) would store bytes 00 00 00 00 00 00
> 00
> > > 00 00 01 02 03 to represent the three 32-bit numbers 00 00 00 01 00 00
> 00
> > > 02 00 00 00 03  (I'm writing big-endian even if that is not actually the
> > > case).
> > >
> > > Value(1) would return 00 00 00 02 by referring to some metadata flag
> that
> > > the column is shuffled, stitching the bytes back together at call time.
> > >
> > > Thus if the column pages were backed by a memory map to something like
> > > zfs/gzip-9 (my actual use-case), one would expect approx 30% savings in
> > > underlying disk usage due to better run lengths.
> > >
> > > It would enable a space/time tradeoff that could be useful?  The
> filesystem
> > > itself cannot easily do this particular compression transform since it
> > > benefits from knowing the shape of the data.
> >
> > For the record, there's a pull request adding this encoding to the
> > Parquet C++ specification.
> >
> > Regards
> >
> > Antoine.

Re: [DISCUSS] Format additions for encoding/compression

Posted by John Muehlhausen <jg...@jgm.org>.

This could also have utility in memory via things like zram/zswap, right?
Mac also has a memory compressor?

I don't think Parquet is an option for me unless the integration with Arrow
is tighter than I imagine (i.e. zero-copy).  That said, I confess I know
next to nothing about Parquet.

On Thu, Jan 23, 2020 at 11:23 AM Antoine Pitrou <an...@python.org> wrote:
>
>
> Le 23/01/2020 à 18:16, John Muehlhausen a écrit :
> > Perhaps related to this thread, are there any current or proposed tools
to
> > transform columns for fixed-length data types according to a "shuffle?"
> >  For precedent see the implementation of the shuffle filter in hdf5.
> >
https://support.hdfgroup.org/ftp/HDF5//documentation/doc1.6/TechNotes/shuffling-algorithm-report.pdf
> >
> > For example, the column (length 3) would store bytes 00 00 00 00 00 00
00
> > 00 00 01 02 03 to represent the three 32-bit numbers 00 00 00 01 00 00
00
> > 02 00 00 00 03  (I'm writing big-endian even if that is not actually the
> > case).
> >
> > Value(1) would return 00 00 00 02 by referring to some metadata flag
that
> > the column is shuffled, stitching the bytes back together at call time.
> >
> > Thus if the column pages were backed by a memory map to something like
> > zfs/gzip-9 (my actual use-case), one would expect approx 30% savings in
> > underlying disk usage due to better run lengths.
> >
> > It would enable a space/time tradeoff that could be useful?  The
filesystem
> > itself cannot easily do this particular compression transform since it
> > benefits from knowing the shape of the data.
>
> For the record, there's a pull request adding this encoding to the
> Parquet C++ specification.
>
> Regards
>
> Antoine.

Re: [DISCUSS] Format additions for encoding/compression

Posted by Antoine Pitrou <an...@python.org>.

Forgot to give the URL:
https://github.com/apache/arrow/pull/6005

Regards

Antoine.


Le 23/01/2020 à 18:23, Antoine Pitrou a écrit :
> 
> Le 23/01/2020 à 18:16, John Muehlhausen a écrit :
>> Perhaps related to this thread, are there any current or proposed tools to
>> transform columns for fixed-length data types according to a "shuffle?"
>>  For precedent see the implementation of the shuffle filter in hdf5.
>> https://support.hdfgroup.org/ftp/HDF5//documentation/doc1.6/TechNotes/shuffling-algorithm-report.pdf
>>
>> For example, the column (length 3) would store bytes 00 00 00 00 00 00 00
>> 00 00 01 02 03 to represent the three 32-bit numbers 00 00 00 01 00 00 00
>> 02 00 00 00 03  (I'm writing big-endian even if that is not actually the
>> case).
>>
>> Value(1) would return 00 00 00 02 by referring to some metadata flag that
>> the column is shuffled, stitching the bytes back together at call time.
>>
>> Thus if the column pages were backed by a memory map to something like
>> zfs/gzip-9 (my actual use-case), one would expect approx 30% savings in
>> underlying disk usage due to better run lengths.
>>
>> It would enable a space/time tradeoff that could be useful?  The filesystem
>> itself cannot easily do this particular compression transform since it
>> benefits from knowing the shape of the data.
> 
> For the record, there's a pull request adding this encoding to the
> Parquet C++ specification.
> 
> Regards
> 
> Antoine.
>

Re: [DISCUSS] Format additions for encoding/compression

Posted by Antoine Pitrou <an...@python.org>.

Le 23/01/2020 à 18:16, John Muehlhausen a écrit :
> Perhaps related to this thread, are there any current or proposed tools to
> transform columns for fixed-length data types according to a "shuffle?"
>  For precedent see the implementation of the shuffle filter in hdf5.
> https://support.hdfgroup.org/ftp/HDF5//documentation/doc1.6/TechNotes/shuffling-algorithm-report.pdf
> 
> For example, the column (length 3) would store bytes 00 00 00 00 00 00 00
> 00 00 01 02 03 to represent the three 32-bit numbers 00 00 00 01 00 00 00
> 02 00 00 00 03  (I'm writing big-endian even if that is not actually the
> case).
> 
> Value(1) would return 00 00 00 02 by referring to some metadata flag that
> the column is shuffled, stitching the bytes back together at call time.
> 
> Thus if the column pages were backed by a memory map to something like
> zfs/gzip-9 (my actual use-case), one would expect approx 30% savings in
> underlying disk usage due to better run lengths.
> 
> It would enable a space/time tradeoff that could be useful?  The filesystem
> itself cannot easily do this particular compression transform since it
> benefits from knowing the shape of the data.

For the record, there's a pull request adding this encoding to the
Parquet C++ specification.

Regards

Antoine.

Re: [DISCUSS] Format additions for encoding/compression (Was: [Discuss] Format additions to Arrow for sparse data and data integrity)

Posted by John Muehlhausen <jg...@jgm.org>.

Perhaps related to this thread, are there any current or proposed tools to
transform columns for fixed-length data types according to a "shuffle?"
 For precedent see the implementation of the shuffle filter in hdf5.
https://support.hdfgroup.org/ftp/HDF5//documentation/doc1.6/TechNotes/shuffling-algorithm-report.pdf

For example, the column (length 3) would store bytes 00 00 00 00 00 00 00
00 00 01 02 03 to represent the three 32-bit numbers 00 00 00 01 00 00 00
02 00 00 00 03  (I'm writing big-endian even if that is not actually the
case).

Value(1) would return 00 00 00 02 by referring to some metadata flag that
the column is shuffled, stitching the bytes back together at call time.

Thus if the column pages were backed by a memory map to something like
zfs/gzip-9 (my actual use-case), one would expect approx 30% savings in
underlying disk usage due to better run lengths.

It would enable a space/time tradeoff that could be useful?  The filesystem
itself cannot easily do this particular compression transform since it
benefits from knowing the shape of the data.

-John

On Sun, Aug 25, 2019 at 10:30 PM Micah Kornfield <em...@gmail.com>
wrote:
>
> Hi Ippokratis,
> Thank you for the feedback, I have some questions based on the links you
> provided.
>
>
> > I think that lightweight encodings (like the FrameOfReference Micah
> > suggests) do make a lot of sense for Arrow. There are a few
implementations
> > of those in commercial systems. One related paper in the literature is
> > http://www.cs.columbia.edu/~orestis/damon15.pdf
>
>
> This paper seems to suggest more complex encodings I was imagining for the
> the first implementation.  Specifically, I proposed using only codes that
> are 2^N bits (8, 16, 32, and 64). Do you think it is is critical to have
> the dense bit-packing in an initial version?
>
> >
> > I would actually also look into some order-preserving dictionary
encodings
> > for strings that also allow vectorized processing (predicates, joins,
..)
> > on encoded data, e.g. see
> >
https://15721.courses.cs.cmu.edu/spring2017/papers/11-compression/p283-binnig.pdf
> >  .
>
> The IPC spec [1] already has some metadata about the ordering of
> dictionaries, but this might not be sufficient.  The paper linked here
> seems to recommend two things:
> 1.  Treating dictionaries as explicit mappings between value and integer
> code today is is implicit because the dictionaries are lists indexed by
> code.  It seems like for forward-compatibility we should add a type enum
to
> the Dictionary Encoding metadata.
> 2.  Adding indexes to the dictionaries.  For this, did you imagine the
> indexes would be transferred or something built up on receiving batches?
>
> Arrow can be used as during shuffles for distributed joins/aggs and being
> > able to operate on encoded data yields benefits (e.g.
> > http://www.vldb.org/pvldb/vol7/p1355-lee.pdf).
>
> The main take-away I got after skimming this paper, as it relates to
> encodings, is that encodings (including dictionary) should be dynamic per
> batch.  The other interesting question it raises with respect to Arrow is
> one of the techniques used is delta-encoding.  I believe delta encoding
> requires linear time access.  The dense representations in Arrow was
> designed to have constant time access to elements. One open question on
how
> far we  want to relax this requirement for encoded columns.  My proposal
> uses a form of RLE that provide O(Log(N)) access).
>
> Cheers,
> Micah
>
> [1] https://github.com/apache/arrow/blob/master/format/Schema.fbs#L285
>
> On Sun, Aug 25, 2019 at 12:03 AM Ippokratis Pandis <ip...@gmail.com>
> wrote:
>
> > I think that lightweight encodings (like the FrameOfReference Micah
> > suggests) do make a lot of sense for Arrow. There are a few
implementations
> > of those in commercial systems. One related paper in the literature is
> > http://www.cs.columbia.edu/~orestis/damon15.pdf
> >
> > I would actually also look into some order-preserving dictionary
encodings
> > for strings that also allow vectorized processing (predicates, joins,
..)
> > on encoded data, e.g. see
> >
https://15721.courses.cs.cmu.edu/spring2017/papers/11-compression/p283-binnig.pdf
> >  .
> >
> > Arrow can be used as during shuffles for distributed joins/aggs and
being
> > able to operate on encoded data yields benefits (e.g.
> > http://www.vldb.org/pvldb/vol7/p1355-lee.pdf).
> >
> > Thanks,
> > -Ippokratis.
> >
> >
> > On Thu, Jul 25, 2019 at 11:06 PM Micah Kornfield <em...@gmail.com>
> > wrote:
> >
> >> >
> >> > It's not just computation libraries, it's any library peeking inside
> >> > Arrow data.  Currently, the Arrow data types are simple, which makes
it
> >> > easy and non-intimidating to build data processing utilities around
> >> > them.  If we start adding sophisticated encodings, we also raise the
> >> > cost of supporting Arrow for third-party libraries.
> >>
> >>
> >> This is another legitimate concern about complexity.
> >>
> >> To try to limit complexity. I simplified the proposal PR [1] to only
have
> >> 1
> >> buffer encoding (FrameOfReferenceIntEncoding) scheme and 1 array
encoding
> >> scheme (RLE) that I think will have the most benefit if exploited
> >> properly.  Compression is removed.
> >>
> >> I'd like to get closure on the proposal one way or another.  I think
now
> >> the question to be answered is if we are willing to introduce the
> >> additional complexity for the performance improvements they can
yield?  Is
> >> there more data that people would like to see that would influence
their
> >> decision?
> >>
> >> Thanks,
> >> Micah
> >>
> >> [1] https://github.com/apache/arrow/pull/4815
> >>
> >> On Mon, Jul 22, 2019 at 8:59 AM Antoine Pitrou <so...@pitrou.net>
> >> wrote:
> >>
> >> > On Mon, 22 Jul 2019 08:40:08 -0700
> >> > Brian Hulette <hu...@gmail.com> wrote:
> >> > > To me, the most important aspect of this proposal is the addition
of
> >> > sparse
> >> > > encodings, and I'm curious if there are any more objections to that
> >> > > specifically. So far I believe the only one is that it will make
> >> > > computation libraries more complicated. This is absolutely true,
but I
> >> > > think it's worth that cost.
> >> >
> >> > It's not just computation libraries, it's any library peeking inside
> >> > Arrow data.  Currently, the Arrow data types are simple, which makes
it
> >> > easy and non-intimidating to build data processing utilities around
> >> > them.  If we start adding sophisticated encodings, we also raise the
> >> > cost of supporting Arrow for third-party libraries.
> >> >
> >> > Regards
> >> >
> >> > Antoine.
> >> >
> >> >
> >> >
> >>
> >
> >
> > --
> > -Ippokratis.
> >
> >

Re: [DISCUSS] Format additions for encoding/compression (Was: [Discuss] Format additions to Arrow for sparse data and data integrity)

Posted by Micah Kornfield <em...@gmail.com>.

Hi Ippokratis,
Thank you for the feedback, I have some questions based on the links you
provided.


> I think that lightweight encodings (like the FrameOfReference Micah
> suggests) do make a lot of sense for Arrow. There are a few implementations
> of those in commercial systems. One related paper in the literature is
> http://www.cs.columbia.edu/~orestis/damon15.pdf


This paper seems to suggest more complex encodings I was imagining for the
the first implementation.  Specifically, I proposed using only codes that
are 2^N bits (8, 16, 32, and 64). Do you think it is is critical to have
the dense bit-packing in an initial version?

>
> I would actually also look into some order-preserving dictionary encodings
> for strings that also allow vectorized processing (predicates, joins, ..)
> on encoded data, e.g. see
> https://15721.courses.cs.cmu.edu/spring2017/papers/11-compression/p283-binnig.pdf
>  .

The IPC spec [1] already has some metadata about the ordering of
dictionaries, but this might not be sufficient.  The paper linked here
seems to recommend two things:
1.  Treating dictionaries as explicit mappings between value and integer
code today is is implicit because the dictionaries are lists indexed by
code.  It seems like for forward-compatibility we should add a type enum to
the Dictionary Encoding metadata.
2.  Adding indexes to the dictionaries.  For this, did you imagine the
indexes would be transferred or something built up on receiving batches?

Arrow can be used as during shuffles for distributed joins/aggs and being
> able to operate on encoded data yields benefits (e.g.
> http://www.vldb.org/pvldb/vol7/p1355-lee.pdf).

The main take-away I got after skimming this paper, as it relates to
encodings, is that encodings (including dictionary) should be dynamic per
batch.  The other interesting question it raises with respect to Arrow is
one of the techniques used is delta-encoding.  I believe delta encoding
requires linear time access.  The dense representations in Arrow was
designed to have constant time access to elements. One open question on how
far we  want to relax this requirement for encoded columns.  My proposal
uses a form of RLE that provide O(Log(N)) access).

Cheers,
Micah

[1] https://github.com/apache/arrow/blob/master/format/Schema.fbs#L285

On Sun, Aug 25, 2019 at 12:03 AM Ippokratis Pandis <ip...@gmail.com>
wrote:

> I think that lightweight encodings (like the FrameOfReference Micah
> suggests) do make a lot of sense for Arrow. There are a few implementations
> of those in commercial systems. One related paper in the literature is
> http://www.cs.columbia.edu/~orestis/damon15.pdf
>
> I would actually also look into some order-preserving dictionary encodings
> for strings that also allow vectorized processing (predicates, joins, ..)
> on encoded data, e.g. see
> https://15721.courses.cs.cmu.edu/spring2017/papers/11-compression/p283-binnig.pdf
>  .
>
> Arrow can be used as during shuffles for distributed joins/aggs and being
> able to operate on encoded data yields benefits (e.g.
> http://www.vldb.org/pvldb/vol7/p1355-lee.pdf).
>
> Thanks,
> -Ippokratis.
>
>
> On Thu, Jul 25, 2019 at 11:06 PM Micah Kornfield <em...@gmail.com>
> wrote:
>
>> >
>> > It's not just computation libraries, it's any library peeking inside
>> > Arrow data.  Currently, the Arrow data types are simple, which makes it
>> > easy and non-intimidating to build data processing utilities around
>> > them.  If we start adding sophisticated encodings, we also raise the
>> > cost of supporting Arrow for third-party libraries.
>>
>>
>> This is another legitimate concern about complexity.
>>
>> To try to limit complexity. I simplified the proposal PR [1] to only have
>> 1
>> buffer encoding (FrameOfReferenceIntEncoding) scheme and 1 array encoding
>> scheme (RLE) that I think will have the most benefit if exploited
>> properly.  Compression is removed.
>>
>> I'd like to get closure on the proposal one way or another.  I think now
>> the question to be answered is if we are willing to introduce the
>> additional complexity for the performance improvements they can yield?  Is
>> there more data that people would like to see that would influence their
>> decision?
>>
>> Thanks,
>> Micah
>>
>> [1] https://github.com/apache/arrow/pull/4815
>>
>> On Mon, Jul 22, 2019 at 8:59 AM Antoine Pitrou <so...@pitrou.net>
>> wrote:
>>
>> > On Mon, 22 Jul 2019 08:40:08 -0700
>> > Brian Hulette <hu...@gmail.com> wrote:
>> > > To me, the most important aspect of this proposal is the addition of
>> > sparse
>> > > encodings, and I'm curious if there are any more objections to that
>> > > specifically. So far I believe the only one is that it will make
>> > > computation libraries more complicated. This is absolutely true, but I
>> > > think it's worth that cost.
>> >
>> > It's not just computation libraries, it's any library peeking inside
>> > Arrow data.  Currently, the Arrow data types are simple, which makes it
>> > easy and non-intimidating to build data processing utilities around
>> > them.  If we start adding sophisticated encodings, we also raise the
>> > cost of supporting Arrow for third-party libraries.
>> >
>> > Regards
>> >
>> > Antoine.
>> >
>> >
>> >
>>
>
>
> --
> -Ippokratis.
>
>

Re: [DISCUSS] Format additions for encoding/compression (Was: [Discuss] Format additions to Arrow for sparse data and data integrity)

Posted by Ippokratis Pandis <ip...@gmail.com>.

I think that lightweight encodings (like the FrameOfReference Micah
suggests) do make a lot of sense for Arrow. There are a few implementations
of those in commercial systems. One related paper in the literature is
http://www.cs.columbia.edu/~orestis/damon15.pdf

I would actually also look into some order-preserving dictionary encodings
for strings that also allow vectorized processing (predicates, joins, ..)
on encoded data, e.g. see
https://15721.courses.cs.cmu.edu/spring2017/papers/11-compression/p283-binnig.pdf
 .

Arrow can be used as during shuffles for distributed joins/aggs and being
able to operate on encoded data yields benefits (e.g.
http://www.vldb.org/pvldb/vol7/p1355-lee.pdf).

Thanks,
-Ippokratis.


On Thu, Jul 25, 2019 at 11:06 PM Micah Kornfield <em...@gmail.com>
wrote:

> >
> > It's not just computation libraries, it's any library peeking inside
> > Arrow data.  Currently, the Arrow data types are simple, which makes it
> > easy and non-intimidating to build data processing utilities around
> > them.  If we start adding sophisticated encodings, we also raise the
> > cost of supporting Arrow for third-party libraries.
>
>
> This is another legitimate concern about complexity.
>
> To try to limit complexity. I simplified the proposal PR [1] to only have 1
> buffer encoding (FrameOfReferenceIntEncoding) scheme and 1 array encoding
> scheme (RLE) that I think will have the most benefit if exploited
> properly.  Compression is removed.
>
> I'd like to get closure on the proposal one way or another.  I think now
> the question to be answered is if we are willing to introduce the
> additional complexity for the performance improvements they can yield?  Is
> there more data that people would like to see that would influence their
> decision?
>
> Thanks,
> Micah
>
> [1] https://github.com/apache/arrow/pull/4815
>
> On Mon, Jul 22, 2019 at 8:59 AM Antoine Pitrou <so...@pitrou.net>
> wrote:
>
> > On Mon, 22 Jul 2019 08:40:08 -0700
> > Brian Hulette <hu...@gmail.com> wrote:
> > > To me, the most important aspect of this proposal is the addition of
> > sparse
> > > encodings, and I'm curious if there are any more objections to that
> > > specifically. So far I believe the only one is that it will make
> > > computation libraries more complicated. This is absolutely true, but I
> > > think it's worth that cost.
> >
> > It's not just computation libraries, it's any library peeking inside
> > Arrow data.  Currently, the Arrow data types are simple, which makes it
> > easy and non-intimidating to build data processing utilities around
> > them.  If we start adding sophisticated encodings, we also raise the
> > cost of supporting Arrow for third-party libraries.
> >
> > Regards
> >
> > Antoine.
> >
> >
> >
>


-- 
-Ippokratis.

Re: [DISCUSS] Format additions for encoding/compression (Was: [Discuss] Format additions to Arrow for sparse data and data integrity)

Posted by Micah Kornfield <em...@gmail.com>.

>
> It's not just computation libraries, it's any library peeking inside
> Arrow data.  Currently, the Arrow data types are simple, which makes it
> easy and non-intimidating to build data processing utilities around
> them.  If we start adding sophisticated encodings, we also raise the
> cost of supporting Arrow for third-party libraries.

This is another legitimate concern about complexity.

To try to limit complexity. I simplified the proposal PR [1] to only have 1
buffer encoding (FrameOfReferenceIntEncoding) scheme and 1 array encoding
scheme (RLE) that I think will have the most benefit if exploited
properly.  Compression is removed.

I'd like to get closure on the proposal one way or another.  I think now
the question to be answered is if we are willing to introduce the
additional complexity for the performance improvements they can yield?  Is
there more data that people would like to see that would influence their
decision?

Thanks,
Micah

[1] https://github.com/apache/arrow/pull/4815

On Mon, Jul 22, 2019 at 8:59 AM Antoine Pitrou <so...@pitrou.net> wrote:

> On Mon, 22 Jul 2019 08:40:08 -0700
> Brian Hulette <hu...@gmail.com> wrote:
> > To me, the most important aspect of this proposal is the addition of
> sparse
> > encodings, and I'm curious if there are any more objections to that
> > specifically. So far I believe the only one is that it will make
> > computation libraries more complicated. This is absolutely true, but I
> > think it's worth that cost.
>
> It's not just computation libraries, it's any library peeking inside
> Arrow data.  Currently, the Arrow data types are simple, which makes it
> easy and non-intimidating to build data processing utilities around
> them.  If we start adding sophisticated encodings, we also raise the
> cost of supporting Arrow for third-party libraries.
>
> Regards
>
> Antoine.
>
>
>

Re: [DISCUSS] Format additions for encoding/compression (Was: [Discuss] Format additions to Arrow for sparse data and data integrity)

Posted by Antoine Pitrou <so...@pitrou.net>.

On Mon, 22 Jul 2019 08:40:08 -0700
Brian Hulette <hu...@gmail.com> wrote:
> To me, the most important aspect of this proposal is the addition of sparse
> encodings, and I'm curious if there are any more objections to that
> specifically. So far I believe the only one is that it will make
> computation libraries more complicated. This is absolutely true, but I
> think it's worth that cost.

It's not just computation libraries, it's any library peeking inside
Arrow data.  Currently, the Arrow data types are simple, which makes it
easy and non-intimidating to build data processing utilities around
them.  If we start adding sophisticated encodings, we also raise the
cost of supporting Arrow for third-party libraries.

Regards

Antoine.

Re: [DISCUSS] Format additions for encoding/compression (Was: [Discuss] Format additions to Arrow for sparse data and data integrity)

Posted by Brian Hulette <hu...@gmail.com>.

To me, the most important aspect of this proposal is the addition of sparse
encodings, and I'm curious if there are any more objections to that
specifically. So far I believe the only one is that it will make
computation libraries more complicated. This is absolutely true, but I
think it's worth that cost.

It's been suggested on this list and elsewhere [1] that sparse encodings
that can be operated on without fully decompressing should be added to the
Arrow format. The longer we continue to develop computation libraries
without considering those schemes, the harder it will be to add them.

[1]
https://dbmsmusings.blogspot.com/2017/10/apache-arrow-vs-parquet-and-orc-do-we.html


On Sat, Jul 13, 2019 at 9:35 AM Wes McKinney <we...@gmail.com> wrote:

> On Sat, Jul 13, 2019 at 11:23 AM Antoine Pitrou <so...@pitrou.net>
> wrote:
> >
> > On Fri, 12 Jul 2019 20:37:15 -0700
> > Micah Kornfield <em...@gmail.com> wrote:
> > >
> > > If the latter, I wonder why Parquet cannot simply be used instead of
> > > > reinventing something similar but different.
> > >
> > > This is a reasonable point.  However there is  continuum here between
> file
> > > size and read and write times.  Parquet will likely always be the
> smallest
> > > with the largest times to convert to and from Arrow.  An uncompressed
> > > Feather/Arrow file will likely always take the most space but will much
> > > faster conversion times.
> >
> > I'm curious whether the Parquet conversion times are inherent to the
> > Parquet format or due to inefficiencies in the implementation.
> >
>
> Parquet is fundamentally more complex to decode. Consider several
> layers of logic that must happen for values to end up in the right
> place
>
> * Data pages are usually compressed, and a column consists of many
> data pages each having a Thrift header that must be deserialized
> * Values are usually dictionary-encoded, dictionary indices are
> encoded using hybrid bit-packed / RLE scheme
> * Null/not-null is encoded in definition levels
> * Only non-null values are stored, so when decoding to Arrow, values
> have to be "moved into place"
>
> The current C++ implementation could certainly be made faster. One
> consideration with Parquet is that the files are much smaller, so when
> you are reading them over the network the effective end-to-end time
> including IO and deserialization will frequently win.
>
> > Regards
> >
> > Antoine.
> >
> >
>

Re: [DISCUSS] Format additions for encoding/compression (Was: [Discuss] Format additions to Arrow for sparse data and data integrity)

Posted by Wes McKinney <we...@gmail.com>.

On Sat, Jul 13, 2019 at 11:23 AM Antoine Pitrou <so...@pitrou.net> wrote:
>
> On Fri, 12 Jul 2019 20:37:15 -0700
> Micah Kornfield <em...@gmail.com> wrote:
> >
> > If the latter, I wonder why Parquet cannot simply be used instead of
> > > reinventing something similar but different.
> >
> > This is a reasonable point.  However there is  continuum here between file
> > size and read and write times.  Parquet will likely always be the smallest
> > with the largest times to convert to and from Arrow.  An uncompressed
> > Feather/Arrow file will likely always take the most space but will much
> > faster conversion times.
>
> I'm curious whether the Parquet conversion times are inherent to the
> Parquet format or due to inefficiencies in the implementation.
>

Parquet is fundamentally more complex to decode. Consider several
layers of logic that must happen for values to end up in the right
place

* Data pages are usually compressed, and a column consists of many
data pages each having a Thrift header that must be deserialized
* Values are usually dictionary-encoded, dictionary indices are
encoded using hybrid bit-packed / RLE scheme
* Null/not-null is encoded in definition levels
* Only non-null values are stored, so when decoding to Arrow, values
have to be "moved into place"

The current C++ implementation could certainly be made faster. One
consideration with Parquet is that the files are much smaller, so when
you are reading them over the network the effective end-to-end time
including IO and deserialization will frequently win.

> Regards
>
> Antoine.
>
>

Re: [DISCUSS] Format additions for encoding/compression (Was: [Discuss] Format additions to Arrow for sparse data and data integrity)

Posted by Antoine Pitrou <so...@pitrou.net>.

On Fri, 12 Jul 2019 20:37:15 -0700
Micah Kornfield <em...@gmail.com> wrote:
> 
> If the latter, I wonder why Parquet cannot simply be used instead of
> > reinventing something similar but different.  
> 
> This is a reasonable point.  However there is  continuum here between file
> size and read and write times.  Parquet will likely always be the smallest
> with the largest times to convert to and from Arrow.  An uncompressed
> Feather/Arrow file will likely always take the most space but will much
> faster conversion times.

I'm curious whether the Parquet conversion times are inherent to the
Parquet format or due to inefficiencies in the implementation.

Regards

Antoine.

Re: [DISCUSS] Format additions for encoding/compression (Was: [Discuss] Format additions to Arrow for sparse data and data integrity)

Posted by Micah Kornfield <em...@gmail.com>.

Hi Antoine,
I think Liya Fan raised some good points in his reply but I'd like to
answer your questions directly.

> So the question is whether this really needs to be in the in-memory
> format, i.e. is it desired to operate directly on this compressed
> format, or is it solely for transport?

I tried to separate the two concepts into Encodings (things Arrow can
operate directly on) and Compression (solely for transport).  While there
is some overlap I think the two features can be considered separately.

For each encoding there is additional implementation complexity to properly
exploit it.  However, the benefit for some workloads can be large [1][2].

If the latter, I wonder why Parquet cannot simply be used instead of
> reinventing something similar but different.

This is a reasonable point.  However there is  continuum here between file
size and read and write times.  Parquet will likely always be the smallest
with the largest times to convert to and from Arrow.  An uncompressed
Feather/Arrow file will likely always take the most space but will much
faster conversion times.    The question is whether a buffer level or some
other sub-file level compression scheme provides enough values compared
with compressing of the entire Feather file.  This is somewhat hand-wavy
but if we feel we might want to investigate this further I can write some
benchmarks to quantify the differences.

Cheers,
Micah

[1] http://db.csail.mit.edu/projects/cstore/abadicidr07.pdf
[2] http://db.csail.mit.edu/projects/cstore/abadisigmod06.pdf

On Fri, Jul 12, 2019 at 2:24 AM Antoine Pitrou <an...@python.org> wrote:

>
> Le 12/07/2019 à 10:08, Micah Kornfield a écrit :
> > OK, I've created a separate thread for data integrity/digests [1], and
> > retitled this thread to continue the discussion on compression and
> > encodings.  As a reminder the PR for the format additions [2] suggested a
> > new SparseRecordBatch that would allow for the following features:
> > 1.  Different data encodings at the Array (e.g. RLE) and Buffer levels
> > (e.g. narrower bit-width integers)
> > 2.  Compression at the buffer level
> > 3.  Eliding all metadata and data for empty columns.
>
> So the question is whether this really needs to be in the in-memory
> format, i.e. is it desired to operate directly on this compressed
> format, or is it solely for transport?
>
> If the latter, I wonder why Parquet cannot simply be used instead of
> reinventing something similar but different.
>
> Regards
>
> Antoine.
>

Re: [DISCUSS] Format additions for encoding/compression (Was: [Discuss] Format additions to Arrow for sparse data and data integrity)

Posted by Fan Liya <li...@gmail.com>.

@Antoine Pitrou,

Good question. I think the answer depends on the concrete encoding scheme.

For some encoding schemes, it is not a good idea to use them for in-memory
data compression.
For others, it is beneficial to operator directly on the compressed data.

For example, it is beneficial to directly work on RLE data, with better
locality and fewer cache misses.

Best,
Liya Fan

On Fri, Jul 12, 2019 at 5:24 PM Antoine Pitrou <an...@python.org> wrote:

>
> Le 12/07/2019 à 10:08, Micah Kornfield a écrit :
> > OK, I've created a separate thread for data integrity/digests [1], and
> > retitled this thread to continue the discussion on compression and
> > encodings.  As a reminder the PR for the format additions [2] suggested a
> > new SparseRecordBatch that would allow for the following features:
> > 1.  Different data encodings at the Array (e.g. RLE) and Buffer levels
> > (e.g. narrower bit-width integers)
> > 2.  Compression at the buffer level
> > 3.  Eliding all metadata and data for empty columns.
>
> So the question is whether this really needs to be in the in-memory
> format, i.e. is it desired to operate directly on this compressed
> format, or is it solely for transport?
>
> If the latter, I wonder why Parquet cannot simply be used instead of
> reinventing something similar but different.
>
> Regards
>
> Antoine.
>

Re: [DISCUSS] Format additions for encoding/compression (Was: [Discuss] Format additions to Arrow for sparse data and data integrity)

Posted by Antoine Pitrou <an...@python.org>.

Le 12/07/2019 à 10:08, Micah Kornfield a écrit :
> OK, I've created a separate thread for data integrity/digests [1], and
> retitled this thread to continue the discussion on compression and
> encodings.  As a reminder the PR for the format additions [2] suggested a
> new SparseRecordBatch that would allow for the following features:
> 1.  Different data encodings at the Array (e.g. RLE) and Buffer levels
> (e.g. narrower bit-width integers)
> 2.  Compression at the buffer level
> 3.  Eliding all metadata and data for empty columns.

So the question is whether this really needs to be in the in-memory
format, i.e. is it desired to operate directly on this compressed
format, or is it solely for transport?

If the latter, I wonder why Parquet cannot simply be used instead of
reinventing something similar but different.

Regards

Antoine.

[DISCUSS] Format additions for encoding/compression (Was: [Discuss] Format additions to Arrow for sparse data and data integrity)

Posted by Micah Kornfield <em...@gmail.com>.

OK, I've created a separate thread for data integrity/digests [1], and
retitled this thread to continue the discussion on compression and
encodings.  As a reminder the PR for the format additions [2] suggested a
new SparseRecordBatch that would allow for the following features:
1.  Different data encodings at the Array (e.g. RLE) and Buffer levels
(e.g. narrower bit-width integers)
2.  Compression at the buffer level
3.  Eliding all metadata and data for empty columns.

To recap my understanding of the highlights discussion so far:

Encodings:
There are some concerns over efficiency of some of the encodings in
different scenarios.
 * Eliding null values makes many algorithms less efficient
 * Joins might become harder with these encodings.
 * Also the additional code complexity came up on the Arrow sync call.

Compression:
- Buffer level compression might be too small a granularity for data
compression.
- General purpose compression at this level might not add much value, so it
might be better to keep it at the transport level.

Alternative designs:
* Put buffer level compression in specific transports (e.g. flight)
* Try to use the extension mechanism to support different encodings

Thanks,
Micah


[1]
https://lists.apache.org/thread.html/23c95508dcba432caa73253062520157346fad82fce9943ba6f681dd@%3Cdev.arrow.apache.org%3E
[2] https://github.com/apache/arrow/pull/4815

On Fri, Jul 12, 2019 at 12:15 AM Antoine Pitrou <an...@python.org> wrote:

>
> I think it would be worthwhile to split the discussion into two separate
> threads.  One thread for compression & encodings (which are related or
> even the same topic), one thread for data integrity.
>
> Regards
>
> Antoine.
>
>
> Le 08/07/2019 à 07:22, Micah Kornfield a écrit :
> >
> > - Compression:
> >    *  Use parquet for random access to data elements.
> >        -  This is one option, the main downside I see to this is
> generally
> > higher encoding/decoding costs.  Per below, I think it is reasonable to
> > wait until we have more data to add compression into the the spec.
> >    *  Have the transport layer do buffer specific compression:
> >       - I'm not a fan of this approach.  Once nice thing about the
> current
> > communication protocols is once you strip away "framing" data all the
> byte
> > streams are equivalent.  I think the simplicity that follows in code from
> > this is a nice feature.
> >
> >
> > *Computational efficiency of array encodings:*
> >
> >> How does "more efficient computation" play out for operations such as
> >> hash or join?
> >
> > You would still need to likely materialize rows in most case.   In some
> > "join" cases the sparse encoding of the null bitmap buffer could be a win
> > because it serves as an index to non-null values.
> >
> > I think I should clarify that these encodings aren't always a win
> depending
> > on workload/data shape, but can have a large impact when used
> appropriately
> > (especially at the "Expression evaluation stage").  Also, any wins don't
> > come for free, to exploit encodings properly  will add some level of
> > complication to existing computation code.
> >
> > On a packed sparse array representation:
> >
> >> This would be fine for simple SIMD aggregations like count/avg/mean, but
> >> compacting null slots complicates more advanced parallel routines that
> >> execute independently and rely on indices aligning with an element's
> >> logical position.
> >
> >
> > The main use-case I had in mind here was for scenarios like loading data
> > directly parquet (i.e. nulls are already elided) doing some computation
> and
> > then potentially translating to a dense representation.  Similarly it
> > appears other have had advantage in some contexts for saving time at
> > shuffle [1].  In many cases there is an overlap with RLE, so I'd be open
> to
> > removing this from the proposal.
> >
> >
> > *On buffer encodings:*
> > To paraphrase, the main concern here seems to be it is similar to
> metadata
> > that was already removed [2].
> >
> > A few points on this:
> > 1.  There was a typo in the original e-mail on sparse-integer set
> encoding
> > where it said "all" values are either null or not null.  This should have
> > read "most" values.  The elision of buffers is a separate feature.
> > 2.  I believe these are different then the previous metadata because this
> > isn't repetitive information. It provides new information about the
> > contents of buffers not available anywhere else.
> > 3.  The proposal is to create a new message type for the this feature so
> it
> > wouldn't be bringing back the old code and hopefully would have minimal
> > impact on already existing IPC code.
> >
> >
> > *On Compression:*
> > So far my take is the consensus is that this can probably be applied at
> the
> > transport level without being in the spec directly.  There might be value
> > in more specific types of compression at the buffer level, but we should
> > benchmark them first..
> >
> > *Data Integrity/Digest:*
> >
> >> one question is whether this occurs at the table level, column level,
> >> sequential array level, etc.
> >
> > This is a good question, it seemed like the batch level was easiest and
> > that is why I proposed it, but I'd be open to other options.  One nice
> > thing about the batch level is that it works for all other message types
> > out of the box (i.e. we can ensure the schema has been transmitted
> > faithfully).
> >
> > Cheers,
> > Micah
> >
> > [1] https://issues.apache.org/jira/browse/ARROW-5821
> > [2] https://github.com/apache/arrow/pull/1297/files
> > [3] https://jira.apache.org/jira/browse/ARROW-300
> >
> >
> > On Sat, Jul 6, 2019 at 11:17 AM Paul Taylor <pt...@gmail.com>
> > wrote:
> >
> >> Hi Micah,
> >>
> >> Similar to Jacques I'm not disagreeing, but wondering if they belong in
> >> Arrow vs. can be done externally. I'm mostly interested in changes that
> >> might impact SIMD processing, considering Arrow's already made conscious
> >> design decisions to trade memory for speed. Apologies in advance if I've
> >> misunderstood any of the proposals.
> >>
> >>> a. Add a run-length encoding scheme to efficiently represent repeated
> >>> values (the actual scheme encodes run ends instead of length to
> preserve
> >>> sub-linear random access).
> >> Couldn't one do RLE at the buffer level via a custom
> >> FixedSizeBinary/Binary/Utf8 encoding? Perhaps as a new ExtensionType?
> >>
> >>> b. Add a “packed” sparse representation (null values don’t take up
> >>> space in value buffers)
> >> This would be fine for simple SIMD aggregations like count/avg/mean, but
> >> compacting null slots complicates more advanced parallel routines that
> >> execute independently and rely on indices aligning with an element's
> >> logical position.
> >>
> >> It sounds like here the logical position depends on knowing the number
> >> of nulls up to that point (via something like sequentially iterating
> >> both data and validity buffers). An efficient parallel routine would
> >> likely need to scan beforehand to inflate the packed representation,
> >> where today it can simply slice/mmap the data buffer directly.
> >>
> >>> a. Add frame of reference integer encoding [7] (this allows for lower
> >>> bit-width encoding of integer types by subtracting a
> >>> “reference” value from all values in the buffer).
> >> I agree this is useful, but couldn't it also live in userland/an
> >> ExtensionType?
> >>
> >>> b. Add a sparse integer set encoding.  This encoding allows more
> >>> efficient encoding of validity bit-masks for cases when all values are
> >>> either null or not null.
> >> If this is in reference to the discussion at link #4 [1], it sounds
> >> similar to the BufferLayout metadata that used to exist but was removed
> >> a while back [2]. Knowing the buffer layouts allows an implementation to
> >> generically elide any buffer at will, but would probably be a lot to
> >> bring back in. I can't say whether adding a different set of metadata
> >> would raise the same concerns issues Jacques mentioned in the JIRA
> >> thread in [2].
> >>
> >>> Data compression.  Similar to encodings but compression is solely for
> >>> reduction of data at rest/on the wire.  The proposal is to allow
> >>> compression of individual buffers. Right now zstd is proposed, but I
> >> don’t
> >>> feel strongly on the specific technologies here.
> >> What's the goal for this? Random element access into compressed
> >> in-memory columns, or compression at I/O boundaries?
> >>
> >> * If the former, is Parquet a better alternative here? Again, I'm
> >> cautious about the impact to parallel routines. CPU speeds are
> >> plateauing while memory and tx/rx keep growing. Compressed element
> >> access seems to be on the CPU side of that equation (meanwhile parallel
> >> deflate already exists, and I remember seeing research into parallel
> >> inflate).
> >>
> >> * If the later, could we do a comparison of Arrow dictionary-encoding +
> >> different compression formats, vs. building them into the spec? I know
> >> content-aware compression yields significant size reductions, but I
> >> wonder if the maintenance burden on Arrow contributors is worth the cost
> >> vs. a simpler dictionary-encoding + streaming gzip.
> >>
> >>> Data Integrity.  While the arrow file format isn’t meant for archiving
> >>> data, I think it is important to allow for optional native data
> integrity
> >>> checks in the format.  To this end, I proposed a new “Digest” message
> >> type
> >>> that can be added after other messages to record a digest/hash of the
> >>> preceding data. I suggested xxhash, but I don’t have a strong opinion
> >> here,
> >>> as long as there is some minimal support that can potentially be
> expanded
> >>> later.
> >> :thumbs up:
> >>
> >>
> >> Best,
> >> Paul
> >>
> >>
> >> 1.
> >>
> >>
> https://lists.apache.org/thread.html/5e09557274f9018efee770ad3712122d874447331f52d27169f99fe0@%3Cdev.arrow.apache.org%3E
> >>
> >> 2.
> >>
> >>
> https://issues.apache.org/jira/browse/ARROW-1693?focusedCommentId=16236902&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16236902
> >>
> >> On 7/5/19 11:53 AM, Micah Kornfield wrote:
> >>> Hi Arrow-dev,
> >>>
> >>> I’d like to make a straw-man proposal to cover some features that I
> think
> >>> would be useful to Arrow, and that I would like to make a
> >> proof-of-concept
> >>> implementation for in Java and C++.  In particular, the proposal covers
> >>> allowing for smaller data sizes via compression and encoding [1][2][8],
> >>> data integrity [3] and avoiding unnecessary data transfer [4][5].
> >>>
> >>> I’ve put together a PR  [6] that has proposed changes to the flatbuffer
> >>> metadata to support the new features.  The PR introduces:
> >>>
> >>>     -
> >>>
> >>>     A new “SparseRecordBatch” that can support one of multiple possible
> >>>     encodings (both dense and sparse), compression and column elision.
> >>>     -
> >>>
> >>>     A “Digest” message type to support optional data integrity.
> >>>
> >>>
> >>> Going into more details on the specific features in the PR:
> >>>
> >>>     1.
> >>>
> >>>     Sparse encodings for arrays and buffers.  The guiding principles
> >> behind
> >>>     the suggested encodings are to support encodings that can be
> >> exploited by
> >>>     compute engines for more efficient computation (I don’t think
> >> parquet style
> >>>     bit-packing belongs in Arrow).  While the encodings don’t maintain
> >> O(1)
> >>>     data element access, they support sublinear, O(log(N)), element
> >> access. The
> >>>     suggested encodings are:
> >>>     1.
> >>>
> >>>        Array encodings:
> >>>        1.
> >>>
> >>>           Add a run-length encoding scheme to efficiently represent
> >> repeated
> >>>           values (the actual scheme encodes run ends instead of length
> >>> to preserve
> >>>           sub-linear random access).
> >>>           2.
> >>>
> >>>           Add a “packed” sparse representation (null values don’t take
> up
> >>>           space in value buffers)
> >>>           2.
> >>>
> >>>        Buffer encodings:
> >>>        1.
> >>>
> >>>           Add frame of reference integer encoding [7] (this allows for
> >> lower
> >>>           bit-width encoding of integer types by subtracting a
> >>> “reference” value from
> >>>           all values in the buffer).
> >>>           2.
> >>>
> >>>           Add a sparse integer set encoding.  This encoding allows more
> >>>           efficient encoding of validity bit-masks for cases when all
> >> values are
> >>>           either null or not null.
> >>>           2.
> >>>
> >>>     Data compression.  Similar to encodings but compression is solely
> for
> >>>     reduction of data at rest/on the wire.  The proposal is to allow
> >>>     compression of individual buffers. Right now zstd is proposed, but
> I
> >> don’t
> >>>     feel strongly on the specific technologies here.
> >>>     3.
> >>>
> >>>     Column Elision.  For some use-cases, like structured logging, the
> >>>     overhead of including array metadata for columns with no data
> present
> >>>     represents non-negligible overhead.   The proposal provides a
> >> mechanism for
> >>>     omitting meta-data for such arrays.
> >>>     4.
> >>>
> >>>     Data Integrity.  While the arrow file format isn’t meant for
> >> archiving
> >>>     data, I think it is important to allow for optional native data
> >> integrity
> >>>     checks in the format.  To this end, I proposed a new “Digest”
> >> message type
> >>>     that can be added after other messages to record a digest/hash of
> the
> >>>     preceding data. I suggested xxhash, but I don’t have a strong
> >> opinion here,
> >>>     as long as there is some minimal support that can potentially be
> >> expanded
> >>>     later.
> >>>
> >>>
> >>> In the proposal I chose to use Tables and Unions everywhere for
> >> flexibility
> >>> but in all likelihood some could be replaced by enums.
> >>>
> >>> My initial plan would be to solely focus on an IPC mechanism that can
> >> send
> >>> a SparseRecordBatch and immediately translate it to a normal
> RecordBatch
> >> in
> >>> both Java and C++.
> >>>
> >>> As a practical matter the proposal represents a lot of work to get an
> MVP
> >>> working in time for 1.0.0 release (provided they are accepted by the
> >>> community), so I'd greatly appreciate if anyone wants to collaborate on
> >>> this.
> >>>
> >>> If it is easier I’m happy to start a separate thread for feature if
> >> people
> >>> feel like it would make the conversation easier.  I can also create a
> >>> Google Doc for direct comments if that is preferred.
> >>>
> >>> Thanks,
> >>>
> >>> Micah
> >>>
> >>>
> >>>
> >>> P.S. In the interest of full disclosure, these ideas evolved in
> >>> collaboration with Brian Hulette and other colleagues at Google who are
> >>> interested in making use of Arrow in both internal and external
> projects.
> >>>
> >>> [1] https://issues.apache.org/jira/browse/ARROW-300
> >>>
> >>> [2]  https://issues.apache.org/jira/browse/ARROW-5224
> >>>
> >>> [3]
> >>>
> >>
> https://lists.apache.org/thread.html/36ab9c2b8b5d9f04493b3f9ea3b63c3ca3bc0f90743aa726b7a3199b@%3Cdev.arrow.apache.org%3E
> >>>
> >>> [4]
> >>>
> >>
> https://lists.apache.org/thread.html/5e09557274f9018efee770ad3712122d874447331f52d27169f99fe0@%3Cdev.arrow.apache.org%3E
> >>>
> >>> [5]
> >>>
> >>
> https://issues.apache.org/jira/browse/ARROW-1693?page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel&focusedCommentId=16244812#comment-16244812
> >>>
> >>> [6] https://github.com/apache/arrow/pull/4815
> >>>
> >>> [7]
> >>>
> >>
> https://lemire.me/blog/2012/02/08/effective-compression-using-frame-of-reference-and-delta-coding/
> >>>
> >>> [8] https://issues.apache.org/jira/browse/ARROW-5821
> >>>
> >>
> >>
> >
>

Re: [Discuss] Format additions to Arrow for sparse data and data integrity

Posted by Antoine Pitrou <an...@python.org>.

I think it would be worthwhile to split the discussion into two separate
threads.  One thread for compression & encodings (which are related or
even the same topic), one thread for data integrity.

Regards

Antoine.


Le 08/07/2019 à 07:22, Micah Kornfield a écrit :
> 
> - Compression:
>    *  Use parquet for random access to data elements.
>        -  This is one option, the main downside I see to this is generally
> higher encoding/decoding costs.  Per below, I think it is reasonable to
> wait until we have more data to add compression into the the spec.
>    *  Have the transport layer do buffer specific compression:
>       - I'm not a fan of this approach.  Once nice thing about the current
> communication protocols is once you strip away "framing" data all the byte
> streams are equivalent.  I think the simplicity that follows in code from
> this is a nice feature.
> 
> 
> *Computational efficiency of array encodings:*
> 
>> How does "more efficient computation" play out for operations such as
>> hash or join?
> 
> You would still need to likely materialize rows in most case.   In some
> "join" cases the sparse encoding of the null bitmap buffer could be a win
> because it serves as an index to non-null values.
> 
> I think I should clarify that these encodings aren't always a win depending
> on workload/data shape, but can have a large impact when used appropriately
> (especially at the "Expression evaluation stage").  Also, any wins don't
> come for free, to exploit encodings properly  will add some level of
> complication to existing computation code.
> 
> On a packed sparse array representation:
> 
>> This would be fine for simple SIMD aggregations like count/avg/mean, but
>> compacting null slots complicates more advanced parallel routines that
>> execute independently and rely on indices aligning with an element's
>> logical position.
> 
> 
> The main use-case I had in mind here was for scenarios like loading data
> directly parquet (i.e. nulls are already elided) doing some computation and
> then potentially translating to a dense representation.  Similarly it
> appears other have had advantage in some contexts for saving time at
> shuffle [1].  In many cases there is an overlap with RLE, so I'd be open to
> removing this from the proposal.
> 
> 
> *On buffer encodings:*
> To paraphrase, the main concern here seems to be it is similar to metadata
> that was already removed [2].
> 
> A few points on this:
> 1.  There was a typo in the original e-mail on sparse-integer set encoding
> where it said "all" values are either null or not null.  This should have
> read "most" values.  The elision of buffers is a separate feature.
> 2.  I believe these are different then the previous metadata because this
> isn't repetitive information. It provides new information about the
> contents of buffers not available anywhere else.
> 3.  The proposal is to create a new message type for the this feature so it
> wouldn't be bringing back the old code and hopefully would have minimal
> impact on already existing IPC code.
> 
> 
> *On Compression:*
> So far my take is the consensus is that this can probably be applied at the
> transport level without being in the spec directly.  There might be value
> in more specific types of compression at the buffer level, but we should
> benchmark them first..
> 
> *Data Integrity/Digest:*
> 
>> one question is whether this occurs at the table level, column level,
>> sequential array level, etc.
> 
> This is a good question, it seemed like the batch level was easiest and
> that is why I proposed it, but I'd be open to other options.  One nice
> thing about the batch level is that it works for all other message types
> out of the box (i.e. we can ensure the schema has been transmitted
> faithfully).
> 
> Cheers,
> Micah
> 
> [1] https://issues.apache.org/jira/browse/ARROW-5821
> [2] https://github.com/apache/arrow/pull/1297/files
> [3] https://jira.apache.org/jira/browse/ARROW-300
> 
> 
> On Sat, Jul 6, 2019 at 11:17 AM Paul Taylor <pt...@gmail.com>
> wrote:
> 
>> Hi Micah,
>>
>> Similar to Jacques I'm not disagreeing, but wondering if they belong in
>> Arrow vs. can be done externally. I'm mostly interested in changes that
>> might impact SIMD processing, considering Arrow's already made conscious
>> design decisions to trade memory for speed. Apologies in advance if I've
>> misunderstood any of the proposals.
>>
>>> a. Add a run-length encoding scheme to efficiently represent repeated
>>> values (the actual scheme encodes run ends instead of length to preserve
>>> sub-linear random access).
>> Couldn't one do RLE at the buffer level via a custom
>> FixedSizeBinary/Binary/Utf8 encoding? Perhaps as a new ExtensionType?
>>
>>> b. Add a “packed” sparse representation (null values don’t take up
>>> space in value buffers)
>> This would be fine for simple SIMD aggregations like count/avg/mean, but
>> compacting null slots complicates more advanced parallel routines that
>> execute independently and rely on indices aligning with an element's
>> logical position.
>>
>> It sounds like here the logical position depends on knowing the number
>> of nulls up to that point (via something like sequentially iterating
>> both data and validity buffers). An efficient parallel routine would
>> likely need to scan beforehand to inflate the packed representation,
>> where today it can simply slice/mmap the data buffer directly.
>>
>>> a. Add frame of reference integer encoding [7] (this allows for lower
>>> bit-width encoding of integer types by subtracting a
>>> “reference” value from all values in the buffer).
>> I agree this is useful, but couldn't it also live in userland/an
>> ExtensionType?
>>
>>> b. Add a sparse integer set encoding.  This encoding allows more
>>> efficient encoding of validity bit-masks for cases when all values are
>>> either null or not null.
>> If this is in reference to the discussion at link #4 [1], it sounds
>> similar to the BufferLayout metadata that used to exist but was removed
>> a while back [2]. Knowing the buffer layouts allows an implementation to
>> generically elide any buffer at will, but would probably be a lot to
>> bring back in. I can't say whether adding a different set of metadata
>> would raise the same concerns issues Jacques mentioned in the JIRA
>> thread in [2].
>>
>>> Data compression.  Similar to encodings but compression is solely for
>>> reduction of data at rest/on the wire.  The proposal is to allow
>>> compression of individual buffers. Right now zstd is proposed, but I
>> don’t
>>> feel strongly on the specific technologies here.
>> What's the goal for this? Random element access into compressed
>> in-memory columns, or compression at I/O boundaries?
>>
>> * If the former, is Parquet a better alternative here? Again, I'm
>> cautious about the impact to parallel routines. CPU speeds are
>> plateauing while memory and tx/rx keep growing. Compressed element
>> access seems to be on the CPU side of that equation (meanwhile parallel
>> deflate already exists, and I remember seeing research into parallel
>> inflate).
>>
>> * If the later, could we do a comparison of Arrow dictionary-encoding +
>> different compression formats, vs. building them into the spec? I know
>> content-aware compression yields significant size reductions, but I
>> wonder if the maintenance burden on Arrow contributors is worth the cost
>> vs. a simpler dictionary-encoding + streaming gzip.
>>
>>> Data Integrity.  While the arrow file format isn’t meant for archiving
>>> data, I think it is important to allow for optional native data integrity
>>> checks in the format.  To this end, I proposed a new “Digest” message
>> type
>>> that can be added after other messages to record a digest/hash of the
>>> preceding data. I suggested xxhash, but I don’t have a strong opinion
>> here,
>>> as long as there is some minimal support that can potentially be expanded
>>> later.
>> :thumbs up:
>>
>>
>> Best,
>> Paul
>>
>>
>> 1.
>>
>> https://lists.apache.org/thread.html/5e09557274f9018efee770ad3712122d874447331f52d27169f99fe0@%3Cdev.arrow.apache.org%3E
>>
>> 2.
>>
>> https://issues.apache.org/jira/browse/ARROW-1693?focusedCommentId=16236902&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16236902
>>
>> On 7/5/19 11:53 AM, Micah Kornfield wrote:
>>> Hi Arrow-dev,
>>>
>>> I’d like to make a straw-man proposal to cover some features that I think
>>> would be useful to Arrow, and that I would like to make a
>> proof-of-concept
>>> implementation for in Java and C++.  In particular, the proposal covers
>>> allowing for smaller data sizes via compression and encoding [1][2][8],
>>> data integrity [3] and avoiding unnecessary data transfer [4][5].
>>>
>>> I’ve put together a PR  [6] that has proposed changes to the flatbuffer
>>> metadata to support the new features.  The PR introduces:
>>>
>>>     -
>>>
>>>     A new “SparseRecordBatch” that can support one of multiple possible
>>>     encodings (both dense and sparse), compression and column elision.
>>>     -
>>>
>>>     A “Digest” message type to support optional data integrity.
>>>
>>>
>>> Going into more details on the specific features in the PR:
>>>
>>>     1.
>>>
>>>     Sparse encodings for arrays and buffers.  The guiding principles
>> behind
>>>     the suggested encodings are to support encodings that can be
>> exploited by
>>>     compute engines for more efficient computation (I don’t think
>> parquet style
>>>     bit-packing belongs in Arrow).  While the encodings don’t maintain
>> O(1)
>>>     data element access, they support sublinear, O(log(N)), element
>> access. The
>>>     suggested encodings are:
>>>     1.
>>>
>>>        Array encodings:
>>>        1.
>>>
>>>           Add a run-length encoding scheme to efficiently represent
>> repeated
>>>           values (the actual scheme encodes run ends instead of length
>>> to preserve
>>>           sub-linear random access).
>>>           2.
>>>
>>>           Add a “packed” sparse representation (null values don’t take up
>>>           space in value buffers)
>>>           2.
>>>
>>>        Buffer encodings:
>>>        1.
>>>
>>>           Add frame of reference integer encoding [7] (this allows for
>> lower
>>>           bit-width encoding of integer types by subtracting a
>>> “reference” value from
>>>           all values in the buffer).
>>>           2.
>>>
>>>           Add a sparse integer set encoding.  This encoding allows more
>>>           efficient encoding of validity bit-masks for cases when all
>> values are
>>>           either null or not null.
>>>           2.
>>>
>>>     Data compression.  Similar to encodings but compression is solely for
>>>     reduction of data at rest/on the wire.  The proposal is to allow
>>>     compression of individual buffers. Right now zstd is proposed, but I
>> don’t
>>>     feel strongly on the specific technologies here.
>>>     3.
>>>
>>>     Column Elision.  For some use-cases, like structured logging, the
>>>     overhead of including array metadata for columns with no data present
>>>     represents non-negligible overhead.   The proposal provides a
>> mechanism for
>>>     omitting meta-data for such arrays.
>>>     4.
>>>
>>>     Data Integrity.  While the arrow file format isn’t meant for
>> archiving
>>>     data, I think it is important to allow for optional native data
>> integrity
>>>     checks in the format.  To this end, I proposed a new “Digest”
>> message type
>>>     that can be added after other messages to record a digest/hash of the
>>>     preceding data. I suggested xxhash, but I don’t have a strong
>> opinion here,
>>>     as long as there is some minimal support that can potentially be
>> expanded
>>>     later.
>>>
>>>
>>> In the proposal I chose to use Tables and Unions everywhere for
>> flexibility
>>> but in all likelihood some could be replaced by enums.
>>>
>>> My initial plan would be to solely focus on an IPC mechanism that can
>> send
>>> a SparseRecordBatch and immediately translate it to a normal RecordBatch
>> in
>>> both Java and C++.
>>>
>>> As a practical matter the proposal represents a lot of work to get an MVP
>>> working in time for 1.0.0 release (provided they are accepted by the
>>> community), so I'd greatly appreciate if anyone wants to collaborate on
>>> this.
>>>
>>> If it is easier I’m happy to start a separate thread for feature if
>> people
>>> feel like it would make the conversation easier.  I can also create a
>>> Google Doc for direct comments if that is preferred.
>>>
>>> Thanks,
>>>
>>> Micah
>>>
>>>
>>>
>>> P.S. In the interest of full disclosure, these ideas evolved in
>>> collaboration with Brian Hulette and other colleagues at Google who are
>>> interested in making use of Arrow in both internal and external projects.
>>>
>>> [1] https://issues.apache.org/jira/browse/ARROW-300
>>>
>>> [2]  https://issues.apache.org/jira/browse/ARROW-5224
>>>
>>> [3]
>>>
>> https://lists.apache.org/thread.html/36ab9c2b8b5d9f04493b3f9ea3b63c3ca3bc0f90743aa726b7a3199b@%3Cdev.arrow.apache.org%3E
>>>
>>> [4]
>>>
>> https://lists.apache.org/thread.html/5e09557274f9018efee770ad3712122d874447331f52d27169f99fe0@%3Cdev.arrow.apache.org%3E
>>>
>>> [5]
>>>
>> https://issues.apache.org/jira/browse/ARROW-1693?page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel&focusedCommentId=16244812#comment-16244812
>>>
>>> [6] https://github.com/apache/arrow/pull/4815
>>>
>>> [7]
>>>
>> https://lemire.me/blog/2012/02/08/effective-compression-using-frame-of-reference-and-delta-coding/
>>>
>>> [8] https://issues.apache.org/jira/browse/ARROW-5821
>>>
>>
>>
>

Re: [Discuss] Format additions to Arrow for sparse data and data integrity

Posted by Ji Liu <ni...@aliyun.com.INVALID>.

Hi Micah,
Thanks for opening this discussion.

Similar to Liya Fan, I generally agree with you in most features. As you mentioned above, we have made some attempts in our application to reduce data size, for example, data encoding and RecordBatch compact[1], and it has significant performance benefits in our systems.

Since this kind of features are not supported in Arrow, users have to achieve this by themselves which is not convenient.

+1 for supporting these kind of features, and after finalizing the plan, we would like to take part in the work.

Thanks,
Ji Liu

[1] https://issues.apache.org/jira/browse/ARROW-5821


------------------------------------------------------------------
From:Fan Liya <li...@gmail.com>
Send Time:2019年7月8日(星期一) 15:09
To:dev <de...@arrow.apache.org>; Micah Kornfield <em...@gmail.com>
Cc:ptaylor <pt...@apache.org>
Subject:Re: [Discuss] Format additions to Arrow for sparse data and data integrity

Hi Micah,

Thanks for opening this discussion.

For me, most of the features are super useful, especially RLE and integer
encoding.

IMO, to support these new features, we need some basic algorithms first
(e.g. sort and search).
For example, RLE and sort are often used in combination.
These new features should be at a higher level compared with the basic
algorithms.

Some of the basic algorithms is in progress (e.g. [1] and [2]), but I think
more are needed.

Best,
Liya Fan

[1] https://github.com/apache/arrow/pull/4788
[2] https://github.com/apache/arrow/pull/4699

On Mon, Jul 8, 2019 at 1:23 PM Micah Kornfield <em...@gmail.com>
wrote:

> Hi Paul, Jacques and Antoine,
> Thank you for the valuable feedback.  I'm going to try to address it all in
> this e-mail to help consolidate the conversation.  I've grouped my
> responses by topic and included snippets from other e-mails where relevant.
>
> *Timeline of any features: *
> -  So far the sentiment is that this is too much of the 1.0.0 release.
> This seems reasonable to me.  In general, I tend to be over optimistic on
> what I can get done :)
>
>
> *Design Alternatives proposed (more are welcome)*
> - Encodings:
>   *  Use extension types (and other user land options).   This is one
> potential way of accomplishing these but I think it is suboptimal for a few
> reasons:
>      1.  The encodings are targeted at already existing logical types, not
> new ones.  So it is a little bit awkward to have for a user defined "int32"
> value.
>      2.  The extension types are at a schema level.  It is very useful to
> adapt encodings per batch level.  So in some cases a more compact encoding
> might be warranted but in others using the normal dense encoding would be
> appropriate.
>      3.  Using binary blobs remove much of the efficiency of the encodings
> (i.e. the 4 byte overhead per row).
>      4.  In the long run, I'd like to see encodings be exploited in
> computation engines.  This becomes harder/impossible when using user
> defined types.
>
> - Compression:
>    *  Use parquet for random access to data elements.
>        -  This is one option, the main downside I see to this is generally
> higher encoding/decoding costs.  Per below, I think it is reasonable to
> wait until we have more data to add compression into the the spec.
>    *  Have the transport layer do buffer specific compression:
>       - I'm not a fan of this approach.  Once nice thing about the current
> communication protocols is once you strip away "framing" data all the byte
> streams are equivalent.  I think the simplicity that follows in code from
> this is a nice feature.
>
>
> *Computational efficiency of array encodings:*
>
> > How does "more efficient computation" play out for operations such as
> > hash or join?
>
> You would still need to likely materialize rows in most case.   In some
> "join" cases the sparse encoding of the null bitmap buffer could be a win
> because it serves as an index to non-null values.
>
> I think I should clarify that these encodings aren't always a win depending
> on workload/data shape, but can have a large impact when used appropriately
> (especially at the "Expression evaluation stage").  Also, any wins don't
> come for free, to exploit encodings properly  will add some level of
> complication to existing computation code.
>
> On a packed sparse array representation:
>
> > This would be fine for simple SIMD aggregations like count/avg/mean, but
> > compacting null slots complicates more advanced parallel routines that
> > execute independently and rely on indices aligning with an element's
> > logical position.
>
>
> The main use-case I had in mind here was for scenarios like loading data
> directly parquet (i.e. nulls are already elided) doing some computation and
> then potentially translating to a dense representation.  Similarly it
> appears other have had advantage in some contexts for saving time at
> shuffle [1].  In many cases there is an overlap with RLE, so I'd be open to
> removing this from the proposal.
>
>
> *On buffer encodings:*
> To paraphrase, the main concern here seems to be it is similar to metadata
> that was already removed [2].
>
> A few points on this:
> 1.  There was a typo in the original e-mail on sparse-integer set encoding
> where it said "all" values are either null or not null.  This should have
> read "most" values.  The elision of buffers is a separate feature.
> 2.  I believe these are different then the previous metadata because this
> isn't repetitive information. It provides new information about the
> contents of buffers not available anywhere else.
> 3.  The proposal is to create a new message type for the this feature so it
> wouldn't be bringing back the old code and hopefully would have minimal
> impact on already existing IPC code.
>
>
> *On Compression:*
> So far my take is the consensus is that this can probably be applied at the
> transport level without being in the spec directly.  There might be value
> in more specific types of compression at the buffer level, but we should
> benchmark them first..
>
> *Data Integrity/Digest:*
>
> > one question is whether this occurs at the table level, column level,
> > sequential array level, etc.
>
> This is a good question, it seemed like the batch level was easiest and
> that is why I proposed it, but I'd be open to other options.  One nice
> thing about the batch level is that it works for all other message types
> out of the box (i.e. we can ensure the schema has been transmitted
> faithfully).
>
> Cheers,
> Micah
>
> [1] https://issues.apache.org/jira/browse/ARROW-5821
> [2] https://github.com/apache/arrow/pull/1297/files
> [3] https://jira.apache.org/jira/browse/ARROW-300
>
>
> On Sat, Jul 6, 2019 at 11:17 AM Paul Taylor <pt...@gmail.com>
> wrote:
>
> > Hi Micah,
> >
> > Similar to Jacques I'm not disagreeing, but wondering if they belong in
> > Arrow vs. can be done externally. I'm mostly interested in changes that
> > might impact SIMD processing, considering Arrow's already made conscious
> > design decisions to trade memory for speed. Apologies in advance if I've
> > misunderstood any of the proposals.
> >
> > > a. Add a run-length encoding scheme to efficiently represent repeated
> > > values (the actual scheme encodes run ends instead of length to
> preserve
> > > sub-linear random access).
> > Couldn't one do RLE at the buffer level via a custom
> > FixedSizeBinary/Binary/Utf8 encoding? Perhaps as a new ExtensionType?
> >
> > > b. Add a “packed” sparse representation (null values don’t take up
> > > space in value buffers)
> > This would be fine for simple SIMD aggregations like count/avg/mean, but
> > compacting null slots complicates more advanced parallel routines that
> > execute independently and rely on indices aligning with an element's
> > logical position.
> >
> > It sounds like here the logical position depends on knowing the number
> > of nulls up to that point (via something like sequentially iterating
> > both data and validity buffers). An efficient parallel routine would
> > likely need to scan beforehand to inflate the packed representation,
> > where today it can simply slice/mmap the data buffer directly.
> >
> > > a. Add frame of reference integer encoding [7] (this allows for lower
> > > bit-width encoding of integer types by subtracting a
> > > “reference” value from all values in the buffer).
> > I agree this is useful, but couldn't it also live in userland/an
> > ExtensionType?
> >
> > > b. Add a sparse integer set encoding.  This encoding allows more
> > > efficient encoding of validity bit-masks for cases when all values are
> > > either null or not null.
> > If this is in reference to the discussion at link #4 [1], it sounds
> > similar to the BufferLayout metadata that used to exist but was removed
> > a while back [2]. Knowing the buffer layouts allows an implementation to
> > generically elide any buffer at will, but would probably be a lot to
> > bring back in. I can't say whether adding a different set of metadata
> > would raise the same concerns issues Jacques mentioned in the JIRA
> > thread in [2].
> >
> > > Data compression.  Similar to encodings but compression is solely for
> > > reduction of data at rest/on the wire.  The proposal is to allow
> > > compression of individual buffers. Right now zstd is proposed, but I
> > don’t
> > > feel strongly on the specific technologies here.
> > What's the goal for this? Random element access into compressed
> > in-memory columns, or compression at I/O boundaries?
> >
> > * If the former, is Parquet a better alternative here? Again, I'm
> > cautious about the impact to parallel routines. CPU speeds are
> > plateauing while memory and tx/rx keep growing. Compressed element
> > access seems to be on the CPU side of that equation (meanwhile parallel
> > deflate already exists, and I remember seeing research into parallel
> > inflate).
> >
> > * If the later, could we do a comparison of Arrow dictionary-encoding +
> > different compression formats, vs. building them into the spec? I know
> > content-aware compression yields significant size reductions, but I
> > wonder if the maintenance burden on Arrow contributors is worth the cost
> > vs. a simpler dictionary-encoding + streaming gzip.
> >
> > > Data Integrity.  While the arrow file format isn’t meant for archiving
> > > data, I think it is important to allow for optional native data
> integrity
> > > checks in the format.  To this end, I proposed a new “Digest” message
> > type
> > > that can be added after other messages to record a digest/hash of the
> > > preceding data. I suggested xxhash, but I don’t have a strong opinion
> > here,
> > > as long as there is some minimal support that can potentially be
> expanded
> > > later.
> > :thumbs up:
> >
> >
> > Best,
> > Paul
> >
> >
> > 1.
> >
> >
> https://lists.apache.org/thread.html/5e09557274f9018efee770ad3712122d874447331f52d27169f99fe0@%3Cdev.arrow.apache.org%3E
> >
> > 2.
> >
> >
> https://issues.apache.org/jira/browse/ARROW-1693?focusedCommentId=16236902&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16236902
> >
> > On 7/5/19 11:53 AM, Micah Kornfield wrote:
> > > Hi Arrow-dev,
> > >
> > > I’d like to make a straw-man proposal to cover some features that I
> think
> > > would be useful to Arrow, and that I would like to make a
> > proof-of-concept
> > > implementation for in Java and C++.  In particular, the proposal covers
> > > allowing for smaller data sizes via compression and encoding [1][2][8],
> > > data integrity [3] and avoiding unnecessary data transfer [4][5].
> > >
> > > I’ve put together a PR  [6] that has proposed changes to the flatbuffer
> > > metadata to support the new features.  The PR introduces:
> > >
> > >     -
> > >
> > >     A new “SparseRecordBatch” that can support one of multiple possible
> > >     encodings (both dense and sparse), compression and column elision.
> > >     -
> > >
> > >     A “Digest” message type to support optional data integrity.
> > >
> > >
> > > Going into more details on the specific features in the PR:
> > >
> > >     1.
> > >
> > >     Sparse encodings for arrays and buffers.  The guiding principles
> > behind
> > >     the suggested encodings are to support encodings that can be
> > exploited by
> > >     compute engines for more efficient computation (I don’t think
> > parquet style
> > >     bit-packing belongs in Arrow).  While the encodings don’t maintain
> > O(1)
> > >     data element access, they support sublinear, O(log(N)), element
> > access. The
> > >     suggested encodings are:
> > >     1.
> > >
> > >        Array encodings:
> > >        1.
> > >
> > >           Add a run-length encoding scheme to efficiently represent
> > repeated
> > >           values (the actual scheme encodes run ends instead of length
> > > to preserve
> > >           sub-linear random access).
> > >           2.
> > >
> > >           Add a “packed” sparse representation (null values don’t take
> up
> > >           space in value buffers)
> > >           2.
> > >
> > >        Buffer encodings:
> > >        1.
> > >
> > >           Add frame of reference integer encoding [7] (this allows for
> > lower
> > >           bit-width encoding of integer types by subtracting a
> > > “reference” value from
> > >           all values in the buffer).
> > >           2.
> > >
> > >           Add a sparse integer set encoding.  This encoding allows more
> > >           efficient encoding of validity bit-masks for cases when all
> > values are
> > >           either null or not null.
> > >           2.
> > >
> > >     Data compression.  Similar to encodings but compression is solely
> for
> > >     reduction of data at rest/on the wire.  The proposal is to allow
> > >     compression of individual buffers. Right now zstd is proposed, but
> I
> > don’t
> > >     feel strongly on the specific technologies here.
> > >     3.
> > >
> > >     Column Elision.  For some use-cases, like structured logging, the
> > >     overhead of including array metadata for columns with no data
> present
> > >     represents non-negligible overhead.   The proposal provides a
> > mechanism for
> > >     omitting meta-data for such arrays.
> > >     4.
> > >
> > >     Data Integrity.  While the arrow file format isn’t meant for
> > archiving
> > >     data, I think it is important to allow for optional native data
> > integrity
> > >     checks in the format.  To this end, I proposed a new “Digest”
> > message type
> > >     that can be added after other messages to record a digest/hash of
> the
> > >     preceding data. I suggested xxhash, but I don’t have a strong
> > opinion here,
> > >     as long as there is some minimal support that can potentially be
> > expanded
> > >     later.
> > >
> > >
> > > In the proposal I chose to use Tables and Unions everywhere for
> > flexibility
> > > but in all likelihood some could be replaced by enums.
> > >
> > > My initial plan would be to solely focus on an IPC mechanism that can
> > send
> > > a SparseRecordBatch and immediately translate it to a normal
> RecordBatch
> > in
> > > both Java and C++.
> > >
> > > As a practical matter the proposal represents a lot of work to get an
> MVP
> > > working in time for 1.0.0 release (provided they are accepted by the
> > > community), so I'd greatly appreciate if anyone wants to collaborate on
> > > this.
> > >
> > > If it is easier I’m happy to start a separate thread for feature if
> > people
> > > feel like it would make the conversation easier.  I can also create a
> > > Google Doc for direct comments if that is preferred.
> > >
> > > Thanks,
> > >
> > > Micah
> > >
> > >
> > >
> > > P.S. In the interest of full disclosure, these ideas evolved in
> > > collaboration with Brian Hulette and other colleagues at Google who are
> > > interested in making use of Arrow in both internal and external
> projects.
> > >
> > > [1] https://issues.apache.org/jira/browse/ARROW-300
> > >
> > > [2]  https://issues.apache.org/jira/browse/ARROW-5224
> > >
> > > [3]
> > >
> >
> https://lists.apache.org/thread.html/36ab9c2b8b5d9f04493b3f9ea3b63c3ca3bc0f90743aa726b7a3199b@%3Cdev.arrow.apache.org%3E
> > >
> > > [4]
> > >
> >
> https://lists.apache.org/thread.html/5e09557274f9018efee770ad3712122d874447331f52d27169f99fe0@%3Cdev.arrow.apache.org%3E
> > >
> > > [5]
> > >
> >
> https://issues.apache.org/jira/browse/ARROW-1693?page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel&focusedCommentId=16244812#comment-16244812
> > >
> > > [6] https://github.com/apache/arrow/pull/4815
> > >
> > > [7]
> > >
> >
> https://lemire.me/blog/2012/02/08/effective-compression-using-frame-of-reference-and-delta-coding/
> > >
> > > [8] https://issues.apache.org/jira/browse/ARROW-5821
> > >
> >
> >
>

Re: [Discuss] Format additions to Arrow for sparse data and data integrity

Posted by Fan Liya <li...@gmail.com>.

Hi Micah,

Thanks for opening this discussion.

For me, most of the features are super useful, especially RLE and integer
encoding.

IMO, to support these new features, we need some basic algorithms first
(e.g. sort and search).
For example, RLE and sort are often used in combination.
These new features should be at a higher level compared with the basic
algorithms.

Some of the basic algorithms is in progress (e.g. [1] and [2]), but I think
more are needed.

Best,
Liya Fan

[1] https://github.com/apache/arrow/pull/4788
[2] https://github.com/apache/arrow/pull/4699

On Mon, Jul 8, 2019 at 1:23 PM Micah Kornfield <em...@gmail.com>
wrote:

> Hi Paul, Jacques and Antoine,
> Thank you for the valuable feedback.  I'm going to try to address it all in
> this e-mail to help consolidate the conversation.  I've grouped my
> responses by topic and included snippets from other e-mails where relevant.
>
> *Timeline of any features: *
> -  So far the sentiment is that this is too much of the 1.0.0 release.
> This seems reasonable to me.  In general, I tend to be over optimistic on
> what I can get done :)
>
>
> *Design Alternatives proposed (more are welcome)*
> - Encodings:
>   *  Use extension types (and other user land options).   This is one
> potential way of accomplishing these but I think it is suboptimal for a few
> reasons:
>      1.  The encodings are targeted at already existing logical types, not
> new ones.  So it is a little bit awkward to have for a user defined "int32"
> value.
>      2.  The extension types are at a schema level.  It is very useful to
> adapt encodings per batch level.  So in some cases a more compact encoding
> might be warranted but in others using the normal dense encoding would be
> appropriate.
>      3.  Using binary blobs remove much of the efficiency of the encodings
> (i.e. the 4 byte overhead per row).
>      4.  In the long run, I'd like to see encodings be exploited in
> computation engines.  This becomes harder/impossible when using user
> defined types.
>
> - Compression:
>    *  Use parquet for random access to data elements.
>        -  This is one option, the main downside I see to this is generally
> higher encoding/decoding costs.  Per below, I think it is reasonable to
> wait until we have more data to add compression into the the spec.
>    *  Have the transport layer do buffer specific compression:
>       - I'm not a fan of this approach.  Once nice thing about the current
> communication protocols is once you strip away "framing" data all the byte
> streams are equivalent.  I think the simplicity that follows in code from
> this is a nice feature.
>
>
> *Computational efficiency of array encodings:*
>
> > How does "more efficient computation" play out for operations such as
> > hash or join?
>
> You would still need to likely materialize rows in most case.   In some
> "join" cases the sparse encoding of the null bitmap buffer could be a win
> because it serves as an index to non-null values.
>
> I think I should clarify that these encodings aren't always a win depending
> on workload/data shape, but can have a large impact when used appropriately
> (especially at the "Expression evaluation stage").  Also, any wins don't
> come for free, to exploit encodings properly  will add some level of
> complication to existing computation code.
>
> On a packed sparse array representation:
>
> > This would be fine for simple SIMD aggregations like count/avg/mean, but
> > compacting null slots complicates more advanced parallel routines that
> > execute independently and rely on indices aligning with an element's
> > logical position.
>
>
> The main use-case I had in mind here was for scenarios like loading data
> directly parquet (i.e. nulls are already elided) doing some computation and
> then potentially translating to a dense representation.  Similarly it
> appears other have had advantage in some contexts for saving time at
> shuffle [1].  In many cases there is an overlap with RLE, so I'd be open to
> removing this from the proposal.
>
>
> *On buffer encodings:*
> To paraphrase, the main concern here seems to be it is similar to metadata
> that was already removed [2].
>
> A few points on this:
> 1.  There was a typo in the original e-mail on sparse-integer set encoding
> where it said "all" values are either null or not null.  This should have
> read "most" values.  The elision of buffers is a separate feature.
> 2.  I believe these are different then the previous metadata because this
> isn't repetitive information. It provides new information about the
> contents of buffers not available anywhere else.
> 3.  The proposal is to create a new message type for the this feature so it
> wouldn't be bringing back the old code and hopefully would have minimal
> impact on already existing IPC code.
>
>
> *On Compression:*
> So far my take is the consensus is that this can probably be applied at the
> transport level without being in the spec directly.  There might be value
> in more specific types of compression at the buffer level, but we should
> benchmark them first..
>
> *Data Integrity/Digest:*
>
> > one question is whether this occurs at the table level, column level,
> > sequential array level, etc.
>
> This is a good question, it seemed like the batch level was easiest and
> that is why I proposed it, but I'd be open to other options.  One nice
> thing about the batch level is that it works for all other message types
> out of the box (i.e. we can ensure the schema has been transmitted
> faithfully).
>
> Cheers,
> Micah
>
> [1] https://issues.apache.org/jira/browse/ARROW-5821
> [2] https://github.com/apache/arrow/pull/1297/files
> [3] https://jira.apache.org/jira/browse/ARROW-300
>
>
> On Sat, Jul 6, 2019 at 11:17 AM Paul Taylor <pt...@gmail.com>
> wrote:
>
> > Hi Micah,
> >
> > Similar to Jacques I'm not disagreeing, but wondering if they belong in
> > Arrow vs. can be done externally. I'm mostly interested in changes that
> > might impact SIMD processing, considering Arrow's already made conscious
> > design decisions to trade memory for speed. Apologies in advance if I've
> > misunderstood any of the proposals.
> >
> > > a. Add a run-length encoding scheme to efficiently represent repeated
> > > values (the actual scheme encodes run ends instead of length to
> preserve
> > > sub-linear random access).
> > Couldn't one do RLE at the buffer level via a custom
> > FixedSizeBinary/Binary/Utf8 encoding? Perhaps as a new ExtensionType?
> >
> > > b. Add a “packed” sparse representation (null values don’t take up
> > > space in value buffers)
> > This would be fine for simple SIMD aggregations like count/avg/mean, but
> > compacting null slots complicates more advanced parallel routines that
> > execute independently and rely on indices aligning with an element's
> > logical position.
> >
> > It sounds like here the logical position depends on knowing the number
> > of nulls up to that point (via something like sequentially iterating
> > both data and validity buffers). An efficient parallel routine would
> > likely need to scan beforehand to inflate the packed representation,
> > where today it can simply slice/mmap the data buffer directly.
> >
> > > a. Add frame of reference integer encoding [7] (this allows for lower
> > > bit-width encoding of integer types by subtracting a
> > > “reference” value from all values in the buffer).
> > I agree this is useful, but couldn't it also live in userland/an
> > ExtensionType?
> >
> > > b. Add a sparse integer set encoding.  This encoding allows more
> > > efficient encoding of validity bit-masks for cases when all values are
> > > either null or not null.
> > If this is in reference to the discussion at link #4 [1], it sounds
> > similar to the BufferLayout metadata that used to exist but was removed
> > a while back [2]. Knowing the buffer layouts allows an implementation to
> > generically elide any buffer at will, but would probably be a lot to
> > bring back in. I can't say whether adding a different set of metadata
> > would raise the same concerns issues Jacques mentioned in the JIRA
> > thread in [2].
> >
> > > Data compression.  Similar to encodings but compression is solely for
> > > reduction of data at rest/on the wire.  The proposal is to allow
> > > compression of individual buffers. Right now zstd is proposed, but I
> > don’t
> > > feel strongly on the specific technologies here.
> > What's the goal for this? Random element access into compressed
> > in-memory columns, or compression at I/O boundaries?
> >
> > * If the former, is Parquet a better alternative here? Again, I'm
> > cautious about the impact to parallel routines. CPU speeds are
> > plateauing while memory and tx/rx keep growing. Compressed element
> > access seems to be on the CPU side of that equation (meanwhile parallel
> > deflate already exists, and I remember seeing research into parallel
> > inflate).
> >
> > * If the later, could we do a comparison of Arrow dictionary-encoding +
> > different compression formats, vs. building them into the spec? I know
> > content-aware compression yields significant size reductions, but I
> > wonder if the maintenance burden on Arrow contributors is worth the cost
> > vs. a simpler dictionary-encoding + streaming gzip.
> >
> > > Data Integrity.  While the arrow file format isn’t meant for archiving
> > > data, I think it is important to allow for optional native data
> integrity
> > > checks in the format.  To this end, I proposed a new “Digest” message
> > type
> > > that can be added after other messages to record a digest/hash of the
> > > preceding data. I suggested xxhash, but I don’t have a strong opinion
> > here,
> > > as long as there is some minimal support that can potentially be
> expanded
> > > later.
> > :thumbs up:
> >
> >
> > Best,
> > Paul
> >
> >
> > 1.
> >
> >
> https://lists.apache.org/thread.html/5e09557274f9018efee770ad3712122d874447331f52d27169f99fe0@%3Cdev.arrow.apache.org%3E
> >
> > 2.
> >
> >
> https://issues.apache.org/jira/browse/ARROW-1693?focusedCommentId=16236902&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16236902
> >
> > On 7/5/19 11:53 AM, Micah Kornfield wrote:
> > > Hi Arrow-dev,
> > >
> > > I’d like to make a straw-man proposal to cover some features that I
> think
> > > would be useful to Arrow, and that I would like to make a
> > proof-of-concept
> > > implementation for in Java and C++.  In particular, the proposal covers
> > > allowing for smaller data sizes via compression and encoding [1][2][8],
> > > data integrity [3] and avoiding unnecessary data transfer [4][5].
> > >
> > > I’ve put together a PR  [6] that has proposed changes to the flatbuffer
> > > metadata to support the new features.  The PR introduces:
> > >
> > >     -
> > >
> > >     A new “SparseRecordBatch” that can support one of multiple possible
> > >     encodings (both dense and sparse), compression and column elision.
> > >     -
> > >
> > >     A “Digest” message type to support optional data integrity.
> > >
> > >
> > > Going into more details on the specific features in the PR:
> > >
> > >     1.
> > >
> > >     Sparse encodings for arrays and buffers.  The guiding principles
> > behind
> > >     the suggested encodings are to support encodings that can be
> > exploited by
> > >     compute engines for more efficient computation (I don’t think
> > parquet style
> > >     bit-packing belongs in Arrow).  While the encodings don’t maintain
> > O(1)
> > >     data element access, they support sublinear, O(log(N)), element
> > access. The
> > >     suggested encodings are:
> > >     1.
> > >
> > >        Array encodings:
> > >        1.
> > >
> > >           Add a run-length encoding scheme to efficiently represent
> > repeated
> > >           values (the actual scheme encodes run ends instead of length
> > > to preserve
> > >           sub-linear random access).
> > >           2.
> > >
> > >           Add a “packed” sparse representation (null values don’t take
> up
> > >           space in value buffers)
> > >           2.
> > >
> > >        Buffer encodings:
> > >        1.
> > >
> > >           Add frame of reference integer encoding [7] (this allows for
> > lower
> > >           bit-width encoding of integer types by subtracting a
> > > “reference” value from
> > >           all values in the buffer).
> > >           2.
> > >
> > >           Add a sparse integer set encoding.  This encoding allows more
> > >           efficient encoding of validity bit-masks for cases when all
> > values are
> > >           either null or not null.
> > >           2.
> > >
> > >     Data compression.  Similar to encodings but compression is solely
> for
> > >     reduction of data at rest/on the wire.  The proposal is to allow
> > >     compression of individual buffers. Right now zstd is proposed, but
> I
> > don’t
> > >     feel strongly on the specific technologies here.
> > >     3.
> > >
> > >     Column Elision.  For some use-cases, like structured logging, the
> > >     overhead of including array metadata for columns with no data
> present
> > >     represents non-negligible overhead.   The proposal provides a
> > mechanism for
> > >     omitting meta-data for such arrays.
> > >     4.
> > >
> > >     Data Integrity.  While the arrow file format isn’t meant for
> > archiving
> > >     data, I think it is important to allow for optional native data
> > integrity
> > >     checks in the format.  To this end, I proposed a new “Digest”
> > message type
> > >     that can be added after other messages to record a digest/hash of
> the
> > >     preceding data. I suggested xxhash, but I don’t have a strong
> > opinion here,
> > >     as long as there is some minimal support that can potentially be
> > expanded
> > >     later.
> > >
> > >
> > > In the proposal I chose to use Tables and Unions everywhere for
> > flexibility
> > > but in all likelihood some could be replaced by enums.
> > >
> > > My initial plan would be to solely focus on an IPC mechanism that can
> > send
> > > a SparseRecordBatch and immediately translate it to a normal
> RecordBatch
> > in
> > > both Java and C++.
> > >
> > > As a practical matter the proposal represents a lot of work to get an
> MVP
> > > working in time for 1.0.0 release (provided they are accepted by the
> > > community), so I'd greatly appreciate if anyone wants to collaborate on
> > > this.
> > >
> > > If it is easier I’m happy to start a separate thread for feature if
> > people
> > > feel like it would make the conversation easier.  I can also create a
> > > Google Doc for direct comments if that is preferred.
> > >
> > > Thanks,
> > >
> > > Micah
> > >
> > >
> > >
> > > P.S. In the interest of full disclosure, these ideas evolved in
> > > collaboration with Brian Hulette and other colleagues at Google who are
> > > interested in making use of Arrow in both internal and external
> projects.
> > >
> > > [1] https://issues.apache.org/jira/browse/ARROW-300
> > >
> > > [2]  https://issues.apache.org/jira/browse/ARROW-5224
> > >
> > > [3]
> > >
> >
> https://lists.apache.org/thread.html/36ab9c2b8b5d9f04493b3f9ea3b63c3ca3bc0f90743aa726b7a3199b@%3Cdev.arrow.apache.org%3E
> > >
> > > [4]
> > >
> >
> https://lists.apache.org/thread.html/5e09557274f9018efee770ad3712122d874447331f52d27169f99fe0@%3Cdev.arrow.apache.org%3E
> > >
> > > [5]
> > >
> >
> https://issues.apache.org/jira/browse/ARROW-1693?page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel&focusedCommentId=16244812#comment-16244812
> > >
> > > [6] https://github.com/apache/arrow/pull/4815
> > >
> > > [7]
> > >
> >
> https://lemire.me/blog/2012/02/08/effective-compression-using-frame-of-reference-and-delta-coding/
> > >
> > > [8] https://issues.apache.org/jira/browse/ARROW-5821
> > >
> >
> >
>

Re: [Discuss] Format additions to Arrow for sparse data and data integrity

Posted by Micah Kornfield <em...@gmail.com>.

Hi Paul, Jacques and Antoine,
Thank you for the valuable feedback.  I'm going to try to address it all in
this e-mail to help consolidate the conversation.  I've grouped my
responses by topic and included snippets from other e-mails where relevant.

*Timeline of any features: *
-  So far the sentiment is that this is too much of the 1.0.0 release.
This seems reasonable to me.  In general, I tend to be over optimistic on
what I can get done :)


*Design Alternatives proposed (more are welcome)*
- Encodings:
  *  Use extension types (and other user land options).   This is one
potential way of accomplishing these but I think it is suboptimal for a few
reasons:
     1.  The encodings are targeted at already existing logical types, not
new ones.  So it is a little bit awkward to have for a user defined "int32"
value.
     2.  The extension types are at a schema level.  It is very useful to
adapt encodings per batch level.  So in some cases a more compact encoding
might be warranted but in others using the normal dense encoding would be
appropriate.
     3.  Using binary blobs remove much of the efficiency of the encodings
(i.e. the 4 byte overhead per row).
     4.  In the long run, I'd like to see encodings be exploited in
computation engines.  This becomes harder/impossible when using user
defined types.

- Compression:
   *  Use parquet for random access to data elements.
       -  This is one option, the main downside I see to this is generally
higher encoding/decoding costs.  Per below, I think it is reasonable to
wait until we have more data to add compression into the the spec.
   *  Have the transport layer do buffer specific compression:
      - I'm not a fan of this approach.  Once nice thing about the current
communication protocols is once you strip away "framing" data all the byte
streams are equivalent.  I think the simplicity that follows in code from
this is a nice feature.


*Computational efficiency of array encodings:*

> How does "more efficient computation" play out for operations such as
> hash or join?

You would still need to likely materialize rows in most case.   In some
"join" cases the sparse encoding of the null bitmap buffer could be a win
because it serves as an index to non-null values.

I think I should clarify that these encodings aren't always a win depending
on workload/data shape, but can have a large impact when used appropriately
(especially at the "Expression evaluation stage").  Also, any wins don't
come for free, to exploit encodings properly  will add some level of
complication to existing computation code.

On a packed sparse array representation:

> This would be fine for simple SIMD aggregations like count/avg/mean, but
> compacting null slots complicates more advanced parallel routines that
> execute independently and rely on indices aligning with an element's
> logical position.


The main use-case I had in mind here was for scenarios like loading data
directly parquet (i.e. nulls are already elided) doing some computation and
then potentially translating to a dense representation.  Similarly it
appears other have had advantage in some contexts for saving time at
shuffle [1].  In many cases there is an overlap with RLE, so I'd be open to
removing this from the proposal.


*On buffer encodings:*
To paraphrase, the main concern here seems to be it is similar to metadata
that was already removed [2].

A few points on this:
1.  There was a typo in the original e-mail on sparse-integer set encoding
where it said "all" values are either null or not null.  This should have
read "most" values.  The elision of buffers is a separate feature.
2.  I believe these are different then the previous metadata because this
isn't repetitive information. It provides new information about the
contents of buffers not available anywhere else.
3.  The proposal is to create a new message type for the this feature so it
wouldn't be bringing back the old code and hopefully would have minimal
impact on already existing IPC code.


*On Compression:*
So far my take is the consensus is that this can probably be applied at the
transport level without being in the spec directly.  There might be value
in more specific types of compression at the buffer level, but we should
benchmark them first..

*Data Integrity/Digest:*

> one question is whether this occurs at the table level, column level,
> sequential array level, etc.

This is a good question, it seemed like the batch level was easiest and
that is why I proposed it, but I'd be open to other options.  One nice
thing about the batch level is that it works for all other message types
out of the box (i.e. we can ensure the schema has been transmitted
faithfully).

Cheers,
Micah

[1] https://issues.apache.org/jira/browse/ARROW-5821
[2] https://github.com/apache/arrow/pull/1297/files
[3] https://jira.apache.org/jira/browse/ARROW-300


On Sat, Jul 6, 2019 at 11:17 AM Paul Taylor <pt...@gmail.com>
wrote:

> Hi Micah,
>
> Similar to Jacques I'm not disagreeing, but wondering if they belong in
> Arrow vs. can be done externally. I'm mostly interested in changes that
> might impact SIMD processing, considering Arrow's already made conscious
> design decisions to trade memory for speed. Apologies in advance if I've
> misunderstood any of the proposals.
>
> > a. Add a run-length encoding scheme to efficiently represent repeated
> > values (the actual scheme encodes run ends instead of length to preserve
> > sub-linear random access).
> Couldn't one do RLE at the buffer level via a custom
> FixedSizeBinary/Binary/Utf8 encoding? Perhaps as a new ExtensionType?
>
> > b. Add a “packed” sparse representation (null values don’t take up
> > space in value buffers)
> This would be fine for simple SIMD aggregations like count/avg/mean, but
> compacting null slots complicates more advanced parallel routines that
> execute independently and rely on indices aligning with an element's
> logical position.
>
> It sounds like here the logical position depends on knowing the number
> of nulls up to that point (via something like sequentially iterating
> both data and validity buffers). An efficient parallel routine would
> likely need to scan beforehand to inflate the packed representation,
> where today it can simply slice/mmap the data buffer directly.
>
> > a. Add frame of reference integer encoding [7] (this allows for lower
> > bit-width encoding of integer types by subtracting a
> > “reference” value from all values in the buffer).
> I agree this is useful, but couldn't it also live in userland/an
> ExtensionType?
>
> > b. Add a sparse integer set encoding.  This encoding allows more
> > efficient encoding of validity bit-masks for cases when all values are
> > either null or not null.
> If this is in reference to the discussion at link #4 [1], it sounds
> similar to the BufferLayout metadata that used to exist but was removed
> a while back [2]. Knowing the buffer layouts allows an implementation to
> generically elide any buffer at will, but would probably be a lot to
> bring back in. I can't say whether adding a different set of metadata
> would raise the same concerns issues Jacques mentioned in the JIRA
> thread in [2].
>
> > Data compression.  Similar to encodings but compression is solely for
> > reduction of data at rest/on the wire.  The proposal is to allow
> > compression of individual buffers. Right now zstd is proposed, but I
> don’t
> > feel strongly on the specific technologies here.
> What's the goal for this? Random element access into compressed
> in-memory columns, or compression at I/O boundaries?
>
> * If the former, is Parquet a better alternative here? Again, I'm
> cautious about the impact to parallel routines. CPU speeds are
> plateauing while memory and tx/rx keep growing. Compressed element
> access seems to be on the CPU side of that equation (meanwhile parallel
> deflate already exists, and I remember seeing research into parallel
> inflate).
>
> * If the later, could we do a comparison of Arrow dictionary-encoding +
> different compression formats, vs. building them into the spec? I know
> content-aware compression yields significant size reductions, but I
> wonder if the maintenance burden on Arrow contributors is worth the cost
> vs. a simpler dictionary-encoding + streaming gzip.
>
> > Data Integrity.  While the arrow file format isn’t meant for archiving
> > data, I think it is important to allow for optional native data integrity
> > checks in the format.  To this end, I proposed a new “Digest” message
> type
> > that can be added after other messages to record a digest/hash of the
> > preceding data. I suggested xxhash, but I don’t have a strong opinion
> here,
> > as long as there is some minimal support that can potentially be expanded
> > later.
> :thumbs up:
>
>
> Best,
> Paul
>
>
> 1.
>
> https://lists.apache.org/thread.html/5e09557274f9018efee770ad3712122d874447331f52d27169f99fe0@%3Cdev.arrow.apache.org%3E
>
> 2.
>
> https://issues.apache.org/jira/browse/ARROW-1693?focusedCommentId=16236902&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16236902
>
> On 7/5/19 11:53 AM, Micah Kornfield wrote:
> > Hi Arrow-dev,
> >
> > I’d like to make a straw-man proposal to cover some features that I think
> > would be useful to Arrow, and that I would like to make a
> proof-of-concept
> > implementation for in Java and C++.  In particular, the proposal covers
> > allowing for smaller data sizes via compression and encoding [1][2][8],
> > data integrity [3] and avoiding unnecessary data transfer [4][5].
> >
> > I’ve put together a PR  [6] that has proposed changes to the flatbuffer
> > metadata to support the new features.  The PR introduces:
> >
> >     -
> >
> >     A new “SparseRecordBatch” that can support one of multiple possible
> >     encodings (both dense and sparse), compression and column elision.
> >     -
> >
> >     A “Digest” message type to support optional data integrity.
> >
> >
> > Going into more details on the specific features in the PR:
> >
> >     1.
> >
> >     Sparse encodings for arrays and buffers.  The guiding principles
> behind
> >     the suggested encodings are to support encodings that can be
> exploited by
> >     compute engines for more efficient computation (I don’t think
> parquet style
> >     bit-packing belongs in Arrow).  While the encodings don’t maintain
> O(1)
> >     data element access, they support sublinear, O(log(N)), element
> access. The
> >     suggested encodings are:
> >     1.
> >
> >        Array encodings:
> >        1.
> >
> >           Add a run-length encoding scheme to efficiently represent
> repeated
> >           values (the actual scheme encodes run ends instead of length
> > to preserve
> >           sub-linear random access).
> >           2.
> >
> >           Add a “packed” sparse representation (null values don’t take up
> >           space in value buffers)
> >           2.
> >
> >        Buffer encodings:
> >        1.
> >
> >           Add frame of reference integer encoding [7] (this allows for
> lower
> >           bit-width encoding of integer types by subtracting a
> > “reference” value from
> >           all values in the buffer).
> >           2.
> >
> >           Add a sparse integer set encoding.  This encoding allows more
> >           efficient encoding of validity bit-masks for cases when all
> values are
> >           either null or not null.
> >           2.
> >
> >     Data compression.  Similar to encodings but compression is solely for
> >     reduction of data at rest/on the wire.  The proposal is to allow
> >     compression of individual buffers. Right now zstd is proposed, but I
> don’t
> >     feel strongly on the specific technologies here.
> >     3.
> >
> >     Column Elision.  For some use-cases, like structured logging, the
> >     overhead of including array metadata for columns with no data present
> >     represents non-negligible overhead.   The proposal provides a
> mechanism for
> >     omitting meta-data for such arrays.
> >     4.
> >
> >     Data Integrity.  While the arrow file format isn’t meant for
> archiving
> >     data, I think it is important to allow for optional native data
> integrity
> >     checks in the format.  To this end, I proposed a new “Digest”
> message type
> >     that can be added after other messages to record a digest/hash of the
> >     preceding data. I suggested xxhash, but I don’t have a strong
> opinion here,
> >     as long as there is some minimal support that can potentially be
> expanded
> >     later.
> >
> >
> > In the proposal I chose to use Tables and Unions everywhere for
> flexibility
> > but in all likelihood some could be replaced by enums.
> >
> > My initial plan would be to solely focus on an IPC mechanism that can
> send
> > a SparseRecordBatch and immediately translate it to a normal RecordBatch
> in
> > both Java and C++.
> >
> > As a practical matter the proposal represents a lot of work to get an MVP
> > working in time for 1.0.0 release (provided they are accepted by the
> > community), so I'd greatly appreciate if anyone wants to collaborate on
> > this.
> >
> > If it is easier I’m happy to start a separate thread for feature if
> people
> > feel like it would make the conversation easier.  I can also create a
> > Google Doc for direct comments if that is preferred.
> >
> > Thanks,
> >
> > Micah
> >
> >
> >
> > P.S. In the interest of full disclosure, these ideas evolved in
> > collaboration with Brian Hulette and other colleagues at Google who are
> > interested in making use of Arrow in both internal and external projects.
> >
> > [1] https://issues.apache.org/jira/browse/ARROW-300
> >
> > [2]  https://issues.apache.org/jira/browse/ARROW-5224
> >
> > [3]
> >
> https://lists.apache.org/thread.html/36ab9c2b8b5d9f04493b3f9ea3b63c3ca3bc0f90743aa726b7a3199b@%3Cdev.arrow.apache.org%3E
> >
> > [4]
> >
> https://lists.apache.org/thread.html/5e09557274f9018efee770ad3712122d874447331f52d27169f99fe0@%3Cdev.arrow.apache.org%3E
> >
> > [5]
> >
> https://issues.apache.org/jira/browse/ARROW-1693?page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel&focusedCommentId=16244812#comment-16244812
> >
> > [6] https://github.com/apache/arrow/pull/4815
> >
> > [7]
> >
> https://lemire.me/blog/2012/02/08/effective-compression-using-frame-of-reference-and-delta-coding/
> >
> > [8] https://issues.apache.org/jira/browse/ARROW-5821
> >
>
>

Re: [Discuss] Format additions to Arrow for sparse data and data integrity

Posted by Paul Taylor <pt...@gmail.com>.

Hi Micah,

Similar to Jacques I'm not disagreeing, but wondering if they belong in 
Arrow vs. can be done externally. I'm mostly interested in changes that 
might impact SIMD processing, considering Arrow's already made conscious 
design decisions to trade memory for speed. Apologies in advance if I've 
misunderstood any of the proposals.

> a. Add a run-length encoding scheme to efficiently represent repeated
> values (the actual scheme encodes run ends instead of length to preserve
> sub-linear random access).
Couldn't one do RLE at the buffer level via a custom 
FixedSizeBinary/Binary/Utf8 encoding? Perhaps as a new ExtensionType?

> b. Add a “packed” sparse representation (null values don’t take up
> space in value buffers)
This would be fine for simple SIMD aggregations like count/avg/mean, but 
compacting null slots complicates more advanced parallel routines that 
execute independently and rely on indices aligning with an element's 
logical position.

It sounds like here the logical position depends on knowing the number 
of nulls up to that point (via something like sequentially iterating 
both data and validity buffers). An efficient parallel routine would 
likely need to scan beforehand to inflate the packed representation, 
where today it can simply slice/mmap the data buffer directly.

> a. Add frame of reference integer encoding [7] (this allows for lower
> bit-width encoding of integer types by subtracting a
> “reference” value from all values in the buffer).
I agree this is useful, but couldn't it also live in userland/an 
ExtensionType?

> b. Add a sparse integer set encoding.  This encoding allows more
> efficient encoding of validity bit-masks for cases when all values are
> either null or not null.
If this is in reference to the discussion at link #4 [1], it sounds 
similar to the BufferLayout metadata that used to exist but was removed 
a while back [2]. Knowing the buffer layouts allows an implementation to 
generically elide any buffer at will, but would probably be a lot to 
bring back in. I can't say whether adding a different set of metadata 
would raise the same concerns issues Jacques mentioned in the JIRA 
thread in [2].

> Data compression.  Similar to encodings but compression is solely for
> reduction of data at rest/on the wire.  The proposal is to allow
> compression of individual buffers. Right now zstd is proposed, but I don’t
> feel strongly on the specific technologies here.
What's the goal for this? Random element access into compressed 
in-memory columns, or compression at I/O boundaries?

* If the former, is Parquet a better alternative here? Again, I'm 
cautious about the impact to parallel routines. CPU speeds are 
plateauing while memory and tx/rx keep growing. Compressed element 
access seems to be on the CPU side of that equation (meanwhile parallel 
deflate already exists, and I remember seeing research into parallel 
inflate).

* If the later, could we do a comparison of Arrow dictionary-encoding + 
different compression formats, vs. building them into the spec? I know 
content-aware compression yields significant size reductions, but I 
wonder if the maintenance burden on Arrow contributors is worth the cost 
vs. a simpler dictionary-encoding + streaming gzip.

> Data Integrity.  While the arrow file format isn’t meant for archiving
> data, I think it is important to allow for optional native data integrity
> checks in the format.  To this end, I proposed a new “Digest” message type
> that can be added after other messages to record a digest/hash of the
> preceding data. I suggested xxhash, but I don’t have a strong opinion here,
> as long as there is some minimal support that can potentially be expanded
> later.
:thumbs up:


Best,
Paul


1. 
https://lists.apache.org/thread.html/5e09557274f9018efee770ad3712122d874447331f52d27169f99fe0@%3Cdev.arrow.apache.org%3E

2. 
https://issues.apache.org/jira/browse/ARROW-1693?focusedCommentId=16236902&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16236902

On 7/5/19 11:53 AM, Micah Kornfield wrote:
> Hi Arrow-dev,
>
> I’d like to make a straw-man proposal to cover some features that I think
> would be useful to Arrow, and that I would like to make a proof-of-concept
> implementation for in Java and C++.  In particular, the proposal covers
> allowing for smaller data sizes via compression and encoding [1][2][8],
> data integrity [3] and avoiding unnecessary data transfer [4][5].
>
> I’ve put together a PR  [6] that has proposed changes to the flatbuffer
> metadata to support the new features.  The PR introduces:
>
>     -
>
>     A new “SparseRecordBatch” that can support one of multiple possible
>     encodings (both dense and sparse), compression and column elision.
>     -
>
>     A “Digest” message type to support optional data integrity.
>
>
> Going into more details on the specific features in the PR:
>
>     1.
>
>     Sparse encodings for arrays and buffers.  The guiding principles behind
>     the suggested encodings are to support encodings that can be exploited by
>     compute engines for more efficient computation (I don’t think parquet style
>     bit-packing belongs in Arrow).  While the encodings don’t maintain O(1)
>     data element access, they support sublinear, O(log(N)), element access. The
>     suggested encodings are:
>     1.
>
>        Array encodings:
>        1.
>
>           Add a run-length encoding scheme to efficiently represent repeated
>           values (the actual scheme encodes run ends instead of length
> to preserve
>           sub-linear random access).
>           2.
>
>           Add a “packed” sparse representation (null values don’t take up
>           space in value buffers)
>           2.
>
>        Buffer encodings:
>        1.
>
>           Add frame of reference integer encoding [7] (this allows for lower
>           bit-width encoding of integer types by subtracting a
> “reference” value from
>           all values in the buffer).
>           2.
>
>           Add a sparse integer set encoding.  This encoding allows more
>           efficient encoding of validity bit-masks for cases when all values are
>           either null or not null.
>           2.
>
>     Data compression.  Similar to encodings but compression is solely for
>     reduction of data at rest/on the wire.  The proposal is to allow
>     compression of individual buffers. Right now zstd is proposed, but I don’t
>     feel strongly on the specific technologies here.
>     3.
>
>     Column Elision.  For some use-cases, like structured logging, the
>     overhead of including array metadata for columns with no data present
>     represents non-negligible overhead.   The proposal provides a mechanism for
>     omitting meta-data for such arrays.
>     4.
>
>     Data Integrity.  While the arrow file format isn’t meant for archiving
>     data, I think it is important to allow for optional native data integrity
>     checks in the format.  To this end, I proposed a new “Digest” message type
>     that can be added after other messages to record a digest/hash of the
>     preceding data. I suggested xxhash, but I don’t have a strong opinion here,
>     as long as there is some minimal support that can potentially be expanded
>     later.
>
>
> In the proposal I chose to use Tables and Unions everywhere for flexibility
> but in all likelihood some could be replaced by enums.
>
> My initial plan would be to solely focus on an IPC mechanism that can send
> a SparseRecordBatch and immediately translate it to a normal RecordBatch in
> both Java and C++.
>
> As a practical matter the proposal represents a lot of work to get an MVP
> working in time for 1.0.0 release (provided they are accepted by the
> community), so I'd greatly appreciate if anyone wants to collaborate on
> this.
>
> If it is easier I’m happy to start a separate thread for feature if people
> feel like it would make the conversation easier.  I can also create a
> Google Doc for direct comments if that is preferred.
>
> Thanks,
>
> Micah
>
>
>
> P.S. In the interest of full disclosure, these ideas evolved in
> collaboration with Brian Hulette and other colleagues at Google who are
> interested in making use of Arrow in both internal and external projects.
>
> [1] https://issues.apache.org/jira/browse/ARROW-300
>
> [2]  https://issues.apache.org/jira/browse/ARROW-5224
>
> [3]
> https://lists.apache.org/thread.html/36ab9c2b8b5d9f04493b3f9ea3b63c3ca3bc0f90743aa726b7a3199b@%3Cdev.arrow.apache.org%3E
>
> [4]
> https://lists.apache.org/thread.html/5e09557274f9018efee770ad3712122d874447331f52d27169f99fe0@%3Cdev.arrow.apache.org%3E
>
> [5]
> https://issues.apache.org/jira/browse/ARROW-1693?page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel&focusedCommentId=16244812#comment-16244812
>
> [6] https://github.com/apache/arrow/pull/4815
>
> [7]
> https://lemire.me/blog/2012/02/08/effective-compression-using-frame-of-reference-and-delta-coding/
>
> [8] https://issues.apache.org/jira/browse/ARROW-5821
>

Re: [Discuss] Format additions to Arrow for sparse data and data integrity

Posted by Micah Kornfield <em...@gmail.com>.

Hi Jacques,

> That's quite interesting. Can you share more about the use case.

Sorry I realized I missed answering this.  We are still investigating, so
the initial diagnosis might be off.  The use-case is a data transfer
application, reading data at rest, translating it to arrow and sending it
out to clients.

I look forward hearing your thoughts on the rest of the proposal.

Thanks,
Micah



On Sat, Jul 6, 2019 at 2:53 PM Jacques Nadeau <ja...@apache.org> wrote:

> What is the driving force for transport compression? Are you seeing that
>>> as a major bottleneck in particular circumstances? (I'm not disagreeing,
>>> just want to clearly define the particular problem you're worried about.)
>>
>>
>> I've been working on a 20% project where we appear to be IO bound for
>> transporting record batches.   Also, I believe Ji Liu (tianchen92) has been
>> seeing some of the same bottlenecks with the query engine they are is
>> working on.  Trading off some CPU here would allow us to lower the overall
>> latency in the system.
>>
>
> That's quite interesting. Can you share more about the use case. With the
> exception of broadcast and round-robin type distribution patterns, we find
> that there is typically more cycles focused on partitioning the sending
> data such that IO bounding is less of a problem. In most of our operations,
> almost all the largest workloads are done via partitioning thus it isn't
> typically a problem. (We also have clients with 10gbps and 100gbps network
> interconnects...) Are you partitioning the data pre-send?
>
>
>
>> Random thought: what do you think of defining this at the transport level
>>> rather than the record batch level? (e.g. in Arrow Flight). This is one way
>>> to avoid extending the core record batch concept with something that isn't
>>> related to processing (at least in your initial proposal)
>>
>>
>> Per above, this seems like a reasonable approach to me if we want to hold
>> off on buffer level compression.  Another use-case for buffer/record-batch
>> level compression would be the Feather file format for only decompressing
>> subset of columns/rows.  If this use-case isn't compelling, I'd be happy to
>> hold off adding compression to sparse batches until we have benchmarks
>> showing the trade-off between channel level and buffer level compression.
>>
>
> I was proposing that type specific buffer encodings be done at the Flight
> level, not message level encodings. Just want to make sure the formats
> don't leak into the core spec until we're ready.
>

Re: [Discuss] Format additions to Arrow for sparse data and data integrity

Posted by Jacques Nadeau <ja...@apache.org>.

>
> What is the driving force for transport compression? Are you seeing that
>> as a major bottleneck in particular circumstances? (I'm not disagreeing,
>> just want to clearly define the particular problem you're worried about.)
>
>
> I've been working on a 20% project where we appear to be IO bound for
> transporting record batches.   Also, I believe Ji Liu (tianchen92) has been
> seeing some of the same bottlenecks with the query engine they are is
> working on.  Trading off some CPU here would allow us to lower the overall
> latency in the system.
>

That's quite interesting. Can you share more about the use case. With the
exception of broadcast and round-robin type distribution patterns, we find
that there is typically more cycles focused on partitioning the sending
data such that IO bounding is less of a problem. In most of our operations,
almost all the largest workloads are done via partitioning thus it isn't
typically a problem. (We also have clients with 10gbps and 100gbps network
interconnects...) Are you partitioning the data pre-send?



> Random thought: what do you think of defining this at the transport level
>> rather than the record batch level? (e.g. in Arrow Flight). This is one way
>> to avoid extending the core record batch concept with something that isn't
>> related to processing (at least in your initial proposal)
>
>
> Per above, this seems like a reasonable approach to me if we want to hold
> off on buffer level compression.  Another use-case for buffer/record-batch
> level compression would be the Feather file format for only decompressing
> subset of columns/rows.  If this use-case isn't compelling, I'd be happy to
> hold off adding compression to sparse batches until we have benchmarks
> showing the trade-off between channel level and buffer level compression.
>

I was proposing that type specific buffer encodings be done at the Flight
level, not message level encodings. Just want to make sure the formats
don't leak into the core spec until we're ready.

Re: [Discuss] Format additions to Arrow for sparse data and data integrity

Posted by Micah Kornfield <em...@gmail.com>.

Hi Jacques,
I think our e-mails might have crossed, so I'm consolidating my responses
from the previous e-mail as well.

I don't think most of this should be targeted for 1.0. It is a lot of
> change/enhancement and seems like it would likely substantially delay 1.0.

I agree it shouldn't block 1.0.  I think time based releases are working
well for the community.    But if the features are implemented in Java and
C++ with integration tests in time for 1.0, should we explicitly rule it
out?  If not for 1.0 would the subsequent release make sense?

What is the driving force for transport compression? Are you seeing that as
> a major bottleneck in particular circumstances? (I'm not disagreeing, just
> want to clearly define the particular problem you're worried about.)

I've been working on a 20% project where we appear to be IO bound for
transporting record batches.   Also, I believe Ji Liu (tianchen92) has been
seeing some of the same bottlenecks with the query engine they are is
working on.  Trading off some CPU here would allow us to lower the overall
latency in the system.

You suggested that this be done on the buffer level but it seems like that
> maybe too narrow depending on batch size? What is the thinking here about
> tradeoffs around message versus batch.

Two reasons for this proposal:
- I'm not sure if there is much value add at the batch level vs simply
compressing the whole transport channel.  It could be for small batch sizes
compression mostly goes unused.  But if it is seen as valuable we could
certainly incorporate a batch level aspect as well .
-  At the buffer level you can use more specialized compression techniques
that don't require larger sized data to be effective.  For example there is
a JIRA open to consider using  PFOR [1] which, if I understand correctly,
starts being effective once you have ~128 integers.

Random thought: what do you think of defining this at the transport level
> rather than the record batch level? (e.g. in Arrow Flight). This is one way
> to avoid extending the core record batch concept with something that isn't
> related to processing (at least in your initial proposal)

Per above, this seems like a reasonable approach to me if we want to hold
off on buffer level compression.  Another use-case for buffer/record-batch
level compression would be the Feather file format for only decompressing
subset of columns/rows.  If this use-case isn't compelling, I'd be happy to
hold off adding compression to sparse batches until we have benchmarks
showing the trade-off between channel level and buffer level compression.

If we implement buffer level encodings we should also see a decent size win
on space without compression.

Thanks,
Micah

[1] https://github.com/lemire/FastPFor

On Fri, Jul 5, 2019 at 1:48 PM Jacques Nadeau <ja...@apache.org> wrote:

> One question and a random thought:
>
> What is the driving force for transport compression? Are you seeing that
> as a major bottleneck in particular circumstances? (I'm not disagreeing,
> just want to clearly define the particular problem you're worried about.)
>
> Random thought: what do you think of defining this at the transport level
> rather than the record batch level? (e.g. in Arrow Flight). This is one way
> to avoid extending the core record batch concept with something that isn't
> related to processing (at least in your initial proposal).
>

Re: [Discuss] Format additions to Arrow for sparse data and data integrity

Posted by Jacques Nadeau <ja...@apache.org>.

One question and a random thought:

What is the driving force for transport compression? Are you seeing that as
a major bottleneck in particular circumstances? (I'm not disagreeing, just
want to clearly define the particular problem you're worried about.)

Random thought: what do you think of defining this at the transport level
rather than the record batch level? (e.g. in Arrow Flight). This is one way
to avoid extending the core record batch concept with something that isn't
related to processing (at least in your initial proposal).

Re: [Discuss] Format additions to Arrow for sparse data and data integrity

Posted by Micah Kornfield <em...@gmail.com>.

Hi Jacques,
Thanks for the quick response.

I don't think most of this should be targeted for 1.0. It is a lot of
> change/enhancement and seems like it would likely substantially delay 1.0.


I agree it shouldn't block 1.0.  I think time based releases are working
well for the community.    But if the features are implemented in Java and
C++ with integration tests between the two in time for 1.0 should we
explicitly rule it out?  If not for 1.0 would the subsequent release make
sense?

You suggested that this be done on the buffer level but it seems like that
> maybe too narrow depending on batch size? What is the thinking here about
> tradeoffs around message versus batch.


Two reasons for this proposal:
- I'm not sure if there is much value add at the batch level vs simply
compressing the whole transport channel.  It could be for small batch sizes
compression mostly goes unused.  But if it is seen as valuable we could
certainly incorporate a batch level aspect as well .
-  At the buffer level you can use potentially use more specialized
compression techniques that don't require larger sized data to be
effective.  For example there is a JIRA open to consider using  PFOR [1]
which if I understand correctly starts being effective once you have ~128
integers.

Thanks,
Micah

[1] https://github.com/lemire/FastPFor




On Fri, Jul 5, 2019 at 12:38 PM Jacques Nadeau <ja...@apache.org> wrote:

> Initial thought: I don't think most of this should be targeted for 1.0. It
> is a lot of change/enhancement and seems like it would likely substantially
> delay 1.0. The one piece that seems least disruptive would be basic on the
> wire compression. You suggested that this be done on the buffer level but
> it seems like that maybe too narrow depending on batch size? What is the
> thinking here about tradeoffs around message versus batch. When pipelining,
> we target relatively small batches typically of 256k-1mb. Sometimes we
> might go up to 10mb but that is a pretty rare use case.
>
> On Fri, Jul 5, 2019 at 12:32 PM Jacques Nadeau <ja...@apache.org> wrote:
>
>> Hey Micah, you're formatting seems to be messed up on this mail. Some
>> kind of copy/paste error?
>>
>> On Fri, Jul 5, 2019 at 11:54 AM Micah Kornfield <em...@gmail.com>
>> wrote:
>>
>>> Hi Arrow-dev,
>>>
>>> I’d like to make a straw-man proposal to cover some features that I think
>>> would be useful to Arrow, and that I would like to make a
>>> proof-of-concept
>>> implementation for in Java and C++.  In particular, the proposal covers
>>> allowing for smaller data sizes via compression and encoding [1][2][8],
>>> data integrity [3] and avoiding unnecessary data transfer [4][5].
>>>
>>> I’ve put together a PR  [6] that has proposed changes to the flatbuffer
>>> metadata to support the new features.  The PR introduces:
>>>
>>>    -
>>>
>>>    A new “SparseRecordBatch” that can support one of multiple possible
>>>    encodings (both dense and sparse), compression and column elision.
>>>    -
>>>
>>>    A “Digest” message type to support optional data integrity.
>>>
>>>
>>> Going into more details on the specific features in the PR:
>>>
>>>    1.
>>>
>>>    Sparse encodings for arrays and buffers.  The guiding principles
>>> behind
>>>    the suggested encodings are to support encodings that can be
>>> exploited by
>>>    compute engines for more efficient computation (I don’t think parquet
>>> style
>>>    bit-packing belongs in Arrow).  While the encodings don’t maintain
>>> O(1)
>>>    data element access, they support sublinear, O(log(N)), element
>>> access. The
>>>    suggested encodings are:
>>>    1.
>>>
>>>       Array encodings:
>>>       1.
>>>
>>>          Add a run-length encoding scheme to efficiently represent
>>> repeated
>>>          values (the actual scheme encodes run ends instead of length
>>> to preserve
>>>          sub-linear random access).
>>>          2.
>>>
>>>          Add a “packed” sparse representation (null values don’t take up
>>>          space in value buffers)
>>>          2.
>>>
>>>       Buffer encodings:
>>>       1.
>>>
>>>          Add frame of reference integer encoding [7] (this allows for
>>> lower
>>>          bit-width encoding of integer types by subtracting a
>>> “reference” value from
>>>          all values in the buffer).
>>>          2.
>>>
>>>          Add a sparse integer set encoding.  This encoding allows more
>>>          efficient encoding of validity bit-masks for cases when all
>>> values are
>>>          either null or not null.
>>>          2.
>>>
>>>    Data compression.  Similar to encodings but compression is solely for
>>>    reduction of data at rest/on the wire.  The proposal is to allow
>>>    compression of individual buffers. Right now zstd is proposed, but I
>>> don’t
>>>    feel strongly on the specific technologies here.
>>>    3.
>>>
>>>    Column Elision.  For some use-cases, like structured logging, the
>>>    overhead of including array metadata for columns with no data present
>>>    represents non-negligible overhead.   The proposal provides a
>>> mechanism for
>>>    omitting meta-data for such arrays.
>>>    4.
>>>
>>>    Data Integrity.  While the arrow file format isn’t meant for archiving
>>>    data, I think it is important to allow for optional native data
>>> integrity
>>>    checks in the format.  To this end, I proposed a new “Digest” message
>>> type
>>>    that can be added after other messages to record a digest/hash of the
>>>    preceding data. I suggested xxhash, but I don’t have a strong opinion
>>> here,
>>>    as long as there is some minimal support that can potentially be
>>> expanded
>>>    later.
>>>
>>>
>>> In the proposal I chose to use Tables and Unions everywhere for
>>> flexibility
>>> but in all likelihood some could be replaced by enums.
>>>
>>> My initial plan would be to solely focus on an IPC mechanism that can
>>> send
>>> a SparseRecordBatch and immediately translate it to a normal RecordBatch
>>> in
>>> both Java and C++.
>>>
>>> As a practical matter the proposal represents a lot of work to get an MVP
>>> working in time for 1.0.0 release (provided they are accepted by the
>>> community), so I'd greatly appreciate if anyone wants to collaborate on
>>> this.
>>>
>>> If it is easier I’m happy to start a separate thread for feature if
>>> people
>>> feel like it would make the conversation easier.  I can also create a
>>> Google Doc for direct comments if that is preferred.
>>>
>>> Thanks,
>>>
>>> Micah
>>>
>>>
>>>
>>> P.S. In the interest of full disclosure, these ideas evolved in
>>> collaboration with Brian Hulette and other colleagues at Google who are
>>> interested in making use of Arrow in both internal and external projects.
>>>
>>> [1] https://issues.apache.org/jira/browse/ARROW-300
>>>
>>> [2]  https://issues.apache.org/jira/browse/ARROW-5224
>>>
>>> [3]
>>>
>>> https://lists.apache.org/thread.html/36ab9c2b8b5d9f04493b3f9ea3b63c3ca3bc0f90743aa726b7a3199b@%3Cdev.arrow.apache.org%3E
>>>
>>> [4]
>>>
>>> https://lists.apache.org/thread.html/5e09557274f9018efee770ad3712122d874447331f52d27169f99fe0@%3Cdev.arrow.apache.org%3E
>>>
>>> [5]
>>>
>>> https://issues.apache.org/jira/browse/ARROW-1693?page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel&focusedCommentId=16244812#comment-16244812
>>>
>>> [6] https://github.com/apache/arrow/pull/4815
>>>
>>> [7]
>>>
>>> https://lemire.me/blog/2012/02/08/effective-compression-using-frame-of-reference-and-delta-coding/
>>>
>>> [8] https://issues.apache.org/jira/browse/ARROW-5821
>>>
>>

Re: [Discuss] Format additions to Arrow for sparse data and data integrity

Posted by Jacques Nadeau <ja...@apache.org>.

Initial thought: I don't think most of this should be targeted for 1.0. It
is a lot of change/enhancement and seems like it would likely substantially
delay 1.0. The one piece that seems least disruptive would be basic on the
wire compression. You suggested that this be done on the buffer level but
it seems like that maybe too narrow depending on batch size? What is the
thinking here about tradeoffs around message versus batch. When pipelining,
we target relatively small batches typically of 256k-1mb. Sometimes we
might go up to 10mb but that is a pretty rare use case.

On Fri, Jul 5, 2019 at 12:32 PM Jacques Nadeau <ja...@apache.org> wrote:

> Hey Micah, you're formatting seems to be messed up on this mail. Some kind
> of copy/paste error?
>
> On Fri, Jul 5, 2019 at 11:54 AM Micah Kornfield <em...@gmail.com>
> wrote:
>
>> Hi Arrow-dev,
>>
>> I’d like to make a straw-man proposal to cover some features that I think
>> would be useful to Arrow, and that I would like to make a proof-of-concept
>> implementation for in Java and C++.  In particular, the proposal covers
>> allowing for smaller data sizes via compression and encoding [1][2][8],
>> data integrity [3] and avoiding unnecessary data transfer [4][5].
>>
>> I’ve put together a PR  [6] that has proposed changes to the flatbuffer
>> metadata to support the new features.  The PR introduces:
>>
>>    -
>>
>>    A new “SparseRecordBatch” that can support one of multiple possible
>>    encodings (both dense and sparse), compression and column elision.
>>    -
>>
>>    A “Digest” message type to support optional data integrity.
>>
>>
>> Going into more details on the specific features in the PR:
>>
>>    1.
>>
>>    Sparse encodings for arrays and buffers.  The guiding principles behind
>>    the suggested encodings are to support encodings that can be exploited
>> by
>>    compute engines for more efficient computation (I don’t think parquet
>> style
>>    bit-packing belongs in Arrow).  While the encodings don’t maintain O(1)
>>    data element access, they support sublinear, O(log(N)), element
>> access. The
>>    suggested encodings are:
>>    1.
>>
>>       Array encodings:
>>       1.
>>
>>          Add a run-length encoding scheme to efficiently represent
>> repeated
>>          values (the actual scheme encodes run ends instead of length
>> to preserve
>>          sub-linear random access).
>>          2.
>>
>>          Add a “packed” sparse representation (null values don’t take up
>>          space in value buffers)
>>          2.
>>
>>       Buffer encodings:
>>       1.
>>
>>          Add frame of reference integer encoding [7] (this allows for
>> lower
>>          bit-width encoding of integer types by subtracting a
>> “reference” value from
>>          all values in the buffer).
>>          2.
>>
>>          Add a sparse integer set encoding.  This encoding allows more
>>          efficient encoding of validity bit-masks for cases when all
>> values are
>>          either null or not null.
>>          2.
>>
>>    Data compression.  Similar to encodings but compression is solely for
>>    reduction of data at rest/on the wire.  The proposal is to allow
>>    compression of individual buffers. Right now zstd is proposed, but I
>> don’t
>>    feel strongly on the specific technologies here.
>>    3.
>>
>>    Column Elision.  For some use-cases, like structured logging, the
>>    overhead of including array metadata for columns with no data present
>>    represents non-negligible overhead.   The proposal provides a
>> mechanism for
>>    omitting meta-data for such arrays.
>>    4.
>>
>>    Data Integrity.  While the arrow file format isn’t meant for archiving
>>    data, I think it is important to allow for optional native data
>> integrity
>>    checks in the format.  To this end, I proposed a new “Digest” message
>> type
>>    that can be added after other messages to record a digest/hash of the
>>    preceding data. I suggested xxhash, but I don’t have a strong opinion
>> here,
>>    as long as there is some minimal support that can potentially be
>> expanded
>>    later.
>>
>>
>> In the proposal I chose to use Tables and Unions everywhere for
>> flexibility
>> but in all likelihood some could be replaced by enums.
>>
>> My initial plan would be to solely focus on an IPC mechanism that can send
>> a SparseRecordBatch and immediately translate it to a normal RecordBatch
>> in
>> both Java and C++.
>>
>> As a practical matter the proposal represents a lot of work to get an MVP
>> working in time for 1.0.0 release (provided they are accepted by the
>> community), so I'd greatly appreciate if anyone wants to collaborate on
>> this.
>>
>> If it is easier I’m happy to start a separate thread for feature if people
>> feel like it would make the conversation easier.  I can also create a
>> Google Doc for direct comments if that is preferred.
>>
>> Thanks,
>>
>> Micah
>>
>>
>>
>> P.S. In the interest of full disclosure, these ideas evolved in
>> collaboration with Brian Hulette and other colleagues at Google who are
>> interested in making use of Arrow in both internal and external projects.
>>
>> [1] https://issues.apache.org/jira/browse/ARROW-300
>>
>> [2]  https://issues.apache.org/jira/browse/ARROW-5224
>>
>> [3]
>>
>> https://lists.apache.org/thread.html/36ab9c2b8b5d9f04493b3f9ea3b63c3ca3bc0f90743aa726b7a3199b@%3Cdev.arrow.apache.org%3E
>>
>> [4]
>>
>> https://lists.apache.org/thread.html/5e09557274f9018efee770ad3712122d874447331f52d27169f99fe0@%3Cdev.arrow.apache.org%3E
>>
>> [5]
>>
>> https://issues.apache.org/jira/browse/ARROW-1693?page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel&focusedCommentId=16244812#comment-16244812
>>
>> [6] https://github.com/apache/arrow/pull/4815
>>
>> [7]
>>
>> https://lemire.me/blog/2012/02/08/effective-compression-using-frame-of-reference-and-delta-coding/
>>
>> [8] https://issues.apache.org/jira/browse/ARROW-5821
>>
>

Re: [Discuss] Format additions to Arrow for sparse data and data integrity

Posted by Micah Kornfield <em...@gmail.com>.

Strange, I've pasted the contents into a google document at [1]

[1]
https://docs.google.com/document/d/1uJzWh63Iqk7FRbElHPhHrsmlfe0NIJ6M8-0kejPmwIw/edit


On Fri, Jul 5, 2019 at 12:32 PM Jacques Nadeau <ja...@apache.org> wrote:

> Hey Micah, you're formatting seems to be messed up on this mail. Some kind
> of copy/paste error?
>
> On Fri, Jul 5, 2019 at 11:54 AM Micah Kornfield <em...@gmail.com>
> wrote:
>
> > Hi Arrow-dev,
> >
> > I’d like to make a straw-man proposal to cover some features that I think
> > would be useful to Arrow, and that I would like to make a
> proof-of-concept
> > implementation for in Java and C++.  In particular, the proposal covers
> > allowing for smaller data sizes via compression and encoding [1][2][8],
> > data integrity [3] and avoiding unnecessary data transfer [4][5].
> >
> > I’ve put together a PR  [6] that has proposed changes to the flatbuffer
> > metadata to support the new features.  The PR introduces:
> >
> >    -
> >
> >    A new “SparseRecordBatch” that can support one of multiple possible
> >    encodings (both dense and sparse), compression and column elision.
> >    -
> >
> >    A “Digest” message type to support optional data integrity.
> >
> >
> > Going into more details on the specific features in the PR:
> >
> >    1.
> >
> >    Sparse encodings for arrays and buffers.  The guiding principles
> behind
> >    the suggested encodings are to support encodings that can be exploited
> > by
> >    compute engines for more efficient computation (I don’t think parquet
> > style
> >    bit-packing belongs in Arrow).  While the encodings don’t maintain
> O(1)
> >    data element access, they support sublinear, O(log(N)), element
> access.
> > The
> >    suggested encodings are:
> >    1.
> >
> >       Array encodings:
> >       1.
> >
> >          Add a run-length encoding scheme to efficiently represent
> repeated
> >          values (the actual scheme encodes run ends instead of length
> > to preserve
> >          sub-linear random access).
> >          2.
> >
> >          Add a “packed” sparse representation (null values don’t take up
> >          space in value buffers)
> >          2.
> >
> >       Buffer encodings:
> >       1.
> >
> >          Add frame of reference integer encoding [7] (this allows for
> lower
> >          bit-width encoding of integer types by subtracting a
> > “reference” value from
> >          all values in the buffer).
> >          2.
> >
> >          Add a sparse integer set encoding.  This encoding allows more
> >          efficient encoding of validity bit-masks for cases when all
> > values are
> >          either null or not null.
> >          2.
> >
> >    Data compression.  Similar to encodings but compression is solely for
> >    reduction of data at rest/on the wire.  The proposal is to allow
> >    compression of individual buffers. Right now zstd is proposed, but I
> > don’t
> >    feel strongly on the specific technologies here.
> >    3.
> >
> >    Column Elision.  For some use-cases, like structured logging, the
> >    overhead of including array metadata for columns with no data present
> >    represents non-negligible overhead.   The proposal provides a
> mechanism
> > for
> >    omitting meta-data for such arrays.
> >    4.
> >
> >    Data Integrity.  While the arrow file format isn’t meant for archiving
> >    data, I think it is important to allow for optional native data
> > integrity
> >    checks in the format.  To this end, I proposed a new “Digest” message
> > type
> >    that can be added after other messages to record a digest/hash of the
> >    preceding data. I suggested xxhash, but I don’t have a strong opinion
> > here,
> >    as long as there is some minimal support that can potentially be
> > expanded
> >    later.
> >
> >
> > In the proposal I chose to use Tables and Unions everywhere for
> flexibility
> > but in all likelihood some could be replaced by enums.
> >
> > My initial plan would be to solely focus on an IPC mechanism that can
> send
> > a SparseRecordBatch and immediately translate it to a normal RecordBatch
> in
> > both Java and C++.
> >
> > As a practical matter the proposal represents a lot of work to get an MVP
> > working in time for 1.0.0 release (provided they are accepted by the
> > community), so I'd greatly appreciate if anyone wants to collaborate on
> > this.
> >
> > If it is easier I’m happy to start a separate thread for feature if
> people
> > feel like it would make the conversation easier.  I can also create a
> > Google Doc for direct comments if that is preferred.
> >
> > Thanks,
> >
> > Micah
> >
> >
> >
> > P.S. In the interest of full disclosure, these ideas evolved in
> > collaboration with Brian Hulette and other colleagues at Google who are
> > interested in making use of Arrow in both internal and external projects.
> >
> > [1] https://issues.apache.org/jira/browse/ARROW-300
> >
> > [2]  https://issues.apache.org/jira/browse/ARROW-5224
> >
> > [3]
> >
> >
> https://lists.apache.org/thread.html/36ab9c2b8b5d9f04493b3f9ea3b63c3ca3bc0f90743aa726b7a3199b@%3Cdev.arrow.apache.org%3E
> >
> > [4]
> >
> >
> https://lists.apache.org/thread.html/5e09557274f9018efee770ad3712122d874447331f52d27169f99fe0@%3Cdev.arrow.apache.org%3E
> >
> > [5]
> >
> >
> https://issues.apache.org/jira/browse/ARROW-1693?page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel&focusedCommentId=16244812#comment-16244812
> >
> > [6] https://github.com/apache/arrow/pull/4815
> >
> > [7]
> >
> >
> https://lemire.me/blog/2012/02/08/effective-compression-using-frame-of-reference-and-delta-coding/
> >
> > [8] https://issues.apache.org/jira/browse/ARROW-5821
> >
>

Re: [Discuss] Format additions to Arrow for sparse data and data integrity

Posted by Jacques Nadeau <ja...@apache.org>.

Hey Micah, you're formatting seems to be messed up on this mail. Some kind
of copy/paste error?

On Fri, Jul 5, 2019 at 11:54 AM Micah Kornfield <em...@gmail.com>
wrote:

> Hi Arrow-dev,
>
> I’d like to make a straw-man proposal to cover some features that I think
> would be useful to Arrow, and that I would like to make a proof-of-concept
> implementation for in Java and C++.  In particular, the proposal covers
> allowing for smaller data sizes via compression and encoding [1][2][8],
> data integrity [3] and avoiding unnecessary data transfer [4][5].
>
> I’ve put together a PR  [6] that has proposed changes to the flatbuffer
> metadata to support the new features.  The PR introduces:
>
>    -
>
>    A new “SparseRecordBatch” that can support one of multiple possible
>    encodings (both dense and sparse), compression and column elision.
>    -
>
>    A “Digest” message type to support optional data integrity.
>
>
> Going into more details on the specific features in the PR:
>
>    1.
>
>    Sparse encodings for arrays and buffers.  The guiding principles behind
>    the suggested encodings are to support encodings that can be exploited
> by
>    compute engines for more efficient computation (I don’t think parquet
> style
>    bit-packing belongs in Arrow).  While the encodings don’t maintain O(1)
>    data element access, they support sublinear, O(log(N)), element access.
> The
>    suggested encodings are:
>    1.
>
>       Array encodings:
>       1.
>
>          Add a run-length encoding scheme to efficiently represent repeated
>          values (the actual scheme encodes run ends instead of length
> to preserve
>          sub-linear random access).
>          2.
>
>          Add a “packed” sparse representation (null values don’t take up
>          space in value buffers)
>          2.
>
>       Buffer encodings:
>       1.
>
>          Add frame of reference integer encoding [7] (this allows for lower
>          bit-width encoding of integer types by subtracting a
> “reference” value from
>          all values in the buffer).
>          2.
>
>          Add a sparse integer set encoding.  This encoding allows more
>          efficient encoding of validity bit-masks for cases when all
> values are
>          either null or not null.
>          2.
>
>    Data compression.  Similar to encodings but compression is solely for
>    reduction of data at rest/on the wire.  The proposal is to allow
>    compression of individual buffers. Right now zstd is proposed, but I
> don’t
>    feel strongly on the specific technologies here.
>    3.
>
>    Column Elision.  For some use-cases, like structured logging, the
>    overhead of including array metadata for columns with no data present
>    represents non-negligible overhead.   The proposal provides a mechanism
> for
>    omitting meta-data for such arrays.
>    4.
>
>    Data Integrity.  While the arrow file format isn’t meant for archiving
>    data, I think it is important to allow for optional native data
> integrity
>    checks in the format.  To this end, I proposed a new “Digest” message
> type
>    that can be added after other messages to record a digest/hash of the
>    preceding data. I suggested xxhash, but I don’t have a strong opinion
> here,
>    as long as there is some minimal support that can potentially be
> expanded
>    later.
>
>
> In the proposal I chose to use Tables and Unions everywhere for flexibility
> but in all likelihood some could be replaced by enums.
>
> My initial plan would be to solely focus on an IPC mechanism that can send
> a SparseRecordBatch and immediately translate it to a normal RecordBatch in
> both Java and C++.
>
> As a practical matter the proposal represents a lot of work to get an MVP
> working in time for 1.0.0 release (provided they are accepted by the
> community), so I'd greatly appreciate if anyone wants to collaborate on
> this.
>
> If it is easier I’m happy to start a separate thread for feature if people
> feel like it would make the conversation easier.  I can also create a
> Google Doc for direct comments if that is preferred.
>
> Thanks,
>
> Micah
>
>
>
> P.S. In the interest of full disclosure, these ideas evolved in
> collaboration with Brian Hulette and other colleagues at Google who are
> interested in making use of Arrow in both internal and external projects.
>
> [1] https://issues.apache.org/jira/browse/ARROW-300
>
> [2]  https://issues.apache.org/jira/browse/ARROW-5224
>
> [3]
>
> https://lists.apache.org/thread.html/36ab9c2b8b5d9f04493b3f9ea3b63c3ca3bc0f90743aa726b7a3199b@%3Cdev.arrow.apache.org%3E
>
> [4]
>
> https://lists.apache.org/thread.html/5e09557274f9018efee770ad3712122d874447331f52d27169f99fe0@%3Cdev.arrow.apache.org%3E
>
> [5]
>
> https://issues.apache.org/jira/browse/ARROW-1693?page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel&focusedCommentId=16244812#comment-16244812
>
> [6] https://github.com/apache/arrow/pull/4815
>
> [7]
>
> https://lemire.me/blog/2012/02/08/effective-compression-using-frame-of-reference-and-delta-coding/
>
> [8] https://issues.apache.org/jira/browse/ARROW-5821
>

Re: [Discuss] Format additions to Arrow for sparse data and data integrity

Posted by Antoine Pitrou <an...@python.org>.

Hi Micah,

Le 05/07/2019 à 20:53, Micah Kornfield a écrit :
> 
> Going into more details on the specific features in the PR:
> 
>    1.
> 
>    Sparse encodings for arrays and buffers.  The guiding principles behind
>    the suggested encodings are to support encodings that can be exploited by
>    compute engines for more efficient computation (I don’t think parquet style
>    bit-packing belongs in Arrow).

How does "more efficient computation" play out for operations such as
hash or join?

>          2.
> 
>    Data compression.  Similar to encodings but compression is solely for
>    reduction of data at rest/on the wire.  The proposal is to allow
>    compression of individual buffers. Right now zstd is proposed, but I don’t
>    feel strongly on the specific technologies here.

Is it useful at the Arrow format level? Any transmission layer can add
its own compression, especially a general-purpose one such as zstd or lz4.

>    4.
> 
>    Data Integrity.  While the arrow file format isn’t meant for archiving
>    data, I think it is important to allow for optional native data integrity
>    checks in the format.  To this end, I proposed a new “Digest” message type
>    that can be added after other messages to record a digest/hash of the
>    preceding data. I suggested xxhash, but I don’t have a strong opinion here,
>    as long as there is some minimal support that can potentially be expanded
>    later.

This sounds potentially useful, though one question is whether this
occurs at the table level, column level, sequential array level, etc.

> As a practical matter the proposal represents a lot of work to get an MVP
> working in time for 1.0.0 release (provided they are accepted by the
> community), so I'd greatly appreciate if anyone wants to collaborate on
> this.

I don't think this is workable for 1.0.0.  The plan currently is for
1.0.0 to come out reasonably "quickly" after 0.14.0, i.e. perhaps in 6-8
weeks?

Regards

Antoine.