You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Antoine Pitrou <an...@python.org> on 2020/01/23 17:23:26 UTC

Re: [DISCUSS] Format additions for encoding/compression

Le 23/01/2020 à 18:16, John Muehlhausen a écrit :
> Perhaps related to this thread, are there any current or proposed tools to
> transform columns for fixed-length data types according to a "shuffle?"
>  For precedent see the implementation of the shuffle filter in hdf5.
> https://support.hdfgroup.org/ftp/HDF5//documentation/doc1.6/TechNotes/shuffling-algorithm-report.pdf
> 
> For example, the column (length 3) would store bytes 00 00 00 00 00 00 00
> 00 00 01 02 03 to represent the three 32-bit numbers 00 00 00 01 00 00 00
> 02 00 00 00 03  (I'm writing big-endian even if that is not actually the
> case).
> 
> Value(1) would return 00 00 00 02 by referring to some metadata flag that
> the column is shuffled, stitching the bytes back together at call time.
> 
> Thus if the column pages were backed by a memory map to something like
> zfs/gzip-9 (my actual use-case), one would expect approx 30% savings in
> underlying disk usage due to better run lengths.
> 
> It would enable a space/time tradeoff that could be useful?  The filesystem
> itself cannot easily do this particular compression transform since it
> benefits from knowing the shape of the data.

For the record, there's a pull request adding this encoding to the
Parquet C++ specification.

Regards

Antoine.

Re: [DISCUSS] Format additions for encoding/compression

Posted by Micah Kornfield <em...@gmail.com>.
Great John, I'd be interesting to hear about progress.

Also, IMO I think we should be only focusing on encoding that have the
potential to be exploited for computational benefits (not just
compressibility).  I think this is what distinguishes Arrow from other
formats like Parquet. I think this echos some others sentiments on this
thread.

Cheers,
Micah

On Fri, Jan 24, 2020 at 8:28 AM John Muehlhausen <jg...@jgm.org> wrote:

> Thanks Micah, I will see if I can find some time to explore this further.
>
> On Thu, Jan 23, 2020 at 10:56 PM Micah Kornfield <em...@gmail.com>
> wrote:
>
>> Hi John,
>> Not Wes, but my thoughts on this are as follows:
>>
>> 1. Alternate bit/byte arrangements can also be useful for processing [1]
>> in
>> addition to compression.
>> 2. I think they are quite a bit more complicated then the existing schemes
>> proposed in [2], so I think it would be more expedient to get the
>> integration hooks necessary to work with simpler encodings before going
>> with something more complex.  I believe the proposal is generic enough to
>> support this type of encoding.
>> 3. For prototyping, this seems like a potential use of the ExtensionType
>> [3] type mechanism already in the specification.
>> 4. I don't think these should be new types or part of the basic Array data
>> structure.  I think having a different container format in the form of
>> "SparseRecordBatch" (or perhaps it should be renamed to
>> EncodedRecordBatch)
>> and keeping the existing types with alternate encodings is a better
>> option.
>>
>> That being said if you have bandwidth to get this working for C++ and Java
>> we can potentially setup a separate development branch to see how it
>> evolves.  Personally, I've not brought my proposal up for discussion
>> again,
>> because I haven't had bandwidth to work on it, but I still think
>> introducing some level of alternate encodings is a good idea.
>>
>> Cheers,
>> Micah
>>
>>
>> [1]
>>
>> https://15721.courses.cs.cmu.edu/spring2018/papers/22-vectorization2/p31-feng.pdf
>> [2] https://github.com/apache/arrow/pull/4815
>> [3]
>>
>> https://github.com/apache/arrow/blob/master/docs/source/format/Columnar.rst#extension-types
>>
>> On Thu, Jan 23, 2020 at 11:36 AM John Muehlhausen <jg...@jgm.org> wrote:
>>
>> > Wes, what do you think about Arrow supporting a new suite of
>> fixed-length
>> > data types that unshuffle on column->Value(i) calls?  This would allow
>> > memory/swap compressors and memory maps backed by compressing
>> > filesystems (ZFS) or block devices (VDO) to operate more efficiently.
>> >
>> > By doing it with new datatypes there is no separate flag to check?
>> >
>> > On Thu, Jan 23, 2020 at 1:09 PM Wes McKinney <we...@gmail.com>
>> wrote:
>> >
>> > > On Thu, Jan 23, 2020 at 12:42 PM John Muehlhausen <jg...@jgm.org>
>> wrote:
>> > > >
>> > > > Again, I know very little about Parquet, so your patience is
>> > appreciated.
>> > > >
>> > > > At the moment I can Arrow/mmap a file without having anywhere
>> nearly as
>> > > > much available memory as the file size.  I can visit random place in
>> > the
>> > > > file (such as a binary search if it is ordered) and only the
>> locations
>> > > > visited by column->Value(i) are paged in.  Paging them out happens
>> > > without
>> > > > my awareness, if necessary.
>> > > >
>> > > > Does Parquet cover this use-case with the same elegance and at least
>> > > equal
>> > > > efficiency, or are there more copies/conversions?  Perhaps it
>> requires
>> > > the
>> > > > entire file to be transformed into Arrow memory at the beginning? Or
>> > on a
>> > > > batch/block basis? Or to get this I need to use a non-Arrow API for
>> > data
>> > > > element access?  Etc.
>> > >
>> > > Data has to be materialized / deserialized from the Parquet file on a
>> > > batch-wise per-column basis. The APIs we provide allow batches of
>> > > values to be read for a given subset of columns
>> > >
>> > > >
>> > > > IFF it covers the above use-case, which does not mention
>> compression or
>> > > > encoding, then I could consider whether it is interesting on those
>> > > points.
>> > >
>> > > My point really has to do with Parquet's design which is about
>> > > reducing file size. In the following blog post
>> > >
>> > > https://ursalabs.org/blog/2019-10-columnar-perf/
>> > >
>> > > I examined a dataset which is about 4GB as raw Arrow stream/file but
>> > > only 114 MB as a Parquet file. A 30+X compression ratio is a huge deal
>> > > if you are working with filesystems that yield < 500MB/s (which
>> > > includes pretty much all cloud filesystems AFAIK). In clickstream
>> > > analytics this kind of compression ratio is not unusual.
>> > >
>> > > >
>> > > > -John
>> > > >
>> > > > On Thu, Jan 23, 2020 at 12:06 PM Francois Saint-Jacques <
>> > > > fsaintjacques@gmail.com> wrote:
>> > > >
>> > > > > What's the point of having zero copy if the OS is doing the
>> > > > > decompression in kernel (which trumps the zero-copy argument)? You
>> > > > > might as well just use parquet without filesystem compression. I
>> > > > > prefer to have compression algorithm where the columnar engine can
>> > > > > benefit from it [1] than marginally improving a file-system-os
>> > > > > specific feature.
>> > > > >
>> > > > > François
>> > > > >
>> > > > > [1] Section 4.3
>> http://db.csail.mit.edu/pubs/abadi-column-stores.pdf
>> > > > >
>> > > > >
>> > > > >
>> > > > >
>> > > > > On Thu, Jan 23, 2020 at 12:43 PM John Muehlhausen <jg...@jgm.org>
>> > wrote:
>> > > > > >
>> > > > > > This could also have utility in memory via things like
>> zram/zswap,
>> > > right?
>> > > > > > Mac also has a memory compressor?
>> > > > > >
>> > > > > > I don't think Parquet is an option for me unless the integration
>> > with
>> > > > > Arrow
>> > > > > > is tighter than I imagine (i.e. zero-copy).  That said, I
>> confess I
>> > > know
>> > > > > > next to nothing about Parquet.
>> > > > > >
>> > > > > > On Thu, Jan 23, 2020 at 11:23 AM Antoine Pitrou <
>> > antoine@python.org>
>> > > > > wrote:
>> > > > > > >
>> > > > > > >
>> > > > > > > Le 23/01/2020 à 18:16, John Muehlhausen a écrit :
>> > > > > > > > Perhaps related to this thread, are there any current or
>> > proposed
>> > > > > tools
>> > > > > > to
>> > > > > > > > transform columns for fixed-length data types according to a
>> > > > > "shuffle?"
>> > > > > > > >  For precedent see the implementation of the shuffle filter
>> in
>> > > hdf5.
>> > > > > > > >
>> > > > > >
>> > > > >
>> > >
>> >
>> https://support.hdfgroup.org/ftp/HDF5//documentation/doc1.6/TechNotes/shuffling-algorithm-report.pdf
>> > > > > > > >
>> > > > > > > > For example, the column (length 3) would store bytes 00 00
>> 00
>> > 00
>> > > 00
>> > > > > 00
>> > > > > > 00
>> > > > > > > > 00 00 01 02 03 to represent the three 32-bit numbers 00 00
>> 00
>> > 01
>> > > 00
>> > > > > 00
>> > > > > > 00
>> > > > > > > > 02 00 00 00 03  (I'm writing big-endian even if that is not
>> > > actually
>> > > > > the
>> > > > > > > > case).
>> > > > > > > >
>> > > > > > > > Value(1) would return 00 00 00 02 by referring to some
>> metadata
>> > > flag
>> > > > > > that
>> > > > > > > > the column is shuffled, stitching the bytes back together at
>> > call
>> > > > > time.
>> > > > > > > >
>> > > > > > > > Thus if the column pages were backed by a memory map to
>> > something
>> > > > > like
>> > > > > > > > zfs/gzip-9 (my actual use-case), one would expect approx 30%
>> > > savings
>> > > > > in
>> > > > > > > > underlying disk usage due to better run lengths.
>> > > > > > > >
>> > > > > > > > It would enable a space/time tradeoff that could be useful?
>> > The
>> > > > > > filesystem
>> > > > > > > > itself cannot easily do this particular compression
>> transform
>> > > since
>> > > > > it
>> > > > > > > > benefits from knowing the shape of the data.
>> > > > > > >
>> > > > > > > For the record, there's a pull request adding this encoding to
>> > the
>> > > > > > > Parquet C++ specification.
>> > > > > > >
>> > > > > > > Regards
>> > > > > > >
>> > > > > > > Antoine.
>> > > > >
>> > >
>> >
>>
>

Re: [DISCUSS] Format additions for encoding/compression

Posted by John Muehlhausen <jg...@jgm.org>.
Thanks Micah, I will see if I can find some time to explore this further.

On Thu, Jan 23, 2020 at 10:56 PM Micah Kornfield <em...@gmail.com>
wrote:

> Hi John,
> Not Wes, but my thoughts on this are as follows:
>
> 1. Alternate bit/byte arrangements can also be useful for processing [1] in
> addition to compression.
> 2. I think they are quite a bit more complicated then the existing schemes
> proposed in [2], so I think it would be more expedient to get the
> integration hooks necessary to work with simpler encodings before going
> with something more complex.  I believe the proposal is generic enough to
> support this type of encoding.
> 3. For prototyping, this seems like a potential use of the ExtensionType
> [3] type mechanism already in the specification.
> 4. I don't think these should be new types or part of the basic Array data
> structure.  I think having a different container format in the form of
> "SparseRecordBatch" (or perhaps it should be renamed to EncodedRecordBatch)
> and keeping the existing types with alternate encodings is a better option.
>
> That being said if you have bandwidth to get this working for C++ and Java
> we can potentially setup a separate development branch to see how it
> evolves.  Personally, I've not brought my proposal up for discussion again,
> because I haven't had bandwidth to work on it, but I still think
> introducing some level of alternate encodings is a good idea.
>
> Cheers,
> Micah
>
>
> [1]
>
> https://15721.courses.cs.cmu.edu/spring2018/papers/22-vectorization2/p31-feng.pdf
> [2] https://github.com/apache/arrow/pull/4815
> [3]
>
> https://github.com/apache/arrow/blob/master/docs/source/format/Columnar.rst#extension-types
>
> On Thu, Jan 23, 2020 at 11:36 AM John Muehlhausen <jg...@jgm.org> wrote:
>
> > Wes, what do you think about Arrow supporting a new suite of fixed-length
> > data types that unshuffle on column->Value(i) calls?  This would allow
> > memory/swap compressors and memory maps backed by compressing
> > filesystems (ZFS) or block devices (VDO) to operate more efficiently.
> >
> > By doing it with new datatypes there is no separate flag to check?
> >
> > On Thu, Jan 23, 2020 at 1:09 PM Wes McKinney <we...@gmail.com>
> wrote:
> >
> > > On Thu, Jan 23, 2020 at 12:42 PM John Muehlhausen <jg...@jgm.org> wrote:
> > > >
> > > > Again, I know very little about Parquet, so your patience is
> > appreciated.
> > > >
> > > > At the moment I can Arrow/mmap a file without having anywhere nearly
> as
> > > > much available memory as the file size.  I can visit random place in
> > the
> > > > file (such as a binary search if it is ordered) and only the
> locations
> > > > visited by column->Value(i) are paged in.  Paging them out happens
> > > without
> > > > my awareness, if necessary.
> > > >
> > > > Does Parquet cover this use-case with the same elegance and at least
> > > equal
> > > > efficiency, or are there more copies/conversions?  Perhaps it
> requires
> > > the
> > > > entire file to be transformed into Arrow memory at the beginning? Or
> > on a
> > > > batch/block basis? Or to get this I need to use a non-Arrow API for
> > data
> > > > element access?  Etc.
> > >
> > > Data has to be materialized / deserialized from the Parquet file on a
> > > batch-wise per-column basis. The APIs we provide allow batches of
> > > values to be read for a given subset of columns
> > >
> > > >
> > > > IFF it covers the above use-case, which does not mention compression
> or
> > > > encoding, then I could consider whether it is interesting on those
> > > points.
> > >
> > > My point really has to do with Parquet's design which is about
> > > reducing file size. In the following blog post
> > >
> > > https://ursalabs.org/blog/2019-10-columnar-perf/
> > >
> > > I examined a dataset which is about 4GB as raw Arrow stream/file but
> > > only 114 MB as a Parquet file. A 30+X compression ratio is a huge deal
> > > if you are working with filesystems that yield < 500MB/s (which
> > > includes pretty much all cloud filesystems AFAIK). In clickstream
> > > analytics this kind of compression ratio is not unusual.
> > >
> > > >
> > > > -John
> > > >
> > > > On Thu, Jan 23, 2020 at 12:06 PM Francois Saint-Jacques <
> > > > fsaintjacques@gmail.com> wrote:
> > > >
> > > > > What's the point of having zero copy if the OS is doing the
> > > > > decompression in kernel (which trumps the zero-copy argument)? You
> > > > > might as well just use parquet without filesystem compression. I
> > > > > prefer to have compression algorithm where the columnar engine can
> > > > > benefit from it [1] than marginally improving a file-system-os
> > > > > specific feature.
> > > > >
> > > > > François
> > > > >
> > > > > [1] Section 4.3
> http://db.csail.mit.edu/pubs/abadi-column-stores.pdf
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On Thu, Jan 23, 2020 at 12:43 PM John Muehlhausen <jg...@jgm.org>
> > wrote:
> > > > > >
> > > > > > This could also have utility in memory via things like
> zram/zswap,
> > > right?
> > > > > > Mac also has a memory compressor?
> > > > > >
> > > > > > I don't think Parquet is an option for me unless the integration
> > with
> > > > > Arrow
> > > > > > is tighter than I imagine (i.e. zero-copy).  That said, I
> confess I
> > > know
> > > > > > next to nothing about Parquet.
> > > > > >
> > > > > > On Thu, Jan 23, 2020 at 11:23 AM Antoine Pitrou <
> > antoine@python.org>
> > > > > wrote:
> > > > > > >
> > > > > > >
> > > > > > > Le 23/01/2020 à 18:16, John Muehlhausen a écrit :
> > > > > > > > Perhaps related to this thread, are there any current or
> > proposed
> > > > > tools
> > > > > > to
> > > > > > > > transform columns for fixed-length data types according to a
> > > > > "shuffle?"
> > > > > > > >  For precedent see the implementation of the shuffle filter
> in
> > > hdf5.
> > > > > > > >
> > > > > >
> > > > >
> > >
> >
> https://support.hdfgroup.org/ftp/HDF5//documentation/doc1.6/TechNotes/shuffling-algorithm-report.pdf
> > > > > > > >
> > > > > > > > For example, the column (length 3) would store bytes 00 00 00
> > 00
> > > 00
> > > > > 00
> > > > > > 00
> > > > > > > > 00 00 01 02 03 to represent the three 32-bit numbers 00 00 00
> > 01
> > > 00
> > > > > 00
> > > > > > 00
> > > > > > > > 02 00 00 00 03  (I'm writing big-endian even if that is not
> > > actually
> > > > > the
> > > > > > > > case).
> > > > > > > >
> > > > > > > > Value(1) would return 00 00 00 02 by referring to some
> metadata
> > > flag
> > > > > > that
> > > > > > > > the column is shuffled, stitching the bytes back together at
> > call
> > > > > time.
> > > > > > > >
> > > > > > > > Thus if the column pages were backed by a memory map to
> > something
> > > > > like
> > > > > > > > zfs/gzip-9 (my actual use-case), one would expect approx 30%
> > > savings
> > > > > in
> > > > > > > > underlying disk usage due to better run lengths.
> > > > > > > >
> > > > > > > > It would enable a space/time tradeoff that could be useful?
> > The
> > > > > > filesystem
> > > > > > > > itself cannot easily do this particular compression transform
> > > since
> > > > > it
> > > > > > > > benefits from knowing the shape of the data.
> > > > > > >
> > > > > > > For the record, there's a pull request adding this encoding to
> > the
> > > > > > > Parquet C++ specification.
> > > > > > >
> > > > > > > Regards
> > > > > > >
> > > > > > > Antoine.
> > > > >
> > >
> >
>

Re: [DISCUSS] Format additions for encoding/compression

Posted by Micah Kornfield <em...@gmail.com>.
Hi John,
Not Wes, but my thoughts on this are as follows:

1. Alternate bit/byte arrangements can also be useful for processing [1] in
addition to compression.
2. I think they are quite a bit more complicated then the existing schemes
proposed in [2], so I think it would be more expedient to get the
integration hooks necessary to work with simpler encodings before going
with something more complex.  I believe the proposal is generic enough to
support this type of encoding.
3. For prototyping, this seems like a potential use of the ExtensionType
[3] type mechanism already in the specification.
4. I don't think these should be new types or part of the basic Array data
structure.  I think having a different container format in the form of
"SparseRecordBatch" (or perhaps it should be renamed to EncodedRecordBatch)
and keeping the existing types with alternate encodings is a better option.

That being said if you have bandwidth to get this working for C++ and Java
we can potentially setup a separate development branch to see how it
evolves.  Personally, I've not brought my proposal up for discussion again,
because I haven't had bandwidth to work on it, but I still think
introducing some level of alternate encodings is a good idea.

Cheers,
Micah


[1]
https://15721.courses.cs.cmu.edu/spring2018/papers/22-vectorization2/p31-feng.pdf
[2] https://github.com/apache/arrow/pull/4815
[3]
https://github.com/apache/arrow/blob/master/docs/source/format/Columnar.rst#extension-types

On Thu, Jan 23, 2020 at 11:36 AM John Muehlhausen <jg...@jgm.org> wrote:

> Wes, what do you think about Arrow supporting a new suite of fixed-length
> data types that unshuffle on column->Value(i) calls?  This would allow
> memory/swap compressors and memory maps backed by compressing
> filesystems (ZFS) or block devices (VDO) to operate more efficiently.
>
> By doing it with new datatypes there is no separate flag to check?
>
> On Thu, Jan 23, 2020 at 1:09 PM Wes McKinney <we...@gmail.com> wrote:
>
> > On Thu, Jan 23, 2020 at 12:42 PM John Muehlhausen <jg...@jgm.org> wrote:
> > >
> > > Again, I know very little about Parquet, so your patience is
> appreciated.
> > >
> > > At the moment I can Arrow/mmap a file without having anywhere nearly as
> > > much available memory as the file size.  I can visit random place in
> the
> > > file (such as a binary search if it is ordered) and only the locations
> > > visited by column->Value(i) are paged in.  Paging them out happens
> > without
> > > my awareness, if necessary.
> > >
> > > Does Parquet cover this use-case with the same elegance and at least
> > equal
> > > efficiency, or are there more copies/conversions?  Perhaps it requires
> > the
> > > entire file to be transformed into Arrow memory at the beginning? Or
> on a
> > > batch/block basis? Or to get this I need to use a non-Arrow API for
> data
> > > element access?  Etc.
> >
> > Data has to be materialized / deserialized from the Parquet file on a
> > batch-wise per-column basis. The APIs we provide allow batches of
> > values to be read for a given subset of columns
> >
> > >
> > > IFF it covers the above use-case, which does not mention compression or
> > > encoding, then I could consider whether it is interesting on those
> > points.
> >
> > My point really has to do with Parquet's design which is about
> > reducing file size. In the following blog post
> >
> > https://ursalabs.org/blog/2019-10-columnar-perf/
> >
> > I examined a dataset which is about 4GB as raw Arrow stream/file but
> > only 114 MB as a Parquet file. A 30+X compression ratio is a huge deal
> > if you are working with filesystems that yield < 500MB/s (which
> > includes pretty much all cloud filesystems AFAIK). In clickstream
> > analytics this kind of compression ratio is not unusual.
> >
> > >
> > > -John
> > >
> > > On Thu, Jan 23, 2020 at 12:06 PM Francois Saint-Jacques <
> > > fsaintjacques@gmail.com> wrote:
> > >
> > > > What's the point of having zero copy if the OS is doing the
> > > > decompression in kernel (which trumps the zero-copy argument)? You
> > > > might as well just use parquet without filesystem compression. I
> > > > prefer to have compression algorithm where the columnar engine can
> > > > benefit from it [1] than marginally improving a file-system-os
> > > > specific feature.
> > > >
> > > > François
> > > >
> > > > [1] Section 4.3 http://db.csail.mit.edu/pubs/abadi-column-stores.pdf
> > > >
> > > >
> > > >
> > > >
> > > > On Thu, Jan 23, 2020 at 12:43 PM John Muehlhausen <jg...@jgm.org>
> wrote:
> > > > >
> > > > > This could also have utility in memory via things like zram/zswap,
> > right?
> > > > > Mac also has a memory compressor?
> > > > >
> > > > > I don't think Parquet is an option for me unless the integration
> with
> > > > Arrow
> > > > > is tighter than I imagine (i.e. zero-copy).  That said, I confess I
> > know
> > > > > next to nothing about Parquet.
> > > > >
> > > > > On Thu, Jan 23, 2020 at 11:23 AM Antoine Pitrou <
> antoine@python.org>
> > > > wrote:
> > > > > >
> > > > > >
> > > > > > Le 23/01/2020 à 18:16, John Muehlhausen a écrit :
> > > > > > > Perhaps related to this thread, are there any current or
> proposed
> > > > tools
> > > > > to
> > > > > > > transform columns for fixed-length data types according to a
> > > > "shuffle?"
> > > > > > >  For precedent see the implementation of the shuffle filter in
> > hdf5.
> > > > > > >
> > > > >
> > > >
> >
> https://support.hdfgroup.org/ftp/HDF5//documentation/doc1.6/TechNotes/shuffling-algorithm-report.pdf
> > > > > > >
> > > > > > > For example, the column (length 3) would store bytes 00 00 00
> 00
> > 00
> > > > 00
> > > > > 00
> > > > > > > 00 00 01 02 03 to represent the three 32-bit numbers 00 00 00
> 01
> > 00
> > > > 00
> > > > > 00
> > > > > > > 02 00 00 00 03  (I'm writing big-endian even if that is not
> > actually
> > > > the
> > > > > > > case).
> > > > > > >
> > > > > > > Value(1) would return 00 00 00 02 by referring to some metadata
> > flag
> > > > > that
> > > > > > > the column is shuffled, stitching the bytes back together at
> call
> > > > time.
> > > > > > >
> > > > > > > Thus if the column pages were backed by a memory map to
> something
> > > > like
> > > > > > > zfs/gzip-9 (my actual use-case), one would expect approx 30%
> > savings
> > > > in
> > > > > > > underlying disk usage due to better run lengths.
> > > > > > >
> > > > > > > It would enable a space/time tradeoff that could be useful?
> The
> > > > > filesystem
> > > > > > > itself cannot easily do this particular compression transform
> > since
> > > > it
> > > > > > > benefits from knowing the shape of the data.
> > > > > >
> > > > > > For the record, there's a pull request adding this encoding to
> the
> > > > > > Parquet C++ specification.
> > > > > >
> > > > > > Regards
> > > > > >
> > > > > > Antoine.
> > > >
> >
>

Re: [DISCUSS] Format additions for encoding/compression

Posted by John Muehlhausen <jg...@jgm.org>.
Wes, what do you think about Arrow supporting a new suite of fixed-length
data types that unshuffle on column->Value(i) calls?  This would allow
memory/swap compressors and memory maps backed by compressing
filesystems (ZFS) or block devices (VDO) to operate more efficiently.

By doing it with new datatypes there is no separate flag to check?

On Thu, Jan 23, 2020 at 1:09 PM Wes McKinney <we...@gmail.com> wrote:

> On Thu, Jan 23, 2020 at 12:42 PM John Muehlhausen <jg...@jgm.org> wrote:
> >
> > Again, I know very little about Parquet, so your patience is appreciated.
> >
> > At the moment I can Arrow/mmap a file without having anywhere nearly as
> > much available memory as the file size.  I can visit random place in the
> > file (such as a binary search if it is ordered) and only the locations
> > visited by column->Value(i) are paged in.  Paging them out happens
> without
> > my awareness, if necessary.
> >
> > Does Parquet cover this use-case with the same elegance and at least
> equal
> > efficiency, or are there more copies/conversions?  Perhaps it requires
> the
> > entire file to be transformed into Arrow memory at the beginning? Or on a
> > batch/block basis? Or to get this I need to use a non-Arrow API for data
> > element access?  Etc.
>
> Data has to be materialized / deserialized from the Parquet file on a
> batch-wise per-column basis. The APIs we provide allow batches of
> values to be read for a given subset of columns
>
> >
> > IFF it covers the above use-case, which does not mention compression or
> > encoding, then I could consider whether it is interesting on those
> points.
>
> My point really has to do with Parquet's design which is about
> reducing file size. In the following blog post
>
> https://ursalabs.org/blog/2019-10-columnar-perf/
>
> I examined a dataset which is about 4GB as raw Arrow stream/file but
> only 114 MB as a Parquet file. A 30+X compression ratio is a huge deal
> if you are working with filesystems that yield < 500MB/s (which
> includes pretty much all cloud filesystems AFAIK). In clickstream
> analytics this kind of compression ratio is not unusual.
>
> >
> > -John
> >
> > On Thu, Jan 23, 2020 at 12:06 PM Francois Saint-Jacques <
> > fsaintjacques@gmail.com> wrote:
> >
> > > What's the point of having zero copy if the OS is doing the
> > > decompression in kernel (which trumps the zero-copy argument)? You
> > > might as well just use parquet without filesystem compression. I
> > > prefer to have compression algorithm where the columnar engine can
> > > benefit from it [1] than marginally improving a file-system-os
> > > specific feature.
> > >
> > > François
> > >
> > > [1] Section 4.3 http://db.csail.mit.edu/pubs/abadi-column-stores.pdf
> > >
> > >
> > >
> > >
> > > On Thu, Jan 23, 2020 at 12:43 PM John Muehlhausen <jg...@jgm.org> wrote:
> > > >
> > > > This could also have utility in memory via things like zram/zswap,
> right?
> > > > Mac also has a memory compressor?
> > > >
> > > > I don't think Parquet is an option for me unless the integration with
> > > Arrow
> > > > is tighter than I imagine (i.e. zero-copy).  That said, I confess I
> know
> > > > next to nothing about Parquet.
> > > >
> > > > On Thu, Jan 23, 2020 at 11:23 AM Antoine Pitrou <an...@python.org>
> > > wrote:
> > > > >
> > > > >
> > > > > Le 23/01/2020 à 18:16, John Muehlhausen a écrit :
> > > > > > Perhaps related to this thread, are there any current or proposed
> > > tools
> > > > to
> > > > > > transform columns for fixed-length data types according to a
> > > "shuffle?"
> > > > > >  For precedent see the implementation of the shuffle filter in
> hdf5.
> > > > > >
> > > >
> > >
> https://support.hdfgroup.org/ftp/HDF5//documentation/doc1.6/TechNotes/shuffling-algorithm-report.pdf
> > > > > >
> > > > > > For example, the column (length 3) would store bytes 00 00 00 00
> 00
> > > 00
> > > > 00
> > > > > > 00 00 01 02 03 to represent the three 32-bit numbers 00 00 00 01
> 00
> > > 00
> > > > 00
> > > > > > 02 00 00 00 03  (I'm writing big-endian even if that is not
> actually
> > > the
> > > > > > case).
> > > > > >
> > > > > > Value(1) would return 00 00 00 02 by referring to some metadata
> flag
> > > > that
> > > > > > the column is shuffled, stitching the bytes back together at call
> > > time.
> > > > > >
> > > > > > Thus if the column pages were backed by a memory map to something
> > > like
> > > > > > zfs/gzip-9 (my actual use-case), one would expect approx 30%
> savings
> > > in
> > > > > > underlying disk usage due to better run lengths.
> > > > > >
> > > > > > It would enable a space/time tradeoff that could be useful?  The
> > > > filesystem
> > > > > > itself cannot easily do this particular compression transform
> since
> > > it
> > > > > > benefits from knowing the shape of the data.
> > > > >
> > > > > For the record, there's a pull request adding this encoding to the
> > > > > Parquet C++ specification.
> > > > >
> > > > > Regards
> > > > >
> > > > > Antoine.
> > >
>

Re: [DISCUSS] Format additions for encoding/compression

Posted by Wes McKinney <we...@gmail.com>.
On Thu, Jan 23, 2020 at 12:42 PM John Muehlhausen <jg...@jgm.org> wrote:
>
> Again, I know very little about Parquet, so your patience is appreciated.
>
> At the moment I can Arrow/mmap a file without having anywhere nearly as
> much available memory as the file size.  I can visit random place in the
> file (such as a binary search if it is ordered) and only the locations
> visited by column->Value(i) are paged in.  Paging them out happens without
> my awareness, if necessary.
>
> Does Parquet cover this use-case with the same elegance and at least equal
> efficiency, or are there more copies/conversions?  Perhaps it requires the
> entire file to be transformed into Arrow memory at the beginning? Or on a
> batch/block basis? Or to get this I need to use a non-Arrow API for data
> element access?  Etc.

Data has to be materialized / deserialized from the Parquet file on a
batch-wise per-column basis. The APIs we provide allow batches of
values to be read for a given subset of columns

>
> IFF it covers the above use-case, which does not mention compression or
> encoding, then I could consider whether it is interesting on those points.

My point really has to do with Parquet's design which is about
reducing file size. In the following blog post

https://ursalabs.org/blog/2019-10-columnar-perf/

I examined a dataset which is about 4GB as raw Arrow stream/file but
only 114 MB as a Parquet file. A 30+X compression ratio is a huge deal
if you are working with filesystems that yield < 500MB/s (which
includes pretty much all cloud filesystems AFAIK). In clickstream
analytics this kind of compression ratio is not unusual.

>
> -John
>
> On Thu, Jan 23, 2020 at 12:06 PM Francois Saint-Jacques <
> fsaintjacques@gmail.com> wrote:
>
> > What's the point of having zero copy if the OS is doing the
> > decompression in kernel (which trumps the zero-copy argument)? You
> > might as well just use parquet without filesystem compression. I
> > prefer to have compression algorithm where the columnar engine can
> > benefit from it [1] than marginally improving a file-system-os
> > specific feature.
> >
> > François
> >
> > [1] Section 4.3 http://db.csail.mit.edu/pubs/abadi-column-stores.pdf
> >
> >
> >
> >
> > On Thu, Jan 23, 2020 at 12:43 PM John Muehlhausen <jg...@jgm.org> wrote:
> > >
> > > This could also have utility in memory via things like zram/zswap, right?
> > > Mac also has a memory compressor?
> > >
> > > I don't think Parquet is an option for me unless the integration with
> > Arrow
> > > is tighter than I imagine (i.e. zero-copy).  That said, I confess I know
> > > next to nothing about Parquet.
> > >
> > > On Thu, Jan 23, 2020 at 11:23 AM Antoine Pitrou <an...@python.org>
> > wrote:
> > > >
> > > >
> > > > Le 23/01/2020 à 18:16, John Muehlhausen a écrit :
> > > > > Perhaps related to this thread, are there any current or proposed
> > tools
> > > to
> > > > > transform columns for fixed-length data types according to a
> > "shuffle?"
> > > > >  For precedent see the implementation of the shuffle filter in hdf5.
> > > > >
> > >
> > https://support.hdfgroup.org/ftp/HDF5//documentation/doc1.6/TechNotes/shuffling-algorithm-report.pdf
> > > > >
> > > > > For example, the column (length 3) would store bytes 00 00 00 00 00
> > 00
> > > 00
> > > > > 00 00 01 02 03 to represent the three 32-bit numbers 00 00 00 01 00
> > 00
> > > 00
> > > > > 02 00 00 00 03  (I'm writing big-endian even if that is not actually
> > the
> > > > > case).
> > > > >
> > > > > Value(1) would return 00 00 00 02 by referring to some metadata flag
> > > that
> > > > > the column is shuffled, stitching the bytes back together at call
> > time.
> > > > >
> > > > > Thus if the column pages were backed by a memory map to something
> > like
> > > > > zfs/gzip-9 (my actual use-case), one would expect approx 30% savings
> > in
> > > > > underlying disk usage due to better run lengths.
> > > > >
> > > > > It would enable a space/time tradeoff that could be useful?  The
> > > filesystem
> > > > > itself cannot easily do this particular compression transform since
> > it
> > > > > benefits from knowing the shape of the data.
> > > >
> > > > For the record, there's a pull request adding this encoding to the
> > > > Parquet C++ specification.
> > > >
> > > > Regards
> > > >
> > > > Antoine.
> >

Re: [DISCUSS] Format additions for encoding/compression

Posted by John Muehlhausen <jg...@jgm.org>.
Again, I know very little about Parquet, so your patience is appreciated.

At the moment I can Arrow/mmap a file without having anywhere nearly as
much available memory as the file size.  I can visit random place in the
file (such as a binary search if it is ordered) and only the locations
visited by column->Value(i) are paged in.  Paging them out happens without
my awareness, if necessary.

Does Parquet cover this use-case with the same elegance and at least equal
efficiency, or are there more copies/conversions?  Perhaps it requires the
entire file to be transformed into Arrow memory at the beginning? Or on a
batch/block basis? Or to get this I need to use a non-Arrow API for data
element access?  Etc.

IFF it covers the above use-case, which does not mention compression or
encoding, then I could consider whether it is interesting on those points.

-John

On Thu, Jan 23, 2020 at 12:06 PM Francois Saint-Jacques <
fsaintjacques@gmail.com> wrote:

> What's the point of having zero copy if the OS is doing the
> decompression in kernel (which trumps the zero-copy argument)? You
> might as well just use parquet without filesystem compression. I
> prefer to have compression algorithm where the columnar engine can
> benefit from it [1] than marginally improving a file-system-os
> specific feature.
>
> François
>
> [1] Section 4.3 http://db.csail.mit.edu/pubs/abadi-column-stores.pdf
>
>
>
>
> On Thu, Jan 23, 2020 at 12:43 PM John Muehlhausen <jg...@jgm.org> wrote:
> >
> > This could also have utility in memory via things like zram/zswap, right?
> > Mac also has a memory compressor?
> >
> > I don't think Parquet is an option for me unless the integration with
> Arrow
> > is tighter than I imagine (i.e. zero-copy).  That said, I confess I know
> > next to nothing about Parquet.
> >
> > On Thu, Jan 23, 2020 at 11:23 AM Antoine Pitrou <an...@python.org>
> wrote:
> > >
> > >
> > > Le 23/01/2020 à 18:16, John Muehlhausen a écrit :
> > > > Perhaps related to this thread, are there any current or proposed
> tools
> > to
> > > > transform columns for fixed-length data types according to a
> "shuffle?"
> > > >  For precedent see the implementation of the shuffle filter in hdf5.
> > > >
> >
> https://support.hdfgroup.org/ftp/HDF5//documentation/doc1.6/TechNotes/shuffling-algorithm-report.pdf
> > > >
> > > > For example, the column (length 3) would store bytes 00 00 00 00 00
> 00
> > 00
> > > > 00 00 01 02 03 to represent the three 32-bit numbers 00 00 00 01 00
> 00
> > 00
> > > > 02 00 00 00 03  (I'm writing big-endian even if that is not actually
> the
> > > > case).
> > > >
> > > > Value(1) would return 00 00 00 02 by referring to some metadata flag
> > that
> > > > the column is shuffled, stitching the bytes back together at call
> time.
> > > >
> > > > Thus if the column pages were backed by a memory map to something
> like
> > > > zfs/gzip-9 (my actual use-case), one would expect approx 30% savings
> in
> > > > underlying disk usage due to better run lengths.
> > > >
> > > > It would enable a space/time tradeoff that could be useful?  The
> > filesystem
> > > > itself cannot easily do this particular compression transform since
> it
> > > > benefits from knowing the shape of the data.
> > >
> > > For the record, there's a pull request adding this encoding to the
> > > Parquet C++ specification.
> > >
> > > Regards
> > >
> > > Antoine.
>

Re: [DISCUSS] Format additions for encoding/compression

Posted by Wes McKinney <we...@gmail.com>.
Parquet is most relevant in scenarios filesystem IO is constrained
(spinning rust HDD, network FS, cloud storage / S3 / GCS). For those
use cases memory-mapped Arrow is not viable.

Against local NVMe (> 2000 MB/s read throughput) your mileage may vary.

On Thu, Jan 23, 2020 at 12:06 PM Francois Saint-Jacques
<fs...@gmail.com> wrote:
>
> What's the point of having zero copy if the OS is doing the
> decompression in kernel (which trumps the zero-copy argument)? You
> might as well just use parquet without filesystem compression. I
> prefer to have compression algorithm where the columnar engine can
> benefit from it [1] than marginally improving a file-system-os
> specific feature.
>
> François
>
> [1] Section 4.3 http://db.csail.mit.edu/pubs/abadi-column-stores.pdf
>
>
>
>
> On Thu, Jan 23, 2020 at 12:43 PM John Muehlhausen <jg...@jgm.org> wrote:
> >
> > This could also have utility in memory via things like zram/zswap, right?
> > Mac also has a memory compressor?
> >
> > I don't think Parquet is an option for me unless the integration with Arrow
> > is tighter than I imagine (i.e. zero-copy).  That said, I confess I know
> > next to nothing about Parquet.
> >
> > On Thu, Jan 23, 2020 at 11:23 AM Antoine Pitrou <an...@python.org> wrote:
> > >
> > >
> > > Le 23/01/2020 à 18:16, John Muehlhausen a écrit :
> > > > Perhaps related to this thread, are there any current or proposed tools
> > to
> > > > transform columns for fixed-length data types according to a "shuffle?"
> > > >  For precedent see the implementation of the shuffle filter in hdf5.
> > > >
> > https://support.hdfgroup.org/ftp/HDF5//documentation/doc1.6/TechNotes/shuffling-algorithm-report.pdf
> > > >
> > > > For example, the column (length 3) would store bytes 00 00 00 00 00 00
> > 00
> > > > 00 00 01 02 03 to represent the three 32-bit numbers 00 00 00 01 00 00
> > 00
> > > > 02 00 00 00 03  (I'm writing big-endian even if that is not actually the
> > > > case).
> > > >
> > > > Value(1) would return 00 00 00 02 by referring to some metadata flag
> > that
> > > > the column is shuffled, stitching the bytes back together at call time.
> > > >
> > > > Thus if the column pages were backed by a memory map to something like
> > > > zfs/gzip-9 (my actual use-case), one would expect approx 30% savings in
> > > > underlying disk usage due to better run lengths.
> > > >
> > > > It would enable a space/time tradeoff that could be useful?  The
> > filesystem
> > > > itself cannot easily do this particular compression transform since it
> > > > benefits from knowing the shape of the data.
> > >
> > > For the record, there's a pull request adding this encoding to the
> > > Parquet C++ specification.
> > >
> > > Regards
> > >
> > > Antoine.

Re: [DISCUSS] Format additions for encoding/compression

Posted by Francois Saint-Jacques <fs...@gmail.com>.
What's the point of having zero copy if the OS is doing the
decompression in kernel (which trumps the zero-copy argument)? You
might as well just use parquet without filesystem compression. I
prefer to have compression algorithm where the columnar engine can
benefit from it [1] than marginally improving a file-system-os
specific feature.

François

[1] Section 4.3 http://db.csail.mit.edu/pubs/abadi-column-stores.pdf




On Thu, Jan 23, 2020 at 12:43 PM John Muehlhausen <jg...@jgm.org> wrote:
>
> This could also have utility in memory via things like zram/zswap, right?
> Mac also has a memory compressor?
>
> I don't think Parquet is an option for me unless the integration with Arrow
> is tighter than I imagine (i.e. zero-copy).  That said, I confess I know
> next to nothing about Parquet.
>
> On Thu, Jan 23, 2020 at 11:23 AM Antoine Pitrou <an...@python.org> wrote:
> >
> >
> > Le 23/01/2020 à 18:16, John Muehlhausen a écrit :
> > > Perhaps related to this thread, are there any current or proposed tools
> to
> > > transform columns for fixed-length data types according to a "shuffle?"
> > >  For precedent see the implementation of the shuffle filter in hdf5.
> > >
> https://support.hdfgroup.org/ftp/HDF5//documentation/doc1.6/TechNotes/shuffling-algorithm-report.pdf
> > >
> > > For example, the column (length 3) would store bytes 00 00 00 00 00 00
> 00
> > > 00 00 01 02 03 to represent the three 32-bit numbers 00 00 00 01 00 00
> 00
> > > 02 00 00 00 03  (I'm writing big-endian even if that is not actually the
> > > case).
> > >
> > > Value(1) would return 00 00 00 02 by referring to some metadata flag
> that
> > > the column is shuffled, stitching the bytes back together at call time.
> > >
> > > Thus if the column pages were backed by a memory map to something like
> > > zfs/gzip-9 (my actual use-case), one would expect approx 30% savings in
> > > underlying disk usage due to better run lengths.
> > >
> > > It would enable a space/time tradeoff that could be useful?  The
> filesystem
> > > itself cannot easily do this particular compression transform since it
> > > benefits from knowing the shape of the data.
> >
> > For the record, there's a pull request adding this encoding to the
> > Parquet C++ specification.
> >
> > Regards
> >
> > Antoine.

Re: [DISCUSS] Format additions for encoding/compression

Posted by John Muehlhausen <jg...@jgm.org>.
This could also have utility in memory via things like zram/zswap, right?
Mac also has a memory compressor?

I don't think Parquet is an option for me unless the integration with Arrow
is tighter than I imagine (i.e. zero-copy).  That said, I confess I know
next to nothing about Parquet.

On Thu, Jan 23, 2020 at 11:23 AM Antoine Pitrou <an...@python.org> wrote:
>
>
> Le 23/01/2020 à 18:16, John Muehlhausen a écrit :
> > Perhaps related to this thread, are there any current or proposed tools
to
> > transform columns for fixed-length data types according to a "shuffle?"
> >  For precedent see the implementation of the shuffle filter in hdf5.
> >
https://support.hdfgroup.org/ftp/HDF5//documentation/doc1.6/TechNotes/shuffling-algorithm-report.pdf
> >
> > For example, the column (length 3) would store bytes 00 00 00 00 00 00
00
> > 00 00 01 02 03 to represent the three 32-bit numbers 00 00 00 01 00 00
00
> > 02 00 00 00 03  (I'm writing big-endian even if that is not actually the
> > case).
> >
> > Value(1) would return 00 00 00 02 by referring to some metadata flag
that
> > the column is shuffled, stitching the bytes back together at call time.
> >
> > Thus if the column pages were backed by a memory map to something like
> > zfs/gzip-9 (my actual use-case), one would expect approx 30% savings in
> > underlying disk usage due to better run lengths.
> >
> > It would enable a space/time tradeoff that could be useful?  The
filesystem
> > itself cannot easily do this particular compression transform since it
> > benefits from knowing the shape of the data.
>
> For the record, there's a pull request adding this encoding to the
> Parquet C++ specification.
>
> Regards
>
> Antoine.

Re: [DISCUSS] Format additions for encoding/compression

Posted by Antoine Pitrou <an...@python.org>.
Forgot to give the URL:
https://github.com/apache/arrow/pull/6005

Regards

Antoine.


Le 23/01/2020 à 18:23, Antoine Pitrou a écrit :
> 
> Le 23/01/2020 à 18:16, John Muehlhausen a écrit :
>> Perhaps related to this thread, are there any current or proposed tools to
>> transform columns for fixed-length data types according to a "shuffle?"
>>  For precedent see the implementation of the shuffle filter in hdf5.
>> https://support.hdfgroup.org/ftp/HDF5//documentation/doc1.6/TechNotes/shuffling-algorithm-report.pdf
>>
>> For example, the column (length 3) would store bytes 00 00 00 00 00 00 00
>> 00 00 01 02 03 to represent the three 32-bit numbers 00 00 00 01 00 00 00
>> 02 00 00 00 03  (I'm writing big-endian even if that is not actually the
>> case).
>>
>> Value(1) would return 00 00 00 02 by referring to some metadata flag that
>> the column is shuffled, stitching the bytes back together at call time.
>>
>> Thus if the column pages were backed by a memory map to something like
>> zfs/gzip-9 (my actual use-case), one would expect approx 30% savings in
>> underlying disk usage due to better run lengths.
>>
>> It would enable a space/time tradeoff that could be useful?  The filesystem
>> itself cannot easily do this particular compression transform since it
>> benefits from knowing the shape of the data.
> 
> For the record, there's a pull request adding this encoding to the
> Parquet C++ specification.
> 
> Regards
> 
> Antoine.
>