You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Wes McKinney <we...@gmail.com> on 2020/04/01 14:44:56 UTC

Re: [DISCUSS] Adding "trivial" buffer compression option to IPC protocol (ARROW-300)

Several pieces of work got done in the last few days:

* Changing from LZ4 raw to LZ4 frame format (what is recommended for
interoperability)
* Parallelizing both compression and decompression at the field level

Here are the results (using 8 threads on an 8-core laptop). I disabled
the "memory map" feature so that in the uncompressed case all of the
data must be read off disk into memory. This helps illustrate the
compression/IO tradeoff to wall clock load times

File size (only LZ4 may be different): https://ibb.co/CP3VQkp
Read time: https://ibb.co/vz9JZMx
Write time: https://ibb.co/H7bb68T

In summary, now with multicore compression and decompression,
LZ4-compressed files are faster both to read and write even on a very
fast SSD, as are ZSTD-compressed files with a low ZSTD compression
level. I didn't notice a major difference between LZ4 raw and LZ4
frame formats. The reads and writes could be made faster still by
pipelining / making concurrent the disk read/write and
compression/decompression steps -- the current implementation performs
these tasks serially. We can improve this in the near future

I'll update the Format proposal this week so we can move toward
something we can vote on. I would recommend that we await
implementations and integration tests for this before releasing this
as stable, in line with prior discussions about adding stuff to the
IPC protocol

On Thu, Mar 26, 2020 at 4:57 PM Wes McKinney <we...@gmail.com> wrote:
>
> Here are the results:
>
> File size: https://ibb.co/71sBsg3
> Read time: https://ibb.co/4ZncdF8
> Write time: https://ibb.co/xhNkRS2
>
> Code: https://github.com/wesm/notebooks/blob/master/20190919file_benchmarks/FeatherCompression.ipynb
> (based on https://github.com/apache/arrow/pull/6694)
>
> High level summary:
>
> * Chunksize 1024 vs 64K has relatively limited impact on file sizes
>
> * Wall clock read time is impacted by chunksize, maybe 30-40%
> difference between 1K row chunks versus 16K row chunks. One notable
> thing is that you can see clearly the overhead associated with IPC
> reconstruction even when the data is memory mapped. For example, in
> the Fannie Mae dataset there are 21,661 batches (each batch has 31
> fields) when the chunksize is 1024. So a read time of 1.3 seconds
> indicates ~60 microseconds of overhead for each record batch. When you
> consider the amount of business logic involved with reconstructing a
> record batch, 60 microseconds is pretty good. This also shows that
> every microsecond counts and we need to be carefully tracking
> microperformance in this critical operation.
>
> * Small chunksize results in higher write times for "expensive" codecs
> like ZSTD with a high compression ratio. For "cheap" codecs like LZ4
> it doesn't make as much of a difference
>
> * Note that LZ4 compressor results in faster wall clock time to disk
> presumably because the compression speed is faster than my SSD's write
> speed
>
> Implementation notes:
> * There is no parallelization or pipelining of reads or writes. For
> example, on write, all of the buffers are compressed with a single
> thread and then compression stops until the write to disk completes.
> On read, buffers are decompressed serially
>
>
> On Thu, Mar 26, 2020 at 12:24 PM Wes McKinney <we...@gmail.com> wrote:
> >
> > I'll run a grid of batch sizes (from 1024 to 64K or 128K) and let you
> > know the read/write times and compression ratios. Shouldn't take too
> > long
> >
> > On Wed, Mar 25, 2020 at 10:37 PM Fan Liya <li...@gmail.com> wrote:
> > >
> > > Thanks a lot for sharing the good results.
> > >
> > > As investigated by Wes, we have existing zstd library for Java (zstd-jni) [1], and lz4 library for Java (lz4-java) [2].
> > > +1 for the 1024 batch size, as it represents an important scenario where the batch fits into the L1 cache (IMO).
> > >
> > > Best,
> > > Liya Fan
> > >
> > > [1] https://github.com/luben/zstd-jni
> > > [2] https://github.com/lz4/lz4-java
> > >
> > > On Thu, Mar 26, 2020 at 2:38 AM Micah Kornfield <em...@gmail.com> wrote:
> > >>
> > >> If it isn't hard could you run with batch sizes of 1024 or 2048 records?  I
> > >> think there was a question previously raised if there was benefit for
> > >> smaller sizes buffers.
> > >>
> > >> Thanks,
> > >> Micah
> > >>
> > >>
> > >> On Wed, Mar 25, 2020 at 8:59 AM Wes McKinney <we...@gmail.com> wrote:
> > >>
> > >> > On Tue, Mar 24, 2020 at 9:22 PM Micah Kornfield <em...@gmail.com>
> > >> > wrote:
> > >> > >
> > >> > > >
> > >> > > > Compression ratios ranging from ~50% with LZ4 and ~75% with ZSTD on
> > >> > > > the Taxi dataset to ~87% with LZ4 and ~90% with ZSTD on the Fannie Mae
> > >> > > > dataset. So that's a huge space savings
> > >> > >
> > >> > > One more question on this.  What was the average row-batch size used?  I
> > >> > > see in the proposal some buffers might not be compressed, did you this
> > >> > > feature in the test?
> > >> >
> > >> > I used 64K row batch size. I haven't implemented the optional
> > >> > non-compressed buffers (for cases where there is little space savings)
> > >> > so everything is compressed. I can check different batch sizes if you
> > >> > like
> > >> >
> > >> >
> > >> > > On Mon, Mar 23, 2020 at 4:40 PM Wes McKinney <we...@gmail.com>
> > >> > wrote:
> > >> > >
> > >> > > > hi folks,
> > >> > > >
> > >> > > > Sorry it's taken me a little while to produce supporting benchmarks.
> > >> > > >
> > >> > > > * I implemented experimental trivial body buffer compression in
> > >> > > > https://github.com/apache/arrow/pull/6638
> > >> > > > * I hooked up the Arrow IPC file format with compression as the new
> > >> > > > Feather V2 format in
> > >> > > > https://github.com/apache/arrow/pull/6694#issuecomment-602906476
> > >> > > >
> > >> > > > I tested a couple of real-world datasets from a prior blog post
> > >> > > > https://ursalabs.org/blog/2019-10-columnar-perf/ with ZSTD and LZ4
> > >> > > > codecs
> > >> > > >
> > >> > > > The complete results are here
> > >> > > > https://github.com/apache/arrow/pull/6694#issuecomment-602906476
> > >> > > >
> > >> > > > Summary:
> > >> > > >
> > >> > > > * Compression ratios ranging from ~50% with LZ4 and ~75% with ZSTD on
> > >> > > > the Taxi dataset to ~87% with LZ4 and ~90% with ZSTD on the Fannie Mae
> > >> > > > dataset. So that's a huge space savings
> > >> > > > * Single-threaded decompression times exceeding 2-4GByte/s with LZ4
> > >> > > > and 1.2-3GByte/s with ZSTD
> > >> > > >
> > >> > > > I would have to do some more engineering to test throughput changes
> > >> > > > with Flight, but given these results on slower networking (e.g. 1
> > >> > > > Gigabit) my guess is that the compression and decompression overhead
> > >> > > > is little compared with the time savings due to high compression
> > >> > > > ratios. If people would like to see these numbers to help make a
> > >> > > > decision I can take a closer look
> > >> > > >
> > >> > > > As far as what Micah said about having a limited number of
> > >> > > > compressors: I would be in favor of having just LZ4 and ZSTD. It seems
> > >> > > > anecdotally that these outperform Snappy in most real world scenarios
> > >> > > > and generally have > 1 GB/s decompression performance. Some Linux
> > >> > > > distributions (Arch at least) have already started adopting ZSTD over
> > >> > > > LZMA or GZIP [1]
> > >> > > >
> > >> > > > - Wes
> > >> > > >
> > >> > > > [1]:
> > >> > > >
> > >> > https://www.archlinux.org/news/now-using-zstandard-instead-of-xz-for-package-compression/
> > >> > > >
> > >> > > > On Fri, Mar 6, 2020 at 8:42 AM Fan Liya <li...@gmail.com> wrote:
> > >> > > > >
> > >> > > > > Hi Wes,
> > >> > > > >
> > >> > > > > Thanks a lot for the additional information.
> > >> > > > > Looking forward to see the good results from your experiments.
> > >> > > > >
> > >> > > > > Best,
> > >> > > > > Liya Fan
> > >> > > > >
> > >> > > > > On Thu, Mar 5, 2020 at 11:42 PM Wes McKinney <we...@gmail.com>
> > >> > > > wrote:
> > >> > > > >
> > >> > > > > > I see, thank you.
> > >> > > > > >
> > >> > > > > > For such a scenario, implementations would need to define a
> > >> > > > > > "UserDefinedCodec" interface to enable codecs to be registered from
> > >> > > > > > third party code, similar to what is done for extension types [1]
> > >> > > > > >
> > >> > > > > > I'll update this thread when I get my experimental C++ patch up to
> > >> > see
> > >> > > > > > what I'm thinking at least for the built-in codecs we have like
> > >> > ZSTD.
> > >> > > > > >
> > >> > > > > >
> > >> > > > > >
> > >> > > >
> > >> > https://github.com/apache/arrow/blob/apache-arrow-0.16.0/docs/source/format/Columnar.rst#extension-types
> > >> > > > > >
> > >> > > > > > On Thu, Mar 5, 2020 at 7:56 AM Fan Liya <li...@gmail.com>
> > >> > wrote:
> > >> > > > > > >
> > >> > > > > > > Hi Wes,
> > >> > > > > > >
> > >> > > > > > > Thanks a lot for your further clarification.
> > >> > > > > > >
> > >> > > > > > > Some of my prelimiary thoughts:
> > >> > > > > > >
> > >> > > > > > > 1. We assign a unique GUID to each pair of
> > >> > compression/decompression
> > >> > > > > > > strategies. The GUID is stored as part of the
> > >> > > > Message.custom_metadata.
> > >> > > > > > When
> > >> > > > > > > receiving the GUID, the receiver knows which decompression
> > >> > strategy
> > >> > > > to
> > >> > > > > > use.
> > >> > > > > > >
> > >> > > > > > > 2. We serialize the decompression strategy, and store it into the
> > >> > > > > > > Message.custom_metadata. The receiver can decompress data after
> > >> > > > > > > deserializing the strategy.
> > >> > > > > > >
> > >> > > > > > > Method 1 is generally used in static strategy scenarios while
> > >> > method
> > >> > > > 2 is
> > >> > > > > > > generally used in dynamic strategy scenarios.
> > >> > > > > > >
> > >> > > > > > > Best,
> > >> > > > > > > Liya Fan
> > >> > > > > > >
> > >> > > > > > > On Wed, Mar 4, 2020 at 11:39 PM Wes McKinney <
> > >> > wesmckinn@gmail.com>
> > >> > > > > > wrote:
> > >> > > > > > >
> > >> > > > > > > > Okay, I guess my question is how the receiver is going to be
> > >> > able
> > >> > > > to
> > >> > > > > > > > determine how to "rehydrate" the record batch buffers:
> > >> > > > > > > >
> > >> > > > > > > > What I've proposed amounts to the following:
> > >> > > > > > > >
> > >> > > > > > > > * UNCOMPRESSED: the current behavior
> > >> > > > > > > > * ZSTD/LZ4/...: each buffer is compressed and written with an
> > >> > int64
> > >> > > > > > > > length prefix
> > >> > > > > > > >
> > >> > > > > > > > (I'm close to putting up a PR implementing an experimental
> > >> > version
> > >> > > > of
> > >> > > > > > > > this that uses Message.custom_metadata to transmit the codec,
> > >> > so
> > >> > > > this
> > >> > > > > > > > will make the implementation details more concrete)
> > >> > > > > > > >
> > >> > > > > > > > So in the USER_DEFINED case, how will the library know how to
> > >> > > > obtain
> > >> > > > > > > > the uncompressed buffer? Is some additional metadata structure
> > >> > > > > > > > required to provide instructions?
> > >> > > > > > > >
> > >> > > > > > > > On Wed, Mar 4, 2020 at 8:05 AM Fan Liya <li...@gmail.com>
> > >> > > > wrote:
> > >> > > > > > > > >
> > >> > > > > > > > > Hi Wes,
> > >> > > > > > > > >
> > >> > > > > > > > > I am thinking of adding an option named "USER_DEFINED" (or
> > >> > > > something
> > >> > > > > > > > > similar) to enum CompressionType in your proposal.
> > >> > > > > > > > > IMO, this option should be used primarily in Flight.
> > >> > > > > > > > >
> > >> > > > > > > > > Best,
> > >> > > > > > > > > Liya Fan
> > >> > > > > > > > >
> > >> > > > > > > > > On Wed, Mar 4, 2020 at 11:12 AM Wes McKinney <
> > >> > > > wesmckinn@gmail.com>
> > >> > > > > > > > wrote:
> > >> > > > > > > > >
> > >> > > > > > > > > > On Tue, Mar 3, 2020, 8:11 PM Fan Liya <
> > >> > liya.fan03@gmail.com>
> > >> > > > > > wrote:
> > >> > > > > > > > > >
> > >> > > > > > > > > > > Sure. I agree with you that we should not overdo this.
> > >> > > > > > > > > > > I am wondering if we should provide an option to allow
> > >> > users
> > >> > > > to
> > >> > > > > > > > plugin
> > >> > > > > > > > > > > their customized compression strategies.
> > >> > > > > > > > > > >
> > >> > > > > > > > > >
> > >> > > > > > > > > > Can you provide a patch showing changes to Message.fbs (or
> > >> > > > > > Schema.fbs)
> > >> > > > > > > > that
> > >> > > > > > > > > > make this idea more concrete?
> > >> > > > > > > > > >
> > >> > > > > > > > > >
> > >> > > > > > > > > > > Best,
> > >> > > > > > > > > > > Liya Fan
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > On Tue, Mar 3, 2020 at 9:47 PM Wes McKinney <
> > >> > > > wesmckinn@gmail.com
> > >> > > > > > >
> > >> > > > > > > > wrote:
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > > On Tue, Mar 3, 2020, 7:36 AM Fan Liya <
> > >> > > > liya.fan03@gmail.com>
> > >> > > > > > > > wrote:
> > >> > > > > > > > > > > >
> > >> > > > > > > > > > > > > I am so glad to see this discussion, and I am
> > >> > willing to
> > >> > > > > > provide
> > >> > > > > > > > help
> > >> > > > > > > > > > > > from
> > >> > > > > > > > > > > > > the Java side.
> > >> > > > > > > > > > > > >
> > >> > > > > > > > > > > > > In the proposal, I see the support for basic
> > >> > compression
> > >> > > > > > > > strategies
> > >> > > > > > > > > > > > > (e.g.gzip, snappy).
> > >> > > > > > > > > > > > > IMO, applying a single basic strategy is not likely
> > >> > to
> > >> > > > > > achieve
> > >> > > > > > > > > > > > performance
> > >> > > > > > > > > > > > > improvement for most scenarios.
> > >> > > > > > > > > > > > > The optimal compression strategy is often obtained by
> > >> > > > > > composing
> > >> > > > > > > > basic
> > >> > > > > > > > > > > > > strategies and tuning parameters.
> > >> > > > > > > > > > > > >
> > >> > > > > > > > > > > > > I hope we can support such highly customized
> > >> > compression
> > >> > > > > > > > strategies.
> > >> > > > > > > > > > > > >
> > >> > > > > > > > > > > >
> > >> > > > > > > > > > > > I think very much beyond trivial one-shot buffer level
> > >> > > > > > compression
> > >> > > > > > > > is
> > >> > > > > > > > > > > > probably out of the question for addition to the
> > >> > current
> > >> > > > > > > > "RecordBatch"
> > >> > > > > > > > > > > > Flatbuffers type, because the additional metadata
> > >> > would add
> > >> > > > > > > > undesirable
> > >> > > > > > > > > > > > bloat (which I would be against). If people have other
> > >> > > > ideas it
> > >> > > > > > > > would
> > >> > > > > > > > > > be
> > >> > > > > > > > > > > > great to see exactly what you are thinking as far as
> > >> > > > changes
> > >> > > > > > to the
> > >> > > > > > > > > > > > protocol files.
> > >> > > > > > > > > > > >
> > >> > > > > > > > > > > > I'll try to assemble some examples to show the
> > >> > before/after
> > >> > > > > > > > results of
> > >> > > > > > > > > > > > applying the simple strategy.
> > >> > > > > > > > > > > >
> > >> > > > > > > > > > > >
> > >> > > > > > > > > > > > >
> > >> > > > > > > > > > > > > Best,
> > >> > > > > > > > > > > > > Liya Fan
> > >> > > > > > > > > > > > >
> > >> > > > > > > > > > > > >
> > >> > > > > > > > > > > > >
> > >> > > > > > > > > > > > > On Tue, Mar 3, 2020 at 8:15 PM Antoine Pitrou <
> > >> > > > > > > > antoine@python.org>
> > >> > > > > > > > > > > > wrote:
> > >> > > > > > > > > > > > >
> > >> > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > If we want to use a HTTP header, it would be more
> > >> > of a
> > >> > > > > > > > > > > Accept-Encoding
> > >> > > > > > > > > > > > > > header, no?
> > >> > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > In any case, we would have to put non-standard
> > >> > values
> > >> > > > there
> > >> > > > > > > > (e.g.
> > >> > > > > > > > > > > lz4),
> > >> > > > > > > > > > > > > > so I'm not sure how desirable it is to repurpose
> > >> > HTTP
> > >> > > > > > headers
> > >> > > > > > > > for
> > >> > > > > > > > > > > that,
> > >> > > > > > > > > > > > > > rather than add some dedicated field to the Flight
> > >> > > > > > messages.
> > >> > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > Regards
> > >> > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > Antoine.
> > >> > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > Le 03/03/2020 à 12:52, David Li a écrit :
> > >> > > > > > > > > > > > > > > gRPC supports headers so for Flight, we could
> > >> > send
> > >> > > > > > > > essentially an
> > >> > > > > > > > > > > > > Accept
> > >> > > > > > > > > > > > > > > header and perhaps a Content-Type header.
> > >> > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > David
> > >> > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > On Mon, Mar 2, 2020, 23:15 Micah Kornfield <
> > >> > > > > > > > > > emkornfield@gmail.com>
> > >> > > > > > > > > > > > > > wrote:
> > >> > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > >> Hi Wes,
> > >> > > > > > > > > > > > > > >> A few thoughts on this.  In general, I think it
> > >> > is a
> > >> > > > > > good
> > >> > > > > > > > idea.
> > >> > > > > > > > > > > But
> > >> > > > > > > > > > > > > > before
> > >> > > > > > > > > > > > > > >> proceeding, I think the following points are
> > >> > worth
> > >> > > > > > > > discussing:
> > >> > > > > > > > > > > > > > >> 1.  Does this actually improve
> > >> > throughput/latency
> > >> > > > for
> > >> > > > > > > > Flight? (I
> > >> > > > > > > > > > > > think
> > >> > > > > > > > > > > > > > you
> > >> > > > > > > > > > > > > > >> mentioned you would follow-up with benchmarks).
> > >> > > > > > > > > > > > > > >> 2.  I think we should limit the number of
> > >> > supported
> > >> > > > > > > > compression
> > >> > > > > > > > > > > > > schemes
> > >> > > > > > > > > > > > > > to
> > >> > > > > > > > > > > > > > >> only 1 or 2.  I think the criteria for selection
> > >> > > > speed
> > >> > > > > > and
> > >> > > > > > > > > > native
> > >> > > > > > > > > > > > > > >> implementations available across the widest
> > >> > possible
> > >> > > > > > > > languages.
> > >> > > > > > > > > > > As
> > >> > > > > > > > > > > > > far
> > >> > > > > > > > > > > > > > as
> > >> > > > > > > > > > > > > > >> i can tell zstd only have bindings in java via
> > >> > JNI,
> > >> > > > but
> > >> > > > > > my
> > >> > > > > > > > > > > > > > understanding is
> > >> > > > > > > > > > > > > > >> it is probably the type of compression for our
> > >> > > > > > use-cases.
> > >> > > > > > > > So I
> > >> > > > > > > > > > > > think
> > >> > > > > > > > > > > > > > >> zstd + potentially 1 more.
> > >> > > > > > > > > > > > > > >> 3.  Commitment from someone on the Java side to
> > >> > > > > > implement
> > >> > > > > > > > this.
> > >> > > > > > > > > > > > > > >> 4.  This doesn't need to be coupled with this
> > >> > change
> > >> > > > > > per-se
> > >> > > > > > > > but
> > >> > > > > > > > > > > for
> > >> > > > > > > > > > > > > > >> something like flight it would be good to have a
> > >> > > > > > standard
> > >> > > > > > > > > > > mechanism
> > >> > > > > > > > > > > > > for
> > >> > > > > > > > > > > > > > >> negotiating server/client capabilities (e.g.
> > >> > client
> > >> > > > > > doesn't
> > >> > > > > > > > > > > support
> > >> > > > > > > > > > > > > > >> compression or only supports a subset).
> > >> > > > > > > > > > > > > > >>
> > >> > > > > > > > > > > > > > >>
> > >> > > > > > > > > > > > > > >> Thanks,
> > >> > > > > > > > > > > > > > >> Micah
> > >> > > > > > > > > > > > > > >>
> > >> > > > > > > > > > > > > > >> On Sun, Mar 1, 2020 at 1:24 PM Wes McKinney <
> > >> > > > > > > > > > wesmckinn@gmail.com>
> > >> > > > > > > > > > > > > > wrote:
> > >> > > > > > > > > > > > > > >>
> > >> > > > > > > > > > > > > > >>> On Sun, Mar 1, 2020 at 3:14 PM Antoine Pitrou <
> > >> > > > > > > > > > > antoine@python.org>
> > >> > > > > > > > > > > > > > >> wrote:
> > >> > > > > > > > > > > > > > >>>>
> > >> > > > > > > > > > > > > > >>>>
> > >> > > > > > > > > > > > > > >>>> Le 01/03/2020 à 22:01, Wes McKinney a écrit :
> > >> > > > > > > > > > > > > > >>>>> In the context of a "next version of the
> > >> > Feather
> > >> > > > > > format"
> > >> > > > > > > > > > > > ARROW-5510
> > >> > > > > > > > > > > > > > >>>>> (which is consumed only by Python and R at
> > >> > the
> > >> > > > > > moment), I
> > >> > > > > > > > > > have
> > >> > > > > > > > > > > > been
> > >> > > > > > > > > > > > > > >>>>> looking at compressing buffers using fast
> > >> > > > compressors
> > >> > > > > > > > like
> > >> > > > > > > > > > ZSTD
> > >> > > > > > > > > > > > > when
> > >> > > > > > > > > > > > > > >>>>> writing the RecordBatch bodies. This could be
> > >> > > > handled
> > >> > > > > > > > > > privately
> > >> > > > > > > > > > > > as
> > >> > > > > > > > > > > > > an
> > >> > > > > > > > > > > > > > >>>>> implementation detail of the Feather file,
> > >> > but
> > >> > > > since
> > >> > > > > > ZSTD
> > >> > > > > > > > > > > > > compression
> > >> > > > > > > > > > > > > > >>>>> could improve throughput in Flight, for
> > >> > example,
> > >> > > > I
> > >> > > > > > > > thought I
> > >> > > > > > > > > > > > would
> > >> > > > > > > > > > > > > > >>>>> bring it up for discussion.
> > >> > > > > > > > > > > > > > >>>>>
> > >> > > > > > > > > > > > > > >>>>> I can see two simple compression strategies:
> > >> > > > > > > > > > > > > > >>>>>
> > >> > > > > > > > > > > > > > >>>>> * Compress the entire message body in
> > >> > one-shot,
> > >> > > > > > writing
> > >> > > > > > > > the
> > >> > > > > > > > > > > > result
> > >> > > > > > > > > > > > > > >> out
> > >> > > > > > > > > > > > > > >>>>> with an 8-byte int64 prefix indicating the
> > >> > > > > > uncompressed
> > >> > > > > > > > size
> > >> > > > > > > > > > > > > > >>>>> * Compress each non-zero-length constituent
> > >> > > > Buffer
> > >> > > > > > prior
> > >> > > > > > > > to
> > >> > > > > > > > > > > > writing
> > >> > > > > > > > > > > > > > >> to
> > >> > > > > > > > > > > > > > >>>>> the body (and using the same
> > >> > > > > > uncompressed-length-prefix
> > >> > > > > > > > when
> > >> > > > > > > > > > > > > writing
> > >> > > > > > > > > > > > > > >>>>> the compressed buffer)
> > >> > > > > > > > > > > > > > >>>>>
> > >> > > > > > > > > > > > > > >>>>> The latter strategy is preferable for
> > >> > scenarios
> > >> > > > > > where we
> > >> > > > > > > > may
> > >> > > > > > > > > > > > > project
> > >> > > > > > > > > > > > > > >>>>> out only a few fields from a larger record
> > >> > batch
> > >> > > > > > (such as
> > >> > > > > > > > > > > reading
> > >> > > > > > > > > > > > > > >> from
> > >> > > > > > > > > > > > > > >>>>> a memory-mapped file).
> > >> > > > > > > > > > > > > > >>>>
> > >> > > > > > > > > > > > > > >>>> Agreed.  It may also allow using different
> > >> > > > compression
> > >> > > > > > > > > > > strategies
> > >> > > > > > > > > > > > > for
> > >> > > > > > > > > > > > > > >>>> different kinds of buffers (for example a
> > >> > > > bytestream
> > >> > > > > > > > splitting
> > >> > > > > > > > > > > > > > strategy
> > >> > > > > > > > > > > > > > >>>> for floats and doubles, or a delta encoding
> > >> > > > strategy
> > >> > > > > > for
> > >> > > > > > > > > > > > integers).
> > >> > > > > > > > > > > > > > >>>
> > >> > > > > > > > > > > > > > >>> If we wanted to allow for different
> > >> > compression to
> > >> > > > > > apply to
> > >> > > > > > > > > > > > different
> > >> > > > > > > > > > > > > > >>> buffers, I think we will need a new Message
> > >> > type
> > >> > > > > > because
> > >> > > > > > > > this
> > >> > > > > > > > > > > would
> > >> > > > > > > > > > > > > > >>> inflate metadata sizes in a way that is not
> > >> > likely
> > >> > > > to
> > >> > > > > > be
> > >> > > > > > > > > > > acceptable
> > >> > > > > > > > > > > > > > >>> for the current uncompressed use case.
> > >> > > > > > > > > > > > > > >>>
> > >> > > > > > > > > > > > > > >>> Here is my strawman proposal
> > >> > > > > > > > > > > > > > >>>
> > >> > > > > > > > > > > > > > >>>
> > >> > > > > > > > > > > > > > >>
> > >> > > > > > > > > > > > > >
> > >> > > > > > > > > > > > >
> > >> > > > > > > > > > > >
> > >> > > > > > > > > > >
> > >> > > > > > > > > >
> > >> > > > > > > >
> > >> > > > > >
> > >> > > >
> > >> > https://github.com/apache/arrow/compare/master...wesm:compression-strawman
> > >> > > > > > > > > > > > > > >>>
> > >> > > > > > > > > > > > > > >>>>> Implementation could be accomplished by one
> > >> > of
> > >> > > > the
> > >> > > > > > > > following
> > >> > > > > > > > > > > > > methods:
> > >> > > > > > > > > > > > > > >>>>>
> > >> > > > > > > > > > > > > > >>>>> * Setting a field in Message.custom_metadata
> > >> > > > > > > > > > > > > > >>>>> * Adding a new field to Message
> > >> > > > > > > > > > > > > > >>>>
> > >> > > > > > > > > > > > > > >>>> I think it has to be a new field in Message.
> > >> > > > Making
> > >> > > > > > it an
> > >> > > > > > > > > > > > ignorable
> > >> > > > > > > > > > > > > > >>>> metadata field means non-supporting receivers
> > >> > will
> > >> > > > > > decode
> > >> > > > > > > > and
> > >> > > > > > > > > > > > > > interpret
> > >> > > > > > > > > > > > > > >>>> the data wrongly.
> > >> > > > > > > > > > > > > > >>>>
> > >> > > > > > > > > > > > > > >>>> Regards
> > >> > > > > > > > > > > > > > >>>>
> > >> > > > > > > > > > > > > > >>>> Antoine.
> > >> > > > > > > > > > > > > > >>>
> > >> > > > > > > > > > > > > > >>
> > >> > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > >
> > >> > > > > > > > > > > > >
> > >> > > > > > > > > > > >
> > >> > > > > > > > > > >
> > >> > > > > > > > > >
> > >> > > > > > > >
> > >> > > > > >
> > >> > > >
> > >> >

Re: [DISCUSS] Adding "trivial" buffer compression option to IPC protocol (ARROW-300)

Posted by Wes McKinney <we...@gmail.com>.
It seems like there is reasonable consensus in the PR. If there are no
further comments I'll start a vote about this within the next several
days

On Mon, Apr 6, 2020 at 10:55 PM Wes McKinney <we...@gmail.com> wrote:
>
> I updated the Format proposal again, please have a look
>
> https://github.com/apache/arrow/pull/6707
>
> On Wed, Apr 1, 2020 at 10:15 AM Wes McKinney <we...@gmail.com> wrote:
> >
> > For uncompressed, memory mapping is disabled, so all of the bytes are
> > being read into RAM. I wanted to show that even when your IO pipe is
> > very fast (in the case with an NVMe SSD like I have, > 1GB/s for read
> > from disk) that you can still load faster with compressed files.
> >
> > Here were the prior Read results with
> >
> > * Single threaded decompression
> > * Memory mapping enabled
> >
> > https://ibb.co/4ZncdF8
> >
> > You can see for larger chunksizes, because the IPC reconstruction
> > overhead is about 60 microseconds per batch, that read time is very
> > low (10s of milliseconds).
> >
> > On Wed, Apr 1, 2020 at 10:10 AM Antoine Pitrou <an...@python.org> wrote:
> > >
> > >
> > > The read times are still with memory mapping for the uncompressed case?
> > >  If so, impressive!
> > >
> > > Regards
> > >
> > > Antoine.
> > >
> > >
> > > Le 01/04/2020 à 16:44, Wes McKinney a écrit :
> > > > Several pieces of work got done in the last few days:
> > > >
> > > > * Changing from LZ4 raw to LZ4 frame format (what is recommended for
> > > > interoperability)
> > > > * Parallelizing both compression and decompression at the field level
> > > >
> > > > Here are the results (using 8 threads on an 8-core laptop). I disabled
> > > > the "memory map" feature so that in the uncompressed case all of the
> > > > data must be read off disk into memory. This helps illustrate the
> > > > compression/IO tradeoff to wall clock load times
> > > >
> > > > File size (only LZ4 may be different): https://ibb.co/CP3VQkp
> > > > Read time: https://ibb.co/vz9JZMx
> > > > Write time: https://ibb.co/H7bb68T
> > > >
> > > > In summary, now with multicore compression and decompression,
> > > > LZ4-compressed files are faster both to read and write even on a very
> > > > fast SSD, as are ZSTD-compressed files with a low ZSTD compression
> > > > level. I didn't notice a major difference between LZ4 raw and LZ4
> > > > frame formats. The reads and writes could be made faster still by
> > > > pipelining / making concurrent the disk read/write and
> > > > compression/decompression steps -- the current implementation performs
> > > > these tasks serially. We can improve this in the near future
> > > >
> > > > I'll update the Format proposal this week so we can move toward
> > > > something we can vote on. I would recommend that we await
> > > > implementations and integration tests for this before releasing this
> > > > as stable, in line with prior discussions about adding stuff to the
> > > > IPC protocol
> > > >
> > > > On Thu, Mar 26, 2020 at 4:57 PM Wes McKinney <we...@gmail.com> wrote:
> > > >>
> > > >> Here are the results:
> > > >>
> > > >> File size: https://ibb.co/71sBsg3
> > > >> Read time: https://ibb.co/4ZncdF8
> > > >> Write time: https://ibb.co/xhNkRS2
> > > >>
> > > >> Code: https://github.com/wesm/notebooks/blob/master/20190919file_benchmarks/FeatherCompression.ipynb
> > > >> (based on https://github.com/apache/arrow/pull/6694)
> > > >>
> > > >> High level summary:
> > > >>
> > > >> * Chunksize 1024 vs 64K has relatively limited impact on file sizes
> > > >>
> > > >> * Wall clock read time is impacted by chunksize, maybe 30-40%
> > > >> difference between 1K row chunks versus 16K row chunks. One notable
> > > >> thing is that you can see clearly the overhead associated with IPC
> > > >> reconstruction even when the data is memory mapped. For example, in
> > > >> the Fannie Mae dataset there are 21,661 batches (each batch has 31
> > > >> fields) when the chunksize is 1024. So a read time of 1.3 seconds
> > > >> indicates ~60 microseconds of overhead for each record batch. When you
> > > >> consider the amount of business logic involved with reconstructing a
> > > >> record batch, 60 microseconds is pretty good. This also shows that
> > > >> every microsecond counts and we need to be carefully tracking
> > > >> microperformance in this critical operation.
> > > >>
> > > >> * Small chunksize results in higher write times for "expensive" codecs
> > > >> like ZSTD with a high compression ratio. For "cheap" codecs like LZ4
> > > >> it doesn't make as much of a difference
> > > >>
> > > >> * Note that LZ4 compressor results in faster wall clock time to disk
> > > >> presumably because the compression speed is faster than my SSD's write
> > > >> speed
> > > >>
> > > >> Implementation notes:
> > > >> * There is no parallelization or pipelining of reads or writes. For
> > > >> example, on write, all of the buffers are compressed with a single
> > > >> thread and then compression stops until the write to disk completes.
> > > >> On read, buffers are decompressed serially
> > > >>
> > > >>
> > > >> On Thu, Mar 26, 2020 at 12:24 PM Wes McKinney <we...@gmail.com> wrote:
> > > >>>
> > > >>> I'll run a grid of batch sizes (from 1024 to 64K or 128K) and let you
> > > >>> know the read/write times and compression ratios. Shouldn't take too
> > > >>> long
> > > >>>
> > > >>> On Wed, Mar 25, 2020 at 10:37 PM Fan Liya <li...@gmail.com> wrote:
> > > >>>>
> > > >>>> Thanks a lot for sharing the good results.
> > > >>>>
> > > >>>> As investigated by Wes, we have existing zstd library for Java (zstd-jni) [1], and lz4 library for Java (lz4-java) [2].
> > > >>>> +1 for the 1024 batch size, as it represents an important scenario where the batch fits into the L1 cache (IMO).
> > > >>>>
> > > >>>> Best,
> > > >>>> Liya Fan
> > > >>>>
> > > >>>> [1] https://github.com/luben/zstd-jni
> > > >>>> [2] https://github.com/lz4/lz4-java
> > > >>>>
> > > >>>> On Thu, Mar 26, 2020 at 2:38 AM Micah Kornfield <em...@gmail.com> wrote:
> > > >>>>>
> > > >>>>> If it isn't hard could you run with batch sizes of 1024 or 2048 records?  I
> > > >>>>> think there was a question previously raised if there was benefit for
> > > >>>>> smaller sizes buffers.
> > > >>>>>
> > > >>>>> Thanks,
> > > >>>>> Micah
> > > >>>>>
> > > >>>>>
> > > >>>>> On Wed, Mar 25, 2020 at 8:59 AM Wes McKinney <we...@gmail.com> wrote:
> > > >>>>>
> > > >>>>>> On Tue, Mar 24, 2020 at 9:22 PM Micah Kornfield <em...@gmail.com>
> > > >>>>>> wrote:
> > > >>>>>>>
> > > >>>>>>>>
> > > >>>>>>>> Compression ratios ranging from ~50% with LZ4 and ~75% with ZSTD on
> > > >>>>>>>> the Taxi dataset to ~87% with LZ4 and ~90% with ZSTD on the Fannie Mae
> > > >>>>>>>> dataset. So that's a huge space savings
> > > >>>>>>>
> > > >>>>>>> One more question on this.  What was the average row-batch size used?  I
> > > >>>>>>> see in the proposal some buffers might not be compressed, did you this
> > > >>>>>>> feature in the test?
> > > >>>>>>
> > > >>>>>> I used 64K row batch size. I haven't implemented the optional
> > > >>>>>> non-compressed buffers (for cases where there is little space savings)
> > > >>>>>> so everything is compressed. I can check different batch sizes if you
> > > >>>>>> like
> > > >>>>>>
> > > >>>>>>
> > > >>>>>>> On Mon, Mar 23, 2020 at 4:40 PM Wes McKinney <we...@gmail.com>
> > > >>>>>> wrote:
> > > >>>>>>>
> > > >>>>>>>> hi folks,
> > > >>>>>>>>
> > > >>>>>>>> Sorry it's taken me a little while to produce supporting benchmarks.
> > > >>>>>>>>
> > > >>>>>>>> * I implemented experimental trivial body buffer compression in
> > > >>>>>>>> https://github.com/apache/arrow/pull/6638
> > > >>>>>>>> * I hooked up the Arrow IPC file format with compression as the new
> > > >>>>>>>> Feather V2 format in
> > > >>>>>>>> https://github.com/apache/arrow/pull/6694#issuecomment-602906476
> > > >>>>>>>>
> > > >>>>>>>> I tested a couple of real-world datasets from a prior blog post
> > > >>>>>>>> https://ursalabs.org/blog/2019-10-columnar-perf/ with ZSTD and LZ4
> > > >>>>>>>> codecs
> > > >>>>>>>>
> > > >>>>>>>> The complete results are here
> > > >>>>>>>> https://github.com/apache/arrow/pull/6694#issuecomment-602906476
> > > >>>>>>>>
> > > >>>>>>>> Summary:
> > > >>>>>>>>
> > > >>>>>>>> * Compression ratios ranging from ~50% with LZ4 and ~75% with ZSTD on
> > > >>>>>>>> the Taxi dataset to ~87% with LZ4 and ~90% with ZSTD on the Fannie Mae
> > > >>>>>>>> dataset. So that's a huge space savings
> > > >>>>>>>> * Single-threaded decompression times exceeding 2-4GByte/s with LZ4
> > > >>>>>>>> and 1.2-3GByte/s with ZSTD
> > > >>>>>>>>
> > > >>>>>>>> I would have to do some more engineering to test throughput changes
> > > >>>>>>>> with Flight, but given these results on slower networking (e.g. 1
> > > >>>>>>>> Gigabit) my guess is that the compression and decompression overhead
> > > >>>>>>>> is little compared with the time savings due to high compression
> > > >>>>>>>> ratios. If people would like to see these numbers to help make a
> > > >>>>>>>> decision I can take a closer look
> > > >>>>>>>>
> > > >>>>>>>> As far as what Micah said about having a limited number of
> > > >>>>>>>> compressors: I would be in favor of having just LZ4 and ZSTD. It seems
> > > >>>>>>>> anecdotally that these outperform Snappy in most real world scenarios
> > > >>>>>>>> and generally have > 1 GB/s decompression performance. Some Linux
> > > >>>>>>>> distributions (Arch at least) have already started adopting ZSTD over
> > > >>>>>>>> LZMA or GZIP [1]
> > > >>>>>>>>
> > > >>>>>>>> - Wes
> > > >>>>>>>>
> > > >>>>>>>> [1]:
> > > >>>>>>>>
> > > >>>>>> https://www.archlinux.org/news/now-using-zstandard-instead-of-xz-for-package-compression/
> > > >>>>>>>>
> > > >>>>>>>> On Fri, Mar 6, 2020 at 8:42 AM Fan Liya <li...@gmail.com> wrote:
> > > >>>>>>>>>
> > > >>>>>>>>> Hi Wes,
> > > >>>>>>>>>
> > > >>>>>>>>> Thanks a lot for the additional information.
> > > >>>>>>>>> Looking forward to see the good results from your experiments.
> > > >>>>>>>>>
> > > >>>>>>>>> Best,
> > > >>>>>>>>> Liya Fan
> > > >>>>>>>>>
> > > >>>>>>>>> On Thu, Mar 5, 2020 at 11:42 PM Wes McKinney <we...@gmail.com>
> > > >>>>>>>> wrote:
> > > >>>>>>>>>
> > > >>>>>>>>>> I see, thank you.
> > > >>>>>>>>>>
> > > >>>>>>>>>> For such a scenario, implementations would need to define a
> > > >>>>>>>>>> "UserDefinedCodec" interface to enable codecs to be registered from
> > > >>>>>>>>>> third party code, similar to what is done for extension types [1]
> > > >>>>>>>>>>
> > > >>>>>>>>>> I'll update this thread when I get my experimental C++ patch up to
> > > >>>>>> see
> > > >>>>>>>>>> what I'm thinking at least for the built-in codecs we have like
> > > >>>>>> ZSTD.
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>
> > > >>>>>> https://github.com/apache/arrow/blob/apache-arrow-0.16.0/docs/source/format/Columnar.rst#extension-types
> > > >>>>>>>>>>
> > > >>>>>>>>>> On Thu, Mar 5, 2020 at 7:56 AM Fan Liya <li...@gmail.com>
> > > >>>>>> wrote:
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> Hi Wes,
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> Thanks a lot for your further clarification.
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> Some of my prelimiary thoughts:
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> 1. We assign a unique GUID to each pair of
> > > >>>>>> compression/decompression
> > > >>>>>>>>>>> strategies. The GUID is stored as part of the
> > > >>>>>>>> Message.custom_metadata.
> > > >>>>>>>>>> When
> > > >>>>>>>>>>> receiving the GUID, the receiver knows which decompression
> > > >>>>>> strategy
> > > >>>>>>>> to
> > > >>>>>>>>>> use.
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> 2. We serialize the decompression strategy, and store it into the
> > > >>>>>>>>>>> Message.custom_metadata. The receiver can decompress data after
> > > >>>>>>>>>>> deserializing the strategy.
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> Method 1 is generally used in static strategy scenarios while
> > > >>>>>> method
> > > >>>>>>>> 2 is
> > > >>>>>>>>>>> generally used in dynamic strategy scenarios.
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> Best,
> > > >>>>>>>>>>> Liya Fan
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> On Wed, Mar 4, 2020 at 11:39 PM Wes McKinney <
> > > >>>>>> wesmckinn@gmail.com>
> > > >>>>>>>>>> wrote:
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>> Okay, I guess my question is how the receiver is going to be
> > > >>>>>> able
> > > >>>>>>>> to
> > > >>>>>>>>>>>> determine how to "rehydrate" the record batch buffers:
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> What I've proposed amounts to the following:
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> * UNCOMPRESSED: the current behavior
> > > >>>>>>>>>>>> * ZSTD/LZ4/...: each buffer is compressed and written with an
> > > >>>>>> int64
> > > >>>>>>>>>>>> length prefix
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> (I'm close to putting up a PR implementing an experimental
> > > >>>>>> version
> > > >>>>>>>> of
> > > >>>>>>>>>>>> this that uses Message.custom_metadata to transmit the codec,
> > > >>>>>> so
> > > >>>>>>>> this
> > > >>>>>>>>>>>> will make the implementation details more concrete)
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> So in the USER_DEFINED case, how will the library know how to
> > > >>>>>>>> obtain
> > > >>>>>>>>>>>> the uncompressed buffer? Is some additional metadata structure
> > > >>>>>>>>>>>> required to provide instructions?
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> On Wed, Mar 4, 2020 at 8:05 AM Fan Liya <li...@gmail.com>
> > > >>>>>>>> wrote:
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> Hi Wes,
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> I am thinking of adding an option named "USER_DEFINED" (or
> > > >>>>>>>> something
> > > >>>>>>>>>>>>> similar) to enum CompressionType in your proposal.
> > > >>>>>>>>>>>>> IMO, this option should be used primarily in Flight.
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> Best,
> > > >>>>>>>>>>>>> Liya Fan
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> On Wed, Mar 4, 2020 at 11:12 AM Wes McKinney <
> > > >>>>>>>> wesmckinn@gmail.com>
> > > >>>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> On Tue, Mar 3, 2020, 8:11 PM Fan Liya <
> > > >>>>>> liya.fan03@gmail.com>
> > > >>>>>>>>>> wrote:
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> Sure. I agree with you that we should not overdo this.
> > > >>>>>>>>>>>>>>> I am wondering if we should provide an option to allow
> > > >>>>>> users
> > > >>>>>>>> to
> > > >>>>>>>>>>>> plugin
> > > >>>>>>>>>>>>>>> their customized compression strategies.
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> Can you provide a patch showing changes to Message.fbs (or
> > > >>>>>>>>>> Schema.fbs)
> > > >>>>>>>>>>>> that
> > > >>>>>>>>>>>>>> make this idea more concrete?
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> Best,
> > > >>>>>>>>>>>>>>> Liya Fan
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> On Tue, Mar 3, 2020 at 9:47 PM Wes McKinney <
> > > >>>>>>>> wesmckinn@gmail.com
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> On Tue, Mar 3, 2020, 7:36 AM Fan Liya <
> > > >>>>>>>> liya.fan03@gmail.com>
> > > >>>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> I am so glad to see this discussion, and I am
> > > >>>>>> willing to
> > > >>>>>>>>>> provide
> > > >>>>>>>>>>>> help
> > > >>>>>>>>>>>>>>>> from
> > > >>>>>>>>>>>>>>>>> the Java side.
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> In the proposal, I see the support for basic
> > > >>>>>> compression
> > > >>>>>>>>>>>> strategies
> > > >>>>>>>>>>>>>>>>> (e.g.gzip, snappy).
> > > >>>>>>>>>>>>>>>>> IMO, applying a single basic strategy is not likely
> > > >>>>>> to
> > > >>>>>>>>>> achieve
> > > >>>>>>>>>>>>>>>> performance
> > > >>>>>>>>>>>>>>>>> improvement for most scenarios.
> > > >>>>>>>>>>>>>>>>> The optimal compression strategy is often obtained by
> > > >>>>>>>>>> composing
> > > >>>>>>>>>>>> basic
> > > >>>>>>>>>>>>>>>>> strategies and tuning parameters.
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> I hope we can support such highly customized
> > > >>>>>> compression
> > > >>>>>>>>>>>> strategies.
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> I think very much beyond trivial one-shot buffer level
> > > >>>>>>>>>> compression
> > > >>>>>>>>>>>> is
> > > >>>>>>>>>>>>>>>> probably out of the question for addition to the
> > > >>>>>> current
> > > >>>>>>>>>>>> "RecordBatch"
> > > >>>>>>>>>>>>>>>> Flatbuffers type, because the additional metadata
> > > >>>>>> would add
> > > >>>>>>>>>>>> undesirable
> > > >>>>>>>>>>>>>>>> bloat (which I would be against). If people have other
> > > >>>>>>>> ideas it
> > > >>>>>>>>>>>> would
> > > >>>>>>>>>>>>>> be
> > > >>>>>>>>>>>>>>>> great to see exactly what you are thinking as far as
> > > >>>>>>>> changes
> > > >>>>>>>>>> to the
> > > >>>>>>>>>>>>>>>> protocol files.
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> I'll try to assemble some examples to show the
> > > >>>>>> before/after
> > > >>>>>>>>>>>> results of
> > > >>>>>>>>>>>>>>>> applying the simple strategy.
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> Best,
> > > >>>>>>>>>>>>>>>>> Liya Fan
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> On Tue, Mar 3, 2020 at 8:15 PM Antoine Pitrou <
> > > >>>>>>>>>>>> antoine@python.org>
> > > >>>>>>>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> If we want to use a HTTP header, it would be more
> > > >>>>>> of a
> > > >>>>>>>>>>>>>>> Accept-Encoding
> > > >>>>>>>>>>>>>>>>>> header, no?
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> In any case, we would have to put non-standard
> > > >>>>>> values
> > > >>>>>>>> there
> > > >>>>>>>>>>>> (e.g.
> > > >>>>>>>>>>>>>>> lz4),
> > > >>>>>>>>>>>>>>>>>> so I'm not sure how desirable it is to repurpose
> > > >>>>>> HTTP
> > > >>>>>>>>>> headers
> > > >>>>>>>>>>>> for
> > > >>>>>>>>>>>>>>> that,
> > > >>>>>>>>>>>>>>>>>> rather than add some dedicated field to the Flight
> > > >>>>>>>>>> messages.
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> Regards
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> Antoine.
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> Le 03/03/2020 à 12:52, David Li a écrit :
> > > >>>>>>>>>>>>>>>>>>> gRPC supports headers so for Flight, we could
> > > >>>>>> send
> > > >>>>>>>>>>>> essentially an
> > > >>>>>>>>>>>>>>>>> Accept
> > > >>>>>>>>>>>>>>>>>>> header and perhaps a Content-Type header.
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> David
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> On Mon, Mar 2, 2020, 23:15 Micah Kornfield <
> > > >>>>>>>>>>>>>> emkornfield@gmail.com>
> > > >>>>>>>>>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>> Hi Wes,
> > > >>>>>>>>>>>>>>>>>>>> A few thoughts on this.  In general, I think it
> > > >>>>>> is a
> > > >>>>>>>>>> good
> > > >>>>>>>>>>>> idea.
> > > >>>>>>>>>>>>>>> But
> > > >>>>>>>>>>>>>>>>>> before
> > > >>>>>>>>>>>>>>>>>>>> proceeding, I think the following points are
> > > >>>>>> worth
> > > >>>>>>>>>>>> discussing:
> > > >>>>>>>>>>>>>>>>>>>> 1.  Does this actually improve
> > > >>>>>> throughput/latency
> > > >>>>>>>> for
> > > >>>>>>>>>>>> Flight? (I
> > > >>>>>>>>>>>>>>>> think
> > > >>>>>>>>>>>>>>>>>> you
> > > >>>>>>>>>>>>>>>>>>>> mentioned you would follow-up with benchmarks).
> > > >>>>>>>>>>>>>>>>>>>> 2.  I think we should limit the number of
> > > >>>>>> supported
> > > >>>>>>>>>>>> compression
> > > >>>>>>>>>>>>>>>>> schemes
> > > >>>>>>>>>>>>>>>>>> to
> > > >>>>>>>>>>>>>>>>>>>> only 1 or 2.  I think the criteria for selection
> > > >>>>>>>> speed
> > > >>>>>>>>>> and
> > > >>>>>>>>>>>>>> native
> > > >>>>>>>>>>>>>>>>>>>> implementations available across the widest
> > > >>>>>> possible
> > > >>>>>>>>>>>> languages.
> > > >>>>>>>>>>>>>>> As
> > > >>>>>>>>>>>>>>>>> far
> > > >>>>>>>>>>>>>>>>>> as
> > > >>>>>>>>>>>>>>>>>>>> i can tell zstd only have bindings in java via
> > > >>>>>> JNI,
> > > >>>>>>>> but
> > > >>>>>>>>>> my
> > > >>>>>>>>>>>>>>>>>> understanding is
> > > >>>>>>>>>>>>>>>>>>>> it is probably the type of compression for our
> > > >>>>>>>>>> use-cases.
> > > >>>>>>>>>>>> So I
> > > >>>>>>>>>>>>>>>> think
> > > >>>>>>>>>>>>>>>>>>>> zstd + potentially 1 more.
> > > >>>>>>>>>>>>>>>>>>>> 3.  Commitment from someone on the Java side to
> > > >>>>>>>>>> implement
> > > >>>>>>>>>>>> this.
> > > >>>>>>>>>>>>>>>>>>>> 4.  This doesn't need to be coupled with this
> > > >>>>>> change
> > > >>>>>>>>>> per-se
> > > >>>>>>>>>>>> but
> > > >>>>>>>>>>>>>>> for
> > > >>>>>>>>>>>>>>>>>>>> something like flight it would be good to have a
> > > >>>>>>>>>> standard
> > > >>>>>>>>>>>>>>> mechanism
> > > >>>>>>>>>>>>>>>>> for
> > > >>>>>>>>>>>>>>>>>>>> negotiating server/client capabilities (e.g.
> > > >>>>>> client
> > > >>>>>>>>>> doesn't
> > > >>>>>>>>>>>>>>> support
> > > >>>>>>>>>>>>>>>>>>>> compression or only supports a subset).
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>> Thanks,
> > > >>>>>>>>>>>>>>>>>>>> Micah
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>> On Sun, Mar 1, 2020 at 1:24 PM Wes McKinney <
> > > >>>>>>>>>>>>>> wesmckinn@gmail.com>
> > > >>>>>>>>>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>> On Sun, Mar 1, 2020 at 3:14 PM Antoine Pitrou <
> > > >>>>>>>>>>>>>>> antoine@python.org>
> > > >>>>>>>>>>>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>> Le 01/03/2020 à 22:01, Wes McKinney a écrit :
> > > >>>>>>>>>>>>>>>>>>>>>>> In the context of a "next version of the
> > > >>>>>> Feather
> > > >>>>>>>>>> format"
> > > >>>>>>>>>>>>>>>> ARROW-5510
> > > >>>>>>>>>>>>>>>>>>>>>>> (which is consumed only by Python and R at
> > > >>>>>> the
> > > >>>>>>>>>> moment), I
> > > >>>>>>>>>>>>>> have
> > > >>>>>>>>>>>>>>>> been
> > > >>>>>>>>>>>>>>>>>>>>>>> looking at compressing buffers using fast
> > > >>>>>>>> compressors
> > > >>>>>>>>>>>> like
> > > >>>>>>>>>>>>>> ZSTD
> > > >>>>>>>>>>>>>>>>> when
> > > >>>>>>>>>>>>>>>>>>>>>>> writing the RecordBatch bodies. This could be
> > > >>>>>>>> handled
> > > >>>>>>>>>>>>>> privately
> > > >>>>>>>>>>>>>>>> as
> > > >>>>>>>>>>>>>>>>> an
> > > >>>>>>>>>>>>>>>>>>>>>>> implementation detail of the Feather file,
> > > >>>>>> but
> > > >>>>>>>> since
> > > >>>>>>>>>> ZSTD
> > > >>>>>>>>>>>>>>>>> compression
> > > >>>>>>>>>>>>>>>>>>>>>>> could improve throughput in Flight, for
> > > >>>>>> example,
> > > >>>>>>>> I
> > > >>>>>>>>>>>> thought I
> > > >>>>>>>>>>>>>>>> would
> > > >>>>>>>>>>>>>>>>>>>>>>> bring it up for discussion.
> > > >>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>> I can see two simple compression strategies:
> > > >>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>> * Compress the entire message body in
> > > >>>>>> one-shot,
> > > >>>>>>>>>> writing
> > > >>>>>>>>>>>> the
> > > >>>>>>>>>>>>>>>> result
> > > >>>>>>>>>>>>>>>>>>>> out
> > > >>>>>>>>>>>>>>>>>>>>>>> with an 8-byte int64 prefix indicating the
> > > >>>>>>>>>> uncompressed
> > > >>>>>>>>>>>> size
> > > >>>>>>>>>>>>>>>>>>>>>>> * Compress each non-zero-length constituent
> > > >>>>>>>> Buffer
> > > >>>>>>>>>> prior
> > > >>>>>>>>>>>> to
> > > >>>>>>>>>>>>>>>> writing
> > > >>>>>>>>>>>>>>>>>>>> to
> > > >>>>>>>>>>>>>>>>>>>>>>> the body (and using the same
> > > >>>>>>>>>> uncompressed-length-prefix
> > > >>>>>>>>>>>> when
> > > >>>>>>>>>>>>>>>>> writing
> > > >>>>>>>>>>>>>>>>>>>>>>> the compressed buffer)
> > > >>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>> The latter strategy is preferable for
> > > >>>>>> scenarios
> > > >>>>>>>>>> where we
> > > >>>>>>>>>>>> may
> > > >>>>>>>>>>>>>>>>> project
> > > >>>>>>>>>>>>>>>>>>>>>>> out only a few fields from a larger record
> > > >>>>>> batch
> > > >>>>>>>>>> (such as
> > > >>>>>>>>>>>>>>> reading
> > > >>>>>>>>>>>>>>>>>>>> from
> > > >>>>>>>>>>>>>>>>>>>>>>> a memory-mapped file).
> > > >>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>> Agreed.  It may also allow using different
> > > >>>>>>>> compression
> > > >>>>>>>>>>>>>>> strategies
> > > >>>>>>>>>>>>>>>>> for
> > > >>>>>>>>>>>>>>>>>>>>>> different kinds of buffers (for example a
> > > >>>>>>>> bytestream
> > > >>>>>>>>>>>> splitting
> > > >>>>>>>>>>>>>>>>>> strategy
> > > >>>>>>>>>>>>>>>>>>>>>> for floats and doubles, or a delta encoding
> > > >>>>>>>> strategy
> > > >>>>>>>>>> for
> > > >>>>>>>>>>>>>>>> integers).
> > > >>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>> If we wanted to allow for different
> > > >>>>>> compression to
> > > >>>>>>>>>> apply to
> > > >>>>>>>>>>>>>>>> different
> > > >>>>>>>>>>>>>>>>>>>>> buffers, I think we will need a new Message
> > > >>>>>> type
> > > >>>>>>>>>> because
> > > >>>>>>>>>>>> this
> > > >>>>>>>>>>>>>>> would
> > > >>>>>>>>>>>>>>>>>>>>> inflate metadata sizes in a way that is not
> > > >>>>>> likely
> > > >>>>>>>> to
> > > >>>>>>>>>> be
> > > >>>>>>>>>>>>>>> acceptable
> > > >>>>>>>>>>>>>>>>>>>>> for the current uncompressed use case.
> > > >>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>> Here is my strawman proposal
> > > >>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>
> > > >>>>>> https://github.com/apache/arrow/compare/master...wesm:compression-strawman
> > > >>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>> Implementation could be accomplished by one
> > > >>>>>> of
> > > >>>>>>>> the
> > > >>>>>>>>>>>> following
> > > >>>>>>>>>>>>>>>>> methods:
> > > >>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>> * Setting a field in Message.custom_metadata
> > > >>>>>>>>>>>>>>>>>>>>>>> * Adding a new field to Message
> > > >>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>> I think it has to be a new field in Message.
> > > >>>>>>>> Making
> > > >>>>>>>>>> it an
> > > >>>>>>>>>>>>>>>> ignorable
> > > >>>>>>>>>>>>>>>>>>>>>> metadata field means non-supporting receivers
> > > >>>>>> will
> > > >>>>>>>>>> decode
> > > >>>>>>>>>>>> and
> > > >>>>>>>>>>>>>>>>>> interpret
> > > >>>>>>>>>>>>>>>>>>>>>> the data wrongly.
> > > >>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>> Regards
> > > >>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>> Antoine.
> > > >>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>
> > > >>>>>>

Re: [DISCUSS] Adding "trivial" buffer compression option to IPC protocol (ARROW-300)

Posted by Wes McKinney <we...@gmail.com>.
I updated the Format proposal again, please have a look

https://github.com/apache/arrow/pull/6707

On Wed, Apr 1, 2020 at 10:15 AM Wes McKinney <we...@gmail.com> wrote:
>
> For uncompressed, memory mapping is disabled, so all of the bytes are
> being read into RAM. I wanted to show that even when your IO pipe is
> very fast (in the case with an NVMe SSD like I have, > 1GB/s for read
> from disk) that you can still load faster with compressed files.
>
> Here were the prior Read results with
>
> * Single threaded decompression
> * Memory mapping enabled
>
> https://ibb.co/4ZncdF8
>
> You can see for larger chunksizes, because the IPC reconstruction
> overhead is about 60 microseconds per batch, that read time is very
> low (10s of milliseconds).
>
> On Wed, Apr 1, 2020 at 10:10 AM Antoine Pitrou <an...@python.org> wrote:
> >
> >
> > The read times are still with memory mapping for the uncompressed case?
> >  If so, impressive!
> >
> > Regards
> >
> > Antoine.
> >
> >
> > Le 01/04/2020 à 16:44, Wes McKinney a écrit :
> > > Several pieces of work got done in the last few days:
> > >
> > > * Changing from LZ4 raw to LZ4 frame format (what is recommended for
> > > interoperability)
> > > * Parallelizing both compression and decompression at the field level
> > >
> > > Here are the results (using 8 threads on an 8-core laptop). I disabled
> > > the "memory map" feature so that in the uncompressed case all of the
> > > data must be read off disk into memory. This helps illustrate the
> > > compression/IO tradeoff to wall clock load times
> > >
> > > File size (only LZ4 may be different): https://ibb.co/CP3VQkp
> > > Read time: https://ibb.co/vz9JZMx
> > > Write time: https://ibb.co/H7bb68T
> > >
> > > In summary, now with multicore compression and decompression,
> > > LZ4-compressed files are faster both to read and write even on a very
> > > fast SSD, as are ZSTD-compressed files with a low ZSTD compression
> > > level. I didn't notice a major difference between LZ4 raw and LZ4
> > > frame formats. The reads and writes could be made faster still by
> > > pipelining / making concurrent the disk read/write and
> > > compression/decompression steps -- the current implementation performs
> > > these tasks serially. We can improve this in the near future
> > >
> > > I'll update the Format proposal this week so we can move toward
> > > something we can vote on. I would recommend that we await
> > > implementations and integration tests for this before releasing this
> > > as stable, in line with prior discussions about adding stuff to the
> > > IPC protocol
> > >
> > > On Thu, Mar 26, 2020 at 4:57 PM Wes McKinney <we...@gmail.com> wrote:
> > >>
> > >> Here are the results:
> > >>
> > >> File size: https://ibb.co/71sBsg3
> > >> Read time: https://ibb.co/4ZncdF8
> > >> Write time: https://ibb.co/xhNkRS2
> > >>
> > >> Code: https://github.com/wesm/notebooks/blob/master/20190919file_benchmarks/FeatherCompression.ipynb
> > >> (based on https://github.com/apache/arrow/pull/6694)
> > >>
> > >> High level summary:
> > >>
> > >> * Chunksize 1024 vs 64K has relatively limited impact on file sizes
> > >>
> > >> * Wall clock read time is impacted by chunksize, maybe 30-40%
> > >> difference between 1K row chunks versus 16K row chunks. One notable
> > >> thing is that you can see clearly the overhead associated with IPC
> > >> reconstruction even when the data is memory mapped. For example, in
> > >> the Fannie Mae dataset there are 21,661 batches (each batch has 31
> > >> fields) when the chunksize is 1024. So a read time of 1.3 seconds
> > >> indicates ~60 microseconds of overhead for each record batch. When you
> > >> consider the amount of business logic involved with reconstructing a
> > >> record batch, 60 microseconds is pretty good. This also shows that
> > >> every microsecond counts and we need to be carefully tracking
> > >> microperformance in this critical operation.
> > >>
> > >> * Small chunksize results in higher write times for "expensive" codecs
> > >> like ZSTD with a high compression ratio. For "cheap" codecs like LZ4
> > >> it doesn't make as much of a difference
> > >>
> > >> * Note that LZ4 compressor results in faster wall clock time to disk
> > >> presumably because the compression speed is faster than my SSD's write
> > >> speed
> > >>
> > >> Implementation notes:
> > >> * There is no parallelization or pipelining of reads or writes. For
> > >> example, on write, all of the buffers are compressed with a single
> > >> thread and then compression stops until the write to disk completes.
> > >> On read, buffers are decompressed serially
> > >>
> > >>
> > >> On Thu, Mar 26, 2020 at 12:24 PM Wes McKinney <we...@gmail.com> wrote:
> > >>>
> > >>> I'll run a grid of batch sizes (from 1024 to 64K or 128K) and let you
> > >>> know the read/write times and compression ratios. Shouldn't take too
> > >>> long
> > >>>
> > >>> On Wed, Mar 25, 2020 at 10:37 PM Fan Liya <li...@gmail.com> wrote:
> > >>>>
> > >>>> Thanks a lot for sharing the good results.
> > >>>>
> > >>>> As investigated by Wes, we have existing zstd library for Java (zstd-jni) [1], and lz4 library for Java (lz4-java) [2].
> > >>>> +1 for the 1024 batch size, as it represents an important scenario where the batch fits into the L1 cache (IMO).
> > >>>>
> > >>>> Best,
> > >>>> Liya Fan
> > >>>>
> > >>>> [1] https://github.com/luben/zstd-jni
> > >>>> [2] https://github.com/lz4/lz4-java
> > >>>>
> > >>>> On Thu, Mar 26, 2020 at 2:38 AM Micah Kornfield <em...@gmail.com> wrote:
> > >>>>>
> > >>>>> If it isn't hard could you run with batch sizes of 1024 or 2048 records?  I
> > >>>>> think there was a question previously raised if there was benefit for
> > >>>>> smaller sizes buffers.
> > >>>>>
> > >>>>> Thanks,
> > >>>>> Micah
> > >>>>>
> > >>>>>
> > >>>>> On Wed, Mar 25, 2020 at 8:59 AM Wes McKinney <we...@gmail.com> wrote:
> > >>>>>
> > >>>>>> On Tue, Mar 24, 2020 at 9:22 PM Micah Kornfield <em...@gmail.com>
> > >>>>>> wrote:
> > >>>>>>>
> > >>>>>>>>
> > >>>>>>>> Compression ratios ranging from ~50% with LZ4 and ~75% with ZSTD on
> > >>>>>>>> the Taxi dataset to ~87% with LZ4 and ~90% with ZSTD on the Fannie Mae
> > >>>>>>>> dataset. So that's a huge space savings
> > >>>>>>>
> > >>>>>>> One more question on this.  What was the average row-batch size used?  I
> > >>>>>>> see in the proposal some buffers might not be compressed, did you this
> > >>>>>>> feature in the test?
> > >>>>>>
> > >>>>>> I used 64K row batch size. I haven't implemented the optional
> > >>>>>> non-compressed buffers (for cases where there is little space savings)
> > >>>>>> so everything is compressed. I can check different batch sizes if you
> > >>>>>> like
> > >>>>>>
> > >>>>>>
> > >>>>>>> On Mon, Mar 23, 2020 at 4:40 PM Wes McKinney <we...@gmail.com>
> > >>>>>> wrote:
> > >>>>>>>
> > >>>>>>>> hi folks,
> > >>>>>>>>
> > >>>>>>>> Sorry it's taken me a little while to produce supporting benchmarks.
> > >>>>>>>>
> > >>>>>>>> * I implemented experimental trivial body buffer compression in
> > >>>>>>>> https://github.com/apache/arrow/pull/6638
> > >>>>>>>> * I hooked up the Arrow IPC file format with compression as the new
> > >>>>>>>> Feather V2 format in
> > >>>>>>>> https://github.com/apache/arrow/pull/6694#issuecomment-602906476
> > >>>>>>>>
> > >>>>>>>> I tested a couple of real-world datasets from a prior blog post
> > >>>>>>>> https://ursalabs.org/blog/2019-10-columnar-perf/ with ZSTD and LZ4
> > >>>>>>>> codecs
> > >>>>>>>>
> > >>>>>>>> The complete results are here
> > >>>>>>>> https://github.com/apache/arrow/pull/6694#issuecomment-602906476
> > >>>>>>>>
> > >>>>>>>> Summary:
> > >>>>>>>>
> > >>>>>>>> * Compression ratios ranging from ~50% with LZ4 and ~75% with ZSTD on
> > >>>>>>>> the Taxi dataset to ~87% with LZ4 and ~90% with ZSTD on the Fannie Mae
> > >>>>>>>> dataset. So that's a huge space savings
> > >>>>>>>> * Single-threaded decompression times exceeding 2-4GByte/s with LZ4
> > >>>>>>>> and 1.2-3GByte/s with ZSTD
> > >>>>>>>>
> > >>>>>>>> I would have to do some more engineering to test throughput changes
> > >>>>>>>> with Flight, but given these results on slower networking (e.g. 1
> > >>>>>>>> Gigabit) my guess is that the compression and decompression overhead
> > >>>>>>>> is little compared with the time savings due to high compression
> > >>>>>>>> ratios. If people would like to see these numbers to help make a
> > >>>>>>>> decision I can take a closer look
> > >>>>>>>>
> > >>>>>>>> As far as what Micah said about having a limited number of
> > >>>>>>>> compressors: I would be in favor of having just LZ4 and ZSTD. It seems
> > >>>>>>>> anecdotally that these outperform Snappy in most real world scenarios
> > >>>>>>>> and generally have > 1 GB/s decompression performance. Some Linux
> > >>>>>>>> distributions (Arch at least) have already started adopting ZSTD over
> > >>>>>>>> LZMA or GZIP [1]
> > >>>>>>>>
> > >>>>>>>> - Wes
> > >>>>>>>>
> > >>>>>>>> [1]:
> > >>>>>>>>
> > >>>>>> https://www.archlinux.org/news/now-using-zstandard-instead-of-xz-for-package-compression/
> > >>>>>>>>
> > >>>>>>>> On Fri, Mar 6, 2020 at 8:42 AM Fan Liya <li...@gmail.com> wrote:
> > >>>>>>>>>
> > >>>>>>>>> Hi Wes,
> > >>>>>>>>>
> > >>>>>>>>> Thanks a lot for the additional information.
> > >>>>>>>>> Looking forward to see the good results from your experiments.
> > >>>>>>>>>
> > >>>>>>>>> Best,
> > >>>>>>>>> Liya Fan
> > >>>>>>>>>
> > >>>>>>>>> On Thu, Mar 5, 2020 at 11:42 PM Wes McKinney <we...@gmail.com>
> > >>>>>>>> wrote:
> > >>>>>>>>>
> > >>>>>>>>>> I see, thank you.
> > >>>>>>>>>>
> > >>>>>>>>>> For such a scenario, implementations would need to define a
> > >>>>>>>>>> "UserDefinedCodec" interface to enable codecs to be registered from
> > >>>>>>>>>> third party code, similar to what is done for extension types [1]
> > >>>>>>>>>>
> > >>>>>>>>>> I'll update this thread when I get my experimental C++ patch up to
> > >>>>>> see
> > >>>>>>>>>> what I'm thinking at least for the built-in codecs we have like
> > >>>>>> ZSTD.
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>
> > >>>>>> https://github.com/apache/arrow/blob/apache-arrow-0.16.0/docs/source/format/Columnar.rst#extension-types
> > >>>>>>>>>>
> > >>>>>>>>>> On Thu, Mar 5, 2020 at 7:56 AM Fan Liya <li...@gmail.com>
> > >>>>>> wrote:
> > >>>>>>>>>>>
> > >>>>>>>>>>> Hi Wes,
> > >>>>>>>>>>>
> > >>>>>>>>>>> Thanks a lot for your further clarification.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Some of my prelimiary thoughts:
> > >>>>>>>>>>>
> > >>>>>>>>>>> 1. We assign a unique GUID to each pair of
> > >>>>>> compression/decompression
> > >>>>>>>>>>> strategies. The GUID is stored as part of the
> > >>>>>>>> Message.custom_metadata.
> > >>>>>>>>>> When
> > >>>>>>>>>>> receiving the GUID, the receiver knows which decompression
> > >>>>>> strategy
> > >>>>>>>> to
> > >>>>>>>>>> use.
> > >>>>>>>>>>>
> > >>>>>>>>>>> 2. We serialize the decompression strategy, and store it into the
> > >>>>>>>>>>> Message.custom_metadata. The receiver can decompress data after
> > >>>>>>>>>>> deserializing the strategy.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Method 1 is generally used in static strategy scenarios while
> > >>>>>> method
> > >>>>>>>> 2 is
> > >>>>>>>>>>> generally used in dynamic strategy scenarios.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Best,
> > >>>>>>>>>>> Liya Fan
> > >>>>>>>>>>>
> > >>>>>>>>>>> On Wed, Mar 4, 2020 at 11:39 PM Wes McKinney <
> > >>>>>> wesmckinn@gmail.com>
> > >>>>>>>>>> wrote:
> > >>>>>>>>>>>
> > >>>>>>>>>>>> Okay, I guess my question is how the receiver is going to be
> > >>>>>> able
> > >>>>>>>> to
> > >>>>>>>>>>>> determine how to "rehydrate" the record batch buffers:
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> What I've proposed amounts to the following:
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> * UNCOMPRESSED: the current behavior
> > >>>>>>>>>>>> * ZSTD/LZ4/...: each buffer is compressed and written with an
> > >>>>>> int64
> > >>>>>>>>>>>> length prefix
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> (I'm close to putting up a PR implementing an experimental
> > >>>>>> version
> > >>>>>>>> of
> > >>>>>>>>>>>> this that uses Message.custom_metadata to transmit the codec,
> > >>>>>> so
> > >>>>>>>> this
> > >>>>>>>>>>>> will make the implementation details more concrete)
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> So in the USER_DEFINED case, how will the library know how to
> > >>>>>>>> obtain
> > >>>>>>>>>>>> the uncompressed buffer? Is some additional metadata structure
> > >>>>>>>>>>>> required to provide instructions?
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> On Wed, Mar 4, 2020 at 8:05 AM Fan Liya <li...@gmail.com>
> > >>>>>>>> wrote:
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> Hi Wes,
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> I am thinking of adding an option named "USER_DEFINED" (or
> > >>>>>>>> something
> > >>>>>>>>>>>>> similar) to enum CompressionType in your proposal.
> > >>>>>>>>>>>>> IMO, this option should be used primarily in Flight.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> Best,
> > >>>>>>>>>>>>> Liya Fan
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> On Wed, Mar 4, 2020 at 11:12 AM Wes McKinney <
> > >>>>>>>> wesmckinn@gmail.com>
> > >>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> On Tue, Mar 3, 2020, 8:11 PM Fan Liya <
> > >>>>>> liya.fan03@gmail.com>
> > >>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> Sure. I agree with you that we should not overdo this.
> > >>>>>>>>>>>>>>> I am wondering if we should provide an option to allow
> > >>>>>> users
> > >>>>>>>> to
> > >>>>>>>>>>>> plugin
> > >>>>>>>>>>>>>>> their customized compression strategies.
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Can you provide a patch showing changes to Message.fbs (or
> > >>>>>>>>>> Schema.fbs)
> > >>>>>>>>>>>> that
> > >>>>>>>>>>>>>> make this idea more concrete?
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> Best,
> > >>>>>>>>>>>>>>> Liya Fan
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> On Tue, Mar 3, 2020 at 9:47 PM Wes McKinney <
> > >>>>>>>> wesmckinn@gmail.com
> > >>>>>>>>>>>
> > >>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> On Tue, Mar 3, 2020, 7:36 AM Fan Liya <
> > >>>>>>>> liya.fan03@gmail.com>
> > >>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> I am so glad to see this discussion, and I am
> > >>>>>> willing to
> > >>>>>>>>>> provide
> > >>>>>>>>>>>> help
> > >>>>>>>>>>>>>>>> from
> > >>>>>>>>>>>>>>>>> the Java side.
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> In the proposal, I see the support for basic
> > >>>>>> compression
> > >>>>>>>>>>>> strategies
> > >>>>>>>>>>>>>>>>> (e.g.gzip, snappy).
> > >>>>>>>>>>>>>>>>> IMO, applying a single basic strategy is not likely
> > >>>>>> to
> > >>>>>>>>>> achieve
> > >>>>>>>>>>>>>>>> performance
> > >>>>>>>>>>>>>>>>> improvement for most scenarios.
> > >>>>>>>>>>>>>>>>> The optimal compression strategy is often obtained by
> > >>>>>>>>>> composing
> > >>>>>>>>>>>> basic
> > >>>>>>>>>>>>>>>>> strategies and tuning parameters.
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> I hope we can support such highly customized
> > >>>>>> compression
> > >>>>>>>>>>>> strategies.
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> I think very much beyond trivial one-shot buffer level
> > >>>>>>>>>> compression
> > >>>>>>>>>>>> is
> > >>>>>>>>>>>>>>>> probably out of the question for addition to the
> > >>>>>> current
> > >>>>>>>>>>>> "RecordBatch"
> > >>>>>>>>>>>>>>>> Flatbuffers type, because the additional metadata
> > >>>>>> would add
> > >>>>>>>>>>>> undesirable
> > >>>>>>>>>>>>>>>> bloat (which I would be against). If people have other
> > >>>>>>>> ideas it
> > >>>>>>>>>>>> would
> > >>>>>>>>>>>>>> be
> > >>>>>>>>>>>>>>>> great to see exactly what you are thinking as far as
> > >>>>>>>> changes
> > >>>>>>>>>> to the
> > >>>>>>>>>>>>>>>> protocol files.
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> I'll try to assemble some examples to show the
> > >>>>>> before/after
> > >>>>>>>>>>>> results of
> > >>>>>>>>>>>>>>>> applying the simple strategy.
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> Best,
> > >>>>>>>>>>>>>>>>> Liya Fan
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> On Tue, Mar 3, 2020 at 8:15 PM Antoine Pitrou <
> > >>>>>>>>>>>> antoine@python.org>
> > >>>>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> If we want to use a HTTP header, it would be more
> > >>>>>> of a
> > >>>>>>>>>>>>>>> Accept-Encoding
> > >>>>>>>>>>>>>>>>>> header, no?
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> In any case, we would have to put non-standard
> > >>>>>> values
> > >>>>>>>> there
> > >>>>>>>>>>>> (e.g.
> > >>>>>>>>>>>>>>> lz4),
> > >>>>>>>>>>>>>>>>>> so I'm not sure how desirable it is to repurpose
> > >>>>>> HTTP
> > >>>>>>>>>> headers
> > >>>>>>>>>>>> for
> > >>>>>>>>>>>>>>> that,
> > >>>>>>>>>>>>>>>>>> rather than add some dedicated field to the Flight
> > >>>>>>>>>> messages.
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> Regards
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> Antoine.
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> Le 03/03/2020 à 12:52, David Li a écrit :
> > >>>>>>>>>>>>>>>>>>> gRPC supports headers so for Flight, we could
> > >>>>>> send
> > >>>>>>>>>>>> essentially an
> > >>>>>>>>>>>>>>>>> Accept
> > >>>>>>>>>>>>>>>>>>> header and perhaps a Content-Type header.
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> David
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> On Mon, Mar 2, 2020, 23:15 Micah Kornfield <
> > >>>>>>>>>>>>>> emkornfield@gmail.com>
> > >>>>>>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> Hi Wes,
> > >>>>>>>>>>>>>>>>>>>> A few thoughts on this.  In general, I think it
> > >>>>>> is a
> > >>>>>>>>>> good
> > >>>>>>>>>>>> idea.
> > >>>>>>>>>>>>>>> But
> > >>>>>>>>>>>>>>>>>> before
> > >>>>>>>>>>>>>>>>>>>> proceeding, I think the following points are
> > >>>>>> worth
> > >>>>>>>>>>>> discussing:
> > >>>>>>>>>>>>>>>>>>>> 1.  Does this actually improve
> > >>>>>> throughput/latency
> > >>>>>>>> for
> > >>>>>>>>>>>> Flight? (I
> > >>>>>>>>>>>>>>>> think
> > >>>>>>>>>>>>>>>>>> you
> > >>>>>>>>>>>>>>>>>>>> mentioned you would follow-up with benchmarks).
> > >>>>>>>>>>>>>>>>>>>> 2.  I think we should limit the number of
> > >>>>>> supported
> > >>>>>>>>>>>> compression
> > >>>>>>>>>>>>>>>>> schemes
> > >>>>>>>>>>>>>>>>>> to
> > >>>>>>>>>>>>>>>>>>>> only 1 or 2.  I think the criteria for selection
> > >>>>>>>> speed
> > >>>>>>>>>> and
> > >>>>>>>>>>>>>> native
> > >>>>>>>>>>>>>>>>>>>> implementations available across the widest
> > >>>>>> possible
> > >>>>>>>>>>>> languages.
> > >>>>>>>>>>>>>>> As
> > >>>>>>>>>>>>>>>>> far
> > >>>>>>>>>>>>>>>>>> as
> > >>>>>>>>>>>>>>>>>>>> i can tell zstd only have bindings in java via
> > >>>>>> JNI,
> > >>>>>>>> but
> > >>>>>>>>>> my
> > >>>>>>>>>>>>>>>>>> understanding is
> > >>>>>>>>>>>>>>>>>>>> it is probably the type of compression for our
> > >>>>>>>>>> use-cases.
> > >>>>>>>>>>>> So I
> > >>>>>>>>>>>>>>>> think
> > >>>>>>>>>>>>>>>>>>>> zstd + potentially 1 more.
> > >>>>>>>>>>>>>>>>>>>> 3.  Commitment from someone on the Java side to
> > >>>>>>>>>> implement
> > >>>>>>>>>>>> this.
> > >>>>>>>>>>>>>>>>>>>> 4.  This doesn't need to be coupled with this
> > >>>>>> change
> > >>>>>>>>>> per-se
> > >>>>>>>>>>>> but
> > >>>>>>>>>>>>>>> for
> > >>>>>>>>>>>>>>>>>>>> something like flight it would be good to have a
> > >>>>>>>>>> standard
> > >>>>>>>>>>>>>>> mechanism
> > >>>>>>>>>>>>>>>>> for
> > >>>>>>>>>>>>>>>>>>>> negotiating server/client capabilities (e.g.
> > >>>>>> client
> > >>>>>>>>>> doesn't
> > >>>>>>>>>>>>>>> support
> > >>>>>>>>>>>>>>>>>>>> compression or only supports a subset).
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> Thanks,
> > >>>>>>>>>>>>>>>>>>>> Micah
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> On Sun, Mar 1, 2020 at 1:24 PM Wes McKinney <
> > >>>>>>>>>>>>>> wesmckinn@gmail.com>
> > >>>>>>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>> On Sun, Mar 1, 2020 at 3:14 PM Antoine Pitrou <
> > >>>>>>>>>>>>>>> antoine@python.org>
> > >>>>>>>>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>> Le 01/03/2020 à 22:01, Wes McKinney a écrit :
> > >>>>>>>>>>>>>>>>>>>>>>> In the context of a "next version of the
> > >>>>>> Feather
> > >>>>>>>>>> format"
> > >>>>>>>>>>>>>>>> ARROW-5510
> > >>>>>>>>>>>>>>>>>>>>>>> (which is consumed only by Python and R at
> > >>>>>> the
> > >>>>>>>>>> moment), I
> > >>>>>>>>>>>>>> have
> > >>>>>>>>>>>>>>>> been
> > >>>>>>>>>>>>>>>>>>>>>>> looking at compressing buffers using fast
> > >>>>>>>> compressors
> > >>>>>>>>>>>> like
> > >>>>>>>>>>>>>> ZSTD
> > >>>>>>>>>>>>>>>>> when
> > >>>>>>>>>>>>>>>>>>>>>>> writing the RecordBatch bodies. This could be
> > >>>>>>>> handled
> > >>>>>>>>>>>>>> privately
> > >>>>>>>>>>>>>>>> as
> > >>>>>>>>>>>>>>>>> an
> > >>>>>>>>>>>>>>>>>>>>>>> implementation detail of the Feather file,
> > >>>>>> but
> > >>>>>>>> since
> > >>>>>>>>>> ZSTD
> > >>>>>>>>>>>>>>>>> compression
> > >>>>>>>>>>>>>>>>>>>>>>> could improve throughput in Flight, for
> > >>>>>> example,
> > >>>>>>>> I
> > >>>>>>>>>>>> thought I
> > >>>>>>>>>>>>>>>> would
> > >>>>>>>>>>>>>>>>>>>>>>> bring it up for discussion.
> > >>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>> I can see two simple compression strategies:
> > >>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>> * Compress the entire message body in
> > >>>>>> one-shot,
> > >>>>>>>>>> writing
> > >>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>> result
> > >>>>>>>>>>>>>>>>>>>> out
> > >>>>>>>>>>>>>>>>>>>>>>> with an 8-byte int64 prefix indicating the
> > >>>>>>>>>> uncompressed
> > >>>>>>>>>>>> size
> > >>>>>>>>>>>>>>>>>>>>>>> * Compress each non-zero-length constituent
> > >>>>>>>> Buffer
> > >>>>>>>>>> prior
> > >>>>>>>>>>>> to
> > >>>>>>>>>>>>>>>> writing
> > >>>>>>>>>>>>>>>>>>>> to
> > >>>>>>>>>>>>>>>>>>>>>>> the body (and using the same
> > >>>>>>>>>> uncompressed-length-prefix
> > >>>>>>>>>>>> when
> > >>>>>>>>>>>>>>>>> writing
> > >>>>>>>>>>>>>>>>>>>>>>> the compressed buffer)
> > >>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>> The latter strategy is preferable for
> > >>>>>> scenarios
> > >>>>>>>>>> where we
> > >>>>>>>>>>>> may
> > >>>>>>>>>>>>>>>>> project
> > >>>>>>>>>>>>>>>>>>>>>>> out only a few fields from a larger record
> > >>>>>> batch
> > >>>>>>>>>> (such as
> > >>>>>>>>>>>>>>> reading
> > >>>>>>>>>>>>>>>>>>>> from
> > >>>>>>>>>>>>>>>>>>>>>>> a memory-mapped file).
> > >>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>> Agreed.  It may also allow using different
> > >>>>>>>> compression
> > >>>>>>>>>>>>>>> strategies
> > >>>>>>>>>>>>>>>>> for
> > >>>>>>>>>>>>>>>>>>>>>> different kinds of buffers (for example a
> > >>>>>>>> bytestream
> > >>>>>>>>>>>> splitting
> > >>>>>>>>>>>>>>>>>> strategy
> > >>>>>>>>>>>>>>>>>>>>>> for floats and doubles, or a delta encoding
> > >>>>>>>> strategy
> > >>>>>>>>>> for
> > >>>>>>>>>>>>>>>> integers).
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>> If we wanted to allow for different
> > >>>>>> compression to
> > >>>>>>>>>> apply to
> > >>>>>>>>>>>>>>>> different
> > >>>>>>>>>>>>>>>>>>>>> buffers, I think we will need a new Message
> > >>>>>> type
> > >>>>>>>>>> because
> > >>>>>>>>>>>> this
> > >>>>>>>>>>>>>>> would
> > >>>>>>>>>>>>>>>>>>>>> inflate metadata sizes in a way that is not
> > >>>>>> likely
> > >>>>>>>> to
> > >>>>>>>>>> be
> > >>>>>>>>>>>>>>> acceptable
> > >>>>>>>>>>>>>>>>>>>>> for the current uncompressed use case.
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>> Here is my strawman proposal
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>
> > >>>>>> https://github.com/apache/arrow/compare/master...wesm:compression-strawman
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>> Implementation could be accomplished by one
> > >>>>>> of
> > >>>>>>>> the
> > >>>>>>>>>>>> following
> > >>>>>>>>>>>>>>>>> methods:
> > >>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>> * Setting a field in Message.custom_metadata
> > >>>>>>>>>>>>>>>>>>>>>>> * Adding a new field to Message
> > >>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>> I think it has to be a new field in Message.
> > >>>>>>>> Making
> > >>>>>>>>>> it an
> > >>>>>>>>>>>>>>>> ignorable
> > >>>>>>>>>>>>>>>>>>>>>> metadata field means non-supporting receivers
> > >>>>>> will
> > >>>>>>>>>> decode
> > >>>>>>>>>>>> and
> > >>>>>>>>>>>>>>>>>> interpret
> > >>>>>>>>>>>>>>>>>>>>>> the data wrongly.
> > >>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>> Regards
> > >>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>> Antoine.
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>
> > >>>>>>

Re: [DISCUSS] Adding "trivial" buffer compression option to IPC protocol (ARROW-300)

Posted by Wes McKinney <we...@gmail.com>.
For uncompressed, memory mapping is disabled, so all of the bytes are
being read into RAM. I wanted to show that even when your IO pipe is
very fast (in the case with an NVMe SSD like I have, > 1GB/s for read
from disk) that you can still load faster with compressed files.

Here were the prior Read results with

* Single threaded decompression
* Memory mapping enabled

https://ibb.co/4ZncdF8

You can see for larger chunksizes, because the IPC reconstruction
overhead is about 60 microseconds per batch, that read time is very
low (10s of milliseconds).

On Wed, Apr 1, 2020 at 10:10 AM Antoine Pitrou <an...@python.org> wrote:
>
>
> The read times are still with memory mapping for the uncompressed case?
>  If so, impressive!
>
> Regards
>
> Antoine.
>
>
> Le 01/04/2020 à 16:44, Wes McKinney a écrit :
> > Several pieces of work got done in the last few days:
> >
> > * Changing from LZ4 raw to LZ4 frame format (what is recommended for
> > interoperability)
> > * Parallelizing both compression and decompression at the field level
> >
> > Here are the results (using 8 threads on an 8-core laptop). I disabled
> > the "memory map" feature so that in the uncompressed case all of the
> > data must be read off disk into memory. This helps illustrate the
> > compression/IO tradeoff to wall clock load times
> >
> > File size (only LZ4 may be different): https://ibb.co/CP3VQkp
> > Read time: https://ibb.co/vz9JZMx
> > Write time: https://ibb.co/H7bb68T
> >
> > In summary, now with multicore compression and decompression,
> > LZ4-compressed files are faster both to read and write even on a very
> > fast SSD, as are ZSTD-compressed files with a low ZSTD compression
> > level. I didn't notice a major difference between LZ4 raw and LZ4
> > frame formats. The reads and writes could be made faster still by
> > pipelining / making concurrent the disk read/write and
> > compression/decompression steps -- the current implementation performs
> > these tasks serially. We can improve this in the near future
> >
> > I'll update the Format proposal this week so we can move toward
> > something we can vote on. I would recommend that we await
> > implementations and integration tests for this before releasing this
> > as stable, in line with prior discussions about adding stuff to the
> > IPC protocol
> >
> > On Thu, Mar 26, 2020 at 4:57 PM Wes McKinney <we...@gmail.com> wrote:
> >>
> >> Here are the results:
> >>
> >> File size: https://ibb.co/71sBsg3
> >> Read time: https://ibb.co/4ZncdF8
> >> Write time: https://ibb.co/xhNkRS2
> >>
> >> Code: https://github.com/wesm/notebooks/blob/master/20190919file_benchmarks/FeatherCompression.ipynb
> >> (based on https://github.com/apache/arrow/pull/6694)
> >>
> >> High level summary:
> >>
> >> * Chunksize 1024 vs 64K has relatively limited impact on file sizes
> >>
> >> * Wall clock read time is impacted by chunksize, maybe 30-40%
> >> difference between 1K row chunks versus 16K row chunks. One notable
> >> thing is that you can see clearly the overhead associated with IPC
> >> reconstruction even when the data is memory mapped. For example, in
> >> the Fannie Mae dataset there are 21,661 batches (each batch has 31
> >> fields) when the chunksize is 1024. So a read time of 1.3 seconds
> >> indicates ~60 microseconds of overhead for each record batch. When you
> >> consider the amount of business logic involved with reconstructing a
> >> record batch, 60 microseconds is pretty good. This also shows that
> >> every microsecond counts and we need to be carefully tracking
> >> microperformance in this critical operation.
> >>
> >> * Small chunksize results in higher write times for "expensive" codecs
> >> like ZSTD with a high compression ratio. For "cheap" codecs like LZ4
> >> it doesn't make as much of a difference
> >>
> >> * Note that LZ4 compressor results in faster wall clock time to disk
> >> presumably because the compression speed is faster than my SSD's write
> >> speed
> >>
> >> Implementation notes:
> >> * There is no parallelization or pipelining of reads or writes. For
> >> example, on write, all of the buffers are compressed with a single
> >> thread and then compression stops until the write to disk completes.
> >> On read, buffers are decompressed serially
> >>
> >>
> >> On Thu, Mar 26, 2020 at 12:24 PM Wes McKinney <we...@gmail.com> wrote:
> >>>
> >>> I'll run a grid of batch sizes (from 1024 to 64K or 128K) and let you
> >>> know the read/write times and compression ratios. Shouldn't take too
> >>> long
> >>>
> >>> On Wed, Mar 25, 2020 at 10:37 PM Fan Liya <li...@gmail.com> wrote:
> >>>>
> >>>> Thanks a lot for sharing the good results.
> >>>>
> >>>> As investigated by Wes, we have existing zstd library for Java (zstd-jni) [1], and lz4 library for Java (lz4-java) [2].
> >>>> +1 for the 1024 batch size, as it represents an important scenario where the batch fits into the L1 cache (IMO).
> >>>>
> >>>> Best,
> >>>> Liya Fan
> >>>>
> >>>> [1] https://github.com/luben/zstd-jni
> >>>> [2] https://github.com/lz4/lz4-java
> >>>>
> >>>> On Thu, Mar 26, 2020 at 2:38 AM Micah Kornfield <em...@gmail.com> wrote:
> >>>>>
> >>>>> If it isn't hard could you run with batch sizes of 1024 or 2048 records?  I
> >>>>> think there was a question previously raised if there was benefit for
> >>>>> smaller sizes buffers.
> >>>>>
> >>>>> Thanks,
> >>>>> Micah
> >>>>>
> >>>>>
> >>>>> On Wed, Mar 25, 2020 at 8:59 AM Wes McKinney <we...@gmail.com> wrote:
> >>>>>
> >>>>>> On Tue, Mar 24, 2020 at 9:22 PM Micah Kornfield <em...@gmail.com>
> >>>>>> wrote:
> >>>>>>>
> >>>>>>>>
> >>>>>>>> Compression ratios ranging from ~50% with LZ4 and ~75% with ZSTD on
> >>>>>>>> the Taxi dataset to ~87% with LZ4 and ~90% with ZSTD on the Fannie Mae
> >>>>>>>> dataset. So that's a huge space savings
> >>>>>>>
> >>>>>>> One more question on this.  What was the average row-batch size used?  I
> >>>>>>> see in the proposal some buffers might not be compressed, did you this
> >>>>>>> feature in the test?
> >>>>>>
> >>>>>> I used 64K row batch size. I haven't implemented the optional
> >>>>>> non-compressed buffers (for cases where there is little space savings)
> >>>>>> so everything is compressed. I can check different batch sizes if you
> >>>>>> like
> >>>>>>
> >>>>>>
> >>>>>>> On Mon, Mar 23, 2020 at 4:40 PM Wes McKinney <we...@gmail.com>
> >>>>>> wrote:
> >>>>>>>
> >>>>>>>> hi folks,
> >>>>>>>>
> >>>>>>>> Sorry it's taken me a little while to produce supporting benchmarks.
> >>>>>>>>
> >>>>>>>> * I implemented experimental trivial body buffer compression in
> >>>>>>>> https://github.com/apache/arrow/pull/6638
> >>>>>>>> * I hooked up the Arrow IPC file format with compression as the new
> >>>>>>>> Feather V2 format in
> >>>>>>>> https://github.com/apache/arrow/pull/6694#issuecomment-602906476
> >>>>>>>>
> >>>>>>>> I tested a couple of real-world datasets from a prior blog post
> >>>>>>>> https://ursalabs.org/blog/2019-10-columnar-perf/ with ZSTD and LZ4
> >>>>>>>> codecs
> >>>>>>>>
> >>>>>>>> The complete results are here
> >>>>>>>> https://github.com/apache/arrow/pull/6694#issuecomment-602906476
> >>>>>>>>
> >>>>>>>> Summary:
> >>>>>>>>
> >>>>>>>> * Compression ratios ranging from ~50% with LZ4 and ~75% with ZSTD on
> >>>>>>>> the Taxi dataset to ~87% with LZ4 and ~90% with ZSTD on the Fannie Mae
> >>>>>>>> dataset. So that's a huge space savings
> >>>>>>>> * Single-threaded decompression times exceeding 2-4GByte/s with LZ4
> >>>>>>>> and 1.2-3GByte/s with ZSTD
> >>>>>>>>
> >>>>>>>> I would have to do some more engineering to test throughput changes
> >>>>>>>> with Flight, but given these results on slower networking (e.g. 1
> >>>>>>>> Gigabit) my guess is that the compression and decompression overhead
> >>>>>>>> is little compared with the time savings due to high compression
> >>>>>>>> ratios. If people would like to see these numbers to help make a
> >>>>>>>> decision I can take a closer look
> >>>>>>>>
> >>>>>>>> As far as what Micah said about having a limited number of
> >>>>>>>> compressors: I would be in favor of having just LZ4 and ZSTD. It seems
> >>>>>>>> anecdotally that these outperform Snappy in most real world scenarios
> >>>>>>>> and generally have > 1 GB/s decompression performance. Some Linux
> >>>>>>>> distributions (Arch at least) have already started adopting ZSTD over
> >>>>>>>> LZMA or GZIP [1]
> >>>>>>>>
> >>>>>>>> - Wes
> >>>>>>>>
> >>>>>>>> [1]:
> >>>>>>>>
> >>>>>> https://www.archlinux.org/news/now-using-zstandard-instead-of-xz-for-package-compression/
> >>>>>>>>
> >>>>>>>> On Fri, Mar 6, 2020 at 8:42 AM Fan Liya <li...@gmail.com> wrote:
> >>>>>>>>>
> >>>>>>>>> Hi Wes,
> >>>>>>>>>
> >>>>>>>>> Thanks a lot for the additional information.
> >>>>>>>>> Looking forward to see the good results from your experiments.
> >>>>>>>>>
> >>>>>>>>> Best,
> >>>>>>>>> Liya Fan
> >>>>>>>>>
> >>>>>>>>> On Thu, Mar 5, 2020 at 11:42 PM Wes McKinney <we...@gmail.com>
> >>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> I see, thank you.
> >>>>>>>>>>
> >>>>>>>>>> For such a scenario, implementations would need to define a
> >>>>>>>>>> "UserDefinedCodec" interface to enable codecs to be registered from
> >>>>>>>>>> third party code, similar to what is done for extension types [1]
> >>>>>>>>>>
> >>>>>>>>>> I'll update this thread when I get my experimental C++ patch up to
> >>>>>> see
> >>>>>>>>>> what I'm thinking at least for the built-in codecs we have like
> >>>>>> ZSTD.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>
> >>>>>> https://github.com/apache/arrow/blob/apache-arrow-0.16.0/docs/source/format/Columnar.rst#extension-types
> >>>>>>>>>>
> >>>>>>>>>> On Thu, Mar 5, 2020 at 7:56 AM Fan Liya <li...@gmail.com>
> >>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> Hi Wes,
> >>>>>>>>>>>
> >>>>>>>>>>> Thanks a lot for your further clarification.
> >>>>>>>>>>>
> >>>>>>>>>>> Some of my prelimiary thoughts:
> >>>>>>>>>>>
> >>>>>>>>>>> 1. We assign a unique GUID to each pair of
> >>>>>> compression/decompression
> >>>>>>>>>>> strategies. The GUID is stored as part of the
> >>>>>>>> Message.custom_metadata.
> >>>>>>>>>> When
> >>>>>>>>>>> receiving the GUID, the receiver knows which decompression
> >>>>>> strategy
> >>>>>>>> to
> >>>>>>>>>> use.
> >>>>>>>>>>>
> >>>>>>>>>>> 2. We serialize the decompression strategy, and store it into the
> >>>>>>>>>>> Message.custom_metadata. The receiver can decompress data after
> >>>>>>>>>>> deserializing the strategy.
> >>>>>>>>>>>
> >>>>>>>>>>> Method 1 is generally used in static strategy scenarios while
> >>>>>> method
> >>>>>>>> 2 is
> >>>>>>>>>>> generally used in dynamic strategy scenarios.
> >>>>>>>>>>>
> >>>>>>>>>>> Best,
> >>>>>>>>>>> Liya Fan
> >>>>>>>>>>>
> >>>>>>>>>>> On Wed, Mar 4, 2020 at 11:39 PM Wes McKinney <
> >>>>>> wesmckinn@gmail.com>
> >>>>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> Okay, I guess my question is how the receiver is going to be
> >>>>>> able
> >>>>>>>> to
> >>>>>>>>>>>> determine how to "rehydrate" the record batch buffers:
> >>>>>>>>>>>>
> >>>>>>>>>>>> What I've proposed amounts to the following:
> >>>>>>>>>>>>
> >>>>>>>>>>>> * UNCOMPRESSED: the current behavior
> >>>>>>>>>>>> * ZSTD/LZ4/...: each buffer is compressed and written with an
> >>>>>> int64
> >>>>>>>>>>>> length prefix
> >>>>>>>>>>>>
> >>>>>>>>>>>> (I'm close to putting up a PR implementing an experimental
> >>>>>> version
> >>>>>>>> of
> >>>>>>>>>>>> this that uses Message.custom_metadata to transmit the codec,
> >>>>>> so
> >>>>>>>> this
> >>>>>>>>>>>> will make the implementation details more concrete)
> >>>>>>>>>>>>
> >>>>>>>>>>>> So in the USER_DEFINED case, how will the library know how to
> >>>>>>>> obtain
> >>>>>>>>>>>> the uncompressed buffer? Is some additional metadata structure
> >>>>>>>>>>>> required to provide instructions?
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Wed, Mar 4, 2020 at 8:05 AM Fan Liya <li...@gmail.com>
> >>>>>>>> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Hi Wes,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I am thinking of adding an option named "USER_DEFINED" (or
> >>>>>>>> something
> >>>>>>>>>>>>> similar) to enum CompressionType in your proposal.
> >>>>>>>>>>>>> IMO, this option should be used primarily in Flight.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Best,
> >>>>>>>>>>>>> Liya Fan
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Wed, Mar 4, 2020 at 11:12 AM Wes McKinney <
> >>>>>>>> wesmckinn@gmail.com>
> >>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Tue, Mar 3, 2020, 8:11 PM Fan Liya <
> >>>>>> liya.fan03@gmail.com>
> >>>>>>>>>> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Sure. I agree with you that we should not overdo this.
> >>>>>>>>>>>>>>> I am wondering if we should provide an option to allow
> >>>>>> users
> >>>>>>>> to
> >>>>>>>>>>>> plugin
> >>>>>>>>>>>>>>> their customized compression strategies.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Can you provide a patch showing changes to Message.fbs (or
> >>>>>>>>>> Schema.fbs)
> >>>>>>>>>>>> that
> >>>>>>>>>>>>>> make this idea more concrete?
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>>> Liya Fan
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On Tue, Mar 3, 2020 at 9:47 PM Wes McKinney <
> >>>>>>>> wesmckinn@gmail.com
> >>>>>>>>>>>
> >>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> On Tue, Mar 3, 2020, 7:36 AM Fan Liya <
> >>>>>>>> liya.fan03@gmail.com>
> >>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> I am so glad to see this discussion, and I am
> >>>>>> willing to
> >>>>>>>>>> provide
> >>>>>>>>>>>> help
> >>>>>>>>>>>>>>>> from
> >>>>>>>>>>>>>>>>> the Java side.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> In the proposal, I see the support for basic
> >>>>>> compression
> >>>>>>>>>>>> strategies
> >>>>>>>>>>>>>>>>> (e.g.gzip, snappy).
> >>>>>>>>>>>>>>>>> IMO, applying a single basic strategy is not likely
> >>>>>> to
> >>>>>>>>>> achieve
> >>>>>>>>>>>>>>>> performance
> >>>>>>>>>>>>>>>>> improvement for most scenarios.
> >>>>>>>>>>>>>>>>> The optimal compression strategy is often obtained by
> >>>>>>>>>> composing
> >>>>>>>>>>>> basic
> >>>>>>>>>>>>>>>>> strategies and tuning parameters.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> I hope we can support such highly customized
> >>>>>> compression
> >>>>>>>>>>>> strategies.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> I think very much beyond trivial one-shot buffer level
> >>>>>>>>>> compression
> >>>>>>>>>>>> is
> >>>>>>>>>>>>>>>> probably out of the question for addition to the
> >>>>>> current
> >>>>>>>>>>>> "RecordBatch"
> >>>>>>>>>>>>>>>> Flatbuffers type, because the additional metadata
> >>>>>> would add
> >>>>>>>>>>>> undesirable
> >>>>>>>>>>>>>>>> bloat (which I would be against). If people have other
> >>>>>>>> ideas it
> >>>>>>>>>>>> would
> >>>>>>>>>>>>>> be
> >>>>>>>>>>>>>>>> great to see exactly what you are thinking as far as
> >>>>>>>> changes
> >>>>>>>>>> to the
> >>>>>>>>>>>>>>>> protocol files.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> I'll try to assemble some examples to show the
> >>>>>> before/after
> >>>>>>>>>>>> results of
> >>>>>>>>>>>>>>>> applying the simple strategy.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>>>>> Liya Fan
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> On Tue, Mar 3, 2020 at 8:15 PM Antoine Pitrou <
> >>>>>>>>>>>> antoine@python.org>
> >>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> If we want to use a HTTP header, it would be more
> >>>>>> of a
> >>>>>>>>>>>>>>> Accept-Encoding
> >>>>>>>>>>>>>>>>>> header, no?
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> In any case, we would have to put non-standard
> >>>>>> values
> >>>>>>>> there
> >>>>>>>>>>>> (e.g.
> >>>>>>>>>>>>>>> lz4),
> >>>>>>>>>>>>>>>>>> so I'm not sure how desirable it is to repurpose
> >>>>>> HTTP
> >>>>>>>>>> headers
> >>>>>>>>>>>> for
> >>>>>>>>>>>>>>> that,
> >>>>>>>>>>>>>>>>>> rather than add some dedicated field to the Flight
> >>>>>>>>>> messages.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Regards
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Antoine.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Le 03/03/2020 à 12:52, David Li a écrit :
> >>>>>>>>>>>>>>>>>>> gRPC supports headers so for Flight, we could
> >>>>>> send
> >>>>>>>>>>>> essentially an
> >>>>>>>>>>>>>>>>> Accept
> >>>>>>>>>>>>>>>>>>> header and perhaps a Content-Type header.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> David
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> On Mon, Mar 2, 2020, 23:15 Micah Kornfield <
> >>>>>>>>>>>>>> emkornfield@gmail.com>
> >>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Hi Wes,
> >>>>>>>>>>>>>>>>>>>> A few thoughts on this.  In general, I think it
> >>>>>> is a
> >>>>>>>>>> good
> >>>>>>>>>>>> idea.
> >>>>>>>>>>>>>>> But
> >>>>>>>>>>>>>>>>>> before
> >>>>>>>>>>>>>>>>>>>> proceeding, I think the following points are
> >>>>>> worth
> >>>>>>>>>>>> discussing:
> >>>>>>>>>>>>>>>>>>>> 1.  Does this actually improve
> >>>>>> throughput/latency
> >>>>>>>> for
> >>>>>>>>>>>> Flight? (I
> >>>>>>>>>>>>>>>> think
> >>>>>>>>>>>>>>>>>> you
> >>>>>>>>>>>>>>>>>>>> mentioned you would follow-up with benchmarks).
> >>>>>>>>>>>>>>>>>>>> 2.  I think we should limit the number of
> >>>>>> supported
> >>>>>>>>>>>> compression
> >>>>>>>>>>>>>>>>> schemes
> >>>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>> only 1 or 2.  I think the criteria for selection
> >>>>>>>> speed
> >>>>>>>>>> and
> >>>>>>>>>>>>>> native
> >>>>>>>>>>>>>>>>>>>> implementations available across the widest
> >>>>>> possible
> >>>>>>>>>>>> languages.
> >>>>>>>>>>>>>>> As
> >>>>>>>>>>>>>>>>> far
> >>>>>>>>>>>>>>>>>> as
> >>>>>>>>>>>>>>>>>>>> i can tell zstd only have bindings in java via
> >>>>>> JNI,
> >>>>>>>> but
> >>>>>>>>>> my
> >>>>>>>>>>>>>>>>>> understanding is
> >>>>>>>>>>>>>>>>>>>> it is probably the type of compression for our
> >>>>>>>>>> use-cases.
> >>>>>>>>>>>> So I
> >>>>>>>>>>>>>>>> think
> >>>>>>>>>>>>>>>>>>>> zstd + potentially 1 more.
> >>>>>>>>>>>>>>>>>>>> 3.  Commitment from someone on the Java side to
> >>>>>>>>>> implement
> >>>>>>>>>>>> this.
> >>>>>>>>>>>>>>>>>>>> 4.  This doesn't need to be coupled with this
> >>>>>> change
> >>>>>>>>>> per-se
> >>>>>>>>>>>> but
> >>>>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>>>>> something like flight it would be good to have a
> >>>>>>>>>> standard
> >>>>>>>>>>>>>>> mechanism
> >>>>>>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>>>>> negotiating server/client capabilities (e.g.
> >>>>>> client
> >>>>>>>>>> doesn't
> >>>>>>>>>>>>>>> support
> >>>>>>>>>>>>>>>>>>>> compression or only supports a subset).
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>>>>>> Micah
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> On Sun, Mar 1, 2020 at 1:24 PM Wes McKinney <
> >>>>>>>>>>>>>> wesmckinn@gmail.com>
> >>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> On Sun, Mar 1, 2020 at 3:14 PM Antoine Pitrou <
> >>>>>>>>>>>>>>> antoine@python.org>
> >>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> Le 01/03/2020 à 22:01, Wes McKinney a écrit :
> >>>>>>>>>>>>>>>>>>>>>>> In the context of a "next version of the
> >>>>>> Feather
> >>>>>>>>>> format"
> >>>>>>>>>>>>>>>> ARROW-5510
> >>>>>>>>>>>>>>>>>>>>>>> (which is consumed only by Python and R at
> >>>>>> the
> >>>>>>>>>> moment), I
> >>>>>>>>>>>>>> have
> >>>>>>>>>>>>>>>> been
> >>>>>>>>>>>>>>>>>>>>>>> looking at compressing buffers using fast
> >>>>>>>> compressors
> >>>>>>>>>>>> like
> >>>>>>>>>>>>>> ZSTD
> >>>>>>>>>>>>>>>>> when
> >>>>>>>>>>>>>>>>>>>>>>> writing the RecordBatch bodies. This could be
> >>>>>>>> handled
> >>>>>>>>>>>>>> privately
> >>>>>>>>>>>>>>>> as
> >>>>>>>>>>>>>>>>> an
> >>>>>>>>>>>>>>>>>>>>>>> implementation detail of the Feather file,
> >>>>>> but
> >>>>>>>> since
> >>>>>>>>>> ZSTD
> >>>>>>>>>>>>>>>>> compression
> >>>>>>>>>>>>>>>>>>>>>>> could improve throughput in Flight, for
> >>>>>> example,
> >>>>>>>> I
> >>>>>>>>>>>> thought I
> >>>>>>>>>>>>>>>> would
> >>>>>>>>>>>>>>>>>>>>>>> bring it up for discussion.
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> I can see two simple compression strategies:
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> * Compress the entire message body in
> >>>>>> one-shot,
> >>>>>>>>>> writing
> >>>>>>>>>>>> the
> >>>>>>>>>>>>>>>> result
> >>>>>>>>>>>>>>>>>>>> out
> >>>>>>>>>>>>>>>>>>>>>>> with an 8-byte int64 prefix indicating the
> >>>>>>>>>> uncompressed
> >>>>>>>>>>>> size
> >>>>>>>>>>>>>>>>>>>>>>> * Compress each non-zero-length constituent
> >>>>>>>> Buffer
> >>>>>>>>>> prior
> >>>>>>>>>>>> to
> >>>>>>>>>>>>>>>> writing
> >>>>>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>>>>> the body (and using the same
> >>>>>>>>>> uncompressed-length-prefix
> >>>>>>>>>>>> when
> >>>>>>>>>>>>>>>>> writing
> >>>>>>>>>>>>>>>>>>>>>>> the compressed buffer)
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> The latter strategy is preferable for
> >>>>>> scenarios
> >>>>>>>>>> where we
> >>>>>>>>>>>> may
> >>>>>>>>>>>>>>>>> project
> >>>>>>>>>>>>>>>>>>>>>>> out only a few fields from a larger record
> >>>>>> batch
> >>>>>>>>>> (such as
> >>>>>>>>>>>>>>> reading
> >>>>>>>>>>>>>>>>>>>> from
> >>>>>>>>>>>>>>>>>>>>>>> a memory-mapped file).
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> Agreed.  It may also allow using different
> >>>>>>>> compression
> >>>>>>>>>>>>>>> strategies
> >>>>>>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>>>>>>> different kinds of buffers (for example a
> >>>>>>>> bytestream
> >>>>>>>>>>>> splitting
> >>>>>>>>>>>>>>>>>> strategy
> >>>>>>>>>>>>>>>>>>>>>> for floats and doubles, or a delta encoding
> >>>>>>>> strategy
> >>>>>>>>>> for
> >>>>>>>>>>>>>>>> integers).
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> If we wanted to allow for different
> >>>>>> compression to
> >>>>>>>>>> apply to
> >>>>>>>>>>>>>>>> different
> >>>>>>>>>>>>>>>>>>>>> buffers, I think we will need a new Message
> >>>>>> type
> >>>>>>>>>> because
> >>>>>>>>>>>> this
> >>>>>>>>>>>>>>> would
> >>>>>>>>>>>>>>>>>>>>> inflate metadata sizes in a way that is not
> >>>>>> likely
> >>>>>>>> to
> >>>>>>>>>> be
> >>>>>>>>>>>>>>> acceptable
> >>>>>>>>>>>>>>>>>>>>> for the current uncompressed use case.
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Here is my strawman proposal
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>
> >>>>>> https://github.com/apache/arrow/compare/master...wesm:compression-strawman
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> Implementation could be accomplished by one
> >>>>>> of
> >>>>>>>> the
> >>>>>>>>>>>> following
> >>>>>>>>>>>>>>>>> methods:
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> * Setting a field in Message.custom_metadata
> >>>>>>>>>>>>>>>>>>>>>>> * Adding a new field to Message
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> I think it has to be a new field in Message.
> >>>>>>>> Making
> >>>>>>>>>> it an
> >>>>>>>>>>>>>>>> ignorable
> >>>>>>>>>>>>>>>>>>>>>> metadata field means non-supporting receivers
> >>>>>> will
> >>>>>>>>>> decode
> >>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>> interpret
> >>>>>>>>>>>>>>>>>>>>>> the data wrongly.
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> Regards
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> Antoine.
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>

Re: [DISCUSS] Adding "trivial" buffer compression option to IPC protocol (ARROW-300)

Posted by Antoine Pitrou <an...@python.org>.
The read times are still with memory mapping for the uncompressed case?
 If so, impressive!

Regards

Antoine.


Le 01/04/2020 à 16:44, Wes McKinney a écrit :
> Several pieces of work got done in the last few days:
> 
> * Changing from LZ4 raw to LZ4 frame format (what is recommended for
> interoperability)
> * Parallelizing both compression and decompression at the field level
> 
> Here are the results (using 8 threads on an 8-core laptop). I disabled
> the "memory map" feature so that in the uncompressed case all of the
> data must be read off disk into memory. This helps illustrate the
> compression/IO tradeoff to wall clock load times
> 
> File size (only LZ4 may be different): https://ibb.co/CP3VQkp
> Read time: https://ibb.co/vz9JZMx
> Write time: https://ibb.co/H7bb68T
> 
> In summary, now with multicore compression and decompression,
> LZ4-compressed files are faster both to read and write even on a very
> fast SSD, as are ZSTD-compressed files with a low ZSTD compression
> level. I didn't notice a major difference between LZ4 raw and LZ4
> frame formats. The reads and writes could be made faster still by
> pipelining / making concurrent the disk read/write and
> compression/decompression steps -- the current implementation performs
> these tasks serially. We can improve this in the near future
> 
> I'll update the Format proposal this week so we can move toward
> something we can vote on. I would recommend that we await
> implementations and integration tests for this before releasing this
> as stable, in line with prior discussions about adding stuff to the
> IPC protocol
> 
> On Thu, Mar 26, 2020 at 4:57 PM Wes McKinney <we...@gmail.com> wrote:
>>
>> Here are the results:
>>
>> File size: https://ibb.co/71sBsg3
>> Read time: https://ibb.co/4ZncdF8
>> Write time: https://ibb.co/xhNkRS2
>>
>> Code: https://github.com/wesm/notebooks/blob/master/20190919file_benchmarks/FeatherCompression.ipynb
>> (based on https://github.com/apache/arrow/pull/6694)
>>
>> High level summary:
>>
>> * Chunksize 1024 vs 64K has relatively limited impact on file sizes
>>
>> * Wall clock read time is impacted by chunksize, maybe 30-40%
>> difference between 1K row chunks versus 16K row chunks. One notable
>> thing is that you can see clearly the overhead associated with IPC
>> reconstruction even when the data is memory mapped. For example, in
>> the Fannie Mae dataset there are 21,661 batches (each batch has 31
>> fields) when the chunksize is 1024. So a read time of 1.3 seconds
>> indicates ~60 microseconds of overhead for each record batch. When you
>> consider the amount of business logic involved with reconstructing a
>> record batch, 60 microseconds is pretty good. This also shows that
>> every microsecond counts and we need to be carefully tracking
>> microperformance in this critical operation.
>>
>> * Small chunksize results in higher write times for "expensive" codecs
>> like ZSTD with a high compression ratio. For "cheap" codecs like LZ4
>> it doesn't make as much of a difference
>>
>> * Note that LZ4 compressor results in faster wall clock time to disk
>> presumably because the compression speed is faster than my SSD's write
>> speed
>>
>> Implementation notes:
>> * There is no parallelization or pipelining of reads or writes. For
>> example, on write, all of the buffers are compressed with a single
>> thread and then compression stops until the write to disk completes.
>> On read, buffers are decompressed serially
>>
>>
>> On Thu, Mar 26, 2020 at 12:24 PM Wes McKinney <we...@gmail.com> wrote:
>>>
>>> I'll run a grid of batch sizes (from 1024 to 64K or 128K) and let you
>>> know the read/write times and compression ratios. Shouldn't take too
>>> long
>>>
>>> On Wed, Mar 25, 2020 at 10:37 PM Fan Liya <li...@gmail.com> wrote:
>>>>
>>>> Thanks a lot for sharing the good results.
>>>>
>>>> As investigated by Wes, we have existing zstd library for Java (zstd-jni) [1], and lz4 library for Java (lz4-java) [2].
>>>> +1 for the 1024 batch size, as it represents an important scenario where the batch fits into the L1 cache (IMO).
>>>>
>>>> Best,
>>>> Liya Fan
>>>>
>>>> [1] https://github.com/luben/zstd-jni
>>>> [2] https://github.com/lz4/lz4-java
>>>>
>>>> On Thu, Mar 26, 2020 at 2:38 AM Micah Kornfield <em...@gmail.com> wrote:
>>>>>
>>>>> If it isn't hard could you run with batch sizes of 1024 or 2048 records?  I
>>>>> think there was a question previously raised if there was benefit for
>>>>> smaller sizes buffers.
>>>>>
>>>>> Thanks,
>>>>> Micah
>>>>>
>>>>>
>>>>> On Wed, Mar 25, 2020 at 8:59 AM Wes McKinney <we...@gmail.com> wrote:
>>>>>
>>>>>> On Tue, Mar 24, 2020 at 9:22 PM Micah Kornfield <em...@gmail.com>
>>>>>> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>> Compression ratios ranging from ~50% with LZ4 and ~75% with ZSTD on
>>>>>>>> the Taxi dataset to ~87% with LZ4 and ~90% with ZSTD on the Fannie Mae
>>>>>>>> dataset. So that's a huge space savings
>>>>>>>
>>>>>>> One more question on this.  What was the average row-batch size used?  I
>>>>>>> see in the proposal some buffers might not be compressed, did you this
>>>>>>> feature in the test?
>>>>>>
>>>>>> I used 64K row batch size. I haven't implemented the optional
>>>>>> non-compressed buffers (for cases where there is little space savings)
>>>>>> so everything is compressed. I can check different batch sizes if you
>>>>>> like
>>>>>>
>>>>>>
>>>>>>> On Mon, Mar 23, 2020 at 4:40 PM Wes McKinney <we...@gmail.com>
>>>>>> wrote:
>>>>>>>
>>>>>>>> hi folks,
>>>>>>>>
>>>>>>>> Sorry it's taken me a little while to produce supporting benchmarks.
>>>>>>>>
>>>>>>>> * I implemented experimental trivial body buffer compression in
>>>>>>>> https://github.com/apache/arrow/pull/6638
>>>>>>>> * I hooked up the Arrow IPC file format with compression as the new
>>>>>>>> Feather V2 format in
>>>>>>>> https://github.com/apache/arrow/pull/6694#issuecomment-602906476
>>>>>>>>
>>>>>>>> I tested a couple of real-world datasets from a prior blog post
>>>>>>>> https://ursalabs.org/blog/2019-10-columnar-perf/ with ZSTD and LZ4
>>>>>>>> codecs
>>>>>>>>
>>>>>>>> The complete results are here
>>>>>>>> https://github.com/apache/arrow/pull/6694#issuecomment-602906476
>>>>>>>>
>>>>>>>> Summary:
>>>>>>>>
>>>>>>>> * Compression ratios ranging from ~50% with LZ4 and ~75% with ZSTD on
>>>>>>>> the Taxi dataset to ~87% with LZ4 and ~90% with ZSTD on the Fannie Mae
>>>>>>>> dataset. So that's a huge space savings
>>>>>>>> * Single-threaded decompression times exceeding 2-4GByte/s with LZ4
>>>>>>>> and 1.2-3GByte/s with ZSTD
>>>>>>>>
>>>>>>>> I would have to do some more engineering to test throughput changes
>>>>>>>> with Flight, but given these results on slower networking (e.g. 1
>>>>>>>> Gigabit) my guess is that the compression and decompression overhead
>>>>>>>> is little compared with the time savings due to high compression
>>>>>>>> ratios. If people would like to see these numbers to help make a
>>>>>>>> decision I can take a closer look
>>>>>>>>
>>>>>>>> As far as what Micah said about having a limited number of
>>>>>>>> compressors: I would be in favor of having just LZ4 and ZSTD. It seems
>>>>>>>> anecdotally that these outperform Snappy in most real world scenarios
>>>>>>>> and generally have > 1 GB/s decompression performance. Some Linux
>>>>>>>> distributions (Arch at least) have already started adopting ZSTD over
>>>>>>>> LZMA or GZIP [1]
>>>>>>>>
>>>>>>>> - Wes
>>>>>>>>
>>>>>>>> [1]:
>>>>>>>>
>>>>>> https://www.archlinux.org/news/now-using-zstandard-instead-of-xz-for-package-compression/
>>>>>>>>
>>>>>>>> On Fri, Mar 6, 2020 at 8:42 AM Fan Liya <li...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>> Hi Wes,
>>>>>>>>>
>>>>>>>>> Thanks a lot for the additional information.
>>>>>>>>> Looking forward to see the good results from your experiments.
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>> Liya Fan
>>>>>>>>>
>>>>>>>>> On Thu, Mar 5, 2020 at 11:42 PM Wes McKinney <we...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> I see, thank you.
>>>>>>>>>>
>>>>>>>>>> For such a scenario, implementations would need to define a
>>>>>>>>>> "UserDefinedCodec" interface to enable codecs to be registered from
>>>>>>>>>> third party code, similar to what is done for extension types [1]
>>>>>>>>>>
>>>>>>>>>> I'll update this thread when I get my experimental C++ patch up to
>>>>>> see
>>>>>>>>>> what I'm thinking at least for the built-in codecs we have like
>>>>>> ZSTD.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>> https://github.com/apache/arrow/blob/apache-arrow-0.16.0/docs/source/format/Columnar.rst#extension-types
>>>>>>>>>>
>>>>>>>>>> On Thu, Mar 5, 2020 at 7:56 AM Fan Liya <li...@gmail.com>
>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Hi Wes,
>>>>>>>>>>>
>>>>>>>>>>> Thanks a lot for your further clarification.
>>>>>>>>>>>
>>>>>>>>>>> Some of my prelimiary thoughts:
>>>>>>>>>>>
>>>>>>>>>>> 1. We assign a unique GUID to each pair of
>>>>>> compression/decompression
>>>>>>>>>>> strategies. The GUID is stored as part of the
>>>>>>>> Message.custom_metadata.
>>>>>>>>>> When
>>>>>>>>>>> receiving the GUID, the receiver knows which decompression
>>>>>> strategy
>>>>>>>> to
>>>>>>>>>> use.
>>>>>>>>>>>
>>>>>>>>>>> 2. We serialize the decompression strategy, and store it into the
>>>>>>>>>>> Message.custom_metadata. The receiver can decompress data after
>>>>>>>>>>> deserializing the strategy.
>>>>>>>>>>>
>>>>>>>>>>> Method 1 is generally used in static strategy scenarios while
>>>>>> method
>>>>>>>> 2 is
>>>>>>>>>>> generally used in dynamic strategy scenarios.
>>>>>>>>>>>
>>>>>>>>>>> Best,
>>>>>>>>>>> Liya Fan
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Mar 4, 2020 at 11:39 PM Wes McKinney <
>>>>>> wesmckinn@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Okay, I guess my question is how the receiver is going to be
>>>>>> able
>>>>>>>> to
>>>>>>>>>>>> determine how to "rehydrate" the record batch buffers:
>>>>>>>>>>>>
>>>>>>>>>>>> What I've proposed amounts to the following:
>>>>>>>>>>>>
>>>>>>>>>>>> * UNCOMPRESSED: the current behavior
>>>>>>>>>>>> * ZSTD/LZ4/...: each buffer is compressed and written with an
>>>>>> int64
>>>>>>>>>>>> length prefix
>>>>>>>>>>>>
>>>>>>>>>>>> (I'm close to putting up a PR implementing an experimental
>>>>>> version
>>>>>>>> of
>>>>>>>>>>>> this that uses Message.custom_metadata to transmit the codec,
>>>>>> so
>>>>>>>> this
>>>>>>>>>>>> will make the implementation details more concrete)
>>>>>>>>>>>>
>>>>>>>>>>>> So in the USER_DEFINED case, how will the library know how to
>>>>>>>> obtain
>>>>>>>>>>>> the uncompressed buffer? Is some additional metadata structure
>>>>>>>>>>>> required to provide instructions?
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Mar 4, 2020 at 8:05 AM Fan Liya <li...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Wes,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I am thinking of adding an option named "USER_DEFINED" (or
>>>>>>>> something
>>>>>>>>>>>>> similar) to enum CompressionType in your proposal.
>>>>>>>>>>>>> IMO, this option should be used primarily in Flight.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Best,
>>>>>>>>>>>>> Liya Fan
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Wed, Mar 4, 2020 at 11:12 AM Wes McKinney <
>>>>>>>> wesmckinn@gmail.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Tue, Mar 3, 2020, 8:11 PM Fan Liya <
>>>>>> liya.fan03@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Sure. I agree with you that we should not overdo this.
>>>>>>>>>>>>>>> I am wondering if we should provide an option to allow
>>>>>> users
>>>>>>>> to
>>>>>>>>>>>> plugin
>>>>>>>>>>>>>>> their customized compression strategies.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Can you provide a patch showing changes to Message.fbs (or
>>>>>>>>>> Schema.fbs)
>>>>>>>>>>>> that
>>>>>>>>>>>>>> make this idea more concrete?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>> Liya Fan
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Tue, Mar 3, 2020 at 9:47 PM Wes McKinney <
>>>>>>>> wesmckinn@gmail.com
>>>>>>>>>>>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Tue, Mar 3, 2020, 7:36 AM Fan Liya <
>>>>>>>> liya.fan03@gmail.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I am so glad to see this discussion, and I am
>>>>>> willing to
>>>>>>>>>> provide
>>>>>>>>>>>> help
>>>>>>>>>>>>>>>> from
>>>>>>>>>>>>>>>>> the Java side.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> In the proposal, I see the support for basic
>>>>>> compression
>>>>>>>>>>>> strategies
>>>>>>>>>>>>>>>>> (e.g.gzip, snappy).
>>>>>>>>>>>>>>>>> IMO, applying a single basic strategy is not likely
>>>>>> to
>>>>>>>>>> achieve
>>>>>>>>>>>>>>>> performance
>>>>>>>>>>>>>>>>> improvement for most scenarios.
>>>>>>>>>>>>>>>>> The optimal compression strategy is often obtained by
>>>>>>>>>> composing
>>>>>>>>>>>> basic
>>>>>>>>>>>>>>>>> strategies and tuning parameters.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I hope we can support such highly customized
>>>>>> compression
>>>>>>>>>>>> strategies.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I think very much beyond trivial one-shot buffer level
>>>>>>>>>> compression
>>>>>>>>>>>> is
>>>>>>>>>>>>>>>> probably out of the question for addition to the
>>>>>> current
>>>>>>>>>>>> "RecordBatch"
>>>>>>>>>>>>>>>> Flatbuffers type, because the additional metadata
>>>>>> would add
>>>>>>>>>>>> undesirable
>>>>>>>>>>>>>>>> bloat (which I would be against). If people have other
>>>>>>>> ideas it
>>>>>>>>>>>> would
>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>> great to see exactly what you are thinking as far as
>>>>>>>> changes
>>>>>>>>>> to the
>>>>>>>>>>>>>>>> protocol files.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I'll try to assemble some examples to show the
>>>>>> before/after
>>>>>>>>>>>> results of
>>>>>>>>>>>>>>>> applying the simple strategy.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>> Liya Fan
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Tue, Mar 3, 2020 at 8:15 PM Antoine Pitrou <
>>>>>>>>>>>> antoine@python.org>
>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> If we want to use a HTTP header, it would be more
>>>>>> of a
>>>>>>>>>>>>>>> Accept-Encoding
>>>>>>>>>>>>>>>>>> header, no?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> In any case, we would have to put non-standard
>>>>>> values
>>>>>>>> there
>>>>>>>>>>>> (e.g.
>>>>>>>>>>>>>>> lz4),
>>>>>>>>>>>>>>>>>> so I'm not sure how desirable it is to repurpose
>>>>>> HTTP
>>>>>>>>>> headers
>>>>>>>>>>>> for
>>>>>>>>>>>>>>> that,
>>>>>>>>>>>>>>>>>> rather than add some dedicated field to the Flight
>>>>>>>>>> messages.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Antoine.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Le 03/03/2020 à 12:52, David Li a écrit :
>>>>>>>>>>>>>>>>>>> gRPC supports headers so for Flight, we could
>>>>>> send
>>>>>>>>>>>> essentially an
>>>>>>>>>>>>>>>>> Accept
>>>>>>>>>>>>>>>>>>> header and perhaps a Content-Type header.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> David
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Mon, Mar 2, 2020, 23:15 Micah Kornfield <
>>>>>>>>>>>>>> emkornfield@gmail.com>
>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Hi Wes,
>>>>>>>>>>>>>>>>>>>> A few thoughts on this.  In general, I think it
>>>>>> is a
>>>>>>>>>> good
>>>>>>>>>>>> idea.
>>>>>>>>>>>>>>> But
>>>>>>>>>>>>>>>>>> before
>>>>>>>>>>>>>>>>>>>> proceeding, I think the following points are
>>>>>> worth
>>>>>>>>>>>> discussing:
>>>>>>>>>>>>>>>>>>>> 1.  Does this actually improve
>>>>>> throughput/latency
>>>>>>>> for
>>>>>>>>>>>> Flight? (I
>>>>>>>>>>>>>>>> think
>>>>>>>>>>>>>>>>>> you
>>>>>>>>>>>>>>>>>>>> mentioned you would follow-up with benchmarks).
>>>>>>>>>>>>>>>>>>>> 2.  I think we should limit the number of
>>>>>> supported
>>>>>>>>>>>> compression
>>>>>>>>>>>>>>>>> schemes
>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>> only 1 or 2.  I think the criteria for selection
>>>>>>>> speed
>>>>>>>>>> and
>>>>>>>>>>>>>> native
>>>>>>>>>>>>>>>>>>>> implementations available across the widest
>>>>>> possible
>>>>>>>>>>>> languages.
>>>>>>>>>>>>>>> As
>>>>>>>>>>>>>>>>> far
>>>>>>>>>>>>>>>>>> as
>>>>>>>>>>>>>>>>>>>> i can tell zstd only have bindings in java via
>>>>>> JNI,
>>>>>>>> but
>>>>>>>>>> my
>>>>>>>>>>>>>>>>>> understanding is
>>>>>>>>>>>>>>>>>>>> it is probably the type of compression for our
>>>>>>>>>> use-cases.
>>>>>>>>>>>> So I
>>>>>>>>>>>>>>>> think
>>>>>>>>>>>>>>>>>>>> zstd + potentially 1 more.
>>>>>>>>>>>>>>>>>>>> 3.  Commitment from someone on the Java side to
>>>>>>>>>> implement
>>>>>>>>>>>> this.
>>>>>>>>>>>>>>>>>>>> 4.  This doesn't need to be coupled with this
>>>>>> change
>>>>>>>>>> per-se
>>>>>>>>>>>> but
>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>> something like flight it would be good to have a
>>>>>>>>>> standard
>>>>>>>>>>>>>>> mechanism
>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>> negotiating server/client capabilities (e.g.
>>>>>> client
>>>>>>>>>> doesn't
>>>>>>>>>>>>>>> support
>>>>>>>>>>>>>>>>>>>> compression or only supports a subset).
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>> Micah
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Sun, Mar 1, 2020 at 1:24 PM Wes McKinney <
>>>>>>>>>>>>>> wesmckinn@gmail.com>
>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On Sun, Mar 1, 2020 at 3:14 PM Antoine Pitrou <
>>>>>>>>>>>>>>> antoine@python.org>
>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Le 01/03/2020 à 22:01, Wes McKinney a écrit :
>>>>>>>>>>>>>>>>>>>>>>> In the context of a "next version of the
>>>>>> Feather
>>>>>>>>>> format"
>>>>>>>>>>>>>>>> ARROW-5510
>>>>>>>>>>>>>>>>>>>>>>> (which is consumed only by Python and R at
>>>>>> the
>>>>>>>>>> moment), I
>>>>>>>>>>>>>> have
>>>>>>>>>>>>>>>> been
>>>>>>>>>>>>>>>>>>>>>>> looking at compressing buffers using fast
>>>>>>>> compressors
>>>>>>>>>>>> like
>>>>>>>>>>>>>> ZSTD
>>>>>>>>>>>>>>>>> when
>>>>>>>>>>>>>>>>>>>>>>> writing the RecordBatch bodies. This could be
>>>>>>>> handled
>>>>>>>>>>>>>> privately
>>>>>>>>>>>>>>>> as
>>>>>>>>>>>>>>>>> an
>>>>>>>>>>>>>>>>>>>>>>> implementation detail of the Feather file,
>>>>>> but
>>>>>>>> since
>>>>>>>>>> ZSTD
>>>>>>>>>>>>>>>>> compression
>>>>>>>>>>>>>>>>>>>>>>> could improve throughput in Flight, for
>>>>>> example,
>>>>>>>> I
>>>>>>>>>>>> thought I
>>>>>>>>>>>>>>>> would
>>>>>>>>>>>>>>>>>>>>>>> bring it up for discussion.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> I can see two simple compression strategies:
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> * Compress the entire message body in
>>>>>> one-shot,
>>>>>>>>>> writing
>>>>>>>>>>>> the
>>>>>>>>>>>>>>>> result
>>>>>>>>>>>>>>>>>>>> out
>>>>>>>>>>>>>>>>>>>>>>> with an 8-byte int64 prefix indicating the
>>>>>>>>>> uncompressed
>>>>>>>>>>>> size
>>>>>>>>>>>>>>>>>>>>>>> * Compress each non-zero-length constituent
>>>>>>>> Buffer
>>>>>>>>>> prior
>>>>>>>>>>>> to
>>>>>>>>>>>>>>>> writing
>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>> the body (and using the same
>>>>>>>>>> uncompressed-length-prefix
>>>>>>>>>>>> when
>>>>>>>>>>>>>>>>> writing
>>>>>>>>>>>>>>>>>>>>>>> the compressed buffer)
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> The latter strategy is preferable for
>>>>>> scenarios
>>>>>>>>>> where we
>>>>>>>>>>>> may
>>>>>>>>>>>>>>>>> project
>>>>>>>>>>>>>>>>>>>>>>> out only a few fields from a larger record
>>>>>> batch
>>>>>>>>>> (such as
>>>>>>>>>>>>>>> reading
>>>>>>>>>>>>>>>>>>>> from
>>>>>>>>>>>>>>>>>>>>>>> a memory-mapped file).
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Agreed.  It may also allow using different
>>>>>>>> compression
>>>>>>>>>>>>>>> strategies
>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>>> different kinds of buffers (for example a
>>>>>>>> bytestream
>>>>>>>>>>>> splitting
>>>>>>>>>>>>>>>>>> strategy
>>>>>>>>>>>>>>>>>>>>>> for floats and doubles, or a delta encoding
>>>>>>>> strategy
>>>>>>>>>> for
>>>>>>>>>>>>>>>> integers).
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> If we wanted to allow for different
>>>>>> compression to
>>>>>>>>>> apply to
>>>>>>>>>>>>>>>> different
>>>>>>>>>>>>>>>>>>>>> buffers, I think we will need a new Message
>>>>>> type
>>>>>>>>>> because
>>>>>>>>>>>> this
>>>>>>>>>>>>>>> would
>>>>>>>>>>>>>>>>>>>>> inflate metadata sizes in a way that is not
>>>>>> likely
>>>>>>>> to
>>>>>>>>>> be
>>>>>>>>>>>>>>> acceptable
>>>>>>>>>>>>>>>>>>>>> for the current uncompressed use case.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Here is my strawman proposal
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>> https://github.com/apache/arrow/compare/master...wesm:compression-strawman
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Implementation could be accomplished by one
>>>>>> of
>>>>>>>> the
>>>>>>>>>>>> following
>>>>>>>>>>>>>>>>> methods:
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> * Setting a field in Message.custom_metadata
>>>>>>>>>>>>>>>>>>>>>>> * Adding a new field to Message
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> I think it has to be a new field in Message.
>>>>>>>> Making
>>>>>>>>>> it an
>>>>>>>>>>>>>>>> ignorable
>>>>>>>>>>>>>>>>>>>>>> metadata field means non-supporting receivers
>>>>>> will
>>>>>>>>>> decode
>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>> interpret
>>>>>>>>>>>>>>>>>>>>>> the data wrongly.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Antoine.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>