You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@arrow.apache.org by Chris Nuernberger <ch...@techascent.com> on 2022/01/13 22:55:02 UTC

examples of using new compression scheme

Upgrading to 6.0.X I noticed that record batches can have body compression
which I think is great.

I had trouble finding examples in python or R (or java) of writing an IPC
file with various types of compression used for the record batch.

Is the compression applied per-column or upon the record batch after the
buffers have been serialized to the batch?  If it is applied per column
which buffers - given that text for example can consist of 3 buffers
(validity, offset, data) is compression applied to all three or just data
or data and offset?

Re: examples of using new compression scheme

Posted by Micah Kornfield <em...@gmail.com>.

>
> If better compression speed in Java Arrow is desired, then opening a
> JIRA + a Github PR would be most welcome.


https://issues.apache.org/jira/browse/ARROW-11901 is the one covering it.

On Sat, Jan 22, 2022 at 5:40 AM Antoine Pitrou <an...@python.org> wrote:

> On Thu, 20 Jan 2022 14:42:25 -0700
> Chris Nuernberger <ch...@techascent.com> wrote:
> > OK, well, for the record tech.ml.dataset supports three major features
> the
> > official SDK does not - mmap, JDK-17 and just now compression as a
> > user-accessible option during write
> > -
> file:///home/chrisn/dev/tech.all/tech.ml.dataset/docs/tech.v3.libs.arrow.html.
> >
> > Other lz4 compressors are faster than apache but regardless zstd gets the
> > best compression ratio for the very simple test files I tested.
>
> If better compression speed in Java Arrow is desired, then opening a
> JIRA + a Github PR would be most welcome.
>
> Regards
>
> Antoine.
>
>
>

Re: examples of using new compression scheme

Posted by Antoine Pitrou <an...@python.org>.

On Thu, 20 Jan 2022 14:42:25 -0700
Chris Nuernberger <ch...@techascent.com> wrote:
> OK, well, for the record tech.ml.dataset supports three major features the
> official SDK does not - mmap, JDK-17 and just now compression as a
> user-accessible option during write
> - file:///home/chrisn/dev/tech.all/tech.ml.dataset/docs/tech.v3.libs.arrow.html.
> 
> Other lz4 compressors are faster than apache but regardless zstd gets the
> best compression ratio for the very simple test files I tested.

If better compression speed in Java Arrow is desired, then opening a
JIRA + a Github PR would be most welcome.

Regards

Antoine.

Re: examples of using new compression scheme

Posted by Chris Nuernberger <ch...@techascent.com>.

OK, well, for the record tech.ml.dataset supports three major features the
official SDK does not - mmap, JDK-17 and just now compression as a
user-accessible option during write
- file:///home/chrisn/dev/tech.all/tech.ml.dataset/docs/tech.v3.libs.arrow.html.

Other lz4 compressors are faster than apache but regardless zstd gets the
best compression ratio for the very simple test files I tested.

On Thu, Jan 20, 2022 at 10:44 AM Chris Nuernberger <ch...@techascent.com>
wrote:

> Makes sense.  The apache lz4 compressor is very slow according to my
> timings -- at least 10x and more like 50X slower than zstd so I can totally
> understand the base of the questions.
>
> On Thu, Jan 20, 2022 at 10:30 AM Micah Kornfield <em...@gmail.com>
> wrote:
>
>> Hi Chris,
>>
>>> Are there compression constants for snappy and brotli I am not seeing?
>>> The flatbuffer definition of the constants only contains lz4 and zstd.
>>
>> These are not in the spec.  We chose to limit the number of compression
>> standards unless there was a real demand for them.  When I last looked, I
>> believe lz4 is pretty much strictly better then snappy with a proper
>> implementation.  I believe Brotli might provide some space advantages over
>> ZSTD but is generally slower.  If there is a strong use-case for other
>> codecs, I would suggest discussing it on the dev@ mailing list as an
>> addition.
>>
>>
>>> That performance discussion is interesting.  It is disappointing that as
>>> far as java libraries are concerned tech.ml.dataset isn't brought up as it
>>> is both faster and supports more features than the base arrow java SDK.
>>
>> Generally outside of LZ4 performance of the java library hasn't been
>> brought up much.  The only reason why I mentioned javacpp was because the
>> question about native bindings was asked (and you of course know about
>> tech.ml.dataset).
>>
>> Cheers,
>> Micah
>>
>>
>> On Thu, Jan 20, 2022 at 8:32 AM Chris Nuernberger <ch...@techascent.com>
>> wrote:
>>
>>> Are there compression constants for snappy and brotli I am not seeing?
>>> The flatbuffer definition of the constants only contains lz4 and zstd.
>>>
>>> That performance discussion is interesting.  It is disappointing that as
>>> far as java libraries are concerned tech.ml.dataset isn't brought up as it
>>> is both faster and supports more features than the base arrow java SDK.
>>>
>>> On Sun, Jan 16, 2022 at 9:33 PM Micah Kornfield <em...@gmail.com>
>>> wrote:
>>>
>>>> Hi Chris,
>>>>
>>>>> Looking through the code it appears that this isn't exposed to users.
>>>>> ArrowWriter doesn't use the VectorUnloader constructor that includes a
>>>>> compressor so no one is using this in Java.  I found some great comments in
>>>>> the go code that are *super* helpful about the compressed buffer's format.
>>>>
>>>> Unfortunately, the addition of compression didn't go through the normal
>>>> path for integrating new features (integration tests between two or more
>>>> languages actively running).  Right now only the read path is tested from a
>>>> statically generated file in C++ so this gap wasn't caught.  A contribution
>>>> to fix this would be welcome.
>>>>
>>>> Who is using compression?  Are you using it via the c++ dataset
>>>>> pathways or one of the various language wrappers?
>>>>
>>>> We use the decode path in Java (and other languages) to connect to a
>>>> service my team owns that serves arrow data with optional compression.
>>>>  Note that LZ4 is very slow in Java today [1].
>>>>
>>>> What about putting a nice C interface on top of all the c++ and then
>>>>> basing R, python, Julia, and Java via JNA, JNR, or JKD-17's FFI pathway on
>>>>> top of one C interface?  Seems like a hell of a lot less work than the
>>>>> bespoke language wrappers - and everyone gets access to all the features at
>>>>> the same time.
>>>>
>>>> There is already a gobject [2] interface on top of arrow C++ that is
>>>> used in the Ruby bindings.  R and Python bind directly to C++ already.  In
>>>> terms of other implementations there is value in not having every
>>>> implementation have the same core, as it helps ensure the specification is
>>>> understandable and can be implemented outside of the project if necessary.
>>>> Also for some languages it makes prebuilt distribution easier if native
>>>> code is not required.
>>>>
>>>>  If you are looking for auto generated bindings for C++-Arrow in Java
>>>> there is a project [3] that does that.  I have never used it so I can't
>>>> comment on its quality.
>>>>
>>>> -Micah
>>>>
>>>>
>>>> [1] https://issues.apache.org/jira/browse/ARROW-11901
>>>> [2] https://en.wikipedia.org/wiki/GObject
>>>> [3] https://github.com/bytedeco/javacpp-presets/tree/master/arrow
>>>>
>>>>
>>>> On Fri, Jan 14, 2022 at 5:35 AM Chris Nuernberger <ch...@techascent.com>
>>>> wrote:
>>>>
>>>>> Looking through the code it appears that this isn't exposed to users.
>>>>> ArrowWriter doesn't use the VectorUnloader constructor that includes a
>>>>> compressor so no one is using this in Java.  I found some great comments in
>>>>> the go code that are *super* helpful about the compressed buffer's format.
>>>>>
>>>>> Who is using compression?  Are you using it via the c++ dataset
>>>>> pathways or one of the various language wrappers?
>>>>>
>>>>> What about putting a nice C interface on top of all the c++ and then
>>>>> basing R, python, Julia, and Java via JNA, JNR, or JKD-17's FFI pathway on
>>>>> top of one C interface?  Seems like a hell of a lot less work than the
>>>>> bespoke language wrappers - and everyone gets access to all the features at
>>>>> the same time.
>>>>>
>>>>> On Thu, Jan 13, 2022 at 4:11 PM Chris Nuernberger <
>>>>> chris@techascent.com> wrote:
>>>>>
>>>>>> Great, thanks, I just hadn't noticed until now - thanks!
>>>>>>
>>>>>> On Thu, Jan 13, 2022 at 4:09 PM Micah Kornfield <
>>>>>> emkornfield@gmail.com> wrote:
>>>>>>
>>>>>>> Hi Chris,
>>>>>>>
>>>>>>>> Upgrading to 6.0.X I noticed that record batches can have body
>>>>>>>> compression which I think is great.
>>>>>>>
>>>>>>>
>>>>>>> Small nit: this was released in Arrow 4.
>>>>>>>
>>>>>>> I had trouble finding examples in python or R (or java) of writing
>>>>>>>> an IPC file with various types of compression used for the record batch.
>>>>>>>
>>>>>>>
>>>>>>> Java code is at [1] with implementations for compression codec
>>>>>>> living in [2].
>>>>>>>
>>>>>>> Is the compression applied per-column or upon the record batch after
>>>>>>>> the buffers have been serialized to the batch?  If it is applied per column
>>>>>>>> which buffers - given that text for example can consist of 3 buffers
>>>>>>>> (validity, offset, data) is compression applied to all three or just data
>>>>>>>> or data and offset?
>>>>>>>
>>>>>>> It is applied per buffer, all buffers are compressed.
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Micah
>>>>>>>
>>>>>>>
>>>>>>> [1]
>>>>>>> https://github.com/apache/arrow/blob/8e43f23dcc6a9e630516228f110c48b64d13cec6/java/vector/src/main/java/org/apache/arrow/vector/VectorUnloader.java#L100
>>>>>>> [2]
>>>>>>> https://github.com/apache/arrow/tree/971a9d352e456882aa5b70ac722607840cdb9df7/java/compression/src
>>>>>>>
>>>>>>> On Thu, Jan 13, 2022 at 2:55 PM Chris Nuernberger <
>>>>>>> chris@techascent.com> wrote:
>>>>>>>
>>>>>>>> Upgrading to 6.0.X I noticed that record batches can have body
>>>>>>>> compression which I think is great.
>>>>>>>>
>>>>>>>> I had trouble finding examples in python or R (or java) of writing
>>>>>>>> an IPC file with various types of compression used for the record batch.
>>>>>>>>
>>>>>>>> Is the compression applied per-column or upon the record batch
>>>>>>>> after the buffers have been serialized to the batch?  If it is applied per
>>>>>>>> column which buffers - given that text for example can consist of 3 buffers
>>>>>>>> (validity, offset, data) is compression applied to all three or just data
>>>>>>>> or data and offset?
>>>>>>>>
>>>>>>>

Re: examples of using new compression scheme

Posted by Chris Nuernberger <ch...@techascent.com>.

Makes sense.  The apache lz4 compressor is very slow according to my
timings -- at least 10x and more like 50X slower than zstd so I can totally
understand the base of the questions.

On Thu, Jan 20, 2022 at 10:30 AM Micah Kornfield <em...@gmail.com>
wrote:

> Hi Chris,
>
>> Are there compression constants for snappy and brotli I am not seeing?
>> The flatbuffer definition of the constants only contains lz4 and zstd.
>
> These are not in the spec.  We chose to limit the number of compression
> standards unless there was a real demand for them.  When I last looked, I
> believe lz4 is pretty much strictly better then snappy with a proper
> implementation.  I believe Brotli might provide some space advantages over
> ZSTD but is generally slower.  If there is a strong use-case for other
> codecs, I would suggest discussing it on the dev@ mailing list as an
> addition.
>
>
>> That performance discussion is interesting.  It is disappointing that as
>> far as java libraries are concerned tech.ml.dataset isn't brought up as it
>> is both faster and supports more features than the base arrow java SDK.
>
> Generally outside of LZ4 performance of the java library hasn't been
> brought up much.  The only reason why I mentioned javacpp was because the
> question about native bindings was asked (and you of course know about
> tech.ml.dataset).
>
> Cheers,
> Micah
>
>
> On Thu, Jan 20, 2022 at 8:32 AM Chris Nuernberger <ch...@techascent.com>
> wrote:
>
>> Are there compression constants for snappy and brotli I am not seeing?
>> The flatbuffer definition of the constants only contains lz4 and zstd.
>>
>> That performance discussion is interesting.  It is disappointing that as
>> far as java libraries are concerned tech.ml.dataset isn't brought up as it
>> is both faster and supports more features than the base arrow java SDK.
>>
>> On Sun, Jan 16, 2022 at 9:33 PM Micah Kornfield <em...@gmail.com>
>> wrote:
>>
>>> Hi Chris,
>>>
>>>> Looking through the code it appears that this isn't exposed to users.
>>>> ArrowWriter doesn't use the VectorUnloader constructor that includes a
>>>> compressor so no one is using this in Java.  I found some great comments in
>>>> the go code that are *super* helpful about the compressed buffer's format.
>>>
>>> Unfortunately, the addition of compression didn't go through the normal
>>> path for integrating new features (integration tests between two or more
>>> languages actively running).  Right now only the read path is tested from a
>>> statically generated file in C++ so this gap wasn't caught.  A contribution
>>> to fix this would be welcome.
>>>
>>> Who is using compression?  Are you using it via the c++ dataset pathways
>>>> or one of the various language wrappers?
>>>
>>> We use the decode path in Java (and other languages) to connect to a
>>> service my team owns that serves arrow data with optional compression.
>>>  Note that LZ4 is very slow in Java today [1].
>>>
>>> What about putting a nice C interface on top of all the c++ and then
>>>> basing R, python, Julia, and Java via JNA, JNR, or JKD-17's FFI pathway on
>>>> top of one C interface?  Seems like a hell of a lot less work than the
>>>> bespoke language wrappers - and everyone gets access to all the features at
>>>> the same time.
>>>
>>> There is already a gobject [2] interface on top of arrow C++ that is
>>> used in the Ruby bindings.  R and Python bind directly to C++ already.  In
>>> terms of other implementations there is value in not having every
>>> implementation have the same core, as it helps ensure the specification is
>>> understandable and can be implemented outside of the project if necessary.
>>> Also for some languages it makes prebuilt distribution easier if native
>>> code is not required.
>>>
>>>  If you are looking for auto generated bindings for C++-Arrow in Java
>>> there is a project [3] that does that.  I have never used it so I can't
>>> comment on its quality.
>>>
>>> -Micah
>>>
>>>
>>> [1] https://issues.apache.org/jira/browse/ARROW-11901
>>> [2] https://en.wikipedia.org/wiki/GObject
>>> [3] https://github.com/bytedeco/javacpp-presets/tree/master/arrow
>>>
>>>
>>> On Fri, Jan 14, 2022 at 5:35 AM Chris Nuernberger <ch...@techascent.com>
>>> wrote:
>>>
>>>> Looking through the code it appears that this isn't exposed to users.
>>>> ArrowWriter doesn't use the VectorUnloader constructor that includes a
>>>> compressor so no one is using this in Java.  I found some great comments in
>>>> the go code that are *super* helpful about the compressed buffer's format.
>>>>
>>>> Who is using compression?  Are you using it via the c++ dataset
>>>> pathways or one of the various language wrappers?
>>>>
>>>> What about putting a nice C interface on top of all the c++ and then
>>>> basing R, python, Julia, and Java via JNA, JNR, or JKD-17's FFI pathway on
>>>> top of one C interface?  Seems like a hell of a lot less work than the
>>>> bespoke language wrappers - and everyone gets access to all the features at
>>>> the same time.
>>>>
>>>> On Thu, Jan 13, 2022 at 4:11 PM Chris Nuernberger <ch...@techascent.com>
>>>> wrote:
>>>>
>>>>> Great, thanks, I just hadn't noticed until now - thanks!
>>>>>
>>>>> On Thu, Jan 13, 2022 at 4:09 PM Micah Kornfield <em...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Chris,
>>>>>>
>>>>>>> Upgrading to 6.0.X I noticed that record batches can have body
>>>>>>> compression which I think is great.
>>>>>>
>>>>>>
>>>>>> Small nit: this was released in Arrow 4.
>>>>>>
>>>>>> I had trouble finding examples in python or R (or java) of writing an
>>>>>>> IPC file with various types of compression used for the record batch.
>>>>>>
>>>>>>
>>>>>> Java code is at [1] with implementations for compression codec living
>>>>>> in [2].
>>>>>>
>>>>>> Is the compression applied per-column or upon the record batch after
>>>>>>> the buffers have been serialized to the batch?  If it is applied per column
>>>>>>> which buffers - given that text for example can consist of 3 buffers
>>>>>>> (validity, offset, data) is compression applied to all three or just data
>>>>>>> or data and offset?
>>>>>>
>>>>>> It is applied per buffer, all buffers are compressed.
>>>>>>
>>>>>> Cheers,
>>>>>> Micah
>>>>>>
>>>>>>
>>>>>> [1]
>>>>>> https://github.com/apache/arrow/blob/8e43f23dcc6a9e630516228f110c48b64d13cec6/java/vector/src/main/java/org/apache/arrow/vector/VectorUnloader.java#L100
>>>>>> [2]
>>>>>> https://github.com/apache/arrow/tree/971a9d352e456882aa5b70ac722607840cdb9df7/java/compression/src
>>>>>>
>>>>>> On Thu, Jan 13, 2022 at 2:55 PM Chris Nuernberger <
>>>>>> chris@techascent.com> wrote:
>>>>>>
>>>>>>> Upgrading to 6.0.X I noticed that record batches can have body
>>>>>>> compression which I think is great.
>>>>>>>
>>>>>>> I had trouble finding examples in python or R (or java) of writing
>>>>>>> an IPC file with various types of compression used for the record batch.
>>>>>>>
>>>>>>> Is the compression applied per-column or upon the record batch after
>>>>>>> the buffers have been serialized to the batch?  If it is applied per column
>>>>>>> which buffers - given that text for example can consist of 3 buffers
>>>>>>> (validity, offset, data) is compression applied to all three or just data
>>>>>>> or data and offset?
>>>>>>>
>>>>>>

Re: examples of using new compression scheme

Posted by Micah Kornfield <em...@gmail.com>.

Hi Chris,

> Are there compression constants for snappy and brotli I am not seeing?
> The flatbuffer definition of the constants only contains lz4 and zstd.

These are not in the spec.  We chose to limit the number of compression
standards unless there was a real demand for them.  When I last looked, I
believe lz4 is pretty much strictly better then snappy with a proper
implementation.  I believe Brotli might provide some space advantages over
ZSTD but is generally slower.  If there is a strong use-case for other
codecs, I would suggest discussing it on the dev@ mailing list as an
addition.


> That performance discussion is interesting.  It is disappointing that as
> far as java libraries are concerned tech.ml.dataset isn't brought up as it
> is both faster and supports more features than the base arrow java SDK.

Generally outside of LZ4 performance of the java library hasn't been
brought up much.  The only reason why I mentioned javacpp was because the
question about native bindings was asked (and you of course know about
tech.ml.dataset).

Cheers,
Micah


On Thu, Jan 20, 2022 at 8:32 AM Chris Nuernberger <ch...@techascent.com>
wrote:

> Are there compression constants for snappy and brotli I am not seeing?
> The flatbuffer definition of the constants only contains lz4 and zstd.
>
> That performance discussion is interesting.  It is disappointing that as
> far as java libraries are concerned tech.ml.dataset isn't brought up as it
> is both faster and supports more features than the base arrow java SDK.
>
> On Sun, Jan 16, 2022 at 9:33 PM Micah Kornfield <em...@gmail.com>
> wrote:
>
>> Hi Chris,
>>
>>> Looking through the code it appears that this isn't exposed to users.
>>> ArrowWriter doesn't use the VectorUnloader constructor that includes a
>>> compressor so no one is using this in Java.  I found some great comments in
>>> the go code that are *super* helpful about the compressed buffer's format.
>>
>> Unfortunately, the addition of compression didn't go through the normal
>> path for integrating new features (integration tests between two or more
>> languages actively running).  Right now only the read path is tested from a
>> statically generated file in C++ so this gap wasn't caught.  A contribution
>> to fix this would be welcome.
>>
>> Who is using compression?  Are you using it via the c++ dataset pathways
>>> or one of the various language wrappers?
>>
>> We use the decode path in Java (and other languages) to connect to a
>> service my team owns that serves arrow data with optional compression.
>>  Note that LZ4 is very slow in Java today [1].
>>
>> What about putting a nice C interface on top of all the c++ and then
>>> basing R, python, Julia, and Java via JNA, JNR, or JKD-17's FFI pathway on
>>> top of one C interface?  Seems like a hell of a lot less work than the
>>> bespoke language wrappers - and everyone gets access to all the features at
>>> the same time.
>>
>> There is already a gobject [2] interface on top of arrow C++ that is used
>> in the Ruby bindings.  R and Python bind directly to C++ already.  In terms
>> of other implementations there is value in not having every implementation
>> have the same core, as it helps ensure the specification is understandable
>> and can be implemented outside of the project if necessary.  Also for some
>> languages it makes prebuilt distribution easier if native code is not
>> required.
>>
>>  If you are looking for auto generated bindings for C++-Arrow in Java
>> there is a project [3] that does that.  I have never used it so I can't
>> comment on its quality.
>>
>> -Micah
>>
>>
>> [1] https://issues.apache.org/jira/browse/ARROW-11901
>> [2] https://en.wikipedia.org/wiki/GObject
>> [3] https://github.com/bytedeco/javacpp-presets/tree/master/arrow
>>
>>
>> On Fri, Jan 14, 2022 at 5:35 AM Chris Nuernberger <ch...@techascent.com>
>> wrote:
>>
>>> Looking through the code it appears that this isn't exposed to users.
>>> ArrowWriter doesn't use the VectorUnloader constructor that includes a
>>> compressor so no one is using this in Java.  I found some great comments in
>>> the go code that are *super* helpful about the compressed buffer's format.
>>>
>>> Who is using compression?  Are you using it via the c++ dataset pathways
>>> or one of the various language wrappers?
>>>
>>> What about putting a nice C interface on top of all the c++ and then
>>> basing R, python, Julia, and Java via JNA, JNR, or JKD-17's FFI pathway on
>>> top of one C interface?  Seems like a hell of a lot less work than the
>>> bespoke language wrappers - and everyone gets access to all the features at
>>> the same time.
>>>
>>> On Thu, Jan 13, 2022 at 4:11 PM Chris Nuernberger <ch...@techascent.com>
>>> wrote:
>>>
>>>> Great, thanks, I just hadn't noticed until now - thanks!
>>>>
>>>> On Thu, Jan 13, 2022 at 4:09 PM Micah Kornfield <em...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Chris,
>>>>>
>>>>>> Upgrading to 6.0.X I noticed that record batches can have body
>>>>>> compression which I think is great.
>>>>>
>>>>>
>>>>> Small nit: this was released in Arrow 4.
>>>>>
>>>>> I had trouble finding examples in python or R (or java) of writing an
>>>>>> IPC file with various types of compression used for the record batch.
>>>>>
>>>>>
>>>>> Java code is at [1] with implementations for compression codec living
>>>>> in [2].
>>>>>
>>>>> Is the compression applied per-column or upon the record batch after
>>>>>> the buffers have been serialized to the batch?  If it is applied per column
>>>>>> which buffers - given that text for example can consist of 3 buffers
>>>>>> (validity, offset, data) is compression applied to all three or just data
>>>>>> or data and offset?
>>>>>
>>>>> It is applied per buffer, all buffers are compressed.
>>>>>
>>>>> Cheers,
>>>>> Micah
>>>>>
>>>>>
>>>>> [1]
>>>>> https://github.com/apache/arrow/blob/8e43f23dcc6a9e630516228f110c48b64d13cec6/java/vector/src/main/java/org/apache/arrow/vector/VectorUnloader.java#L100
>>>>> [2]
>>>>> https://github.com/apache/arrow/tree/971a9d352e456882aa5b70ac722607840cdb9df7/java/compression/src
>>>>>
>>>>> On Thu, Jan 13, 2022 at 2:55 PM Chris Nuernberger <
>>>>> chris@techascent.com> wrote:
>>>>>
>>>>>> Upgrading to 6.0.X I noticed that record batches can have body
>>>>>> compression which I think is great.
>>>>>>
>>>>>> I had trouble finding examples in python or R (or java) of writing an
>>>>>> IPC file with various types of compression used for the record batch.
>>>>>>
>>>>>> Is the compression applied per-column or upon the record batch after
>>>>>> the buffers have been serialized to the batch?  If it is applied per column
>>>>>> which buffers - given that text for example can consist of 3 buffers
>>>>>> (validity, offset, data) is compression applied to all three or just data
>>>>>> or data and offset?
>>>>>>
>>>>>

Re: examples of using new compression scheme

Posted by Chris Nuernberger <ch...@techascent.com>.

Are there compression constants for snappy and brotli I am not seeing?  The
flatbuffer definition of the constants only contains lz4 and zstd.

That performance discussion is interesting.  It is disappointing that as
far as java libraries are concerned tech.ml.dataset isn't brought up as it
is both faster and supports more features than the base arrow java SDK.

On Sun, Jan 16, 2022 at 9:33 PM Micah Kornfield <em...@gmail.com>
wrote:

> Hi Chris,
>
>> Looking through the code it appears that this isn't exposed to users.
>> ArrowWriter doesn't use the VectorUnloader constructor that includes a
>> compressor so no one is using this in Java.  I found some great comments in
>> the go code that are *super* helpful about the compressed buffer's format.
>
> Unfortunately, the addition of compression didn't go through the normal
> path for integrating new features (integration tests between two or more
> languages actively running).  Right now only the read path is tested from a
> statically generated file in C++ so this gap wasn't caught.  A contribution
> to fix this would be welcome.
>
> Who is using compression?  Are you using it via the c++ dataset pathways
>> or one of the various language wrappers?
>
> We use the decode path in Java (and other languages) to connect to a
> service my team owns that serves arrow data with optional compression.
>  Note that LZ4 is very slow in Java today [1].
>
> What about putting a nice C interface on top of all the c++ and then
>> basing R, python, Julia, and Java via JNA, JNR, or JKD-17's FFI pathway on
>> top of one C interface?  Seems like a hell of a lot less work than the
>> bespoke language wrappers - and everyone gets access to all the features at
>> the same time.
>
> There is already a gobject [2] interface on top of arrow C++ that is used
> in the Ruby bindings.  R and Python bind directly to C++ already.  In terms
> of other implementations there is value in not having every implementation
> have the same core, as it helps ensure the specification is understandable
> and can be implemented outside of the project if necessary.  Also for some
> languages it makes prebuilt distribution easier if native code is not
> required.
>
>  If you are looking for auto generated bindings for C++-Arrow in Java
> there is a project [3] that does that.  I have never used it so I can't
> comment on its quality.
>
> -Micah
>
>
> [1] https://issues.apache.org/jira/browse/ARROW-11901
> [2] https://en.wikipedia.org/wiki/GObject
> [3] https://github.com/bytedeco/javacpp-presets/tree/master/arrow
>
>
> On Fri, Jan 14, 2022 at 5:35 AM Chris Nuernberger <ch...@techascent.com>
> wrote:
>
>> Looking through the code it appears that this isn't exposed to users.
>> ArrowWriter doesn't use the VectorUnloader constructor that includes a
>> compressor so no one is using this in Java.  I found some great comments in
>> the go code that are *super* helpful about the compressed buffer's format.
>>
>> Who is using compression?  Are you using it via the c++ dataset pathways
>> or one of the various language wrappers?
>>
>> What about putting a nice C interface on top of all the c++ and then
>> basing R, python, Julia, and Java via JNA, JNR, or JKD-17's FFI pathway on
>> top of one C interface?  Seems like a hell of a lot less work than the
>> bespoke language wrappers - and everyone gets access to all the features at
>> the same time.
>>
>> On Thu, Jan 13, 2022 at 4:11 PM Chris Nuernberger <ch...@techascent.com>
>> wrote:
>>
>>> Great, thanks, I just hadn't noticed until now - thanks!
>>>
>>> On Thu, Jan 13, 2022 at 4:09 PM Micah Kornfield <em...@gmail.com>
>>> wrote:
>>>
>>>> Hi Chris,
>>>>
>>>>> Upgrading to 6.0.X I noticed that record batches can have body
>>>>> compression which I think is great.
>>>>
>>>>
>>>> Small nit: this was released in Arrow 4.
>>>>
>>>> I had trouble finding examples in python or R (or java) of writing an
>>>>> IPC file with various types of compression used for the record batch.
>>>>
>>>>
>>>> Java code is at [1] with implementations for compression codec living
>>>> in [2].
>>>>
>>>> Is the compression applied per-column or upon the record batch after
>>>>> the buffers have been serialized to the batch?  If it is applied per column
>>>>> which buffers - given that text for example can consist of 3 buffers
>>>>> (validity, offset, data) is compression applied to all three or just data
>>>>> or data and offset?
>>>>
>>>> It is applied per buffer, all buffers are compressed.
>>>>
>>>> Cheers,
>>>> Micah
>>>>
>>>>
>>>> [1]
>>>> https://github.com/apache/arrow/blob/8e43f23dcc6a9e630516228f110c48b64d13cec6/java/vector/src/main/java/org/apache/arrow/vector/VectorUnloader.java#L100
>>>> [2]
>>>> https://github.com/apache/arrow/tree/971a9d352e456882aa5b70ac722607840cdb9df7/java/compression/src
>>>>
>>>> On Thu, Jan 13, 2022 at 2:55 PM Chris Nuernberger <ch...@techascent.com>
>>>> wrote:
>>>>
>>>>> Upgrading to 6.0.X I noticed that record batches can have body
>>>>> compression which I think is great.
>>>>>
>>>>> I had trouble finding examples in python or R (or java) of writing an
>>>>> IPC file with various types of compression used for the record batch.
>>>>>
>>>>> Is the compression applied per-column or upon the record batch after
>>>>> the buffers have been serialized to the batch?  If it is applied per column
>>>>> which buffers - given that text for example can consist of 3 buffers
>>>>> (validity, offset, data) is compression applied to all three or just data
>>>>> or data and offset?
>>>>>
>>>>

Re: examples of using new compression scheme

Posted by Micah Kornfield <em...@gmail.com>.

Hi Chris,

> Looking through the code it appears that this isn't exposed to users.
> ArrowWriter doesn't use the VectorUnloader constructor that includes a
> compressor so no one is using this in Java.  I found some great comments in
> the go code that are *super* helpful about the compressed buffer's format.

Unfortunately, the addition of compression didn't go through the normal
path for integrating new features (integration tests between two or more
languages actively running).  Right now only the read path is tested from a
statically generated file in C++ so this gap wasn't caught.  A contribution
to fix this would be welcome.

Who is using compression?  Are you using it via the c++ dataset pathways or
> one of the various language wrappers?

We use the decode path in Java (and other languages) to connect to a
service my team owns that serves arrow data with optional compression.
 Note that LZ4 is very slow in Java today [1].

What about putting a nice C interface on top of all the c++ and then basing
> R, python, Julia, and Java via JNA, JNR, or JKD-17's FFI pathway on top of
> one C interface?  Seems like a hell of a lot less work than the bespoke
> language wrappers - and everyone gets access to all the features at the
> same time.

There is already a gobject [2] interface on top of arrow C++ that is used
in the Ruby bindings.  R and Python bind directly to C++ already.  In terms
of other implementations there is value in not having every implementation
have the same core, as it helps ensure the specification is understandable
and can be implemented outside of the project if necessary.  Also for some
languages it makes prebuilt distribution easier if native code is not
required.

 If you are looking for auto generated bindings for C++-Arrow in Java there
is a project [3] that does that.  I have never used it so I can't comment
on its quality.

-Micah


[1] https://issues.apache.org/jira/browse/ARROW-11901
[2] https://en.wikipedia.org/wiki/GObject
[3] https://github.com/bytedeco/javacpp-presets/tree/master/arrow


On Fri, Jan 14, 2022 at 5:35 AM Chris Nuernberger <ch...@techascent.com>
wrote:

> Looking through the code it appears that this isn't exposed to users.
> ArrowWriter doesn't use the VectorUnloader constructor that includes a
> compressor so no one is using this in Java.  I found some great comments in
> the go code that are *super* helpful about the compressed buffer's format.
>
> Who is using compression?  Are you using it via the c++ dataset pathways
> or one of the various language wrappers?
>
> What about putting a nice C interface on top of all the c++ and then
> basing R, python, Julia, and Java via JNA, JNR, or JKD-17's FFI pathway on
> top of one C interface?  Seems like a hell of a lot less work than the
> bespoke language wrappers - and everyone gets access to all the features at
> the same time.
>
> On Thu, Jan 13, 2022 at 4:11 PM Chris Nuernberger <ch...@techascent.com>
> wrote:
>
>> Great, thanks, I just hadn't noticed until now - thanks!
>>
>> On Thu, Jan 13, 2022 at 4:09 PM Micah Kornfield <em...@gmail.com>
>> wrote:
>>
>>> Hi Chris,
>>>
>>>> Upgrading to 6.0.X I noticed that record batches can have body
>>>> compression which I think is great.
>>>
>>>
>>> Small nit: this was released in Arrow 4.
>>>
>>> I had trouble finding examples in python or R (or java) of writing an
>>>> IPC file with various types of compression used for the record batch.
>>>
>>>
>>> Java code is at [1] with implementations for compression codec living in
>>> [2].
>>>
>>> Is the compression applied per-column or upon the record batch after the
>>>> buffers have been serialized to the batch?  If it is applied per column
>>>> which buffers - given that text for example can consist of 3 buffers
>>>> (validity, offset, data) is compression applied to all three or just data
>>>> or data and offset?
>>>
>>> It is applied per buffer, all buffers are compressed.
>>>
>>> Cheers,
>>> Micah
>>>
>>>
>>> [1]
>>> https://github.com/apache/arrow/blob/8e43f23dcc6a9e630516228f110c48b64d13cec6/java/vector/src/main/java/org/apache/arrow/vector/VectorUnloader.java#L100
>>> [2]
>>> https://github.com/apache/arrow/tree/971a9d352e456882aa5b70ac722607840cdb9df7/java/compression/src
>>>
>>> On Thu, Jan 13, 2022 at 2:55 PM Chris Nuernberger <ch...@techascent.com>
>>> wrote:
>>>
>>>> Upgrading to 6.0.X I noticed that record batches can have body
>>>> compression which I think is great.
>>>>
>>>> I had trouble finding examples in python or R (or java) of writing an
>>>> IPC file with various types of compression used for the record batch.
>>>>
>>>> Is the compression applied per-column or upon the record batch after
>>>> the buffers have been serialized to the batch?  If it is applied per column
>>>> which buffers - given that text for example can consist of 3 buffers
>>>> (validity, offset, data) is compression applied to all three or just data
>>>> or data and offset?
>>>>
>>>

Re: examples of using new compression scheme

Posted by Chris Nuernberger <ch...@techascent.com>.

Looking through the code it appears that this isn't exposed to users.
ArrowWriter doesn't use the VectorUnloader constructor that includes a
compressor so no one is using this in Java.  I found some great comments in
the go code that are *super* helpful about the compressed buffer's format.

Who is using compression?  Are you using it via the c++ dataset pathways or
one of the various language wrappers?

What about putting a nice C interface on top of all the c++ and then basing
R, python, Julia, and Java via JNA, JNR, or JKD-17's FFI pathway on top of
one C interface?  Seems like a hell of a lot less work than the bespoke
language wrappers - and everyone gets access to all the features at the
same time.

On Thu, Jan 13, 2022 at 4:11 PM Chris Nuernberger <ch...@techascent.com>
wrote:

> Great, thanks, I just hadn't noticed until now - thanks!
>
> On Thu, Jan 13, 2022 at 4:09 PM Micah Kornfield <em...@gmail.com>
> wrote:
>
>> Hi Chris,
>>
>>> Upgrading to 6.0.X I noticed that record batches can have body
>>> compression which I think is great.
>>
>>
>> Small nit: this was released in Arrow 4.
>>
>> I had trouble finding examples in python or R (or java) of writing an IPC
>>> file with various types of compression used for the record batch.
>>
>>
>> Java code is at [1] with implementations for compression codec living in
>> [2].
>>
>> Is the compression applied per-column or upon the record batch after the
>>> buffers have been serialized to the batch?  If it is applied per column
>>> which buffers - given that text for example can consist of 3 buffers
>>> (validity, offset, data) is compression applied to all three or just data
>>> or data and offset?
>>
>> It is applied per buffer, all buffers are compressed.
>>
>> Cheers,
>> Micah
>>
>>
>> [1]
>> https://github.com/apache/arrow/blob/8e43f23dcc6a9e630516228f110c48b64d13cec6/java/vector/src/main/java/org/apache/arrow/vector/VectorUnloader.java#L100
>> [2]
>> https://github.com/apache/arrow/tree/971a9d352e456882aa5b70ac722607840cdb9df7/java/compression/src
>>
>> On Thu, Jan 13, 2022 at 2:55 PM Chris Nuernberger <ch...@techascent.com>
>> wrote:
>>
>>> Upgrading to 6.0.X I noticed that record batches can have body
>>> compression which I think is great.
>>>
>>> I had trouble finding examples in python or R (or java) of writing an
>>> IPC file with various types of compression used for the record batch.
>>>
>>> Is the compression applied per-column or upon the record batch after the
>>> buffers have been serialized to the batch?  If it is applied per column
>>> which buffers - given that text for example can consist of 3 buffers
>>> (validity, offset, data) is compression applied to all three or just data
>>> or data and offset?
>>>
>>

Re: examples of using new compression scheme

Posted by Chris Nuernberger <ch...@techascent.com>.

Great, thanks, I just hadn't noticed until now - thanks!

On Thu, Jan 13, 2022 at 4:09 PM Micah Kornfield <em...@gmail.com>
wrote:

> Hi Chris,
>
>> Upgrading to 6.0.X I noticed that record batches can have body
>> compression which I think is great.
>
>
> Small nit: this was released in Arrow 4.
>
> I had trouble finding examples in python or R (or java) of writing an IPC
>> file with various types of compression used for the record batch.
>
>
> Java code is at [1] with implementations for compression codec living in
> [2].
>
> Is the compression applied per-column or upon the record batch after the
>> buffers have been serialized to the batch?  If it is applied per column
>> which buffers - given that text for example can consist of 3 buffers
>> (validity, offset, data) is compression applied to all three or just data
>> or data and offset?
>
> It is applied per buffer, all buffers are compressed.
>
> Cheers,
> Micah
>
>
> [1]
> https://github.com/apache/arrow/blob/8e43f23dcc6a9e630516228f110c48b64d13cec6/java/vector/src/main/java/org/apache/arrow/vector/VectorUnloader.java#L100
> [2]
> https://github.com/apache/arrow/tree/971a9d352e456882aa5b70ac722607840cdb9df7/java/compression/src
>
> On Thu, Jan 13, 2022 at 2:55 PM Chris Nuernberger <ch...@techascent.com>
> wrote:
>
>> Upgrading to 6.0.X I noticed that record batches can have body
>> compression which I think is great.
>>
>> I had trouble finding examples in python or R (or java) of writing an IPC
>> file with various types of compression used for the record batch.
>>
>> Is the compression applied per-column or upon the record batch after the
>> buffers have been serialized to the batch?  If it is applied per column
>> which buffers - given that text for example can consist of 3 buffers
>> (validity, offset, data) is compression applied to all three or just data
>> or data and offset?
>>
>

Re: examples of using new compression scheme

Posted by Micah Kornfield <em...@gmail.com>.

Hi Chris,

> Upgrading to 6.0.X I noticed that record batches can have body compression
> which I think is great.


Small nit: this was released in Arrow 4.

I had trouble finding examples in python or R (or java) of writing an IPC
> file with various types of compression used for the record batch.


Java code is at [1] with implementations for compression codec living in
[2].

Is the compression applied per-column or upon the record batch after the
> buffers have been serialized to the batch?  If it is applied per column
> which buffers - given that text for example can consist of 3 buffers
> (validity, offset, data) is compression applied to all three or just data
> or data and offset?

It is applied per buffer, all buffers are compressed.

Cheers,
Micah


[1]
https://github.com/apache/arrow/blob/8e43f23dcc6a9e630516228f110c48b64d13cec6/java/vector/src/main/java/org/apache/arrow/vector/VectorUnloader.java#L100
[2]
https://github.com/apache/arrow/tree/971a9d352e456882aa5b70ac722607840cdb9df7/java/compression/src

On Thu, Jan 13, 2022 at 2:55 PM Chris Nuernberger <ch...@techascent.com>
wrote:

> Upgrading to 6.0.X I noticed that record batches can have body compression
> which I think is great.
>
> I had trouble finding examples in python or R (or java) of writing an IPC
> file with various types of compression used for the record batch.
>
> Is the compression applied per-column or upon the record batch after the
> buffers have been serialized to the batch?  If it is applied per column
> which buffers - given that text for example can consist of 3 buffers
> (validity, offset, data) is compression applied to all three or just data
> or data and offset?
>