You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@arrow.apache.org by Chris Nuernberger <ch...@techascent.com> on 2020/08/13 20:22:00 UTC

Blogpost on Arrow's binary format & memory mapping

Arrow Users -

We took some time and wrote a blogpost on arrow's binary format and memory
mapping on the JVM.  We are happy with how succinctly we broke down the
binary format in a visual way and think Arrow users looking to do
interesting/unsupported things with Arrow may be interested in the
presentation.

https://techascent.com/blog/memory-mapping-arrow.html

Chris

Re: Blogpost on Arrow's binary format & memory mapping

Posted by Chris Nuernberger <ch...@techascent.com>.
Micah,

I checked and you are correct, the VectorLoader does not copy anything so
as long as you can create an ArrowBuf then you can initialize a batch of
vectors with that ArrowBuf.  I had thought the VectorLoader did another
copy itself.

The allocators on vectors don't pose a meaningful issue; they just
seem like mild overengineering to me.  The state machine incorporated into
the vectors definitely caused a few WTF moments.

The pathway from an arrow vector into a tech reader goes through the actual
arrowbuf which creates a native buf which is the same thing the mmap
pathway operates on.  It probably is tough to follow for even Clojure
people due to the compile time programming to support many arrow vector
types but the outline is:

For each datatype supported, implement a conversion from the vector
datatype to a tech reader
<https://github.com/techascent/tech.ml.dataset/blob/ffbf40b6f5e3e4c916bb905c28dccaaef5d9e4cc/src/tech/libs/arrow/copying.clj#L378>
via its underlying arrow buffer
<https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/libs/arrow/copying.clj#L378>
which
converts to the same native buffer struct the mmap pathway uses
<https://github.com/techascent/tech.datatype/blob/b621dbe8ad94d42e4bd0db261e75fb1c8e03ace1/src/tech/v2/datatype/mmap.clj#L136>
.

This is done with a technique called a protocols
<https://clojure.org/reference/protocols> which is a language feature of
Clojure that allows you to map interfaces to a type after the fact
precisely for situations like this where I want to bind the arrow vectors
to the TechAscent numerics system.  Protocols can cause a noticeable
performance penalty so I use them once to change into a different efficient
representation but not for per-element access.
<https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/libs/arrow/copying.clj#L378>
So it is unlikely that anything in the performance tuning guide is going to
make a difference; I don't use the arrow vector accessors in the first
place but rather a one-off conversion of the vector into its data memory
address and then I use the memory address directly.  I did check using the
getSafe accessors, however, and they added a small extra bit of overhead
but not enough to really make a point about.

This means the mmap pathway and the copying pathway boil down into the
exact same code for elementwise access; the timing cannot change between
them.  I was interested in file loading time in a specific case where you
only wanted 1 column out of many, not getSafe/etc. timings which can be
avoided multiple ways.

Thanks for both of your responses :-).

Chris

On Sat, Aug 15, 2020 at 9:26 PM Micah Kornfield <em...@gmail.com>
wrote:

> Hi Chris,
>
>> The deserialization system should not assume a copy is necessary
>>> <https://github.com/apache/arrow/blob/ecba35cac76185f59c55048b82e895e5f8300140/java/vector/src/main/java/org/apache/arrow/vector/ipc/message/MessageSerializer.java#L381>.
>>>
>>>
>> This is one of many ways to reconstruct an arrow record batch. We
>> frequently reconstruct without any copies. It'd be great if you looked to
>> contribute some of the improvements you believe are needed back to the
>> project.
>>
>
> +1.  If I didn't say this on the previous thread. IIRC, there is nothing
> about the VectorLoader [1] that assumes copies, this just needs to be
> pushed further down the stack.
>
> My opinion is that a better design for the Arrow JVM bindings would be to
>> have each record batch be potentially allocated but remove allocators from
>> the vectors themselves.
>
>
> Could you expand on this?  What problems to allocators on Vectors present?
>
> Lastly, if you are running benchmarks, please checkout performance tuning
> section of the README [2] which includes environment variables that would
> be set under production scenarios (I had a little trouble following the
> clojure call but it does look like it is calling "get" on the
> Float8Vector?).
>
> -Micah
>
> [1]
> https://github.com/apache/arrow/blob/ecba35cac76185f59c55048b82e895e5f8300140/java/vector/src/main/java/org/apache/arrow/vector/VectorLoader.java
> [2] https://github.com/apache/arrow/tree/master/java#performance-tuning
>
>
>
>

Re: Blogpost on Arrow's binary format & memory mapping

Posted by Micah Kornfield <em...@gmail.com>.
Hi Chris,

> The deserialization system should not assume a copy is necessary
>> <https://github.com/apache/arrow/blob/ecba35cac76185f59c55048b82e895e5f8300140/java/vector/src/main/java/org/apache/arrow/vector/ipc/message/MessageSerializer.java#L381>.
>>
>>
> This is one of many ways to reconstruct an arrow record batch. We
> frequently reconstruct without any copies. It'd be great if you looked to
> contribute some of the improvements you believe are needed back to the
> project.
>

+1.  If I didn't say this on the previous thread. IIRC, there is nothing
about the VectorLoader [1] that assumes copies, this just needs to be
pushed further down the stack.

My opinion is that a better design for the Arrow JVM bindings would be to
> have each record batch be potentially allocated but remove allocators from
> the vectors themselves.


Could you expand on this?  What problems to allocators on Vectors present?

Lastly, if you are running benchmarks, please checkout performance tuning
section of the README [2] which includes environment variables that would
be set under production scenarios (I had a little trouble following the
clojure call but it does look like it is calling "get" on the
Float8Vector?).

-Micah

[1]
https://github.com/apache/arrow/blob/ecba35cac76185f59c55048b82e895e5f8300140/java/vector/src/main/java/org/apache/arrow/vector/VectorLoader.java
[2] https://github.com/apache/arrow/tree/master/java#performance-tuning

Re: Blogpost on Arrow's binary format & memory mapping

Posted by Jacques Nadeau <ja...@apache.org>.
>
> The deserialization system should not assume a copy is necessary
> <https://github.com/apache/arrow/blob/ecba35cac76185f59c55048b82e895e5f8300140/java/vector/src/main/java/org/apache/arrow/vector/ipc/message/MessageSerializer.java#L381>.
>
>
This is one of many ways to reconstruct an arrow record batch. We
frequently reconstruct without any copies. It'd be great if you looked to
contribute some of the improvements you believe are needed back to the
project.

>

Re: Blogpost on Arrow's binary format & memory mapping

Posted by Chris Nuernberger <ch...@techascent.com>.
Micah,

Thanks for taking the time to check out the post! We will have more
performance comparisons later but I wanted to address your question about
buffer allocators.

Here is the code for loading a record in-place using the our numerics
stack:
https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/libs/arrow/in_place.clj#L262
.

We have, at the base level of our numerics stack, a set of typed pure
interfaces called readers:
https://github.com/techascent/tech.datatype/blob/master/java/tech/v2/datatype/DoubleReader.java.


The mmap/set-native-datatype
<https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/libs/arrow/in_place.clj#L293>
call simply constructs a reader of the appropriate datatype that uses
unsafe under the covers to read bytes *but* also implements interfaces so
that I can get back to the native buffer for bulk copies to/from java
arrays or other native buffers.

So, since we have an entire numeric stack meant for working with both JVM
heap and native heap buffers it definitely wasn’t worth it to construct an
allocator; it is far less code to just effectively cast the pointer to
exactly the type and that is what the dataset system works off of anyway;
these abstract readers.  In fact, if I construct an actual tech.ml.dataset
from the copying pathway instead of using the Arrow vectors themselves I
just get the underlying buffer and work from that thus bypassing most of
the allocator design (and the rest of the Arrow codebase) entirely.

My opinion is that a better design for the Arrow JVM bindings would be to
have each record batch be potentially allocated but remove allocators from
the vectors themselves.   The deserialization system should not assume a
copy is necessary
<https://github.com/apache/arrow/blob/ecba35cac76185f59c55048b82e895e5f8300140/java/vector/src/main/java/org/apache/arrow/vector/ipc/message/MessageSerializer.java#L381>.
This sets you up for, when it makes sense, mmapping the entire file in
which case the record batches themselves won't have allocators.  Note this
doesn't preclude copying the batch as it is now but it just doesn't force
it.

As an aside, similar to Gandiva, built on this numerics stack we have
bindings to an AST-based binary code generation system but one with a much
more powerful optimization stack and has backends for CPU, GPU, wasm,
FPGAs, OpenGL, and lots of other pathways:
https://github.com/techascent/tvm-clj.  Potentially TVM would be an
interesting direction to research for really high performance stuff or a
JVM-specific version of TVM that supports some of the new vector
instructions <https://openjdk.java.net/jeps/338>.

Chris


On Thu, Aug 13, 2020 at 11:43 PM Micah Kornfield <em...@gmail.com>
wrote:

> I'd also add that your point:
>
> There are certainly other situations such as small files where the copying
>> pathway is indeed faster, but for these pathways is it not even close.
>
> This is pretty much the intended design of the java library.  Not small
> file per-se but  small batches streamed through processing pipelines.
>
> On Thu, Aug 13, 2020 at 7:59 PM Micah Kornfield <em...@gmail.com>
> wrote:
>
>> Hi Chris,
>> Nice write-up.  I'm curious if you did more analysis on where time was
>> spent for each method?
>>
>> It seems to confirm that investing in zero copy read from disk provides a
>> nice speedup.  I'm curious did you aren't too create a buffer allocator
>> based on memory mapper files for comparison?
>>
>> Thanks,
>> Micah
>>
>> On Thursday, August 13, 2020, Chris Nuernberger <ch...@techascent.com>
>> wrote:
>>
>>> Arrow Users -
>>>
>>> We took some time and wrote a blogpost on arrow's binary format and
>>> memory mapping on the JVM.  We are happy with how succinctly we broke down
>>> the binary format in a visual way and think Arrow users looking to do
>>> interesting/unsupported things with Arrow may be interested in the
>>> presentation.
>>>
>>> https://techascent.com/blog/memory-mapping-arrow.html
>>>
>>> Chris
>>>
>>

Re: Blogpost on Arrow's binary format & memory mapping

Posted by Micah Kornfield <em...@gmail.com>.
I'd also add that your point:

There are certainly other situations such as small files where the copying
> pathway is indeed faster, but for these pathways is it not even close.

This is pretty much the intended design of the java library.  Not small
file per-se but  small batches streamed through processing pipelines.

On Thu, Aug 13, 2020 at 7:59 PM Micah Kornfield <em...@gmail.com>
wrote:

> Hi Chris,
> Nice write-up.  I'm curious if you did more analysis on where time was
> spent for each method?
>
> It seems to confirm that investing in zero copy read from disk provides a
> nice speedup.  I'm curious did you aren't too create a buffer allocator
> based on memory mapper files for comparison?
>
> Thanks,
> Micah
>
> On Thursday, August 13, 2020, Chris Nuernberger <ch...@techascent.com>
> wrote:
>
>> Arrow Users -
>>
>> We took some time and wrote a blogpost on arrow's binary format and
>> memory mapping on the JVM.  We are happy with how succinctly we broke down
>> the binary format in a visual way and think Arrow users looking to do
>> interesting/unsupported things with Arrow may be interested in the
>> presentation.
>>
>> https://techascent.com/blog/memory-mapping-arrow.html
>>
>> Chris
>>
>

Re: Blogpost on Arrow's binary format & memory mapping

Posted by Micah Kornfield <em...@gmail.com>.
Hi Chris,
Nice write-up.  I'm curious if you did more analysis on where time was
spent for each method?

It seems to confirm that investing in zero copy read from disk provides a
nice speedup.  I'm curious did you aren't too create a buffer allocator
based on memory mapper files for comparison?

Thanks,
Micah

On Thursday, August 13, 2020, Chris Nuernberger <ch...@techascent.com>
wrote:

> Arrow Users -
>
> We took some time and wrote a blogpost on arrow's binary format and memory
> mapping on the JVM.  We are happy with how succinctly we broke down the
> binary format in a visual way and think Arrow users looking to do
> interesting/unsupported things with Arrow may be interested in the
> presentation.
>
> https://techascent.com/blog/memory-mapping-arrow.html
>
> Chris
>