You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@arrow.apache.org by Andrew Melo <an...@gmail.com> on 2020/01/23 13:02:32 UTC

(java) Producing an in-memory Arrow buffer from a file

Hello all,

I work in particle physics, which has standardized on the ROOT (
http://root.cern) file format to store/process our data. The format itself
is quite complicated, but the relevant part here is that after
parsing/decompression, we end up with value and offset buffers holding our
data.

What I'd like to do is represent these data in-memory in the Arrow format.
I've written a very rough POC where I manually put an Arrow stream into a
ByteBuffer, then replaced the values and offset buffers with the bytes from
my files., and I'm wondering what's the "proper" way to do this is. From my
reading of the code, it appears (?) that what I want to do is produce a
org.apache.arrow.vector.types.pojo.Schema object, and N ArrowRecordBatch
objects, then use MessageSerializer to stick them into a ByteBuffer one
after each other.

Is this correct? Or, is there another API I'm missing?

Thanks!
Andrew

Re: (java) Producing an in-memory Arrow buffer from a file

Posted by Sebastien Binet <bi...@cern.ch>.
hi Andrew,

slightly related but probably also slightly off-topic:
(for inspiration) you may want to look at how this is done in groot/rarrow
where tools are exported to
- expose a ROOT "schema" as an Arrow Schema
- expose a ROOT Tree as an Arrow Table

groot/rarrow isn't working on zero-copy of ROOT data, though.

hth,
-s

On Thu, Jan 23, 2020 at 2:03 PM Andrew Melo <an...@gmail.com> wrote:

> Hello all,
>
> I work in particle physics, which has standardized on the ROOT (
> http://root.cern) file format to store/process our data. The format
> itself is quite complicated, but the relevant part here is that after
> parsing/decompression, we end up with value and offset buffers holding our
> data.
>
> What I'd like to do is represent these data in-memory in the Arrow format.
> I've written a very rough POC where I manually put an Arrow stream into a
> ByteBuffer, then replaced the values and offset buffers with the bytes from
> my files., and I'm wondering what's the "proper" way to do this is. From my
> reading of the code, it appears (?) that what I want to do is produce a
> org.apache.arrow.vector.types.pojo.Schema object, and N ArrowRecordBatch
> objects, then use MessageSerializer to stick them into a ByteBuffer one
> after each other.
>
> Is this correct? Or, is there another API I'm missing?
>
> Thanks!
> Andrew
>

Re: (java) Producing an in-memory Arrow buffer from a file

Posted by Micah Kornfield <em...@gmail.com>.
Hi Andrew,
Sorry for the late reply.


> I have the data stored in a heirarchy that is roughly table->columns->row
> ranges->ByteBuffer, so I presume ArrowBuf is the right direction. Since
> each column's row range is stored and compressed separately, I could
> decompress them directly into an ArrowBuf (?) and then skip having to
> iterate over the values.

Yes based on the description this sounds like the right approach.

Depending on your end goal, you might want to stream the values through a
>> VectorSchemaRoot instead.
>> It appears (?) that this option also involves iterating over all the
>> values
>
> Yes.

>
> Looking at your examples and thinking about it conceptually, is there much
> of a difference between constructing a large ByteBuffer (or ArrowBuf) with
> the various messages inside it, and handing that to Arrow to parse or
> building the java-object-representation myself?


IMO (not an expert in Java library) is if you already have separate
bytebuffers then constructing the object representation yourself probably
makes sense.



On Fri, Jan 24, 2020 at 2:29 AM Andrew Melo <an...@gmail.com> wrote:

> Hi Micah,
>
> On Fri, Jan 24, 2020 at 6:17 AM Micah Kornfield <em...@gmail.com>
> wrote:
>
>> Hi Andrew,
>> It might help to provide a little more detail on where you are starting
>> from and what you want to do once you have the data in arrow format.
>>
>
> Of course! Like I mentioned, particle physics data is processed in ROOT,
> which is a whole-stack solution -- from file I/O all the way up to plotting
> routines. There are a few different groups working on adopting non-physics
> tools like Spark or the scientific python ecosystem to process these data
> (so, still reading ROOT files, but doing the higher level interaction with
> different applications). I want to analyze these data with Spark, so I've
> implemented a (java-based) Spark DataSource which reads ROOT files. Some of
> my colleagues are experimenting with Kafka and were wondering if the same
> code could be re-used for them (they would like to put ROOT data into kafka
> topics, as I understand it).
>
> Currently, I parse the ROOT metadata to find where the value/offset
> buffers are within the file, then decompress the buffers and store them in
> an object hierarchy which I then use to implement the Spark API. I'd like
> to replace the intermediate object hierarchy with Arrow because
>
> 1) I could re-use the existing Spark code[1] to do the trudgework of
> extracting values from the buffers. That code is ~25% of my codebase
> 2) Adapting this code for different java-based applications becomes quite
> a bit easier. For example, Kafka supports Arrow-based sources, so adding
> kafka support would be relatively straightforward.
>
>
>>
>>  If you have the data already available in some sort of off-heap
>> datastructure you can potentially avoid copies wrap with the existing
>> ArrowBuf machinery [1].  If you have an iterator over the data you can also
>> directly build a ListVector [2].
>>
>
> I have the data stored in a heirarchy that is roughly table->columns->row
> ranges->ByteBuffer, so I presume ArrowBuf is the right direction. Since
> each column's row range is stored and compressed separately, I could
> decompress them directly into an ArrowBuf (?) and then skip having to
> iterate over the values.
>
>
>>
>> Depending on your end goal, you might want to stream the values through a
>> VectorSchemaRoot instead.
>>
>
> It appears (?) that this option also involves iterating over all the values
>
>
>>
>> There was some documentation written that will be published with the next
>> release that gives an overview of the Java libraries [3] that might be
>> helpful.
>>
>>
> I'll take a look at that, thanks!
>
> Looking at your examples and thinking about it conceptually, is there much
> of a difference between constructing a large ByteBuffer (or ArrowBuf) with
> the various messages inside it, and handing that to Arrow to parse or
> building the java-object-representation myself?
>
> Thanks for your patience,
> Andrew
>
> [1]
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/vectorized/ArrowColumnVector.java
>
>
>> Cheers,
>> Micah
>>
>> [1]
>> https://javadoc.io/static/org.apache.arrow/arrow-memory/0.15.1/io/netty/buffer/ArrowBuf.html
>> [2]
>> https://github.com/apache/arrow/blob/master/java/vector/src/main/java/org/apache/arrow/vector/complex/ListVector.java
>> [3] https://github.com/apache/arrow/tree/master/docs/source/java
>>
>> On Thu, Jan 23, 2020 at 5:02 AM Andrew Melo <an...@gmail.com>
>> wrote:
>>
>>> Hello all,
>>>
>>> I work in particle physics, which has standardized on the ROOT (
>>> http://root.cern) file format to store/process our data. The format
>>> itself is quite complicated, but the relevant part here is that after
>>> parsing/decompression, we end up with value and offset buffers holding our
>>> data.
>>>
>>> What I'd like to do is represent these data in-memory in the Arrow
>>> format. I've written a very rough POC where I manually put an Arrow stream
>>> into a ByteBuffer, then replaced the values and offset buffers with the
>>> bytes from my files., and I'm wondering what's the "proper" way to do this
>>> is. From my reading of the code, it appears (?) that what I want to do is
>>> produce a org.apache.arrow.vector.types.pojo.Schema object, and N
>>> ArrowRecordBatch objects, then use MessageSerializer to stick them into a
>>> ByteBuffer one after each other.
>>>
>>> Is this correct? Or, is there another API I'm missing?
>>>
>>> Thanks!
>>> Andrew
>>>
>>

Re: (java) Producing an in-memory Arrow buffer from a file

Posted by Andrew Melo <an...@gmail.com>.
Hi Micah,

On Fri, Jan 24, 2020 at 6:17 AM Micah Kornfield <em...@gmail.com>
wrote:

> Hi Andrew,
> It might help to provide a little more detail on where you are starting
> from and what you want to do once you have the data in arrow format.
>

Of course! Like I mentioned, particle physics data is processed in ROOT,
which is a whole-stack solution -- from file I/O all the way up to plotting
routines. There are a few different groups working on adopting non-physics
tools like Spark or the scientific python ecosystem to process these data
(so, still reading ROOT files, but doing the higher level interaction with
different applications). I want to analyze these data with Spark, so I've
implemented a (java-based) Spark DataSource which reads ROOT files. Some of
my colleagues are experimenting with Kafka and were wondering if the same
code could be re-used for them (they would like to put ROOT data into kafka
topics, as I understand it).

Currently, I parse the ROOT metadata to find where the value/offset buffers
are within the file, then decompress the buffers and store them in an
object hierarchy which I then use to implement the Spark API. I'd like to
replace the intermediate object hierarchy with Arrow because

1) I could re-use the existing Spark code[1] to do the trudgework of
extracting values from the buffers. That code is ~25% of my codebase
2) Adapting this code for different java-based applications becomes quite a
bit easier. For example, Kafka supports Arrow-based sources, so adding
kafka support would be relatively straightforward.


>
>  If you have the data already available in some sort of off-heap
> datastructure you can potentially avoid copies wrap with the existing
> ArrowBuf machinery [1].  If you have an iterator over the data you can also
> directly build a ListVector [2].
>

I have the data stored in a heirarchy that is roughly table->columns->row
ranges->ByteBuffer, so I presume ArrowBuf is the right direction. Since
each column's row range is stored and compressed separately, I could
decompress them directly into an ArrowBuf (?) and then skip having to
iterate over the values.


>
> Depending on your end goal, you might want to stream the values through a
> VectorSchemaRoot instead.
>

It appears (?) that this option also involves iterating over all the values


>
> There was some documentation written that will be published with the next
> release that gives an overview of the Java libraries [3] that might be
> helpful.
>
>
I'll take a look at that, thanks!

Looking at your examples and thinking about it conceptually, is there much
of a difference between constructing a large ByteBuffer (or ArrowBuf) with
the various messages inside it, and handing that to Arrow to parse or
building the java-object-representation myself?

Thanks for your patience,
Andrew

[1]
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/vectorized/ArrowColumnVector.java


> Cheers,
> Micah
>
> [1]
> https://javadoc.io/static/org.apache.arrow/arrow-memory/0.15.1/io/netty/buffer/ArrowBuf.html
> [2]
> https://github.com/apache/arrow/blob/master/java/vector/src/main/java/org/apache/arrow/vector/complex/ListVector.java
> [3] https://github.com/apache/arrow/tree/master/docs/source/java
>
> On Thu, Jan 23, 2020 at 5:02 AM Andrew Melo <an...@gmail.com> wrote:
>
>> Hello all,
>>
>> I work in particle physics, which has standardized on the ROOT (
>> http://root.cern) file format to store/process our data. The format
>> itself is quite complicated, but the relevant part here is that after
>> parsing/decompression, we end up with value and offset buffers holding our
>> data.
>>
>> What I'd like to do is represent these data in-memory in the Arrow
>> format. I've written a very rough POC where I manually put an Arrow stream
>> into a ByteBuffer, then replaced the values and offset buffers with the
>> bytes from my files., and I'm wondering what's the "proper" way to do this
>> is. From my reading of the code, it appears (?) that what I want to do is
>> produce a org.apache.arrow.vector.types.pojo.Schema object, and N
>> ArrowRecordBatch objects, then use MessageSerializer to stick them into a
>> ByteBuffer one after each other.
>>
>> Is this correct? Or, is there another API I'm missing?
>>
>> Thanks!
>> Andrew
>>
>

Re: (java) Producing an in-memory Arrow buffer from a file

Posted by Micah Kornfield <em...@gmail.com>.
Hi Andrew,
It might help to provide a little more detail on where you are starting
from and what you want to do once you have the data in arrow format.

 If you have the data already available in some sort of off-heap
datastructure you can potentially avoid copies wrap with the existing
ArrowBuf machinery [1].  If you have an iterator over the data you can also
directly build a ListVector [2].

Depending on your end goal, you might want to stream the values through a
VectorSchemaRoot instead.

There was some documentation written that will be published with the next
release that gives an overview of the Java libraries [3] that might be
helpful.

Cheers,
Micah

[1]
https://javadoc.io/static/org.apache.arrow/arrow-memory/0.15.1/io/netty/buffer/ArrowBuf.html
[2]
https://github.com/apache/arrow/blob/master/java/vector/src/main/java/org/apache/arrow/vector/complex/ListVector.java
[3] https://github.com/apache/arrow/tree/master/docs/source/java

On Thu, Jan 23, 2020 at 5:02 AM Andrew Melo <an...@gmail.com> wrote:

> Hello all,
>
> I work in particle physics, which has standardized on the ROOT (
> http://root.cern) file format to store/process our data. The format
> itself is quite complicated, but the relevant part here is that after
> parsing/decompression, we end up with value and offset buffers holding our
> data.
>
> What I'd like to do is represent these data in-memory in the Arrow format.
> I've written a very rough POC where I manually put an Arrow stream into a
> ByteBuffer, then replaced the values and offset buffers with the bytes from
> my files., and I'm wondering what's the "proper" way to do this is. From my
> reading of the code, it appears (?) that what I want to do is produce a
> org.apache.arrow.vector.types.pojo.Schema object, and N ArrowRecordBatch
> objects, then use MessageSerializer to stick them into a ByteBuffer one
> after each other.
>
> Is this correct? Or, is there another API I'm missing?
>
> Thanks!
> Andrew
>