You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by Li Jin <ic...@gmail.com> on 2017/04/25 22:22:36 UTC

Serialize/deserialize ArrowRecordBatch to/from bytes?

Hello,

I am trying to serialize/deserialize ArrowRecordBatch in Java, but since
the API has changed quite a bit from 0.2.0, I struggle to find how to do it
correctly. I checked the test for ArrowFileWriter and ArrowFileReader, but
it's still not clear to me how to do it. Can some one give an example or a
pointer please, it would be very helpful.

def serialize(records: ArrowRecordBatch): Array[Byte]

def deserialize(bytes: Array[Byte]): ArrowRecordBatch

Thank you,
Li

Re: Serialize/deserialize ArrowRecordBatch to/from bytes?

Posted by Li Jin <ic...@gmail.com>.

Thanks Julien and Bryan.

Bryan, perfect, this is super helpful. I will check your recent update to
https://github.com/BryanCutler/spark/commits/wip-toPandas_with_arrow-SPARK-13534
and rebase on top it.

On Wed, Apr 26, 2017 at 10:23 PM, Bryan Cutler <cu...@gmail.com> wrote:

> I just update my PR for SPARK-13534
> https://github.com/apache/spark/pull/15821 that uses the latest from
> Arrow,
> hopefully that should help.  I also have been playing around with Python
> UDFs in Spark with Arrow.  I have something sort of working, there are
> still some issues though and the branch is kind of messy right now, but
> feel free to check it out
> https://github.com/BryanCutler/spark/tree/wip-arrow-stream-serializer - I
> just mention this because I saw you created a related Spark PR and I'd be
> glad to help out if you want.
>
> Bryan
>
> On Wed, Apr 26, 2017 at 2:21 PM, Julien Le Dem <ju...@dremio.com> wrote:
>
> > Example of writing to and reading from a file:
> > https://github.com/apache/arrow/blob/master/java/vector/
> > src/test/java/org/apache/arrow/vector/file/TestArrowFile.java
> > Similarly, in case you don't want to go through a file:
> > Unloading a vector into buffers and loading from buffers:
> > https://github.com/apache/arrow/blob/master/java/vector/
> > src/test/java/org/apache/arrow/vector/TestVectorUnloadLoad.java
> > The VectorLoader/Unloader are used to read/write FIles
> >
> > On Wed, Apr 26, 2017 at 10:31 AM, Li Jin <ic...@gmail.com> wrote:
> >
> > > Thanks for the various pointers. I was looking at
> ArrowFileWriter/Reader
> > > and got a little bit confused.
> > >
> > > So what I am trying to do is to convert a list of spark rows into some
> > > arrow format in java ( I will probably go with the file format for
> now),
> > > send the bytes to python, deserialize it into a pyarrow table.
> > >
> > > What is what I currently plan to do:
> > > (1) convert the rows to one or more arrow batch record (Use the
> > > ValueVectors)
> > > (2) serialize the arrow batch records send it over to python (Not sure
> to
> > > use here, ArrowFileWriter?)
> > > (3) deserialize the bytes into pyarrow.Table using pyarrow.FileReader
> > >
> > > I *think* ArrowFileWriter is what I should use to send data over in
> (2),
> > > but:
> > > (1)  I would need to turn the arrow batch records into a
> VectorSchemaRoot
> > > by doing sth like
> > > this
> > > https://github.com/icexelloss/spark/blob/pandas-udf/sql/
> > > core/src/test/scala/org/apache/spark/sql/ArrowConvertersSuite.scala#
> L226
> > > (2) I am not sure how do I write all the data in a vector schema root
> > using
> > > ArrowFileWriter.
> > >
> > > Does this sound the right thing to do?
> > >
> > > Thanks,
> > > Li
> > >
> > > On Tue, Apr 25, 2017 at 8:52 PM, Wes McKinney <we...@gmail.com>
> > wrote:
> > >
> > > > Also, now that we have a website that is easier to write content for
> > (in
> > > > Markdown), it would be great if some Java developers could volunteer
> > some
> > > > time to write user-facing documentation to go with the Javadocs.
> > > >
> > > > On Tue, Apr 25, 2017 at 8:51 PM, Wes McKinney <we...@gmail.com>
> > > wrote:
> > > >
> > > > > There is also https://github.com/apache/arrow/blob/master/java/
> > > > > veator/src/test/java/org/apache/arrow/vector/file/
> > > > TestArrowStreamPipe.java
> > > > >
> > > > > On Tue, Apr 25, 2017 at 8:46 PM, Li Jin <ic...@gmail.com>
> > wrote:
> > > > >
> > > > >> Thanks Julien. I will follow
> > > > >> https://github.com/apache/arrow/blob/990e2bde758ac8bc6e4497a
> > > > >> e1bc37f89b71bb5cf/java/vector/src/test/java/org/apache/
> > > > >> arrow/vector/stream/MessageSerializerTest.java#L91
> > > > >>
> > > > >
> > > > >
> > > >
> > >
> >
> >
> >
> > --
> > Julien
> >
>

Re: Serialize/deserialize ArrowRecordBatch to/from bytes?

Posted by Bryan Cutler <cu...@gmail.com>.

I just update my PR for SPARK-13534
https://github.com/apache/spark/pull/15821 that uses the latest from Arrow,
hopefully that should help.  I also have been playing around with Python
UDFs in Spark with Arrow.  I have something sort of working, there are
still some issues though and the branch is kind of messy right now, but
feel free to check it out
https://github.com/BryanCutler/spark/tree/wip-arrow-stream-serializer - I
just mention this because I saw you created a related Spark PR and I'd be
glad to help out if you want.

Bryan

On Wed, Apr 26, 2017 at 2:21 PM, Julien Le Dem <ju...@dremio.com> wrote:

> Example of writing to and reading from a file:
> https://github.com/apache/arrow/blob/master/java/vector/
> src/test/java/org/apache/arrow/vector/file/TestArrowFile.java
> Similarly, in case you don't want to go through a file:
> Unloading a vector into buffers and loading from buffers:
> https://github.com/apache/arrow/blob/master/java/vector/
> src/test/java/org/apache/arrow/vector/TestVectorUnloadLoad.java
> The VectorLoader/Unloader are used to read/write FIles
>
> On Wed, Apr 26, 2017 at 10:31 AM, Li Jin <ic...@gmail.com> wrote:
>
> > Thanks for the various pointers. I was looking at ArrowFileWriter/Reader
> > and got a little bit confused.
> >
> > So what I am trying to do is to convert a list of spark rows into some
> > arrow format in java ( I will probably go with the file format for now),
> > send the bytes to python, deserialize it into a pyarrow table.
> >
> > What is what I currently plan to do:
> > (1) convert the rows to one or more arrow batch record (Use the
> > ValueVectors)
> > (2) serialize the arrow batch records send it over to python (Not sure to
> > use here, ArrowFileWriter?)
> > (3) deserialize the bytes into pyarrow.Table using pyarrow.FileReader
> >
> > I *think* ArrowFileWriter is what I should use to send data over in (2),
> > but:
> > (1)  I would need to turn the arrow batch records into a VectorSchemaRoot
> > by doing sth like
> > this
> > https://github.com/icexelloss/spark/blob/pandas-udf/sql/
> > core/src/test/scala/org/apache/spark/sql/ArrowConvertersSuite.scala#L226
> > (2) I am not sure how do I write all the data in a vector schema root
> using
> > ArrowFileWriter.
> >
> > Does this sound the right thing to do?
> >
> > Thanks,
> > Li
> >
> > On Tue, Apr 25, 2017 at 8:52 PM, Wes McKinney <we...@gmail.com>
> wrote:
> >
> > > Also, now that we have a website that is easier to write content for
> (in
> > > Markdown), it would be great if some Java developers could volunteer
> some
> > > time to write user-facing documentation to go with the Javadocs.
> > >
> > > On Tue, Apr 25, 2017 at 8:51 PM, Wes McKinney <we...@gmail.com>
> > wrote:
> > >
> > > > There is also https://github.com/apache/arrow/blob/master/java/
> > > > veator/src/test/java/org/apache/arrow/vector/file/
> > > TestArrowStreamPipe.java
> > > >
> > > > On Tue, Apr 25, 2017 at 8:46 PM, Li Jin <ic...@gmail.com>
> wrote:
> > > >
> > > >> Thanks Julien. I will follow
> > > >> https://github.com/apache/arrow/blob/990e2bde758ac8bc6e4497a
> > > >> e1bc37f89b71bb5cf/java/vector/src/test/java/org/apache/
> > > >> arrow/vector/stream/MessageSerializerTest.java#L91
> > > >>
> > > >
> > > >
> > >
> >
>
>
>
> --
> Julien
>

Re: Serialize/deserialize ArrowRecordBatch to/from bytes?

Posted by Julien Le Dem <ju...@dremio.com>.

Example of writing to and reading from a file:
https://github.com/apache/arrow/blob/master/java/vector/src/test/java/org/apache/arrow/vector/file/TestArrowFile.java
Similarly, in case you don't want to go through a file:
Unloading a vector into buffers and loading from buffers:
https://github.com/apache/arrow/blob/master/java/vector/src/test/java/org/apache/arrow/vector/TestVectorUnloadLoad.java
The VectorLoader/Unloader are used to read/write FIles

On Wed, Apr 26, 2017 at 10:31 AM, Li Jin <ic...@gmail.com> wrote:

> Thanks for the various pointers. I was looking at ArrowFileWriter/Reader
> and got a little bit confused.
>
> So what I am trying to do is to convert a list of spark rows into some
> arrow format in java ( I will probably go with the file format for now),
> send the bytes to python, deserialize it into a pyarrow table.
>
> What is what I currently plan to do:
> (1) convert the rows to one or more arrow batch record (Use the
> ValueVectors)
> (2) serialize the arrow batch records send it over to python (Not sure to
> use here, ArrowFileWriter?)
> (3) deserialize the bytes into pyarrow.Table using pyarrow.FileReader
>
> I *think* ArrowFileWriter is what I should use to send data over in (2),
> but:
> (1)  I would need to turn the arrow batch records into a VectorSchemaRoot
> by doing sth like
> this
> https://github.com/icexelloss/spark/blob/pandas-udf/sql/
> core/src/test/scala/org/apache/spark/sql/ArrowConvertersSuite.scala#L226
> (2) I am not sure how do I write all the data in a vector schema root using
> ArrowFileWriter.
>
> Does this sound the right thing to do?
>
> Thanks,
> Li
>
> On Tue, Apr 25, 2017 at 8:52 PM, Wes McKinney <we...@gmail.com> wrote:
>
> > Also, now that we have a website that is easier to write content for (in
> > Markdown), it would be great if some Java developers could volunteer some
> > time to write user-facing documentation to go with the Javadocs.
> >
> > On Tue, Apr 25, 2017 at 8:51 PM, Wes McKinney <we...@gmail.com>
> wrote:
> >
> > > There is also https://github.com/apache/arrow/blob/master/java/
> > > veator/src/test/java/org/apache/arrow/vector/file/
> > TestArrowStreamPipe.java
> > >
> > > On Tue, Apr 25, 2017 at 8:46 PM, Li Jin <ic...@gmail.com> wrote:
> > >
> > >> Thanks Julien. I will follow
> > >> https://github.com/apache/arrow/blob/990e2bde758ac8bc6e4497a
> > >> e1bc37f89b71bb5cf/java/vector/src/test/java/org/apache/
> > >> arrow/vector/stream/MessageSerializerTest.java#L91
> > >>
> > >
> > >
> >
>



-- 
Julien

Re: Serialize/deserialize ArrowRecordBatch to/from bytes?

Posted by Li Jin <ic...@gmail.com>.

Thanks for the various pointers. I was looking at ArrowFileWriter/Reader
and got a little bit confused.

So what I am trying to do is to convert a list of spark rows into some
arrow format in java ( I will probably go with the file format for now),
send the bytes to python, deserialize it into a pyarrow table.

What is what I currently plan to do:
(1) convert the rows to one or more arrow batch record (Use the
ValueVectors)
(2) serialize the arrow batch records send it over to python (Not sure to
use here, ArrowFileWriter?)
(3) deserialize the bytes into pyarrow.Table using pyarrow.FileReader

I *think* ArrowFileWriter is what I should use to send data over in (2),
but:
(1)  I would need to turn the arrow batch records into a VectorSchemaRoot
by doing sth like
this
https://github.com/icexelloss/spark/blob/pandas-udf/sql/core/src/test/scala/org/apache/spark/sql/ArrowConvertersSuite.scala#L226
(2) I am not sure how do I write all the data in a vector schema root using
ArrowFileWriter.

Does this sound the right thing to do?

Thanks,
Li

On Tue, Apr 25, 2017 at 8:52 PM, Wes McKinney <we...@gmail.com> wrote:

> Also, now that we have a website that is easier to write content for (in
> Markdown), it would be great if some Java developers could volunteer some
> time to write user-facing documentation to go with the Javadocs.
>
> On Tue, Apr 25, 2017 at 8:51 PM, Wes McKinney <we...@gmail.com> wrote:
>
> > There is also https://github.com/apache/arrow/blob/master/java/
> > veator/src/test/java/org/apache/arrow/vector/file/
> TestArrowStreamPipe.java
> >
> > On Tue, Apr 25, 2017 at 8:46 PM, Li Jin <ic...@gmail.com> wrote:
> >
> >> Thanks Julien. I will follow
> >> https://github.com/apache/arrow/blob/990e2bde758ac8bc6e4497a
> >> e1bc37f89b71bb5cf/java/vector/src/test/java/org/apache/
> >> arrow/vector/stream/MessageSerializerTest.java#L91
> >>
> >
> >
>

Re: Serialize/deserialize ArrowRecordBatch to/from bytes?

Posted by Wes McKinney <we...@gmail.com>.

Also, now that we have a website that is easier to write content for (in
Markdown), it would be great if some Java developers could volunteer some
time to write user-facing documentation to go with the Javadocs.

On Tue, Apr 25, 2017 at 8:51 PM, Wes McKinney <we...@gmail.com> wrote:

> There is also https://github.com/apache/arrow/blob/master/java/
> veator/src/test/java/org/apache/arrow/vector/file/TestArrowStreamPipe.java
>
> On Tue, Apr 25, 2017 at 8:46 PM, Li Jin <ic...@gmail.com> wrote:
>
>> Thanks Julien. I will follow
>> https://github.com/apache/arrow/blob/990e2bde758ac8bc6e4497a
>> e1bc37f89b71bb5cf/java/vector/src/test/java/org/apache/
>> arrow/vector/stream/MessageSerializerTest.java#L91
>>
>
>

Re: Serialize/deserialize ArrowRecordBatch to/from bytes?

Posted by Wes McKinney <we...@gmail.com>.

There is also
https://github.com/apache/arrow/blob/master/java/veator/src/test/java/org/apache/arrow/vector/file/TestArrowStreamPipe.java

On Tue, Apr 25, 2017 at 8:46 PM, Li Jin <ic...@gmail.com> wrote:

> Thanks Julien. I will follow
> https://github.com/apache/arrow/blob/990e2bde758ac8bc6e4497ae1bc37f
> 89b71bb5cf/java/vector/src/test/java/org/apache/arrow/vector/stream/
> MessageSerializerTest.java#L91
>

Re: Serialize/deserialize ArrowRecordBatch to/from bytes?

Posted by Li Jin <ic...@gmail.com>.

Thanks Julien. I will follow
https://github.com/apache/arrow/blob/990e2bde758ac8bc6e4497ae1bc37f89b71bb5cf/java/vector/src/test/java/org/apache/arrow/vector/stream/MessageSerializerTest.java#L91

Re: Serialize/deserialize ArrowRecordBatch to/from bytes?

Posted by Julien Le Dem <ju...@dremio.com>.

look at org.apache.arrow.vector.stream.MessageSerializer.

There are methods to serialize/deserialize to/from channels.

these could be adapted to byte arrays.

The apis are usually in terms of bytebuffers




On Tue, Apr 25, 2017 at 3:22 PM, Li Jin <ic...@gmail.com> wrote:

> Hello,
>
> I am trying to serialize/deserialize ArrowRecordBatch in Java, but since
> the API has changed quite a bit from 0.2.0, I struggle to find how to do it
> correctly. I checked the test for ArrowFileWriter and ArrowFileReader, but
> it's still not clear to me how to do it. Can some one give an example or a
> pointer please, it would be very helpful.
>
> def serialize(records: ArrowRecordBatch): Array[Byte]
>
> def deserialize(bytes: Array[Byte]): ArrowRecordBatch
>
> Thank you,
> Li
>



-- 
Julien