You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Pierre Avérous <pi...@gmail.com> on 2020/02/26 13:02:22 UTC

[Java] Help needed for a project

Yo,

I'm Pierre, a french student, currently working on a cross-runtime project
and would like to use the apache arrow Plasma Store. The project would get
data in a Python runtime, store it in the plasma store, and some job in a
Java runtime would then process the data asynchronously.

I'm having trouble with the Java part of this, as the Python implementation
is very well documented. I managed to read and write byte arrays into the
plasma store from my Java runtime, but I could not quite figure out how to
process more complex objects. For instance, i'd like to dump a Pandas
dataframe into the plasma store from the Python runtime, and read it in
Java. I struggle with the metadata of the object put in the plasma store. I
tried setting up a VectorSchemaRoot, with a predefined Schema in Java, but
could not figure out how to write it to the plasma store, or to read from
the plasma store into a VectorSchemaRoot.

Would you be able to help me out with this? A small code sample of how it
should be used in Java would help a lot.

Best,
Pierre Averous

Re: [Java] Help needed for a project

Posted by Micah Kornfield <em...@gmail.com>.
Hi Pierre,
If you are sharing the schema out of band from the actual object storage
you probably need to use ArrowVectorLoader/ArrowVectorUnloader [1] to get
the an ArrowRecordBatch and correspond methods to MessageSerializers [2] to
read/write bytes.  A simpler approach (with larger objects) would be to use
ArrowStreamWriter/ArrowStreamReader [3] which are analogous to the stream
readers/writers in python.

Hope this helps.

-Micah

[1]
https://arrow.apache.org/docs/java/org/apache/arrow/vector/VectorUnloader.html
[2]
https://arrow.apache.org/docs/java/org/apache/arrow/vector/ipc/message/MessageSerializer.html#deserializeRecordBatch-org.apache.arrow.flatbuf.Message-io.netty.buffer.ArrowBuf-
[3]
https://arrow.apache.org/docs/java/org/apache/arrow/vector/ipc/ArrowStreamReader.html


On Wed, Feb 26, 2020 at 5:02 AM Pierre Avérous <pi...@gmail.com> wrote:

> Yo,
>
> I'm Pierre, a french student, currently working on a cross-runtime project
> and would like to use the apache arrow Plasma Store. The project would get
> data in a Python runtime, store it in the plasma store, and some job in a
> Java runtime would then process the data asynchronously.
>
> I'm having trouble with the Java part of this, as the Python implementation
> is very well documented. I managed to read and write byte arrays into the
> plasma store from my Java runtime, but I could not quite figure out how to
> process more complex objects. For instance, i'd like to dump a Pandas
> dataframe into the plasma store from the Python runtime, and read it in
> Java. I struggle with the metadata of the object put in the plasma store. I
> tried setting up a VectorSchemaRoot, with a predefined Schema in Java, but
> could not figure out how to write it to the plasma store, or to read from
> the plasma store into a VectorSchemaRoot.
>
> Would you be able to help me out with this? A small code sample of how it
> should be used in Java would help a lot.
>
> Best,
> Pierre Averous
>