You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Bryan Cutler <cu...@gmail.com> on 2016/11/08 23:58:54 UTC

Error Sending ArrowRecordBatch from Java to Python

Hi Devs,

I'm currently working on SPARK-13534 to use Arrow in Spark DataFrame
toPandas conversion and getting stuck with an invalid metadata size error
trying to send a simple ArrowRecordBatch created in Java over a socket to
Python.  The strategy so far is like this:

Java side:
- make a simple ArrowRecordBatch (1 field of Ints)
- create an ArrowWriter using a ByteArrayOutputStream
- call writer.writerRecordBatch() and writer.close() to write to a ByteArray
- send the ByteArray (framed with size) over a socket

Python side:
- read the ByteArray over the socket
- create an ArrowFileReader with the read bytes
- call reader.get_record_batch(0) to convert the bytes to a RecordBattch

This results in "pyarrow.error.ArrowException: Invalid: metadata size
invalid" and debugging shows the ArrowFileReader getting a metadata size of
269422093.  This is obviously way off, but it does seem to read some things
correctly like number of batches and offset.  Here is some debug output

ArrowWriter.java log output
16/11/07 12:04:18 DEBUG ArrowWriter: magic written, now at 6
16/11/07 12:04:18 DEBUG ArrowWriter: RecordBatch at 8, metadata: 104, body:
24
16/11/07 12:04:18 DEBUG ArrowWriter: Footer starts at 136, length: 224
16/11/07 12:04:18 DEBUG ArrowWriter: magic written, now at 370

Arrow-cpp printouts
read length 370
num batches 1
metadata size 269422093
offset 136

From what I can tell by looking through the code, it seems like Java uses
Flatbuffers to write the metadata, but I don't see the Cython side using it
to read back the metadata.

Should this work with the classes I'm uses on both sides? or am I way off
with the above strategy?  I made a simplified example that mimics the Spark
integration and will reproduce the error, here is the gist
https://gist.github.com/BryanCutler/930ecd1de1c6d9484931505dcb6bb321

Sorry for the barrage of info, but any help would be much appreciated!

Thanks,
Bryan

Re: Error Sending ArrowRecordBatch from Java to Python

Posted by Bryan Cutler <cu...@gmail.com>.
Thanks Wes!  I'd be happy to help out with the effort - I'll look into the
refs you mentioned.  I updated SPARK-13534 with what we have so far, but
not much has been done with the internal format conversion yet.

Bryan

On Tue, Nov 8, 2016 at 4:34 PM, Wes McKinney <we...@gmail.com> wrote:

> Until we have integration tests proving otherwise, you can assume that
> the IPC wire/file formats are currently incompatible (not on purpose).
> Both Julien and I are working on corresponding efforts in Java and C++
> this week (e.g. see https://github.com/apache/arrow/pull/201 and
> ARROW-373), but it's going to be a little while until completed. We
> would certainly welcome help (including code review). This is a
> necessary step to moving forward with any hybrid Java/C++ Arrow
> applications.
>
> Also, can you please update SPARK-13534 when you have a WIP branch or
> to otherwise say that you've started work on the project (the issue
> isn't assigned to anyone)? I also have been looking into this with my
> colleagues at Two Sigma and I want to make sure we don't duplicate
> efforts. The general approach you've described makes sense to me.
>
> Thanks
> Wes
>
> On Tue, Nov 8, 2016 at 6:58 PM, Bryan Cutler <cu...@gmail.com> wrote:
> > Hi Devs,
> >
> > I'm currently working on SPARK-13534 to use Arrow in Spark DataFrame
> > toPandas conversion and getting stuck with an invalid metadata size error
> > trying to send a simple ArrowRecordBatch created in Java over a socket to
> > Python.  The strategy so far is like this:
> >
> > Java side:
> > - make a simple ArrowRecordBatch (1 field of Ints)
> > - create an ArrowWriter using a ByteArrayOutputStream
> > - call writer.writerRecordBatch() and writer.close() to write to a
> ByteArray
> > - send the ByteArray (framed with size) over a socket
> >
> > Python side:
> > - read the ByteArray over the socket
> > - create an ArrowFileReader with the read bytes
> > - call reader.get_record_batch(0) to convert the bytes to a RecordBattch
> >
> > This results in "pyarrow.error.ArrowException: Invalid: metadata size
> > invalid" and debugging shows the ArrowFileReader getting a metadata size
> of
> > 269422093.  This is obviously way off, but it does seem to read some
> things
> > correctly like number of batches and offset.  Here is some debug output
> >
> > ArrowWriter.java log output
> > 16/11/07 12:04:18 DEBUG ArrowWriter: magic written, now at 6
> > 16/11/07 12:04:18 DEBUG ArrowWriter: RecordBatch at 8, metadata: 104,
> body:
> > 24
> > 16/11/07 12:04:18 DEBUG ArrowWriter: Footer starts at 136, length: 224
> > 16/11/07 12:04:18 DEBUG ArrowWriter: magic written, now at 370
> >
> > Arrow-cpp printouts
> > read length 370
> > num batches 1
> > metadata size 269422093
> > offset 136
> >
> > From what I can tell by looking through the code, it seems like Java uses
> > Flatbuffers to write the metadata, but I don't see the Cython side using
> it
> > to read back the metadata.
> >
> > Should this work with the classes I'm uses on both sides? or am I way off
> > with the above strategy?  I made a simplified example that mimics the
> Spark
> > integration and will reproduce the error, here is the gist
> > https://gist.github.com/BryanCutler/930ecd1de1c6d9484931505dcb6bb321
> >
> > Sorry for the barrage of info, but any help would be much appreciated!
> >
> > Thanks,
> > Bryan
>

Re: Error Sending ArrowRecordBatch from Java to Python

Posted by Wes McKinney <we...@gmail.com>.
Until we have integration tests proving otherwise, you can assume that
the IPC wire/file formats are currently incompatible (not on purpose).
Both Julien and I are working on corresponding efforts in Java and C++
this week (e.g. see https://github.com/apache/arrow/pull/201 and
ARROW-373), but it's going to be a little while until completed. We
would certainly welcome help (including code review). This is a
necessary step to moving forward with any hybrid Java/C++ Arrow
applications.

Also, can you please update SPARK-13534 when you have a WIP branch or
to otherwise say that you've started work on the project (the issue
isn't assigned to anyone)? I also have been looking into this with my
colleagues at Two Sigma and I want to make sure we don't duplicate
efforts. The general approach you've described makes sense to me.

Thanks
Wes

On Tue, Nov 8, 2016 at 6:58 PM, Bryan Cutler <cu...@gmail.com> wrote:
> Hi Devs,
>
> I'm currently working on SPARK-13534 to use Arrow in Spark DataFrame
> toPandas conversion and getting stuck with an invalid metadata size error
> trying to send a simple ArrowRecordBatch created in Java over a socket to
> Python.  The strategy so far is like this:
>
> Java side:
> - make a simple ArrowRecordBatch (1 field of Ints)
> - create an ArrowWriter using a ByteArrayOutputStream
> - call writer.writerRecordBatch() and writer.close() to write to a ByteArray
> - send the ByteArray (framed with size) over a socket
>
> Python side:
> - read the ByteArray over the socket
> - create an ArrowFileReader with the read bytes
> - call reader.get_record_batch(0) to convert the bytes to a RecordBattch
>
> This results in "pyarrow.error.ArrowException: Invalid: metadata size
> invalid" and debugging shows the ArrowFileReader getting a metadata size of
> 269422093.  This is obviously way off, but it does seem to read some things
> correctly like number of batches and offset.  Here is some debug output
>
> ArrowWriter.java log output
> 16/11/07 12:04:18 DEBUG ArrowWriter: magic written, now at 6
> 16/11/07 12:04:18 DEBUG ArrowWriter: RecordBatch at 8, metadata: 104, body:
> 24
> 16/11/07 12:04:18 DEBUG ArrowWriter: Footer starts at 136, length: 224
> 16/11/07 12:04:18 DEBUG ArrowWriter: magic written, now at 370
>
> Arrow-cpp printouts
> read length 370
> num batches 1
> metadata size 269422093
> offset 136
>
> From what I can tell by looking through the code, it seems like Java uses
> Flatbuffers to write the metadata, but I don't see the Cython side using it
> to read back the metadata.
>
> Should this work with the classes I'm uses on both sides? or am I way off
> with the above strategy?  I made a simplified example that mimics the Spark
> integration and will reproduce the error, here is the gist
> https://gist.github.com/BryanCutler/930ecd1de1c6d9484931505dcb6bb321
>
> Sorry for the barrage of info, but any help would be much appreciated!
>
> Thanks,
> Bryan