You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@arrow.apache.org by Igor <ig...@upgini.com> on 2020/12/30 15:07:53 UTC

Apache Arrow Java

Hello Apache Arrow developers!

We are using apache arrow library in java and python, using arrow-vector arrow-memory-unsafe in java and Pyarrow in python.

We try to implement in memory zero copy DataFrame, but we can’t find appropriate API in java libraries to get memory address of our vectors from python. I have found that API in Pyarrow library, but not in java libraries.

What we need:
1) Create vector in java, collect data in memory using arrow as memory map API
2) Get memory address or descriptor in java
3) Pass it to the python library Pyarrow
4) Read vector data

We have problem in the point 2

Tell us please, how we can do that. Thank you!


Best regards,
Eshtyganov Igor
https://www.upgini.com <https://www.upgini.com/>

Re: Apache Arrow Java

Posted by Chris Nuernberger <ch...@techascent.com>.
Igor,

I am not an arrow developer but to my knowledge only java pathway that can
use mmap is the one I wrote for Clojure:

https://techascent.com/blog/memory-mapping-arrow.html

The underlying library is tech.ml.dataset
<https://github.com/techascent/tech.ml.dataset> and we also have generic
python bindings <https://github.com/clj-python/libpython-clj>.

I do wonder what the pointer actually points at with pyarrow.  Columns
themselves may point to up to 3 buffers (data, valid, offsets) in the case
of text and usually have 2 data points for data and valid. Potentially the
pointer you get back is a pointer to the low level record batch but this
specifically cannot have a pointer to a dictionary.

Just considering the actual arrow file format a single pointer cannot point
to both the schema information (which contains the dictionary) and the
record batch column data.

There isn't a single column interchange format I am aware of aside from
potentially writing a streaming format with a single column.

On Wed, Dec 30, 2020 at 8:08 AM Igor <ig...@upgini.com> wrote:

> Hello Apache Arrow developers!
>
> We are using apache arrow library in java and python, using arrow-vector
> arrow-memory-unsafe in java and Pyarrow in python.
>
> We try to implement in memory zero copy DataFrame, but we can’t find
> appropriate API in java libraries to get memory address of our vectors from
> python. I have found that API in Pyarrow library, but not in java libraries.
>
> What we need:
> 1) Create vector in java, collect data in memory using arrow as memory map
> API
> 2) Get memory address or descriptor in java
> 3) Pass it to the python library Pyarrow
> 4) Read vector data
>
> We have problem in the point 2
>
> Tell us please, how we can do that. Thank you!
>
>
> Best regards,
> Eshtyganov Igor
> https://www.upgini.com
>

Re: Apache Arrow Java

Posted by Micah Kornfield <em...@gmail.com>.
There are two approaches that might help:
1.  Using JPype functionality in pyarrow [1][2]
2.  Direct memory addresses can be obtained from ArrowBuf objects [3].
Gandiva [4] uses this approach to pass the address to C++, the python code
would potentially look similar



[1] https://github.com/apache/arrow/blob/master/python/pyarrow/jvm.py
[2]
https://uwekorn.com/2019/11/17/fast-jdbc-access-in-python-using-pyarrow-jvm.html
[3]
https://github.com/apache/arrow/blob/f7d47a37f0418a5e615702dd974d4231184b4c70/java/memory/memory-core/src/main/java/org/apache/arrow/memory/ArrowBuf.java#L231
[4]
https://github.com/apache/arrow/blob/master/java/gandiva/src/main/java/org/apache/arrow/gandiva/evaluator/Filter.java#L139

A side note: as far as I know Java doesn't currently support MMaped files

On Wed, Dec 30, 2020 at 7:08 AM Igor <ig...@upgini.com> wrote:

> Hello Apache Arrow developers!
>
> We are using apache arrow library in java and python, using arrow-vector
> arrow-memory-unsafe in java and Pyarrow in python.
>
> We try to implement in memory zero copy DataFrame, but we can’t find
> appropriate API in java libraries to get memory address of our vectors from
> python. I have found that API in Pyarrow library, but not in java libraries.
>
> What we need:
> 1) Create vector in java, collect data in memory using arrow as memory map
> API
> 2) Get memory address or descriptor in java
> 3) Pass it to the python library Pyarrow
> 4) Read vector data
>
> We have problem in the point 2
>
> Tell us please, how we can do that. Thank you!
>
>
> Best regards,
> Eshtyganov Igor
> https://www.upgini.com
>