You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by Chris Withers <ch...@withers.org> on 2018/04/12 07:22:02 UTC

new user question about cross-language use

Hi All,

Apologies if I'm on the wrong list or struggle to get my question 
across, I'm very new to Arrow, so please point me to the best place if 
there's somewhere better to ask these kinds of questions...

So, in my mind, Arrow provides a single in-memory model that supports 
access from a bunch of different languages/environments (Pandas, Go, 
C++, etc from looking at https://github.com/apache/arrow), which gives 
me hope that, as someone just starting out on a project to go from a 
proprietary C++ trading framework's market data archive to Pandas 
dataframes would be a good way to look and, if things go through arrow 
in the middle, potentially a way for other environments (Go, Julia?) to 
make sure of the same thing.

That left me wondering, however, that if I write a "to arrow" thing is 
C++, how would a Go or Python user then wire things up to get access to 
the Arrow data structures?
Somewhat important bonus point: how would that happen without memory 
copies? (datasets here are many GB is most cases).

cheers,

Chris

Re: new user question about cross-language use

Posted by Wes McKinney <we...@gmail.com>.

hi Chris,

To add to Uwe's e-mail:

> In this case, the sharing is zero-serialization but not zero-copy.

This depends. If an implementation supports shared memory, then
zero-copy access is possible. So if you generated data in C++, you
could access it in another C++ program or Python program without
copying it into memory. So if you had a 50GB dataset in Arrow format
on disk, you could access any column, row, or single value without any
deserialization or copying if you are using the C++ libraries (or any
bindings thereof, like C, Python, Ruby, etc.)

Not all implementations support accessing Arrow via shared memory yet.
For example, Java does not yet.

- Wes

On Sun, Apr 15, 2018 at 5:05 AM, Uwe L. Korn <uw...@xhochy.com> wrote:
> Hello Chris,
>
> at the moment, we have focused on sharing Arrow structures via inter process communication (IPC). In this case, the sharing is zero-serialization but not zero-copy. Given that we have good integration tests now for a good subset of all implementations, the sharing of memory between different implementation with no copy of the data is the next step.
>
> As each Arrow implementation has its different user-facing data structures with the same backing memory layout, we will have to write some APIs that can convert one interface to another. A very simple example that takes the Java Arrow structures and makes it available to Python is included in this PR (comment): https://github.com/apache/arrow/pull/1693
>
> Note that this is not needed for all languages. For example the Python, Ruby and GLib implementation is all backed on the C++ implementation. Here you can simply  extract that backing C++ object and use in the other language. Thus a pyarrow.Array created in Python already contains a C++ arrow::Array object which then could be directly used as a backing object for Ruby.
>
> Uwe
>
> On Thu, Apr 12, 2018, at 9:22 AM, Chris Withers wrote:
>> Hi All,
>>
>> Apologies if I'm on the wrong list or struggle to get my question
>> across, I'm very new to Arrow, so please point me to the best place if
>> there's somewhere better to ask these kinds of questions...
>>
>> So, in my mind, Arrow provides a single in-memory model that supports
>> access from a bunch of different languages/environments (Pandas, Go,
>> C++, etc from looking at https://github.com/apache/arrow), which gives
>> me hope that, as someone just starting out on a project to go from a
>> proprietary C++ trading framework's market data archive to Pandas
>> dataframes would be a good way to look and, if things go through arrow
>> in the middle, potentially a way for other environments (Go, Julia?) to
>> make sure of the same thing.
>>
>> That left me wondering, however, that if I write a "to arrow" thing is
>> C++, how would a Go or Python user then wire things up to get access to
>> the Arrow data structures?
>> Somewhat important bonus point: how would that happen without memory
>> copies? (datasets here are many GB is most cases).
>>
>> cheers,
>>
>> Chris

Re: new user question about cross-language use

Posted by "Uwe L. Korn" <uw...@xhochy.com>.

Hello Chris,

at the moment, we have focused on sharing Arrow structures via inter process communication (IPC). In this case, the sharing is zero-serialization but not zero-copy. Given that we have good integration tests now for a good subset of all implementations, the sharing of memory between different implementation with no copy of the data is the next step.

As each Arrow implementation has its different user-facing data structures with the same backing memory layout, we will have to write some APIs that can convert one interface to another. A very simple example that takes the Java Arrow structures and makes it available to Python is included in this PR (comment): https://github.com/apache/arrow/pull/1693

Note that this is not needed for all languages. For example the Python, Ruby and GLib implementation is all backed on the C++ implementation. Here you can simply  extract that backing C++ object and use in the other language. Thus a pyarrow.Array created in Python already contains a C++ arrow::Array object which then could be directly used as a backing object for Ruby.

Uwe

On Thu, Apr 12, 2018, at 9:22 AM, Chris Withers wrote:
> Hi All,
> 
> Apologies if I'm on the wrong list or struggle to get my question 
> across, I'm very new to Arrow, so please point me to the best place if 
> there's somewhere better to ask these kinds of questions...
> 
> So, in my mind, Arrow provides a single in-memory model that supports 
> access from a bunch of different languages/environments (Pandas, Go, 
> C++, etc from looking at https://github.com/apache/arrow), which gives 
> me hope that, as someone just starting out on a project to go from a 
> proprietary C++ trading framework's market data archive to Pandas 
> dataframes would be a good way to look and, if things go through arrow 
> in the middle, potentially a way for other environments (Go, Julia?) to 
> make sure of the same thing.
> 
> That left me wondering, however, that if I write a "to arrow" thing is 
> C++, how would a Go or Python user then wire things up to get access to 
> the Arrow data structures?
> Somewhat important bonus point: how would that happen without memory 
> copies? (datasets here are many GB is most cases).
> 
> cheers,
> 
> Chris