You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by Antoine Pitrou <an...@python.org> on 2019/10/03 09:25:05 UTC

Re: [DISCUSS] raw pointers and FFI (C-level in-process array protocol)

Hi Jacques,

Le 03/10/2019 à 02:46, Jacques Nadeau a écrit :
> 
> I think it is reasonable to argue that keeping any ABI (or header/struct
> pattern) as narrow as possible would allow us to minimize overlap with the
> existing in-memory specification. In Arrow's case, this could be as simple
> as a single memory pointer for schema (backed by flatbuffers) and a single
> memory location for data (that references the record batch header, which in
> turn provides pointers into the actual arrow data). [...]
> 
> [...] (For example, in a JVM
> view of the world, working with a plain struct in java rather than a set of
> memory pointers against our existing IPC formats would be quite painful and
> we'd definitely need to create some glue code for users. I worry the same
> pattern would occur in many other languages.)

I'm trying to understand the point you're making.  Here you say that it
was difficult for the JVM to deal with raw pointers.  But above you seem
to argue for a flatbuffers-based serialization containing raw pointers.

Here's another way to frame the question: how do you propose to do
zero-copy between different languages if not by passing raw pointers to
the Arrow data?  And if passing raw pointers is acceptable, what is
wrong with the spec as proposed?


As for creating glue code: yes, of course, that would be needed in most
languages that want to provide this interface (including C++).  You do
need a C FFI for that.  I'm quite sure it would be possible to implement
this proposal in pure Python with ctypes / cffi, for example (as a toy
example, since PyArrow exists :-)).  When writing the spec, I also took
a look at the Go and Rust FFIs, and they seem good enough to interact
with it.  I tried to take a look at JNI, but of course I got lost in the
documentation :-)

If you are worried that people start thinking that this proposal is part
of the Arrow specification, perhaps we can make it clear that exposing
this interface is optional for implementations.

Regards

Antoine.

Re: [DISCUSS] raw pointers and FFI (C-level in-process array protocol)

Posted by Sutou Kouhei <ko...@clear-code.com>.

Hi,

I think that FFI use is misleading. Normally, language
bindings for this API are useless for processing Apache
Arrow data. Because these bindings of this API can only
import/export Apache Arrow data. Target language may not
have useful/fast API for processing imported Apache Arrow
data. For example, Julia may process imported Apache Arrow
data with Julia's built-in feature. Other script
languages may not, even LuaJIT.

We need multiple languages in one process for in-process
use. There are some approaches for this situation. Actually
some approaches are used but these approaches are minor. (I
think.)

I think that interacting to Apache Arrow ready library is a
useful use case of this API.

If SQLite uses this API to return result set in Apache Arrow
format, it'll be useful. SQLite doesn't need additional
dependency to add support for exporting in Apache Arrow
format. SQLite will return schema by its existing API such
as sqlite3_column_type() and return data with this API.
SQLite bindings can add Apache Arrow data export API easily
because it's just raw C API. (FFI may be used to bind the
Apache Arrow data export API.)

SQLite doesn't need to process Apache Arrow data. It just
exports Apache Arrow data. So this API is enough.

This API will be useful for libraries that want to support
just Apache Arrow data import/export.

Thanks,
--
kou

In <CA...@mail.gmail.com>
  "Re: [DISCUSS] raw pointers and FFI (C-level in-process array protocol)" on Thu, 3 Oct 2019 11:17:29 -0500,
  Wes McKinney <we...@gmail.com> wrote:

> Related: Gandiva invented its own particular way of passing memory
> addresses through the JNI boundary rather than using Flatbuffers
> messages
> 
> https://github.com/apache/arrow/blob/master/cpp/src/gandiva/jni/jni_common.cc#L505
> 
> I'm all for language-agnostic in-memory data passing, but there is a
> use case for a C API to pass pointers at call sites while avoiding
> flattening (disassembly) and unflattening (reassembly) steps.
> 
> On Thu, Oct 3, 2019 at 4:34 AM Antoine Pitrou <an...@python.org> wrote:
>>
>>
>> Hi Jacques,
>>
>> Le 03/10/2019 à 02:46, Jacques Nadeau a écrit :
>> >
>> > I think it is reasonable to argue that keeping any ABI (or header/struct
>> > pattern) as narrow as possible would allow us to minimize overlap with the
>> > existing in-memory specification. In Arrow's case, this could be as simple
>> > as a single memory pointer for schema (backed by flatbuffers) and a single
>> > memory location for data (that references the record batch header, which in
>> > turn provides pointers into the actual arrow data). [...]
>> >
>> > [...] (For example, in a JVM
>> > view of the world, working with a plain struct in java rather than a set of
>> > memory pointers against our existing IPC formats would be quite painful and
>> > we'd definitely need to create some glue code for users. I worry the same
>> > pattern would occur in many other languages.)
>>
>> I'm trying to understand the point you're making.  Here you say that it
>> was difficult for the JVM to deal with raw pointers.  But above you seem
>> to argue for a flatbuffers-based serialization containing raw pointers.
>>
>> Here's another way to frame the question: how do you propose to do
>> zero-copy between different languages if not by passing raw pointers to
>> the Arrow data?  And if passing raw pointers is acceptable, what is
>> wrong with the spec as proposed?
>>
>>
>> As for creating glue code: yes, of course, that would be needed in most
>> languages that want to provide this interface (including C++).  You do
>> need a C FFI for that.  I'm quite sure it would be possible to implement
>> this proposal in pure Python with ctypes / cffi, for example (as a toy
>> example, since PyArrow exists :-)).  When writing the spec, I also took
>> a look at the Go and Rust FFIs, and they seem good enough to interact
>> with it.  I tried to take a look at JNI, but of course I got lost in the
>> documentation :-)
>>
>> If you are worried that people start thinking that this proposal is part
>> of the Arrow specification, perhaps we can make it clear that exposing
>> this interface is optional for implementations.
>>
>> Regards
>>
>> Antoine.

Re: [DISCUSS] raw pointers and FFI (C-level in-process array protocol)

Posted by Wes McKinney <we...@gmail.com>.

Related: Gandiva invented its own particular way of passing memory
addresses through the JNI boundary rather than using Flatbuffers
messages

https://github.com/apache/arrow/blob/master/cpp/src/gandiva/jni/jni_common.cc#L505

I'm all for language-agnostic in-memory data passing, but there is a
use case for a C API to pass pointers at call sites while avoiding
flattening (disassembly) and unflattening (reassembly) steps.

On Thu, Oct 3, 2019 at 4:34 AM Antoine Pitrou <an...@python.org> wrote:
>
>
> Hi Jacques,
>
> Le 03/10/2019 à 02:46, Jacques Nadeau a écrit :
> >
> > I think it is reasonable to argue that keeping any ABI (or header/struct
> > pattern) as narrow as possible would allow us to minimize overlap with the
> > existing in-memory specification. In Arrow's case, this could be as simple
> > as a single memory pointer for schema (backed by flatbuffers) and a single
> > memory location for data (that references the record batch header, which in
> > turn provides pointers into the actual arrow data). [...]
> >
> > [...] (For example, in a JVM
> > view of the world, working with a plain struct in java rather than a set of
> > memory pointers against our existing IPC formats would be quite painful and
> > we'd definitely need to create some glue code for users. I worry the same
> > pattern would occur in many other languages.)
>
> I'm trying to understand the point you're making.  Here you say that it
> was difficult for the JVM to deal with raw pointers.  But above you seem
> to argue for a flatbuffers-based serialization containing raw pointers.
>
> Here's another way to frame the question: how do you propose to do
> zero-copy between different languages if not by passing raw pointers to
> the Arrow data?  And if passing raw pointers is acceptable, what is
> wrong with the spec as proposed?
>
>
> As for creating glue code: yes, of course, that would be needed in most
> languages that want to provide this interface (including C++).  You do
> need a C FFI for that.  I'm quite sure it would be possible to implement
> this proposal in pure Python with ctypes / cffi, for example (as a toy
> example, since PyArrow exists :-)).  When writing the spec, I also took
> a look at the Go and Rust FFIs, and they seem good enough to interact
> with it.  I tried to take a look at JNI, but of course I got lost in the
> documentation :-)
>
> If you are worried that people start thinking that this proposal is part
> of the Arrow specification, perhaps we can make it clear that exposing
> this interface is optional for implementations.
>
> Regards
>
> Antoine.