You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by Antoine Pitrou <an...@python.org> on 2019/10/01 09:19:17 UTC

Re: [DISCUSS] C-level in-process array protocol

Le 01/10/2019 à 00:39, Wes McKinney a écrit :
> A couple things:
> 
> * I think a C protocol / FFI for Arrow array/vectors would be better
> to have the same "shape" as an assembled array. Note that the C
> structs here have very nearly the same "shape" as the data structure
> representing a C++ Array object [1]. The disassembly and reassembly
> here is substantially simpler than the IPC protocol. A recursive
> structure in Flatbuffers would make RecordBatch messages much larger,
> so the flattened / disassembled representation we use for serialized
> record batches is the correct one

I'm not sure I agree:

- indeed, it's not a coincidence that the ArrowArray struct looks quite
closely like the C++ ArrayData object :-)  We have good experience with
that abstraction and it has proven to work quite well

- the IPC format is meant for serialization while the C data protocol is
meants for in-memory communication, so different concerns apply

- the fact that this makes the layout slightly larger doesn't seem
important at all; we're not talking about transferring data over the wire

There's also another argument for having a recursive struct: it
simplifies how the data type is represented, since we can encode each
child type individually instead of encoding it in the parent's format
string (same applies for metadata and individual flags).

> * The "formal" C protocol having the "assembled" shape means that many
> minimal Arrow users won't have to implement any separate data
> structures. They can just use the C struct directly or a slightly
> wrapped version thereof with some convenience functions.

Yes, but the same applies to the current proposal.

> * I think that requiring building a Flatbuffer for minimal use cases
> (e.g. communicating simple record batches with primitive types) passes
> on implementation burden to minimal users.

It certainly does.

> I think the mantra of the C protocol should be the following:
> 
> * Users of the protocol have to write little to no code to use it. For
> example, populating an INT32 array should require only a few lines of
> code

Agreed.  As a sidenote, the spec should have an example of doing this in
raw C.

Regards

Antoine.

Re: [DISCUSS] raw pointers and FFI (C-level in-process array protocol)

Posted by Sutou Kouhei <ko...@clear-code.com>.

Hi,

I think that FFI use is misleading. Normally, language
bindings for this API are useless for processing Apache
Arrow data. Because these bindings of this API can only
import/export Apache Arrow data. Target language may not
have useful/fast API for processing imported Apache Arrow
data. For example, Julia may process imported Apache Arrow
data with Julia's built-in feature. Other script
languages may not, even LuaJIT.

We need multiple languages in one process for in-process
use. There are some approaches for this situation. Actually
some approaches are used but these approaches are minor. (I
think.)

I think that interacting to Apache Arrow ready library is a
useful use case of this API.

If SQLite uses this API to return result set in Apache Arrow
format, it'll be useful. SQLite doesn't need additional
dependency to add support for exporting in Apache Arrow
format. SQLite will return schema by its existing API such
as sqlite3_column_type() and return data with this API.
SQLite bindings can add Apache Arrow data export API easily
because it's just raw C API. (FFI may be used to bind the
Apache Arrow data export API.)

SQLite doesn't need to process Apache Arrow data. It just
exports Apache Arrow data. So this API is enough.

This API will be useful for libraries that want to support
just Apache Arrow data import/export.

Thanks,
--
kou

In <CA...@mail.gmail.com>
  "Re: [DISCUSS] raw pointers and FFI (C-level in-process array protocol)" on Thu, 3 Oct 2019 11:17:29 -0500,
  Wes McKinney <we...@gmail.com> wrote:

> Related: Gandiva invented its own particular way of passing memory
> addresses through the JNI boundary rather than using Flatbuffers
> messages
> 
> https://github.com/apache/arrow/blob/master/cpp/src/gandiva/jni/jni_common.cc#L505
> 
> I'm all for language-agnostic in-memory data passing, but there is a
> use case for a C API to pass pointers at call sites while avoiding
> flattening (disassembly) and unflattening (reassembly) steps.
> 
> On Thu, Oct 3, 2019 at 4:34 AM Antoine Pitrou <an...@python.org> wrote:
>>
>>
>> Hi Jacques,
>>
>> Le 03/10/2019 à 02:46, Jacques Nadeau a écrit :
>> >
>> > I think it is reasonable to argue that keeping any ABI (or header/struct
>> > pattern) as narrow as possible would allow us to minimize overlap with the
>> > existing in-memory specification. In Arrow's case, this could be as simple
>> > as a single memory pointer for schema (backed by flatbuffers) and a single
>> > memory location for data (that references the record batch header, which in
>> > turn provides pointers into the actual arrow data). [...]
>> >
>> > [...] (For example, in a JVM
>> > view of the world, working with a plain struct in java rather than a set of
>> > memory pointers against our existing IPC formats would be quite painful and
>> > we'd definitely need to create some glue code for users. I worry the same
>> > pattern would occur in many other languages.)
>>
>> I'm trying to understand the point you're making.  Here you say that it
>> was difficult for the JVM to deal with raw pointers.  But above you seem
>> to argue for a flatbuffers-based serialization containing raw pointers.
>>
>> Here's another way to frame the question: how do you propose to do
>> zero-copy between different languages if not by passing raw pointers to
>> the Arrow data?  And if passing raw pointers is acceptable, what is
>> wrong with the spec as proposed?
>>
>>
>> As for creating glue code: yes, of course, that would be needed in most
>> languages that want to provide this interface (including C++).  You do
>> need a C FFI for that.  I'm quite sure it would be possible to implement
>> this proposal in pure Python with ctypes / cffi, for example (as a toy
>> example, since PyArrow exists :-)).  When writing the spec, I also took
>> a look at the Go and Rust FFIs, and they seem good enough to interact
>> with it.  I tried to take a look at JNI, but of course I got lost in the
>> documentation :-)
>>
>> If you are worried that people start thinking that this proposal is part
>> of the Arrow specification, perhaps we can make it clear that exposing
>> this interface is optional for implementations.
>>
>> Regards
>>
>> Antoine.

Re: [DISCUSS] raw pointers and FFI (C-level in-process array protocol)

Posted by Wes McKinney <we...@gmail.com>.

Related: Gandiva invented its own particular way of passing memory
addresses through the JNI boundary rather than using Flatbuffers
messages

https://github.com/apache/arrow/blob/master/cpp/src/gandiva/jni/jni_common.cc#L505

I'm all for language-agnostic in-memory data passing, but there is a
use case for a C API to pass pointers at call sites while avoiding
flattening (disassembly) and unflattening (reassembly) steps.

On Thu, Oct 3, 2019 at 4:34 AM Antoine Pitrou <an...@python.org> wrote:
>
>
> Hi Jacques,
>
> Le 03/10/2019 à 02:46, Jacques Nadeau a écrit :
> >
> > I think it is reasonable to argue that keeping any ABI (or header/struct
> > pattern) as narrow as possible would allow us to minimize overlap with the
> > existing in-memory specification. In Arrow's case, this could be as simple
> > as a single memory pointer for schema (backed by flatbuffers) and a single
> > memory location for data (that references the record batch header, which in
> > turn provides pointers into the actual arrow data). [...]
> >
> > [...] (For example, in a JVM
> > view of the world, working with a plain struct in java rather than a set of
> > memory pointers against our existing IPC formats would be quite painful and
> > we'd definitely need to create some glue code for users. I worry the same
> > pattern would occur in many other languages.)
>
> I'm trying to understand the point you're making.  Here you say that it
> was difficult for the JVM to deal with raw pointers.  But above you seem
> to argue for a flatbuffers-based serialization containing raw pointers.
>
> Here's another way to frame the question: how do you propose to do
> zero-copy between different languages if not by passing raw pointers to
> the Arrow data?  And if passing raw pointers is acceptable, what is
> wrong with the spec as proposed?
>
>
> As for creating glue code: yes, of course, that would be needed in most
> languages that want to provide this interface (including C++).  You do
> need a C FFI for that.  I'm quite sure it would be possible to implement
> this proposal in pure Python with ctypes / cffi, for example (as a toy
> example, since PyArrow exists :-)).  When writing the spec, I also took
> a look at the Go and Rust FFIs, and they seem good enough to interact
> with it.  I tried to take a look at JNI, but of course I got lost in the
> documentation :-)
>
> If you are worried that people start thinking that this proposal is part
> of the Arrow specification, perhaps we can make it clear that exposing
> this interface is optional for implementations.
>
> Regards
>
> Antoine.

Re: [DISCUSS] raw pointers and FFI (C-level in-process array protocol)

Posted by Antoine Pitrou <an...@python.org>.

Hi Jacques,

Le 03/10/2019 à 02:46, Jacques Nadeau a écrit :
> 
> I think it is reasonable to argue that keeping any ABI (or header/struct
> pattern) as narrow as possible would allow us to minimize overlap with the
> existing in-memory specification. In Arrow's case, this could be as simple
> as a single memory pointer for schema (backed by flatbuffers) and a single
> memory location for data (that references the record batch header, which in
> turn provides pointers into the actual arrow data). [...]
> 
> [...] (For example, in a JVM
> view of the world, working with a plain struct in java rather than a set of
> memory pointers against our existing IPC formats would be quite painful and
> we'd definitely need to create some glue code for users. I worry the same
> pattern would occur in many other languages.)

I'm trying to understand the point you're making.  Here you say that it
was difficult for the JVM to deal with raw pointers.  But above you seem
to argue for a flatbuffers-based serialization containing raw pointers.

Here's another way to frame the question: how do you propose to do
zero-copy between different languages if not by passing raw pointers to
the Arrow data?  And if passing raw pointers is acceptable, what is
wrong with the spec as proposed?


As for creating glue code: yes, of course, that would be needed in most
languages that want to provide this interface (including C++).  You do
need a C FFI for that.  I'm quite sure it would be possible to implement
this proposal in pure Python with ctypes / cffi, for example (as a toy
example, since PyArrow exists :-)).  When writing the spec, I also took
a look at the Go and Rust FFIs, and they seem good enough to interact
with it.  I tried to take a look at JNI, but of course I got lost in the
documentation :-)

If you are worried that people start thinking that this proposal is part
of the Arrow specification, perhaps we can make it clear that exposing
this interface is optional for implementations.

Regards

Antoine.

Re: [DISCUSS] C-level in-process array protocol

Posted by Wes McKinney <we...@gmail.com>.

On Wed, Oct 2, 2019 at 10:19 PM Wes McKinney <we...@gmail.com> wrote:
>
> On Wed, Oct 2, 2019 at 7:46 PM Jacques Nadeau <ja...@apache.org> wrote:
> >
> > I'd like to hear more opinions from others on this topic. This conversation
> > seems mostly dominated by comments from myself, Wes and Antoine.
> >
> > I think it is reasonable to argue that keeping any ABI (or header/struct
> > pattern) as narrow as possible would allow us to minimize overlap with the
> > existing in-memory specification. In Arrow's case, this could be as simple
> > as a single memory pointer for schema (backed by flatbuffers) and a single
> > memory location for data (that references the record batch header, which in
> > turn provides pointers into the actual arrow data). Extensions would need
> > to be added for reference management as done here but I continue to think
> > we should defer discussion of that until the base data structures are
> > resolved. I see the comments here as arguing for a much broader ABI, in
> > part to support having people build "Arrow" components that interconnect
> > using this new interface. I understand the desire to expand the ABI to be
> > driven by needs to reduce dependencies and ease usability.
> >
> > The representation within the related patch is being presented as a way for
> > applications to share Arrow data but is not easily accessible to all
> > languages. I want to avoid a situation where someone says "I produced an
> > Arrow API" when what they've really done is created a C interface which
> > only a small subset of languages can actually leverage. For example, every
> > language now knows how to parse the existing schema definition as rendered
> > in flatbuf. In order to interact with something that implements this new
> > pattern one would also be required to implement completely new schema
> > consumption code. In the proposal itself it suggests this (for example
> > enhancing the C++ library to consume structures produced this way).
>
> I think we are creating a C-based in-memory representation of Arrow
> (significantly simpler than what we have in C++, which involves smart
> pointers and other C++ concepts) and how people use these structs is
> up to them.
>
> > As I said, I really want to hear more opinions. Running this past various
> > developers I know, many have echoed my concerns but that really doesn't
> > matter (and who knows how much of that is colored by my presentation of the
> > issue). What do people here think? If someone builds an "Arrow" library
> > that implements this set of structures, how does one use it in Node? In
> > Java? Does it drive creation of a secondary set of interfaces in each of
> > those languages to work with this kind of pattern? (For example, in a JVM
> > view of the world, working with a plain struct in java rather than a set of
> > memory pointers against our existing IPC formats would be quite painful and
> > we'd definitely need to create some glue code for users. I worry the same
> > pattern would occur in many other languages.)
> >
>
> I'm fine to wait for more opinions, but I don't think that creating a
> strict C programming interface means that all languages have to figure
> out how to do FFI with it.
>
> > To respond directly to some of Wes's most recent comments from the email
> > below. I struggle to map your description of the situation to the rest of
> > the thread and the proposed patch.  For example, you say that a non-goal is
> > "creating a new canonical way to serialize metadata" bute the patch
> > proposes a concrete string based encoding system to describe data types.
> > Aren't those things in conflict?
> >
>
> Each language implementation represents in-memory schemas in a
> different way. In C++ we have the arrow::DataType classes. If the goal
> is to create a very compact C-based data model or Arrow, why is using
> a string representation of types instead of a more verbose object
> model inappropriate?
>

FWIW, the string-style representation of types is widespread. It's
used in the so-called C "buffer protocol" in Python and the "struct"
standard library module

https://docs.python.org/3/library/struct.html#module-struct

> > I'll also think more on this and challenge my own perspective. This isn't
> > where my focus is so my comments aren't as developed/thoughtful as I'd like.
> >
> >
> > On Tue, Oct 1, 2019 at 7:33 PM Wes McKinney <we...@gmail.com> wrote:
> >
> > > hi Jacques,
> > >
> > > I think we've veered off course a bit and maybe we could reframe the
> > > discussion.
> > >
> > > Goals
> > > * A "drop-in" header-only C file that projects can use as a
> > > programming interface either internally only or to expose in-memory
> > > data structures between C functions at call sites. Ideally little to
> > > no disassembly/reassembly should be required on either "side" of the
> > > call site.
> > > * Simplifying adoption of Arrow for C programmers, or languages based
> > > around C FFI
> > >
> > > Non-goals
> > > * Expanding the columnar format or creating an alternative canonical
> > > in-memory representation
> > > * Creating a new canonical way to serialize metadata
> > >
> > > Note that this use case has been on my mind for more than 2 years:
> > > https://issues.apache.org/jira/browse/ARROW-1058
> > >
> > > I think there are a couple of potentially misleading things at play here
> > >
> > > 1. The use of the word "protocol". In C, a struct has a well-defined
> > > binary layout, so a C API is also an ABI. Using C structs to
> > > communicate data can be considered to be a protocol, but it means
> > > something different in the context of the "Arrow protocol". I think we
> > > need to call this a "C API"
> > >
> > > 2. The documentation for this in Antoine's PR is in the format/
> > > directory. It would probably be better to have a "C API" section in
> > > the documentation.
> > >
> > > The header file under discussion and the documentation about it is
> > > best considered as a "library".
> > >
> > > It might be useful at some point to create a C99 implementation of the
> > > IPC protocol as well using FlatCC with the goal of having a complete
> > > implementation of the columnar format in C with minimal binary
> > > footprint. This is analogous to the NanoPB project which is an
> > > implementation of Protocol Buffers with small code size
> > >
> > > https://github.com/nanopb/nanopb
> > >
> > > Let me know if this makes more sense.
> > >
> > > I think it's important to communicate clearly about this primarily for
> > > the benefit of the outside world which can confuse easily as we have
> > > observed over the last few years =)
> > >
> > > Wes
> > >
> > > On Tue, Oct 1, 2019 at 2:55 PM Jacques Nadeau <ja...@apache.org> wrote:
> > > >
> > > > I disagree with this statement:
> > > >
> > > > - the IPC format is meant for serialization while the C data protocol is
> > > > meants for in-memory communication, so different concerns apply
> > > >
> > > > If that is how the a particular implementation presents it, that is a
> > > > weaknesses of the implementation, not the format. The primary use case I
> > > > was focused on when working on the initial format was communication
> > > within
> > > > the same process. It seems like this is being used as a basis for the
> > > > introduction of new things when the premise is inconsistent with the
> > > > intention of the creation. The specific reason we used flatbuffers in the
> > > > project was to collapse the separation of in-process and out-of-process
> > > > communication. It means the same thing it does with the Arrow data
> > > itself:
> > > > that a consumer doesn't have to use a particular library to interact with
> > > > and use the data.
> > > >
> > > > It seems like there are two ideas here:
> > > >
> > > > 1) How do we make it easier for people to use Arrow?
> > > > 2) Should we implement a new in memory representation of Arrow that is
> > > > language specific.
> > > >
> > > > I'm entirely in support of number one. If for a particular type of
> > > domain,
> > > > people want an easier way to interact with Arrow, let's make a new
> > > library
> > > > that helps with that. In easy of our current libraries, we do many things
> > > > to make it easier to work with Arrow. None of those require a change to
> > > the
> > > > core format or are formalized as a new in-memory standard. The in-memory
> > > > representation of rust or javascript or java objects are implementation
> > > > details.
> > > >
> > > > I'm against number two as it creates a fragmentation problem. Arrow is
> > > > about having a single canonical format for memory for both metadata and
> > > > data. Having multiple in-memory formats (especially when some are not
> > > > language independent) is counter to the goals of the project.
> > >
> > > I don't think anyone is proposing anything that would cause fragmentation.
> > >
> > > A central question is whether it is useful to define a reusable C ABI
> > > for the Arrow columnar format, and if there is sufficient interest, a
> > > tiny C implementation of the IPC protocol (which uses the Flatbuffers
> > > message) that assembles and disassembles the data structures defined
> > > in the C ABI.
> > >
> > > We could separately create a tiny implementation of the Arrow IPC
> > > protocol using FlatCC that could be dropped into applications
> > > requiring only a C compiler and nothing else.
> > >
> > >
> > > >
> > > > Two other, separate comments:
> > > > 1) I don't understand the idea that we need to change the way Arrow
> > > > fundamentally works so that people can avoid using a dependency. If the
> > > > dependency is small, open source and easy to build, people can fork it
> > > and
> > > > include directly if they want to. Let's not violate project principles
> > > > because DuckDB has a religious perspective on dependencies. If the
> > > problem
> > > > is people have to swallow too large of a pill to do basic things with
> > > Arrow
> > > > in C, let's focus on fixing that (to our definition of ease, not someone
> > > > else's). If FlatCC solves some those things, great. If we need to build a
> > > > baby integration library that is more C centric, great. Neither of those
> > > > things require implementing something at the format level.
> > > >
> > > > 2) It seems like we should discuss the data structure problem separately
> > > > from the reference management concern.
> > > >
> > > >
> > > > On Tue, Oct 1, 2019 at 5:42 AM Wes McKinney <we...@gmail.com> wrote:
> > > >
> > > > > hi Antoine,
> > > > >
> > > > > On Tue, Oct 1, 2019 at 4:29 AM Antoine Pitrou <an...@python.org>
> > > wrote:
> > > > > >
> > > > > >
> > > > > > Le 01/10/2019 à 00:39, Wes McKinney a écrit :
> > > > > > > A couple things:
> > > > > > >
> > > > > > > * I think a C protocol / FFI for Arrow array/vectors would be
> > > better
> > > > > > > to have the same "shape" as an assembled array. Note that the C
> > > > > > > structs here have very nearly the same "shape" as the data
> > > structure
> > > > > > > representing a C++ Array object [1]. The disassembly and reassembly
> > > > > > > here is substantially simpler than the IPC protocol. A recursive
> > > > > > > structure in Flatbuffers would make RecordBatch messages much
> > > larger,
> > > > > > > so the flattened / disassembled representation we use for
> > > serialized
> > > > > > > record batches is the correct one
> > > > > >
> > > > > > I'm not sure I agree:
> > > > > >
> > > > > > - indeed, it's not a coincidence that the ArrowArray struct looks
> > > quite
> > > > > > closely like the C++ ArrayData object :-)  We have good experience
> > > with
> > > > > > that abstraction and it has proven to work quite well
> > > > > >
> > > > > > - the IPC format is meant for serialization while the C data
> > > protocol is
> > > > > > meants for in-memory communication, so different concerns apply
> > > > > >
> > > > > > - the fact that this makes the layout slightly larger doesn't seem
> > > > > > important at all; we're not talking about transferring data over the
> > > wire
> > > > > >
> > > > > > There's also another argument for having a recursive struct: it
> > > > > > simplifies how the data type is represented, since we can encode each
> > > > > > child type individually instead of encoding it in the parent's format
> > > > > > string (same applies for metadata and individual flags).
> > > > > >
> > > > >
> > > > > I was saying something different here. I was making an argument about
> > > > > why we use the flattened array-of-structs in the IPC protocol. One
> > > > > reason is that it's a more compact representation. That is not very
> > > > > important here because this protocol is only for *in-process* (for
> > > > > languages that have a C FFI facility) rather than *inter-process*
> > > > > communication.
> > > > >
> > > > > I agree also that the type encoding is simple, here, too, since we
> > > > > aren't having to split the schema and record batch between different
> > > > > serialized messages. There is some potential waste with having to
> > > > > populate the type fields multiple times when communicating a sequence
> > > > > of "chunks" from the same logical dataset.
> > > > >
> > > > > > > * The "formal" C protocol having the "assembled" shape means that
> > > many
> > > > > > > minimal Arrow users won't have to implement any separate data
> > > > > > > structures. They can just use the C struct directly or a slightly
> > > > > > > wrapped version thereof with some convenience functions.
> > > > > >
> > > > > > Yes, but the same applies to the current proposal.
> > > > > >
> > > > > > > * I think that requiring building a Flatbuffer for minimal use
> > > cases
> > > > > > > (e.g. communicating simple record batches with primitive types)
> > > passes
> > > > > > > on implementation burden to minimal users.
> > > > > >
> > > > > > It certainly does.
> > > > > >
> > > > > > > I think the mantra of the C protocol should be the following:
> > > > > > >
> > > > > > > * Users of the protocol have to write little to no code to use it.
> > > For
> > > > > > > example, populating an INT32 array should require only a few lines
> > > of
> > > > > > > code
> > > > > >
> > > > > > Agreed.  As a sidenote, the spec should have an example of doing
> > > this in
> > > > > > raw C.
> > > > > >
> > > > > > Regards
> > > > > >
> > > > > > Antoine.
> > > > >
> > >

Re: [DISCUSS] C-level in-process array protocol

Posted by Wes McKinney <we...@gmail.com>.

On Wed, Oct 2, 2019 at 7:46 PM Jacques Nadeau <ja...@apache.org> wrote:
>
> I'd like to hear more opinions from others on this topic. This conversation
> seems mostly dominated by comments from myself, Wes and Antoine.
>
> I think it is reasonable to argue that keeping any ABI (or header/struct
> pattern) as narrow as possible would allow us to minimize overlap with the
> existing in-memory specification. In Arrow's case, this could be as simple
> as a single memory pointer for schema (backed by flatbuffers) and a single
> memory location for data (that references the record batch header, which in
> turn provides pointers into the actual arrow data). Extensions would need
> to be added for reference management as done here but I continue to think
> we should defer discussion of that until the base data structures are
> resolved. I see the comments here as arguing for a much broader ABI, in
> part to support having people build "Arrow" components that interconnect
> using this new interface. I understand the desire to expand the ABI to be
> driven by needs to reduce dependencies and ease usability.
>
> The representation within the related patch is being presented as a way for
> applications to share Arrow data but is not easily accessible to all
> languages. I want to avoid a situation where someone says "I produced an
> Arrow API" when what they've really done is created a C interface which
> only a small subset of languages can actually leverage. For example, every
> language now knows how to parse the existing schema definition as rendered
> in flatbuf. In order to interact with something that implements this new
> pattern one would also be required to implement completely new schema
> consumption code. In the proposal itself it suggests this (for example
> enhancing the C++ library to consume structures produced this way).

I think we are creating a C-based in-memory representation of Arrow
(significantly simpler than what we have in C++, which involves smart
pointers and other C++ concepts) and how people use these structs is
up to them.

> As I said, I really want to hear more opinions. Running this past various
> developers I know, many have echoed my concerns but that really doesn't
> matter (and who knows how much of that is colored by my presentation of the
> issue). What do people here think? If someone builds an "Arrow" library
> that implements this set of structures, how does one use it in Node? In
> Java? Does it drive creation of a secondary set of interfaces in each of
> those languages to work with this kind of pattern? (For example, in a JVM
> view of the world, working with a plain struct in java rather than a set of
> memory pointers against our existing IPC formats would be quite painful and
> we'd definitely need to create some glue code for users. I worry the same
> pattern would occur in many other languages.)
>

I'm fine to wait for more opinions, but I don't think that creating a
strict C programming interface means that all languages have to figure
out how to do FFI with it.

> To respond directly to some of Wes's most recent comments from the email
> below. I struggle to map your description of the situation to the rest of
> the thread and the proposed patch.  For example, you say that a non-goal is
> "creating a new canonical way to serialize metadata" bute the patch
> proposes a concrete string based encoding system to describe data types.
> Aren't those things in conflict?
>

Each language implementation represents in-memory schemas in a
different way. In C++ we have the arrow::DataType classes. If the goal
is to create a very compact C-based data model or Arrow, why is using
a string representation of types instead of a more verbose object
model inappropriate?

> I'll also think more on this and challenge my own perspective. This isn't
> where my focus is so my comments aren't as developed/thoughtful as I'd like.
>
>
> On Tue, Oct 1, 2019 at 7:33 PM Wes McKinney <we...@gmail.com> wrote:
>
> > hi Jacques,
> >
> > I think we've veered off course a bit and maybe we could reframe the
> > discussion.
> >
> > Goals
> > * A "drop-in" header-only C file that projects can use as a
> > programming interface either internally only or to expose in-memory
> > data structures between C functions at call sites. Ideally little to
> > no disassembly/reassembly should be required on either "side" of the
> > call site.
> > * Simplifying adoption of Arrow for C programmers, or languages based
> > around C FFI
> >
> > Non-goals
> > * Expanding the columnar format or creating an alternative canonical
> > in-memory representation
> > * Creating a new canonical way to serialize metadata
> >
> > Note that this use case has been on my mind for more than 2 years:
> > https://issues.apache.org/jira/browse/ARROW-1058
> >
> > I think there are a couple of potentially misleading things at play here
> >
> > 1. The use of the word "protocol". In C, a struct has a well-defined
> > binary layout, so a C API is also an ABI. Using C structs to
> > communicate data can be considered to be a protocol, but it means
> > something different in the context of the "Arrow protocol". I think we
> > need to call this a "C API"
> >
> > 2. The documentation for this in Antoine's PR is in the format/
> > directory. It would probably be better to have a "C API" section in
> > the documentation.
> >
> > The header file under discussion and the documentation about it is
> > best considered as a "library".
> >
> > It might be useful at some point to create a C99 implementation of the
> > IPC protocol as well using FlatCC with the goal of having a complete
> > implementation of the columnar format in C with minimal binary
> > footprint. This is analogous to the NanoPB project which is an
> > implementation of Protocol Buffers with small code size
> >
> > https://github.com/nanopb/nanopb
> >
> > Let me know if this makes more sense.
> >
> > I think it's important to communicate clearly about this primarily for
> > the benefit of the outside world which can confuse easily as we have
> > observed over the last few years =)
> >
> > Wes
> >
> > On Tue, Oct 1, 2019 at 2:55 PM Jacques Nadeau <ja...@apache.org> wrote:
> > >
> > > I disagree with this statement:
> > >
> > > - the IPC format is meant for serialization while the C data protocol is
> > > meants for in-memory communication, so different concerns apply
> > >
> > > If that is how the a particular implementation presents it, that is a
> > > weaknesses of the implementation, not the format. The primary use case I
> > > was focused on when working on the initial format was communication
> > within
> > > the same process. It seems like this is being used as a basis for the
> > > introduction of new things when the premise is inconsistent with the
> > > intention of the creation. The specific reason we used flatbuffers in the
> > > project was to collapse the separation of in-process and out-of-process
> > > communication. It means the same thing it does with the Arrow data
> > itself:
> > > that a consumer doesn't have to use a particular library to interact with
> > > and use the data.
> > >
> > > It seems like there are two ideas here:
> > >
> > > 1) How do we make it easier for people to use Arrow?
> > > 2) Should we implement a new in memory representation of Arrow that is
> > > language specific.
> > >
> > > I'm entirely in support of number one. If for a particular type of
> > domain,
> > > people want an easier way to interact with Arrow, let's make a new
> > library
> > > that helps with that. In easy of our current libraries, we do many things
> > > to make it easier to work with Arrow. None of those require a change to
> > the
> > > core format or are formalized as a new in-memory standard. The in-memory
> > > representation of rust or javascript or java objects are implementation
> > > details.
> > >
> > > I'm against number two as it creates a fragmentation problem. Arrow is
> > > about having a single canonical format for memory for both metadata and
> > > data. Having multiple in-memory formats (especially when some are not
> > > language independent) is counter to the goals of the project.
> >
> > I don't think anyone is proposing anything that would cause fragmentation.
> >
> > A central question is whether it is useful to define a reusable C ABI
> > for the Arrow columnar format, and if there is sufficient interest, a
> > tiny C implementation of the IPC protocol (which uses the Flatbuffers
> > message) that assembles and disassembles the data structures defined
> > in the C ABI.
> >
> > We could separately create a tiny implementation of the Arrow IPC
> > protocol using FlatCC that could be dropped into applications
> > requiring only a C compiler and nothing else.
> >
> >
> > >
> > > Two other, separate comments:
> > > 1) I don't understand the idea that we need to change the way Arrow
> > > fundamentally works so that people can avoid using a dependency. If the
> > > dependency is small, open source and easy to build, people can fork it
> > and
> > > include directly if they want to. Let's not violate project principles
> > > because DuckDB has a religious perspective on dependencies. If the
> > problem
> > > is people have to swallow too large of a pill to do basic things with
> > Arrow
> > > in C, let's focus on fixing that (to our definition of ease, not someone
> > > else's). If FlatCC solves some those things, great. If we need to build a
> > > baby integration library that is more C centric, great. Neither of those
> > > things require implementing something at the format level.
> > >
> > > 2) It seems like we should discuss the data structure problem separately
> > > from the reference management concern.
> > >
> > >
> > > On Tue, Oct 1, 2019 at 5:42 AM Wes McKinney <we...@gmail.com> wrote:
> > >
> > > > hi Antoine,
> > > >
> > > > On Tue, Oct 1, 2019 at 4:29 AM Antoine Pitrou <an...@python.org>
> > wrote:
> > > > >
> > > > >
> > > > > Le 01/10/2019 à 00:39, Wes McKinney a écrit :
> > > > > > A couple things:
> > > > > >
> > > > > > * I think a C protocol / FFI for Arrow array/vectors would be
> > better
> > > > > > to have the same "shape" as an assembled array. Note that the C
> > > > > > structs here have very nearly the same "shape" as the data
> > structure
> > > > > > representing a C++ Array object [1]. The disassembly and reassembly
> > > > > > here is substantially simpler than the IPC protocol. A recursive
> > > > > > structure in Flatbuffers would make RecordBatch messages much
> > larger,
> > > > > > so the flattened / disassembled representation we use for
> > serialized
> > > > > > record batches is the correct one
> > > > >
> > > > > I'm not sure I agree:
> > > > >
> > > > > - indeed, it's not a coincidence that the ArrowArray struct looks
> > quite
> > > > > closely like the C++ ArrayData object :-)  We have good experience
> > with
> > > > > that abstraction and it has proven to work quite well
> > > > >
> > > > > - the IPC format is meant for serialization while the C data
> > protocol is
> > > > > meants for in-memory communication, so different concerns apply
> > > > >
> > > > > - the fact that this makes the layout slightly larger doesn't seem
> > > > > important at all; we're not talking about transferring data over the
> > wire
> > > > >
> > > > > There's also another argument for having a recursive struct: it
> > > > > simplifies how the data type is represented, since we can encode each
> > > > > child type individually instead of encoding it in the parent's format
> > > > > string (same applies for metadata and individual flags).
> > > > >
> > > >
> > > > I was saying something different here. I was making an argument about
> > > > why we use the flattened array-of-structs in the IPC protocol. One
> > > > reason is that it's a more compact representation. That is not very
> > > > important here because this protocol is only for *in-process* (for
> > > > languages that have a C FFI facility) rather than *inter-process*
> > > > communication.
> > > >
> > > > I agree also that the type encoding is simple, here, too, since we
> > > > aren't having to split the schema and record batch between different
> > > > serialized messages. There is some potential waste with having to
> > > > populate the type fields multiple times when communicating a sequence
> > > > of "chunks" from the same logical dataset.
> > > >
> > > > > > * The "formal" C protocol having the "assembled" shape means that
> > many
> > > > > > minimal Arrow users won't have to implement any separate data
> > > > > > structures. They can just use the C struct directly or a slightly
> > > > > > wrapped version thereof with some convenience functions.
> > > > >
> > > > > Yes, but the same applies to the current proposal.
> > > > >
> > > > > > * I think that requiring building a Flatbuffer for minimal use
> > cases
> > > > > > (e.g. communicating simple record batches with primitive types)
> > passes
> > > > > > on implementation burden to minimal users.
> > > > >
> > > > > It certainly does.
> > > > >
> > > > > > I think the mantra of the C protocol should be the following:
> > > > > >
> > > > > > * Users of the protocol have to write little to no code to use it.
> > For
> > > > > > example, populating an INT32 array should require only a few lines
> > of
> > > > > > code
> > > > >
> > > > > Agreed.  As a sidenote, the spec should have an example of doing
> > this in
> > > > > raw C.
> > > > >
> > > > > Regards
> > > > >
> > > > > Antoine.
> > > >
> >

Re: [DISCUSS] C-level in-process array protocol

Posted by Wes McKinney <we...@gmail.com>.

On Tue, Oct 8, 2019 at 3:34 PM Wes McKinney <we...@gmail.com> wrote:
>
> hi Jacques,
>
> On Tue, Oct 8, 2019 at 1:54 PM Jacques Nadeau <ja...@apache.org> wrote:
> >
> > I removing all my objections to this work.
> >
> > I wish there was more feedback from additional community members. I continue to be concerned about fragmentation. I don't agree with the arguments here that we need to add a new api to make it easy for people to *not* use Arrow codebase. It seems like a punt on building useful libraries within the project that will ultimately hurt the interoperability story.
> >
>
> I think we'll have to take a "wait and see" approach. I believe the
> community needs to build accessible libraries that offer value to
> third party users, and we will continue to do that. I think there are
> use cases here that fall outside of which library to use, but time
> will tell.
>
> > As a side note, it seems like much of this is about people's distaste for flatbuffers. I know I regret using it. If we had a chance to do it over again, I would have chosen to use protobuf for everything except the data header, where I would hand write the encoding (since it is so simple anyway). If it is such a problem that people are contorting to work around it, maybe we should address that? Just a thought.
> >
>
> I think that using an Protobuf-like with IDL and a compiler presents a problem.

To clarify some inarticulate language since people reading may misinterpret.

Using an IDL-based metadata representation _in this C API_ presents a
potential roadblock for users.

As a canonical metadata representation with backward and forward
compatibility guarantees, it would be ill-advised to not use
Protobuf/Flatbuffers/Thrift

> Note that Flatbuffers is much better for C/C++ programmers and I still
> think it was the right choice for the project. Unlike Flatbuffers,
> C/C++ applications must either link libprotobuf.so or libprotobuf.a.
> Flatbuffers in C++ is a header-only dependency that is trivial to
> bundle with a project [1]. The same is true for Thrift, and this came
> up in the TF discussion [2]
>
> [1]: https://github.com/apache/arrow/tree/master/cpp/thirdparty/flatbuffers/include/flatbuffers
> [2]: https://github.com/tensorflow/community/pull/162#discussion_r332610486
>
> > Thanks for the discourse and patience.
> >
> > On Wed, Oct 2, 2019 at 10:12 PM Micah Kornfield <em...@gmail.com> wrote:
> >>
> >> Hi Wes,
> >> I agree for third-parties "A" (Field data structures) is the most useful.
> >>
> >> At least in my mind the discussion was for both first and third-parties.  I
> >> was trying to point out that "A" is less necessary as a first step for
> >> first-party integrations and could potentially require more effort if we
> >> already have the code that does "B" (field reassembly).
> >>
> >> Thanks,
> >> Micah
> >>
> >> On Wed, Oct 2, 2019 at 10:28 PM Wes McKinney <we...@gmail.com> wrote:
> >>
> >> > On Wed, Oct 2, 2019 at 11:05 PM Micah Kornfield <em...@gmail.com>
> >> > wrote:
> >> > >
> >> > > I've tried to summarize my understanding of the debate so far and give
> >> > some
> >> > > initial thoughts. I think there are two potentially different sets of
> >> > users
> >> > > that we are targeting with a stable C API/ABI ourselves and external
> >> > > parties.
> >> > >
> >> > > 1.  Different language implementations within the Arrow project that want
> >> > > to call into each other's code.  We still don't have a great story around
> >> > > this in terms of reusable libraries and questions like [1] are a
> >> > motivating
> >> > > examples of making something better in this context.
> >> > > 2.  third-parties wishing to support/integrate with Arrow.  Some
> >> > > conjectures about these users:
> >> > >   - Users in this group are NOT necessarily familiar with existing
> >> > > technologies Arrow uses (i.e. flatbuffers)
> >> > >   - The stability of the API is the primary concern (consumers don't want
> >> > > to change when a new version of the library ships)
> >> > >   - An important secondary concern is additional libraries that need to
> >> > be
> >> > > integrated in addition to the API
> >> > >
> >> > > The main debate points seems to be:
> >> > >
> >> > > 1.  Vector/Array oriented API vs existing Record Batch.  Will an
> >> > additional
> >> > > column oriented API become too much of a maintenance headache/cause
> >> > > fragmentation?
> >> > >
> >> > >  - In my mind the question here is which set of users we are
> >> > prioritizing.
> >> > > IMO the combination of flatbuffers and translation to/from RecordBatch
> >> > > format offers too much friction to make it easy for a third-party
> >> > > implementer to use. If we are prioritizing for our own internal
> >> > use-cases I
> >> > > think we should try out a RecordBatch+Flatbuffers based C-API. We already
> >> > > have all the necessary building blocks.
> >> > >
> >> >
> >> > If a C function passes you a string containing a RecordBatch
> >> > Flatbuffers message, what happens next? This message has to be
> >> > reassembled into a recursive data structure before you can "do"
> >> > anything with it. Are we expecting every third party project to
> >> > implement:
> >> >
> >> > A. Data structures appropriate to represent a logical "field" in a
> >> > record batch (which have to be recursive to account for nested types'
> >> > children)
> >> > B. The logic to convert from the flattened Flatbuffers representation
> >> > to some implementation of A
> >> >
> >> > I'm arguing that we should provide both to third parties. To build B,
> >> > you need A. Some consumers will only use A. This discussion is
> >> > essentially about developing an ultraminimalist "drop-in" C
> >> > implementation of A.
> >> >
> >> > > 2.  How onerous is the dependency on flat-buffers both from a learning
> >> > > curve perspective and as dependency for third-party integrators?
> >> > > - Flatbuffers aren't entirely straight-forward and I think if we do move
> >> > > forward with an API based on Column/Array we should consider alternatives
> >> > > as long as the necessary parsing code can be done in a small amount of
> >> > code
> >> > > (I'm personally against JSON for this, but can see the arguments for it).
> >> > >
> >> > > 3.  Do all existing library implementations need to support both
> >> > > Column/Array a ABI?  How will compliance be checked for the new API/ABI?
> >> > >
> >> > > - I'm still thinking this through.
> >> > >
> >> > > [1]
> >> > >
> >> > https://lists.apache.org/thread.html/18244b294d0b9bd568b5cfd1b1ac2b6a25088383a08202cc7a8a3563@%3Cuser.arrow.apache.org%3E
> >> > >
> >> > > On Wed, Oct 2, 2019 at 6:46 PM Jacques Nadeau <ja...@apache.org>
> >> > wrote:
> >> > >
> >> > > > I'd like to hear more opinions from others on this topic. This
> >> > conversation
> >> > > > seems mostly dominated by comments from myself, Wes and Antoine.
> >> > > >
> >> > > > I think it is reasonable to argue that keeping any ABI (or
> >> > header/struct
> >> > > > pattern) as narrow as possible would allow us to minimize overlap with
> >> > the
> >> > > > existing in-memory specification. In Arrow's case, this could be as
> >> > simple
> >> > > > as a single memory pointer for schema (backed by flatbuffers) and a
> >> > single
> >> > > > memory location for data (that references the record batch header,
> >> > which in
> >> > > > turn provides pointers into the actual arrow data). Extensions would
> >> > need
> >> > > > to be added for reference management as done here but I continue to
> >> > think
> >> > > > we should defer discussion of that until the base data structures are
> >> > > > resolved. I see the comments here as arguing for a much broader ABI, in
> >> > > > part to support having people build "Arrow" components that
> >> > interconnect
> >> > > > using this new interface. I understand the desire to expand the ABI to
> >> > be
> >> > > > driven by needs to reduce dependencies and ease usability.
> >> > > >
> >> > > > The representation within the related patch is being presented as a
> >> > way for
> >> > > > applications to share Arrow data but is not easily accessible to all
> >> > > > languages. I want to avoid a situation where someone says "I produced
> >> > an
> >> > > > Arrow API" when what they've really done is created a C interface which
> >> > > > only a small subset of languages can actually leverage. For example,
> >> > every
> >> > > > language now knows how to parse the existing schema definition as
> >> > rendered
> >> > > > in flatbuf. In order to interact with something that implements this
> >> > new
> >> > > > pattern one would also be required to implement completely new schema
> >> > > > consumption code. In the proposal itself it suggests this (for example
> >> > > > enhancing the C++ library to consume structures produced this way).
> >> > > >
> >> > > > As I said, I really want to hear more opinions. Running this past
> >> > various
> >> > > > developers I know, many have echoed my concerns but that really doesn't
> >> > > > matter (and who knows how much of that is colored by my presentation
> >> > of the
> >> > > > issue). What do people here think? If someone builds an "Arrow" library
> >> > > > that implements this set of structures, how does one use it in Node? In
> >> > > > Java? Does it drive creation of a secondary set of interfaces in each
> >> > of
> >> > > > those languages to work with this kind of pattern? (For example, in a
> >> > JVM
> >> > > > view of the world, working with a plain struct in java rather than a
> >> > set of
> >> > > > memory pointers against our existing IPC formats would be quite
> >> > painful and
> >> > > > we'd definitely need to create some glue code for users. I worry the
> >> > same
> >> > > > pattern would occur in many other languages.)
> >> > > >
> >> > > > To respond directly to some of Wes's most recent comments from the
> >> > email
> >> > > > below. I struggle to map your description of the situation to the rest
> >> > of
> >> > > > the thread and the proposed patch.  For example, you say that a
> >> > non-goal is
> >> > > > "creating a new canonical way to serialize metadata" bute the patch
> >> > > > proposes a concrete string based encoding system to describe data
> >> > types.
> >> > > > Aren't those things in conflict?
> >> > > >
> >> > > > I'll also think more on this and challenge my own perspective. This
> >> > isn't
> >> > > > where my focus is so my comments aren't as developed/thoughtful as I'd
> >> > > > like.
> >> > > >
> >> > > >
> >> > > > On Tue, Oct 1, 2019 at 7:33 PM Wes McKinney <we...@gmail.com>
> >> > wrote:
> >> > > >
> >> > > > > hi Jacques,
> >> > > > >
> >> > > > > I think we've veered off course a bit and maybe we could reframe the
> >> > > > > discussion.
> >> > > > >
> >> > > > > Goals
> >> > > > > * A "drop-in" header-only C file that projects can use as a
> >> > > > > programming interface either internally only or to expose in-memory
> >> > > > > data structures between C functions at call sites. Ideally little to
> >> > > > > no disassembly/reassembly should be required on either "side" of the
> >> > > > > call site.
> >> > > > > * Simplifying adoption of Arrow for C programmers, or languages based
> >> > > > > around C FFI
> >> > > > >
> >> > > > > Non-goals
> >> > > > > * Expanding the columnar format or creating an alternative canonical
> >> > > > > in-memory representation
> >> > > > > * Creating a new canonical way to serialize metadata
> >> > > > >
> >> > > > > Note that this use case has been on my mind for more than 2 years:
> >> > > > > https://issues.apache.org/jira/browse/ARROW-1058
> >> > > > >
> >> > > > > I think there are a couple of potentially misleading things at play
> >> > here
> >> > > > >
> >> > > > > 1. The use of the word "protocol". In C, a struct has a well-defined
> >> > > > > binary layout, so a C API is also an ABI. Using C structs to
> >> > > > > communicate data can be considered to be a protocol, but it means
> >> > > > > something different in the context of the "Arrow protocol". I think
> >> > we
> >> > > > > need to call this a "C API"
> >> > > > >
> >> > > > > 2. The documentation for this in Antoine's PR is in the format/
> >> > > > > directory. It would probably be better to have a "C API" section in
> >> > > > > the documentation.
> >> > > > >
> >> > > > > The header file under discussion and the documentation about it is
> >> > > > > best considered as a "library".
> >> > > > >
> >> > > > > It might be useful at some point to create a C99 implementation of
> >> > the
> >> > > > > IPC protocol as well using FlatCC with the goal of having a complete
> >> > > > > implementation of the columnar format in C with minimal binary
> >> > > > > footprint. This is analogous to the NanoPB project which is an
> >> > > > > implementation of Protocol Buffers with small code size
> >> > > > >
> >> > > > > https://github.com/nanopb/nanopb
> >> > > > >
> >> > > > > Let me know if this makes more sense.
> >> > > > >
> >> > > > > I think it's important to communicate clearly about this primarily
> >> > for
> >> > > > > the benefit of the outside world which can confuse easily as we have
> >> > > > > observed over the last few years =)
> >> > > > >
> >> > > > > Wes
> >> > > > >
> >> > > > > On Tue, Oct 1, 2019 at 2:55 PM Jacques Nadeau <ja...@apache.org>
> >> > > > wrote:
> >> > > > > >
> >> > > > > > I disagree with this statement:
> >> > > > > >
> >> > > > > > - the IPC format is meant for serialization while the C data
> >> > protocol
> >> > > > is
> >> > > > > > meants for in-memory communication, so different concerns apply
> >> > > > > >
> >> > > > > > If that is how the a particular implementation presents it, that
> >> > is a
> >> > > > > > weaknesses of the implementation, not the format. The primary use
> >> > case
> >> > > > I
> >> > > > > > was focused on when working on the initial format was communication
> >> > > > > within
> >> > > > > > the same process. It seems like this is being used as a basis for
> >> > the
> >> > > > > > introduction of new things when the premise is inconsistent with
> >> > the
> >> > > > > > intention of the creation. The specific reason we used flatbuffers
> >> > in
> >> > > > the
> >> > > > > > project was to collapse the separation of in-process and
> >> > out-of-process
> >> > > > > > communication. It means the same thing it does with the Arrow data
> >> > > > > itself:
> >> > > > > > that a consumer doesn't have to use a particular library to
> >> > interact
> >> > > > with
> >> > > > > > and use the data.
> >> > > > > >
> >> > > > > > It seems like there are two ideas here:
> >> > > > > >
> >> > > > > > 1) How do we make it easier for people to use Arrow?
> >> > > > > > 2) Should we implement a new in memory representation of Arrow
> >> > that is
> >> > > > > > language specific.
> >> > > > > >
> >> > > > > > I'm entirely in support of number one. If for a particular type of
> >> > > > > domain,
> >> > > > > > people want an easier way to interact with Arrow, let's make a new
> >> > > > > library
> >> > > > > > that helps with that. In easy of our current libraries, we do many
> >> > > > things
> >> > > > > > to make it easier to work with Arrow. None of those require a
> >> > change to
> >> > > > > the
> >> > > > > > core format or are formalized as a new in-memory standard. The
> >> > > > in-memory
> >> > > > > > representation of rust or javascript or java objects are
> >> > implementation
> >> > > > > > details.
> >> > > > > >
> >> > > > > > I'm against number two as it creates a fragmentation problem.
> >> > Arrow is
> >> > > > > > about having a single canonical format for memory for both
> >> > metadata and
> >> > > > > > data. Having multiple in-memory formats (especially when some are
> >> > not
> >> > > > > > language independent) is counter to the goals of the project.
> >> > > > >
> >> > > > > I don't think anyone is proposing anything that would cause
> >> > > > fragmentation.
> >> > > > >
> >> > > > > A central question is whether it is useful to define a reusable C ABI
> >> > > > > for the Arrow columnar format, and if there is sufficient interest, a
> >> > > > > tiny C implementation of the IPC protocol (which uses the Flatbuffers
> >> > > > > message) that assembles and disassembles the data structures defined
> >> > > > > in the C ABI.
> >> > > > >
> >> > > > > We could separately create a tiny implementation of the Arrow IPC
> >> > > > > protocol using FlatCC that could be dropped into applications
> >> > > > > requiring only a C compiler and nothing else.
> >> > > > >
> >> > > > >
> >> > > > > >
> >> > > > > > Two other, separate comments:
> >> > > > > > 1) I don't understand the idea that we need to change the way Arrow
> >> > > > > > fundamentally works so that people can avoid using a dependency.
> >> > If the
> >> > > > > > dependency is small, open source and easy to build, people can
> >> > fork it
> >> > > > > and
> >> > > > > > include directly if they want to. Let's not violate project
> >> > principles
> >> > > > > > because DuckDB has a religious perspective on dependencies. If the
> >> > > > > problem
> >> > > > > > is people have to swallow too large of a pill to do basic things
> >> > with
> >> > > > > Arrow
> >> > > > > > in C, let's focus on fixing that (to our definition of ease, not
> >> > > > someone
> >> > > > > > else's). If FlatCC solves some those things, great. If we need to
> >> > > > build a
> >> > > > > > baby integration library that is more C centric, great. Neither of
> >> > > > those
> >> > > > > > things require implementing something at the format level.
> >> > > > > >
> >> > > > > > 2) It seems like we should discuss the data structure problem
> >> > > > separately
> >> > > > > > from the reference management concern.
> >> > > > > >
> >> > > > > >
> >> > > > > > On Tue, Oct 1, 2019 at 5:42 AM Wes McKinney <we...@gmail.com>
> >> > > > wrote:
> >> > > > > >
> >> > > > > > > hi Antoine,
> >> > > > > > >
> >> > > > > > > On Tue, Oct 1, 2019 at 4:29 AM Antoine Pitrou <
> >> > antoine@python.org>
> >> > > > > wrote:
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > > Le 01/10/2019 à 00:39, Wes McKinney a écrit :
> >> > > > > > > > > A couple things:
> >> > > > > > > > >
> >> > > > > > > > > * I think a C protocol / FFI for Arrow array/vectors would be
> >> > > > > better
> >> > > > > > > > > to have the same "shape" as an assembled array. Note that
> >> > the C
> >> > > > > > > > > structs here have very nearly the same "shape" as the data
> >> > > > > structure
> >> > > > > > > > > representing a C++ Array object [1]. The disassembly and
> >> > > > reassembly
> >> > > > > > > > > here is substantially simpler than the IPC protocol. A
> >> > recursive
> >> > > > > > > > > structure in Flatbuffers would make RecordBatch messages much
> >> > > > > larger,
> >> > > > > > > > > so the flattened / disassembled representation we use for
> >> > > > > serialized
> >> > > > > > > > > record batches is the correct one
> >> > > > > > > >
> >> > > > > > > > I'm not sure I agree:
> >> > > > > > > >
> >> > > > > > > > - indeed, it's not a coincidence that the ArrowArray struct
> >> > looks
> >> > > > > quite
> >> > > > > > > > closely like the C++ ArrayData object :-)  We have good
> >> > experience
> >> > > > > with
> >> > > > > > > > that abstraction and it has proven to work quite well
> >> > > > > > > >
> >> > > > > > > > - the IPC format is meant for serialization while the C data
> >> > > > > protocol is
> >> > > > > > > > meants for in-memory communication, so different concerns apply
> >> > > > > > > >
> >> > > > > > > > - the fact that this makes the layout slightly larger doesn't
> >> > seem
> >> > > > > > > > important at all; we're not talking about transferring data
> >> > over
> >> > > > the
> >> > > > > wire
> >> > > > > > > >
> >> > > > > > > > There's also another argument for having a recursive struct: it
> >> > > > > > > > simplifies how the data type is represented, since we can
> >> > encode
> >> > > > each
> >> > > > > > > > child type individually instead of encoding it in the parent's
> >> > > > format
> >> > > > > > > > string (same applies for metadata and individual flags).
> >> > > > > > > >
> >> > > > > > >
> >> > > > > > > I was saying something different here. I was making an argument
> >> > about
> >> > > > > > > why we use the flattened array-of-structs in the IPC protocol.
> >> > One
> >> > > > > > > reason is that it's a more compact representation. That is not
> >> > very
> >> > > > > > > important here because this protocol is only for *in-process*
> >> > (for
> >> > > > > > > languages that have a C FFI facility) rather than *inter-process*
> >> > > > > > > communication.
> >> > > > > > >
> >> > > > > > > I agree also that the type encoding is simple, here, too, since
> >> > we
> >> > > > > > > aren't having to split the schema and record batch between
> >> > different
> >> > > > > > > serialized messages. There is some potential waste with having to
> >> > > > > > > populate the type fields multiple times when communicating a
> >> > sequence
> >> > > > > > > of "chunks" from the same logical dataset.
> >> > > > > > >
> >> > > > > > > > > * The "formal" C protocol having the "assembled" shape means
> >> > that
> >> > > > > many
> >> > > > > > > > > minimal Arrow users won't have to implement any separate data
> >> > > > > > > > > structures. They can just use the C struct directly or a
> >> > slightly
> >> > > > > > > > > wrapped version thereof with some convenience functions.
> >> > > > > > > >
> >> > > > > > > > Yes, but the same applies to the current proposal.
> >> > > > > > > >
> >> > > > > > > > > * I think that requiring building a Flatbuffer for minimal
> >> > use
> >> > > > > cases
> >> > > > > > > > > (e.g. communicating simple record batches with primitive
> >> > types)
> >> > > > > passes
> >> > > > > > > > > on implementation burden to minimal users.
> >> > > > > > > >
> >> > > > > > > > It certainly does.
> >> > > > > > > >
> >> > > > > > > > > I think the mantra of the C protocol should be the following:
> >> > > > > > > > >
> >> > > > > > > > > * Users of the protocol have to write little to no code to
> >> > use
> >> > > > it.
> >> > > > > For
> >> > > > > > > > > example, populating an INT32 array should require only a few
> >> > > > lines
> >> > > > > of
> >> > > > > > > > > code
> >> > > > > > > >
> >> > > > > > > > Agreed.  As a sidenote, the spec should have an example of
> >> > doing
> >> > > > > this in
> >> > > > > > > > raw C.
> >> > > > > > > >
> >> > > > > > > > Regards
> >> > > > > > > >
> >> > > > > > > > Antoine.
> >> > > > > > >
> >> > > > >
> >> > > >
> >> >

Re: [DISCUSS] C-level in-process array protocol

Posted by Wes McKinney <we...@gmail.com>.

hi Jacques,

On Tue, Oct 8, 2019 at 1:54 PM Jacques Nadeau <ja...@apache.org> wrote:
>
> I removing all my objections to this work.
>
> I wish there was more feedback from additional community members. I continue to be concerned about fragmentation. I don't agree with the arguments here that we need to add a new api to make it easy for people to *not* use Arrow codebase. It seems like a punt on building useful libraries within the project that will ultimately hurt the interoperability story.
>

I think we'll have to take a "wait and see" approach. I believe the
community needs to build accessible libraries that offer value to
third party users, and we will continue to do that. I think there are
use cases here that fall outside of which library to use, but time
will tell.

> As a side note, it seems like much of this is about people's distaste for flatbuffers. I know I regret using it. If we had a chance to do it over again, I would have chosen to use protobuf for everything except the data header, where I would hand write the encoding (since it is so simple anyway). If it is such a problem that people are contorting to work around it, maybe we should address that? Just a thought.
>

I think that using an Protobuf-like with IDL and a compiler presents a problem.

Note that Flatbuffers is much better for C/C++ programmers and I still
think it was the right choice for the project. Unlike Flatbuffers,
C/C++ applications must either link libprotobuf.so or libprotobuf.a.
Flatbuffers in C++ is a header-only dependency that is trivial to
bundle with a project [1]. The same is true for Thrift, and this came
up in the TF discussion [2]

[1]: https://github.com/apache/arrow/tree/master/cpp/thirdparty/flatbuffers/include/flatbuffers
[2]: https://github.com/tensorflow/community/pull/162#discussion_r332610486

> Thanks for the discourse and patience.
>
> On Wed, Oct 2, 2019 at 10:12 PM Micah Kornfield <em...@gmail.com> wrote:
>>
>> Hi Wes,
>> I agree for third-parties "A" (Field data structures) is the most useful.
>>
>> At least in my mind the discussion was for both first and third-parties.  I
>> was trying to point out that "A" is less necessary as a first step for
>> first-party integrations and could potentially require more effort if we
>> already have the code that does "B" (field reassembly).
>>
>> Thanks,
>> Micah
>>
>> On Wed, Oct 2, 2019 at 10:28 PM Wes McKinney <we...@gmail.com> wrote:
>>
>> > On Wed, Oct 2, 2019 at 11:05 PM Micah Kornfield <em...@gmail.com>
>> > wrote:
>> > >
>> > > I've tried to summarize my understanding of the debate so far and give
>> > some
>> > > initial thoughts. I think there are two potentially different sets of
>> > users
>> > > that we are targeting with a stable C API/ABI ourselves and external
>> > > parties.
>> > >
>> > > 1.  Different language implementations within the Arrow project that want
>> > > to call into each other's code.  We still don't have a great story around
>> > > this in terms of reusable libraries and questions like [1] are a
>> > motivating
>> > > examples of making something better in this context.
>> > > 2.  third-parties wishing to support/integrate with Arrow.  Some
>> > > conjectures about these users:
>> > >   - Users in this group are NOT necessarily familiar with existing
>> > > technologies Arrow uses (i.e. flatbuffers)
>> > >   - The stability of the API is the primary concern (consumers don't want
>> > > to change when a new version of the library ships)
>> > >   - An important secondary concern is additional libraries that need to
>> > be
>> > > integrated in addition to the API
>> > >
>> > > The main debate points seems to be:
>> > >
>> > > 1.  Vector/Array oriented API vs existing Record Batch.  Will an
>> > additional
>> > > column oriented API become too much of a maintenance headache/cause
>> > > fragmentation?
>> > >
>> > >  - In my mind the question here is which set of users we are
>> > prioritizing.
>> > > IMO the combination of flatbuffers and translation to/from RecordBatch
>> > > format offers too much friction to make it easy for a third-party
>> > > implementer to use. If we are prioritizing for our own internal
>> > use-cases I
>> > > think we should try out a RecordBatch+Flatbuffers based C-API. We already
>> > > have all the necessary building blocks.
>> > >
>> >
>> > If a C function passes you a string containing a RecordBatch
>> > Flatbuffers message, what happens next? This message has to be
>> > reassembled into a recursive data structure before you can "do"
>> > anything with it. Are we expecting every third party project to
>> > implement:
>> >
>> > A. Data structures appropriate to represent a logical "field" in a
>> > record batch (which have to be recursive to account for nested types'
>> > children)
>> > B. The logic to convert from the flattened Flatbuffers representation
>> > to some implementation of A
>> >
>> > I'm arguing that we should provide both to third parties. To build B,
>> > you need A. Some consumers will only use A. This discussion is
>> > essentially about developing an ultraminimalist "drop-in" C
>> > implementation of A.
>> >
>> > > 2.  How onerous is the dependency on flat-buffers both from a learning
>> > > curve perspective and as dependency for third-party integrators?
>> > > - Flatbuffers aren't entirely straight-forward and I think if we do move
>> > > forward with an API based on Column/Array we should consider alternatives
>> > > as long as the necessary parsing code can be done in a small amount of
>> > code
>> > > (I'm personally against JSON for this, but can see the arguments for it).
>> > >
>> > > 3.  Do all existing library implementations need to support both
>> > > Column/Array a ABI?  How will compliance be checked for the new API/ABI?
>> > >
>> > > - I'm still thinking this through.
>> > >
>> > > [1]
>> > >
>> > https://lists.apache.org/thread.html/18244b294d0b9bd568b5cfd1b1ac2b6a25088383a08202cc7a8a3563@%3Cuser.arrow.apache.org%3E
>> > >
>> > > On Wed, Oct 2, 2019 at 6:46 PM Jacques Nadeau <ja...@apache.org>
>> > wrote:
>> > >
>> > > > I'd like to hear more opinions from others on this topic. This
>> > conversation
>> > > > seems mostly dominated by comments from myself, Wes and Antoine.
>> > > >
>> > > > I think it is reasonable to argue that keeping any ABI (or
>> > header/struct
>> > > > pattern) as narrow as possible would allow us to minimize overlap with
>> > the
>> > > > existing in-memory specification. In Arrow's case, this could be as
>> > simple
>> > > > as a single memory pointer for schema (backed by flatbuffers) and a
>> > single
>> > > > memory location for data (that references the record batch header,
>> > which in
>> > > > turn provides pointers into the actual arrow data). Extensions would
>> > need
>> > > > to be added for reference management as done here but I continue to
>> > think
>> > > > we should defer discussion of that until the base data structures are
>> > > > resolved. I see the comments here as arguing for a much broader ABI, in
>> > > > part to support having people build "Arrow" components that
>> > interconnect
>> > > > using this new interface. I understand the desire to expand the ABI to
>> > be
>> > > > driven by needs to reduce dependencies and ease usability.
>> > > >
>> > > > The representation within the related patch is being presented as a
>> > way for
>> > > > applications to share Arrow data but is not easily accessible to all
>> > > > languages. I want to avoid a situation where someone says "I produced
>> > an
>> > > > Arrow API" when what they've really done is created a C interface which
>> > > > only a small subset of languages can actually leverage. For example,
>> > every
>> > > > language now knows how to parse the existing schema definition as
>> > rendered
>> > > > in flatbuf. In order to interact with something that implements this
>> > new
>> > > > pattern one would also be required to implement completely new schema
>> > > > consumption code. In the proposal itself it suggests this (for example
>> > > > enhancing the C++ library to consume structures produced this way).
>> > > >
>> > > > As I said, I really want to hear more opinions. Running this past
>> > various
>> > > > developers I know, many have echoed my concerns but that really doesn't
>> > > > matter (and who knows how much of that is colored by my presentation
>> > of the
>> > > > issue). What do people here think? If someone builds an "Arrow" library
>> > > > that implements this set of structures, how does one use it in Node? In
>> > > > Java? Does it drive creation of a secondary set of interfaces in each
>> > of
>> > > > those languages to work with this kind of pattern? (For example, in a
>> > JVM
>> > > > view of the world, working with a plain struct in java rather than a
>> > set of
>> > > > memory pointers against our existing IPC formats would be quite
>> > painful and
>> > > > we'd definitely need to create some glue code for users. I worry the
>> > same
>> > > > pattern would occur in many other languages.)
>> > > >
>> > > > To respond directly to some of Wes's most recent comments from the
>> > email
>> > > > below. I struggle to map your description of the situation to the rest
>> > of
>> > > > the thread and the proposed patch.  For example, you say that a
>> > non-goal is
>> > > > "creating a new canonical way to serialize metadata" bute the patch
>> > > > proposes a concrete string based encoding system to describe data
>> > types.
>> > > > Aren't those things in conflict?
>> > > >
>> > > > I'll also think more on this and challenge my own perspective. This
>> > isn't
>> > > > where my focus is so my comments aren't as developed/thoughtful as I'd
>> > > > like.
>> > > >
>> > > >
>> > > > On Tue, Oct 1, 2019 at 7:33 PM Wes McKinney <we...@gmail.com>
>> > wrote:
>> > > >
>> > > > > hi Jacques,
>> > > > >
>> > > > > I think we've veered off course a bit and maybe we could reframe the
>> > > > > discussion.
>> > > > >
>> > > > > Goals
>> > > > > * A "drop-in" header-only C file that projects can use as a
>> > > > > programming interface either internally only or to expose in-memory
>> > > > > data structures between C functions at call sites. Ideally little to
>> > > > > no disassembly/reassembly should be required on either "side" of the
>> > > > > call site.
>> > > > > * Simplifying adoption of Arrow for C programmers, or languages based
>> > > > > around C FFI
>> > > > >
>> > > > > Non-goals
>> > > > > * Expanding the columnar format or creating an alternative canonical
>> > > > > in-memory representation
>> > > > > * Creating a new canonical way to serialize metadata
>> > > > >
>> > > > > Note that this use case has been on my mind for more than 2 years:
>> > > > > https://issues.apache.org/jira/browse/ARROW-1058
>> > > > >
>> > > > > I think there are a couple of potentially misleading things at play
>> > here
>> > > > >
>> > > > > 1. The use of the word "protocol". In C, a struct has a well-defined
>> > > > > binary layout, so a C API is also an ABI. Using C structs to
>> > > > > communicate data can be considered to be a protocol, but it means
>> > > > > something different in the context of the "Arrow protocol". I think
>> > we
>> > > > > need to call this a "C API"
>> > > > >
>> > > > > 2. The documentation for this in Antoine's PR is in the format/
>> > > > > directory. It would probably be better to have a "C API" section in
>> > > > > the documentation.
>> > > > >
>> > > > > The header file under discussion and the documentation about it is
>> > > > > best considered as a "library".
>> > > > >
>> > > > > It might be useful at some point to create a C99 implementation of
>> > the
>> > > > > IPC protocol as well using FlatCC with the goal of having a complete
>> > > > > implementation of the columnar format in C with minimal binary
>> > > > > footprint. This is analogous to the NanoPB project which is an
>> > > > > implementation of Protocol Buffers with small code size
>> > > > >
>> > > > > https://github.com/nanopb/nanopb
>> > > > >
>> > > > > Let me know if this makes more sense.
>> > > > >
>> > > > > I think it's important to communicate clearly about this primarily
>> > for
>> > > > > the benefit of the outside world which can confuse easily as we have
>> > > > > observed over the last few years =)
>> > > > >
>> > > > > Wes
>> > > > >
>> > > > > On Tue, Oct 1, 2019 at 2:55 PM Jacques Nadeau <ja...@apache.org>
>> > > > wrote:
>> > > > > >
>> > > > > > I disagree with this statement:
>> > > > > >
>> > > > > > - the IPC format is meant for serialization while the C data
>> > protocol
>> > > > is
>> > > > > > meants for in-memory communication, so different concerns apply
>> > > > > >
>> > > > > > If that is how the a particular implementation presents it, that
>> > is a
>> > > > > > weaknesses of the implementation, not the format. The primary use
>> > case
>> > > > I
>> > > > > > was focused on when working on the initial format was communication
>> > > > > within
>> > > > > > the same process. It seems like this is being used as a basis for
>> > the
>> > > > > > introduction of new things when the premise is inconsistent with
>> > the
>> > > > > > intention of the creation. The specific reason we used flatbuffers
>> > in
>> > > > the
>> > > > > > project was to collapse the separation of in-process and
>> > out-of-process
>> > > > > > communication. It means the same thing it does with the Arrow data
>> > > > > itself:
>> > > > > > that a consumer doesn't have to use a particular library to
>> > interact
>> > > > with
>> > > > > > and use the data.
>> > > > > >
>> > > > > > It seems like there are two ideas here:
>> > > > > >
>> > > > > > 1) How do we make it easier for people to use Arrow?
>> > > > > > 2) Should we implement a new in memory representation of Arrow
>> > that is
>> > > > > > language specific.
>> > > > > >
>> > > > > > I'm entirely in support of number one. If for a particular type of
>> > > > > domain,
>> > > > > > people want an easier way to interact with Arrow, let's make a new
>> > > > > library
>> > > > > > that helps with that. In easy of our current libraries, we do many
>> > > > things
>> > > > > > to make it easier to work with Arrow. None of those require a
>> > change to
>> > > > > the
>> > > > > > core format or are formalized as a new in-memory standard. The
>> > > > in-memory
>> > > > > > representation of rust or javascript or java objects are
>> > implementation
>> > > > > > details.
>> > > > > >
>> > > > > > I'm against number two as it creates a fragmentation problem.
>> > Arrow is
>> > > > > > about having a single canonical format for memory for both
>> > metadata and
>> > > > > > data. Having multiple in-memory formats (especially when some are
>> > not
>> > > > > > language independent) is counter to the goals of the project.
>> > > > >
>> > > > > I don't think anyone is proposing anything that would cause
>> > > > fragmentation.
>> > > > >
>> > > > > A central question is whether it is useful to define a reusable C ABI
>> > > > > for the Arrow columnar format, and if there is sufficient interest, a
>> > > > > tiny C implementation of the IPC protocol (which uses the Flatbuffers
>> > > > > message) that assembles and disassembles the data structures defined
>> > > > > in the C ABI.
>> > > > >
>> > > > > We could separately create a tiny implementation of the Arrow IPC
>> > > > > protocol using FlatCC that could be dropped into applications
>> > > > > requiring only a C compiler and nothing else.
>> > > > >
>> > > > >
>> > > > > >
>> > > > > > Two other, separate comments:
>> > > > > > 1) I don't understand the idea that we need to change the way Arrow
>> > > > > > fundamentally works so that people can avoid using a dependency.
>> > If the
>> > > > > > dependency is small, open source and easy to build, people can
>> > fork it
>> > > > > and
>> > > > > > include directly if they want to. Let's not violate project
>> > principles
>> > > > > > because DuckDB has a religious perspective on dependencies. If the
>> > > > > problem
>> > > > > > is people have to swallow too large of a pill to do basic things
>> > with
>> > > > > Arrow
>> > > > > > in C, let's focus on fixing that (to our definition of ease, not
>> > > > someone
>> > > > > > else's). If FlatCC solves some those things, great. If we need to
>> > > > build a
>> > > > > > baby integration library that is more C centric, great. Neither of
>> > > > those
>> > > > > > things require implementing something at the format level.
>> > > > > >
>> > > > > > 2) It seems like we should discuss the data structure problem
>> > > > separately
>> > > > > > from the reference management concern.
>> > > > > >
>> > > > > >
>> > > > > > On Tue, Oct 1, 2019 at 5:42 AM Wes McKinney <we...@gmail.com>
>> > > > wrote:
>> > > > > >
>> > > > > > > hi Antoine,
>> > > > > > >
>> > > > > > > On Tue, Oct 1, 2019 at 4:29 AM Antoine Pitrou <
>> > antoine@python.org>
>> > > > > wrote:
>> > > > > > > >
>> > > > > > > >
>> > > > > > > > Le 01/10/2019 à 00:39, Wes McKinney a écrit :
>> > > > > > > > > A couple things:
>> > > > > > > > >
>> > > > > > > > > * I think a C protocol / FFI for Arrow array/vectors would be
>> > > > > better
>> > > > > > > > > to have the same "shape" as an assembled array. Note that
>> > the C
>> > > > > > > > > structs here have very nearly the same "shape" as the data
>> > > > > structure
>> > > > > > > > > representing a C++ Array object [1]. The disassembly and
>> > > > reassembly
>> > > > > > > > > here is substantially simpler than the IPC protocol. A
>> > recursive
>> > > > > > > > > structure in Flatbuffers would make RecordBatch messages much
>> > > > > larger,
>> > > > > > > > > so the flattened / disassembled representation we use for
>> > > > > serialized
>> > > > > > > > > record batches is the correct one
>> > > > > > > >
>> > > > > > > > I'm not sure I agree:
>> > > > > > > >
>> > > > > > > > - indeed, it's not a coincidence that the ArrowArray struct
>> > looks
>> > > > > quite
>> > > > > > > > closely like the C++ ArrayData object :-)  We have good
>> > experience
>> > > > > with
>> > > > > > > > that abstraction and it has proven to work quite well
>> > > > > > > >
>> > > > > > > > - the IPC format is meant for serialization while the C data
>> > > > > protocol is
>> > > > > > > > meants for in-memory communication, so different concerns apply
>> > > > > > > >
>> > > > > > > > - the fact that this makes the layout slightly larger doesn't
>> > seem
>> > > > > > > > important at all; we're not talking about transferring data
>> > over
>> > > > the
>> > > > > wire
>> > > > > > > >
>> > > > > > > > There's also another argument for having a recursive struct: it
>> > > > > > > > simplifies how the data type is represented, since we can
>> > encode
>> > > > each
>> > > > > > > > child type individually instead of encoding it in the parent's
>> > > > format
>> > > > > > > > string (same applies for metadata and individual flags).
>> > > > > > > >
>> > > > > > >
>> > > > > > > I was saying something different here. I was making an argument
>> > about
>> > > > > > > why we use the flattened array-of-structs in the IPC protocol.
>> > One
>> > > > > > > reason is that it's a more compact representation. That is not
>> > very
>> > > > > > > important here because this protocol is only for *in-process*
>> > (for
>> > > > > > > languages that have a C FFI facility) rather than *inter-process*
>> > > > > > > communication.
>> > > > > > >
>> > > > > > > I agree also that the type encoding is simple, here, too, since
>> > we
>> > > > > > > aren't having to split the schema and record batch between
>> > different
>> > > > > > > serialized messages. There is some potential waste with having to
>> > > > > > > populate the type fields multiple times when communicating a
>> > sequence
>> > > > > > > of "chunks" from the same logical dataset.
>> > > > > > >
>> > > > > > > > > * The "formal" C protocol having the "assembled" shape means
>> > that
>> > > > > many
>> > > > > > > > > minimal Arrow users won't have to implement any separate data
>> > > > > > > > > structures. They can just use the C struct directly or a
>> > slightly
>> > > > > > > > > wrapped version thereof with some convenience functions.
>> > > > > > > >
>> > > > > > > > Yes, but the same applies to the current proposal.
>> > > > > > > >
>> > > > > > > > > * I think that requiring building a Flatbuffer for minimal
>> > use
>> > > > > cases
>> > > > > > > > > (e.g. communicating simple record batches with primitive
>> > types)
>> > > > > passes
>> > > > > > > > > on implementation burden to minimal users.
>> > > > > > > >
>> > > > > > > > It certainly does.
>> > > > > > > >
>> > > > > > > > > I think the mantra of the C protocol should be the following:
>> > > > > > > > >
>> > > > > > > > > * Users of the protocol have to write little to no code to
>> > use
>> > > > it.
>> > > > > For
>> > > > > > > > > example, populating an INT32 array should require only a few
>> > > > lines
>> > > > > of
>> > > > > > > > > code
>> > > > > > > >
>> > > > > > > > Agreed.  As a sidenote, the spec should have an example of
>> > doing
>> > > > > this in
>> > > > > > > > raw C.
>> > > > > > > >
>> > > > > > > > Regards
>> > > > > > > >
>> > > > > > > > Antoine.
>> > > > > > >
>> > > > >
>> > > >
>> >

Re: [DISCUSS] C-level in-process array protocol

Posted by "Uwe L. Korn" <uw...@xhochy.com>.

I'm not sure whether flatbuffers is actually an issue in the end but keeping it out of the C-API definitely simplifies it a bit adoption-wise. I don't think that though that using protobuf would make a difference here.

In general, I really like the C-interface work as sadly C-APIs are still the most accessible ones. Even when using the official Arrow C++ library, I often want access to the underlying data with some other non-C++ processing library having the C-interface is making my life easier. In my case I'm working with Numba (LLVM-based JIT for a subset of numerical Python) and this is not easily supporting interfaces to C++ but can work with C FFI calls directly. 

Uwe

On Tue, Oct 8, 2019, at 8:54 PM, Jacques Nadeau wrote:
> I removing all my objections to this work.
> 
> I wish there was more feedback from additional community members. I
> continue to be concerned about fragmentation. I don't agree with the
> arguments here that we need to add a new api to make it easy for people to
> *not* use Arrow codebase. It seems like a punt on building useful libraries
> within the project that will ultimately hurt the interoperability story.
> 
> As a side note, it seems like much of this is about people's distaste for
> flatbuffers. I know I regret using it. If we had a chance to do it over
> again, I would have chosen to use protobuf for everything except the data
> header, where I would hand write the encoding (since it is so simple
> anyway). If it is such a problem that people are contorting to work around
> it, maybe we should address that? Just a thought.
> 
> Thanks for the discourse and patience.
> 
> On Wed, Oct 2, 2019 at 10:12 PM Micah Kornfield <em...@gmail.com>
> wrote:
> 
> > Hi Wes,
> > I agree for third-parties "A" (Field data structures) is the most useful.
> >
> > At least in my mind the discussion was for both first and third-parties.  I
> > was trying to point out that "A" is less necessary as a first step for
> > first-party integrations and could potentially require more effort if we
> > already have the code that does "B" (field reassembly).
> >
> > Thanks,
> > Micah
> >
> > On Wed, Oct 2, 2019 at 10:28 PM Wes McKinney <we...@gmail.com> wrote:
> >
> > > On Wed, Oct 2, 2019 at 11:05 PM Micah Kornfield <em...@gmail.com>
> > > wrote:
> > > >
> > > > I've tried to summarize my understanding of the debate so far and give
> > > some
> > > > initial thoughts. I think there are two potentially different sets of
> > > users
> > > > that we are targeting with a stable C API/ABI ourselves and external
> > > > parties.
> > > >
> > > > 1.  Different language implementations within the Arrow project that
> > want
> > > > to call into each other's code.  We still don't have a great story
> > around
> > > > this in terms of reusable libraries and questions like [1] are a
> > > motivating
> > > > examples of making something better in this context.
> > > > 2.  third-parties wishing to support/integrate with Arrow.  Some
> > > > conjectures about these users:
> > > >   - Users in this group are NOT necessarily familiar with existing
> > > > technologies Arrow uses (i.e. flatbuffers)
> > > >   - The stability of the API is the primary concern (consumers don't
> > want
> > > > to change when a new version of the library ships)
> > > >   - An important secondary concern is additional libraries that need to
> > > be
> > > > integrated in addition to the API
> > > >
> > > > The main debate points seems to be:
> > > >
> > > > 1.  Vector/Array oriented API vs existing Record Batch.  Will an
> > > additional
> > > > column oriented API become too much of a maintenance headache/cause
> > > > fragmentation?
> > > >
> > > >  - In my mind the question here is which set of users we are
> > > prioritizing.
> > > > IMO the combination of flatbuffers and translation to/from RecordBatch
> > > > format offers too much friction to make it easy for a third-party
> > > > implementer to use. If we are prioritizing for our own internal
> > > use-cases I
> > > > think we should try out a RecordBatch+Flatbuffers based C-API. We
> > already
> > > > have all the necessary building blocks.
> > > >
> > >
> > > If a C function passes you a string containing a RecordBatch
> > > Flatbuffers message, what happens next? This message has to be
> > > reassembled into a recursive data structure before you can "do"
> > > anything with it. Are we expecting every third party project to
> > > implement:
> > >
> > > A. Data structures appropriate to represent a logical "field" in a
> > > record batch (which have to be recursive to account for nested types'
> > > children)
> > > B. The logic to convert from the flattened Flatbuffers representation
> > > to some implementation of A
> > >
> > > I'm arguing that we should provide both to third parties. To build B,
> > > you need A. Some consumers will only use A. This discussion is
> > > essentially about developing an ultraminimalist "drop-in" C
> > > implementation of A.
> > >
> > > > 2.  How onerous is the dependency on flat-buffers both from a learning
> > > > curve perspective and as dependency for third-party integrators?
> > > > - Flatbuffers aren't entirely straight-forward and I think if we do
> > move
> > > > forward with an API based on Column/Array we should consider
> > alternatives
> > > > as long as the necessary parsing code can be done in a small amount of
> > > code
> > > > (I'm personally against JSON for this, but can see the arguments for
> > it).
> > > >
> > > > 3.  Do all existing library implementations need to support both
> > > > Column/Array a ABI?  How will compliance be checked for the new
> > API/ABI?
> > > >
> > > > - I'm still thinking this through.
> > > >
> > > > [1]
> > > >
> > >
> > https://lists.apache.org/thread.html/18244b294d0b9bd568b5cfd1b1ac2b6a25088383a08202cc7a8a3563@%3Cuser.arrow.apache.org%3E
> > > >
> > > > On Wed, Oct 2, 2019 at 6:46 PM Jacques Nadeau <ja...@apache.org>
> > > wrote:
> > > >
> > > > > I'd like to hear more opinions from others on this topic. This
> > > conversation
> > > > > seems mostly dominated by comments from myself, Wes and Antoine.
> > > > >
> > > > > I think it is reasonable to argue that keeping any ABI (or
> > > header/struct
> > > > > pattern) as narrow as possible would allow us to minimize overlap
> > with
> > > the
> > > > > existing in-memory specification. In Arrow's case, this could be as
> > > simple
> > > > > as a single memory pointer for schema (backed by flatbuffers) and a
> > > single
> > > > > memory location for data (that references the record batch header,
> > > which in
> > > > > turn provides pointers into the actual arrow data). Extensions would
> > > need
> > > > > to be added for reference management as done here but I continue to
> > > think
> > > > > we should defer discussion of that until the base data structures are
> > > > > resolved. I see the comments here as arguing for a much broader ABI,
> > in
> > > > > part to support having people build "Arrow" components that
> > > interconnect
> > > > > using this new interface. I understand the desire to expand the ABI
> > to
> > > be
> > > > > driven by needs to reduce dependencies and ease usability.
> > > > >
> > > > > The representation within the related patch is being presented as a
> > > way for
> > > > > applications to share Arrow data but is not easily accessible to all
> > > > > languages. I want to avoid a situation where someone says "I produced
> > > an
> > > > > Arrow API" when what they've really done is created a C interface
> > which
> > > > > only a small subset of languages can actually leverage. For example,
> > > every
> > > > > language now knows how to parse the existing schema definition as
> > > rendered
> > > > > in flatbuf. In order to interact with something that implements this
> > > new
> > > > > pattern one would also be required to implement completely new schema
> > > > > consumption code. In the proposal itself it suggests this (for
> > example
> > > > > enhancing the C++ library to consume structures produced this way).
> > > > >
> > > > > As I said, I really want to hear more opinions. Running this past
> > > various
> > > > > developers I know, many have echoed my concerns but that really
> > doesn't
> > > > > matter (and who knows how much of that is colored by my presentation
> > > of the
> > > > > issue). What do people here think? If someone builds an "Arrow"
> > library
> > > > > that implements this set of structures, how does one use it in Node?
> > In
> > > > > Java? Does it drive creation of a secondary set of interfaces in each
> > > of
> > > > > those languages to work with this kind of pattern? (For example, in a
> > > JVM
> > > > > view of the world, working with a plain struct in java rather than a
> > > set of
> > > > > memory pointers against our existing IPC formats would be quite
> > > painful and
> > > > > we'd definitely need to create some glue code for users. I worry the
> > > same
> > > > > pattern would occur in many other languages.)
> > > > >
> > > > > To respond directly to some of Wes's most recent comments from the
> > > email
> > > > > below. I struggle to map your description of the situation to the
> > rest
> > > of
> > > > > the thread and the proposed patch.  For example, you say that a
> > > non-goal is
> > > > > "creating a new canonical way to serialize metadata" bute the patch
> > > > > proposes a concrete string based encoding system to describe data
> > > types.
> > > > > Aren't those things in conflict?
> > > > >
> > > > > I'll also think more on this and challenge my own perspective. This
> > > isn't
> > > > > where my focus is so my comments aren't as developed/thoughtful as
> > I'd
> > > > > like.
> > > > >
> > > > >
> > > > > On Tue, Oct 1, 2019 at 7:33 PM Wes McKinney <we...@gmail.com>
> > > wrote:
> > > > >
> > > > > > hi Jacques,
> > > > > >
> > > > > > I think we've veered off course a bit and maybe we could reframe
> > the
> > > > > > discussion.
> > > > > >
> > > > > > Goals
> > > > > > * A "drop-in" header-only C file that projects can use as a
> > > > > > programming interface either internally only or to expose in-memory
> > > > > > data structures between C functions at call sites. Ideally little
> > to
> > > > > > no disassembly/reassembly should be required on either "side" of
> > the
> > > > > > call site.
> > > > > > * Simplifying adoption of Arrow for C programmers, or languages
> > based
> > > > > > around C FFI
> > > > > >
> > > > > > Non-goals
> > > > > > * Expanding the columnar format or creating an alternative
> > canonical
> > > > > > in-memory representation
> > > > > > * Creating a new canonical way to serialize metadata
> > > > > >
> > > > > > Note that this use case has been on my mind for more than 2 years:
> > > > > > https://issues.apache.org/jira/browse/ARROW-1058
> > > > > >
> > > > > > I think there are a couple of potentially misleading things at play
> > > here
> > > > > >
> > > > > > 1. The use of the word "protocol". In C, a struct has a
> > well-defined
> > > > > > binary layout, so a C API is also an ABI. Using C structs to
> > > > > > communicate data can be considered to be a protocol, but it means
> > > > > > something different in the context of the "Arrow protocol". I think
> > > we
> > > > > > need to call this a "C API"
> > > > > >
> > > > > > 2. The documentation for this in Antoine's PR is in the format/
> > > > > > directory. It would probably be better to have a "C API" section in
> > > > > > the documentation.
> > > > > >
> > > > > > The header file under discussion and the documentation about it is
> > > > > > best considered as a "library".
> > > > > >
> > > > > > It might be useful at some point to create a C99 implementation of
> > > the
> > > > > > IPC protocol as well using FlatCC with the goal of having a
> > complete
> > > > > > implementation of the columnar format in C with minimal binary
> > > > > > footprint. This is analogous to the NanoPB project which is an
> > > > > > implementation of Protocol Buffers with small code size
> > > > > >
> > > > > > https://github.com/nanopb/nanopb
> > > > > >
> > > > > > Let me know if this makes more sense.
> > > > > >
> > > > > > I think it's important to communicate clearly about this primarily
> > > for
> > > > > > the benefit of the outside world which can confuse easily as we
> > have
> > > > > > observed over the last few years =)
> > > > > >
> > > > > > Wes
> > > > > >
> > > > > > On Tue, Oct 1, 2019 at 2:55 PM Jacques Nadeau <ja...@apache.org>
> > > > > wrote:
> > > > > > >
> > > > > > > I disagree with this statement:
> > > > > > >
> > > > > > > - the IPC format is meant for serialization while the C data
> > > protocol
> > > > > is
> > > > > > > meants for in-memory communication, so different concerns apply
> > > > > > >
> > > > > > > If that is how the a particular implementation presents it, that
> > > is a
> > > > > > > weaknesses of the implementation, not the format. The primary use
> > > case
> > > > > I
> > > > > > > was focused on when working on the initial format was
> > communication
> > > > > > within
> > > > > > > the same process. It seems like this is being used as a basis for
> > > the
> > > > > > > introduction of new things when the premise is inconsistent with
> > > the
> > > > > > > intention of the creation. The specific reason we used
> > flatbuffers
> > > in
> > > > > the
> > > > > > > project was to collapse the separation of in-process and
> > > out-of-process
> > > > > > > communication. It means the same thing it does with the Arrow
> > data
> > > > > > itself:
> > > > > > > that a consumer doesn't have to use a particular library to
> > > interact
> > > > > with
> > > > > > > and use the data.
> > > > > > >
> > > > > > > It seems like there are two ideas here:
> > > > > > >
> > > > > > > 1) How do we make it easier for people to use Arrow?
> > > > > > > 2) Should we implement a new in memory representation of Arrow
> > > that is
> > > > > > > language specific.
> > > > > > >
> > > > > > > I'm entirely in support of number one. If for a particular type
> > of
> > > > > > domain,
> > > > > > > people want an easier way to interact with Arrow, let's make a
> > new
> > > > > > library
> > > > > > > that helps with that. In easy of our current libraries, we do
> > many
> > > > > things
> > > > > > > to make it easier to work with Arrow. None of those require a
> > > change to
> > > > > > the
> > > > > > > core format or are formalized as a new in-memory standard. The
> > > > > in-memory
> > > > > > > representation of rust or javascript or java objects are
> > > implementation
> > > > > > > details.
> > > > > > >
> > > > > > > I'm against number two as it creates a fragmentation problem.
> > > Arrow is
> > > > > > > about having a single canonical format for memory for both
> > > metadata and
> > > > > > > data. Having multiple in-memory formats (especially when some are
> > > not
> > > > > > > language independent) is counter to the goals of the project.
> > > > > >
> > > > > > I don't think anyone is proposing anything that would cause
> > > > > fragmentation.
> > > > > >
> > > > > > A central question is whether it is useful to define a reusable C
> > ABI
> > > > > > for the Arrow columnar format, and if there is sufficient
> > interest, a
> > > > > > tiny C implementation of the IPC protocol (which uses the
> > Flatbuffers
> > > > > > message) that assembles and disassembles the data structures
> > defined
> > > > > > in the C ABI.
> > > > > >
> > > > > > We could separately create a tiny implementation of the Arrow IPC
> > > > > > protocol using FlatCC that could be dropped into applications
> > > > > > requiring only a C compiler and nothing else.
> > > > > >
> > > > > >
> > > > > > >
> > > > > > > Two other, separate comments:
> > > > > > > 1) I don't understand the idea that we need to change the way
> > Arrow
> > > > > > > fundamentally works so that people can avoid using a dependency.
> > > If the
> > > > > > > dependency is small, open source and easy to build, people can
> > > fork it
> > > > > > and
> > > > > > > include directly if they want to. Let's not violate project
> > > principles
> > > > > > > because DuckDB has a religious perspective on dependencies. If
> > the
> > > > > > problem
> > > > > > > is people have to swallow too large of a pill to do basic things
> > > with
> > > > > > Arrow
> > > > > > > in C, let's focus on fixing that (to our definition of ease, not
> > > > > someone
> > > > > > > else's). If FlatCC solves some those things, great. If we need to
> > > > > build a
> > > > > > > baby integration library that is more C centric, great. Neither
> > of
> > > > > those
> > > > > > > things require implementing something at the format level.
> > > > > > >
> > > > > > > 2) It seems like we should discuss the data structure problem
> > > > > separately
> > > > > > > from the reference management concern.
> > > > > > >
> > > > > > >
> > > > > > > On Tue, Oct 1, 2019 at 5:42 AM Wes McKinney <wesmckinn@gmail.com
> > >
> > > > > wrote:
> > > > > > >
> > > > > > > > hi Antoine,
> > > > > > > >
> > > > > > > > On Tue, Oct 1, 2019 at 4:29 AM Antoine Pitrou <
> > > antoine@python.org>
> > > > > > wrote:
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Le 01/10/2019 à 00:39, Wes McKinney a écrit :
> > > > > > > > > > A couple things:
> > > > > > > > > >
> > > > > > > > > > * I think a C protocol / FFI for Arrow array/vectors would
> > be
> > > > > > better
> > > > > > > > > > to have the same "shape" as an assembled array. Note that
> > > the C
> > > > > > > > > > structs here have very nearly the same "shape" as the data
> > > > > > structure
> > > > > > > > > > representing a C++ Array object [1]. The disassembly and
> > > > > reassembly
> > > > > > > > > > here is substantially simpler than the IPC protocol. A
> > > recursive
> > > > > > > > > > structure in Flatbuffers would make RecordBatch messages
> > much
> > > > > > larger,
> > > > > > > > > > so the flattened / disassembled representation we use for
> > > > > > serialized
> > > > > > > > > > record batches is the correct one
> > > > > > > > >
> > > > > > > > > I'm not sure I agree:
> > > > > > > > >
> > > > > > > > > - indeed, it's not a coincidence that the ArrowArray struct
> > > looks
> > > > > > quite
> > > > > > > > > closely like the C++ ArrayData object :-)  We have good
> > > experience
> > > > > > with
> > > > > > > > > that abstraction and it has proven to work quite well
> > > > > > > > >
> > > > > > > > > - the IPC format is meant for serialization while the C data
> > > > > > protocol is
> > > > > > > > > meants for in-memory communication, so different concerns
> > apply
> > > > > > > > >
> > > > > > > > > - the fact that this makes the layout slightly larger doesn't
> > > seem
> > > > > > > > > important at all; we're not talking about transferring data
> > > over
> > > > > the
> > > > > > wire
> > > > > > > > >
> > > > > > > > > There's also another argument for having a recursive struct:
> > it
> > > > > > > > > simplifies how the data type is represented, since we can
> > > encode
> > > > > each
> > > > > > > > > child type individually instead of encoding it in the
> > parent's
> > > > > format
> > > > > > > > > string (same applies for metadata and individual flags).
> > > > > > > > >
> > > > > > > >
> > > > > > > > I was saying something different here. I was making an argument
> > > about
> > > > > > > > why we use the flattened array-of-structs in the IPC protocol.
> > > One
> > > > > > > > reason is that it's a more compact representation. That is not
> > > very
> > > > > > > > important here because this protocol is only for *in-process*
> > > (for
> > > > > > > > languages that have a C FFI facility) rather than
> > *inter-process*
> > > > > > > > communication.
> > > > > > > >
> > > > > > > > I agree also that the type encoding is simple, here, too, since
> > > we
> > > > > > > > aren't having to split the schema and record batch between
> > > different
> > > > > > > > serialized messages. There is some potential waste with having
> > to
> > > > > > > > populate the type fields multiple times when communicating a
> > > sequence
> > > > > > > > of "chunks" from the same logical dataset.
> > > > > > > >
> > > > > > > > > > * The "formal" C protocol having the "assembled" shape
> > means
> > > that
> > > > > > many
> > > > > > > > > > minimal Arrow users won't have to implement any separate
> > data
> > > > > > > > > > structures. They can just use the C struct directly or a
> > > slightly
> > > > > > > > > > wrapped version thereof with some convenience functions.
> > > > > > > > >
> > > > > > > > > Yes, but the same applies to the current proposal.
> > > > > > > > >
> > > > > > > > > > * I think that requiring building a Flatbuffer for minimal
> > > use
> > > > > > cases
> > > > > > > > > > (e.g. communicating simple record batches with primitive
> > > types)
> > > > > > passes
> > > > > > > > > > on implementation burden to minimal users.
> > > > > > > > >
> > > > > > > > > It certainly does.
> > > > > > > > >
> > > > > > > > > > I think the mantra of the C protocol should be the
> > following:
> > > > > > > > > >
> > > > > > > > > > * Users of the protocol have to write little to no code to
> > > use
> > > > > it.
> > > > > > For
> > > > > > > > > > example, populating an INT32 array should require only a
> > few
> > > > > lines
> > > > > > of
> > > > > > > > > > code
> > > > > > > > >
> > > > > > > > > Agreed.  As a sidenote, the spec should have an example of
> > > doing
> > > > > > this in
> > > > > > > > > raw C.
> > > > > > > > >
> > > > > > > > > Regards
> > > > > > > > >
> > > > > > > > > Antoine.
> > > > > > > >
> > > > > >
> > > > >
> > >
> >
>

Re: [DISCUSS] C-level in-process array protocol

Posted by Jacques Nadeau <ja...@apache.org>.

I removing all my objections to this work.

I wish there was more feedback from additional community members. I
continue to be concerned about fragmentation. I don't agree with the
arguments here that we need to add a new api to make it easy for people to
*not* use Arrow codebase. It seems like a punt on building useful libraries
within the project that will ultimately hurt the interoperability story.

As a side note, it seems like much of this is about people's distaste for
flatbuffers. I know I regret using it. If we had a chance to do it over
again, I would have chosen to use protobuf for everything except the data
header, where I would hand write the encoding (since it is so simple
anyway). If it is such a problem that people are contorting to work around
it, maybe we should address that? Just a thought.

Thanks for the discourse and patience.

On Wed, Oct 2, 2019 at 10:12 PM Micah Kornfield <em...@gmail.com>
wrote:

> Hi Wes,
> I agree for third-parties "A" (Field data structures) is the most useful.
>
> At least in my mind the discussion was for both first and third-parties.  I
> was trying to point out that "A" is less necessary as a first step for
> first-party integrations and could potentially require more effort if we
> already have the code that does "B" (field reassembly).
>
> Thanks,
> Micah
>
> On Wed, Oct 2, 2019 at 10:28 PM Wes McKinney <we...@gmail.com> wrote:
>
> > On Wed, Oct 2, 2019 at 11:05 PM Micah Kornfield <em...@gmail.com>
> > wrote:
> > >
> > > I've tried to summarize my understanding of the debate so far and give
> > some
> > > initial thoughts. I think there are two potentially different sets of
> > users
> > > that we are targeting with a stable C API/ABI ourselves and external
> > > parties.
> > >
> > > 1.  Different language implementations within the Arrow project that
> want
> > > to call into each other's code.  We still don't have a great story
> around
> > > this in terms of reusable libraries and questions like [1] are a
> > motivating
> > > examples of making something better in this context.
> > > 2.  third-parties wishing to support/integrate with Arrow.  Some
> > > conjectures about these users:
> > >   - Users in this group are NOT necessarily familiar with existing
> > > technologies Arrow uses (i.e. flatbuffers)
> > >   - The stability of the API is the primary concern (consumers don't
> want
> > > to change when a new version of the library ships)
> > >   - An important secondary concern is additional libraries that need to
> > be
> > > integrated in addition to the API
> > >
> > > The main debate points seems to be:
> > >
> > > 1.  Vector/Array oriented API vs existing Record Batch.  Will an
> > additional
> > > column oriented API become too much of a maintenance headache/cause
> > > fragmentation?
> > >
> > >  - In my mind the question here is which set of users we are
> > prioritizing.
> > > IMO the combination of flatbuffers and translation to/from RecordBatch
> > > format offers too much friction to make it easy for a third-party
> > > implementer to use. If we are prioritizing for our own internal
> > use-cases I
> > > think we should try out a RecordBatch+Flatbuffers based C-API. We
> already
> > > have all the necessary building blocks.
> > >
> >
> > If a C function passes you a string containing a RecordBatch
> > Flatbuffers message, what happens next? This message has to be
> > reassembled into a recursive data structure before you can "do"
> > anything with it. Are we expecting every third party project to
> > implement:
> >
> > A. Data structures appropriate to represent a logical "field" in a
> > record batch (which have to be recursive to account for nested types'
> > children)
> > B. The logic to convert from the flattened Flatbuffers representation
> > to some implementation of A
> >
> > I'm arguing that we should provide both to third parties. To build B,
> > you need A. Some consumers will only use A. This discussion is
> > essentially about developing an ultraminimalist "drop-in" C
> > implementation of A.
> >
> > > 2.  How onerous is the dependency on flat-buffers both from a learning
> > > curve perspective and as dependency for third-party integrators?
> > > - Flatbuffers aren't entirely straight-forward and I think if we do
> move
> > > forward with an API based on Column/Array we should consider
> alternatives
> > > as long as the necessary parsing code can be done in a small amount of
> > code
> > > (I'm personally against JSON for this, but can see the arguments for
> it).
> > >
> > > 3.  Do all existing library implementations need to support both
> > > Column/Array a ABI?  How will compliance be checked for the new
> API/ABI?
> > >
> > > - I'm still thinking this through.
> > >
> > > [1]
> > >
> >
> https://lists.apache.org/thread.html/18244b294d0b9bd568b5cfd1b1ac2b6a25088383a08202cc7a8a3563@%3Cuser.arrow.apache.org%3E
> > >
> > > On Wed, Oct 2, 2019 at 6:46 PM Jacques Nadeau <ja...@apache.org>
> > wrote:
> > >
> > > > I'd like to hear more opinions from others on this topic. This
> > conversation
> > > > seems mostly dominated by comments from myself, Wes and Antoine.
> > > >
> > > > I think it is reasonable to argue that keeping any ABI (or
> > header/struct
> > > > pattern) as narrow as possible would allow us to minimize overlap
> with
> > the
> > > > existing in-memory specification. In Arrow's case, this could be as
> > simple
> > > > as a single memory pointer for schema (backed by flatbuffers) and a
> > single
> > > > memory location for data (that references the record batch header,
> > which in
> > > > turn provides pointers into the actual arrow data). Extensions would
> > need
> > > > to be added for reference management as done here but I continue to
> > think
> > > > we should defer discussion of that until the base data structures are
> > > > resolved. I see the comments here as arguing for a much broader ABI,
> in
> > > > part to support having people build "Arrow" components that
> > interconnect
> > > > using this new interface. I understand the desire to expand the ABI
> to
> > be
> > > > driven by needs to reduce dependencies and ease usability.
> > > >
> > > > The representation within the related patch is being presented as a
> > way for
> > > > applications to share Arrow data but is not easily accessible to all
> > > > languages. I want to avoid a situation where someone says "I produced
> > an
> > > > Arrow API" when what they've really done is created a C interface
> which
> > > > only a small subset of languages can actually leverage. For example,
> > every
> > > > language now knows how to parse the existing schema definition as
> > rendered
> > > > in flatbuf. In order to interact with something that implements this
> > new
> > > > pattern one would also be required to implement completely new schema
> > > > consumption code. In the proposal itself it suggests this (for
> example
> > > > enhancing the C++ library to consume structures produced this way).
> > > >
> > > > As I said, I really want to hear more opinions. Running this past
> > various
> > > > developers I know, many have echoed my concerns but that really
> doesn't
> > > > matter (and who knows how much of that is colored by my presentation
> > of the
> > > > issue). What do people here think? If someone builds an "Arrow"
> library
> > > > that implements this set of structures, how does one use it in Node?
> In
> > > > Java? Does it drive creation of a secondary set of interfaces in each
> > of
> > > > those languages to work with this kind of pattern? (For example, in a
> > JVM
> > > > view of the world, working with a plain struct in java rather than a
> > set of
> > > > memory pointers against our existing IPC formats would be quite
> > painful and
> > > > we'd definitely need to create some glue code for users. I worry the
> > same
> > > > pattern would occur in many other languages.)
> > > >
> > > > To respond directly to some of Wes's most recent comments from the
> > email
> > > > below. I struggle to map your description of the situation to the
> rest
> > of
> > > > the thread and the proposed patch.  For example, you say that a
> > non-goal is
> > > > "creating a new canonical way to serialize metadata" bute the patch
> > > > proposes a concrete string based encoding system to describe data
> > types.
> > > > Aren't those things in conflict?
> > > >
> > > > I'll also think more on this and challenge my own perspective. This
> > isn't
> > > > where my focus is so my comments aren't as developed/thoughtful as
> I'd
> > > > like.
> > > >
> > > >
> > > > On Tue, Oct 1, 2019 at 7:33 PM Wes McKinney <we...@gmail.com>
> > wrote:
> > > >
> > > > > hi Jacques,
> > > > >
> > > > > I think we've veered off course a bit and maybe we could reframe
> the
> > > > > discussion.
> > > > >
> > > > > Goals
> > > > > * A "drop-in" header-only C file that projects can use as a
> > > > > programming interface either internally only or to expose in-memory
> > > > > data structures between C functions at call sites. Ideally little
> to
> > > > > no disassembly/reassembly should be required on either "side" of
> the
> > > > > call site.
> > > > > * Simplifying adoption of Arrow for C programmers, or languages
> based
> > > > > around C FFI
> > > > >
> > > > > Non-goals
> > > > > * Expanding the columnar format or creating an alternative
> canonical
> > > > > in-memory representation
> > > > > * Creating a new canonical way to serialize metadata
> > > > >
> > > > > Note that this use case has been on my mind for more than 2 years:
> > > > > https://issues.apache.org/jira/browse/ARROW-1058
> > > > >
> > > > > I think there are a couple of potentially misleading things at play
> > here
> > > > >
> > > > > 1. The use of the word "protocol". In C, a struct has a
> well-defined
> > > > > binary layout, so a C API is also an ABI. Using C structs to
> > > > > communicate data can be considered to be a protocol, but it means
> > > > > something different in the context of the "Arrow protocol". I think
> > we
> > > > > need to call this a "C API"
> > > > >
> > > > > 2. The documentation for this in Antoine's PR is in the format/
> > > > > directory. It would probably be better to have a "C API" section in
> > > > > the documentation.
> > > > >
> > > > > The header file under discussion and the documentation about it is
> > > > > best considered as a "library".
> > > > >
> > > > > It might be useful at some point to create a C99 implementation of
> > the
> > > > > IPC protocol as well using FlatCC with the goal of having a
> complete
> > > > > implementation of the columnar format in C with minimal binary
> > > > > footprint. This is analogous to the NanoPB project which is an
> > > > > implementation of Protocol Buffers with small code size
> > > > >
> > > > > https://github.com/nanopb/nanopb
> > > > >
> > > > > Let me know if this makes more sense.
> > > > >
> > > > > I think it's important to communicate clearly about this primarily
> > for
> > > > > the benefit of the outside world which can confuse easily as we
> have
> > > > > observed over the last few years =)
> > > > >
> > > > > Wes
> > > > >
> > > > > On Tue, Oct 1, 2019 at 2:55 PM Jacques Nadeau <ja...@apache.org>
> > > > wrote:
> > > > > >
> > > > > > I disagree with this statement:
> > > > > >
> > > > > > - the IPC format is meant for serialization while the C data
> > protocol
> > > > is
> > > > > > meants for in-memory communication, so different concerns apply
> > > > > >
> > > > > > If that is how the a particular implementation presents it, that
> > is a
> > > > > > weaknesses of the implementation, not the format. The primary use
> > case
> > > > I
> > > > > > was focused on when working on the initial format was
> communication
> > > > > within
> > > > > > the same process. It seems like this is being used as a basis for
> > the
> > > > > > introduction of new things when the premise is inconsistent with
> > the
> > > > > > intention of the creation. The specific reason we used
> flatbuffers
> > in
> > > > the
> > > > > > project was to collapse the separation of in-process and
> > out-of-process
> > > > > > communication. It means the same thing it does with the Arrow
> data
> > > > > itself:
> > > > > > that a consumer doesn't have to use a particular library to
> > interact
> > > > with
> > > > > > and use the data.
> > > > > >
> > > > > > It seems like there are two ideas here:
> > > > > >
> > > > > > 1) How do we make it easier for people to use Arrow?
> > > > > > 2) Should we implement a new in memory representation of Arrow
> > that is
> > > > > > language specific.
> > > > > >
> > > > > > I'm entirely in support of number one. If for a particular type
> of
> > > > > domain,
> > > > > > people want an easier way to interact with Arrow, let's make a
> new
> > > > > library
> > > > > > that helps with that. In easy of our current libraries, we do
> many
> > > > things
> > > > > > to make it easier to work with Arrow. None of those require a
> > change to
> > > > > the
> > > > > > core format or are formalized as a new in-memory standard. The
> > > > in-memory
> > > > > > representation of rust or javascript or java objects are
> > implementation
> > > > > > details.
> > > > > >
> > > > > > I'm against number two as it creates a fragmentation problem.
> > Arrow is
> > > > > > about having a single canonical format for memory for both
> > metadata and
> > > > > > data. Having multiple in-memory formats (especially when some are
> > not
> > > > > > language independent) is counter to the goals of the project.
> > > > >
> > > > > I don't think anyone is proposing anything that would cause
> > > > fragmentation.
> > > > >
> > > > > A central question is whether it is useful to define a reusable C
> ABI
> > > > > for the Arrow columnar format, and if there is sufficient
> interest, a
> > > > > tiny C implementation of the IPC protocol (which uses the
> Flatbuffers
> > > > > message) that assembles and disassembles the data structures
> defined
> > > > > in the C ABI.
> > > > >
> > > > > We could separately create a tiny implementation of the Arrow IPC
> > > > > protocol using FlatCC that could be dropped into applications
> > > > > requiring only a C compiler and nothing else.
> > > > >
> > > > >
> > > > > >
> > > > > > Two other, separate comments:
> > > > > > 1) I don't understand the idea that we need to change the way
> Arrow
> > > > > > fundamentally works so that people can avoid using a dependency.
> > If the
> > > > > > dependency is small, open source and easy to build, people can
> > fork it
> > > > > and
> > > > > > include directly if they want to. Let's not violate project
> > principles
> > > > > > because DuckDB has a religious perspective on dependencies. If
> the
> > > > > problem
> > > > > > is people have to swallow too large of a pill to do basic things
> > with
> > > > > Arrow
> > > > > > in C, let's focus on fixing that (to our definition of ease, not
> > > > someone
> > > > > > else's). If FlatCC solves some those things, great. If we need to
> > > > build a
> > > > > > baby integration library that is more C centric, great. Neither
> of
> > > > those
> > > > > > things require implementing something at the format level.
> > > > > >
> > > > > > 2) It seems like we should discuss the data structure problem
> > > > separately
> > > > > > from the reference management concern.
> > > > > >
> > > > > >
> > > > > > On Tue, Oct 1, 2019 at 5:42 AM Wes McKinney <wesmckinn@gmail.com
> >
> > > > wrote:
> > > > > >
> > > > > > > hi Antoine,
> > > > > > >
> > > > > > > On Tue, Oct 1, 2019 at 4:29 AM Antoine Pitrou <
> > antoine@python.org>
> > > > > wrote:
> > > > > > > >
> > > > > > > >
> > > > > > > > Le 01/10/2019 à 00:39, Wes McKinney a écrit :
> > > > > > > > > A couple things:
> > > > > > > > >
> > > > > > > > > * I think a C protocol / FFI for Arrow array/vectors would
> be
> > > > > better
> > > > > > > > > to have the same "shape" as an assembled array. Note that
> > the C
> > > > > > > > > structs here have very nearly the same "shape" as the data
> > > > > structure
> > > > > > > > > representing a C++ Array object [1]. The disassembly and
> > > > reassembly
> > > > > > > > > here is substantially simpler than the IPC protocol. A
> > recursive
> > > > > > > > > structure in Flatbuffers would make RecordBatch messages
> much
> > > > > larger,
> > > > > > > > > so the flattened / disassembled representation we use for
> > > > > serialized
> > > > > > > > > record batches is the correct one
> > > > > > > >
> > > > > > > > I'm not sure I agree:
> > > > > > > >
> > > > > > > > - indeed, it's not a coincidence that the ArrowArray struct
> > looks
> > > > > quite
> > > > > > > > closely like the C++ ArrayData object :-)  We have good
> > experience
> > > > > with
> > > > > > > > that abstraction and it has proven to work quite well
> > > > > > > >
> > > > > > > > - the IPC format is meant for serialization while the C data
> > > > > protocol is
> > > > > > > > meants for in-memory communication, so different concerns
> apply
> > > > > > > >
> > > > > > > > - the fact that this makes the layout slightly larger doesn't
> > seem
> > > > > > > > important at all; we're not talking about transferring data
> > over
> > > > the
> > > > > wire
> > > > > > > >
> > > > > > > > There's also another argument for having a recursive struct:
> it
> > > > > > > > simplifies how the data type is represented, since we can
> > encode
> > > > each
> > > > > > > > child type individually instead of encoding it in the
> parent's
> > > > format
> > > > > > > > string (same applies for metadata and individual flags).
> > > > > > > >
> > > > > > >
> > > > > > > I was saying something different here. I was making an argument
> > about
> > > > > > > why we use the flattened array-of-structs in the IPC protocol.
> > One
> > > > > > > reason is that it's a more compact representation. That is not
> > very
> > > > > > > important here because this protocol is only for *in-process*
> > (for
> > > > > > > languages that have a C FFI facility) rather than
> *inter-process*
> > > > > > > communication.
> > > > > > >
> > > > > > > I agree also that the type encoding is simple, here, too, since
> > we
> > > > > > > aren't having to split the schema and record batch between
> > different
> > > > > > > serialized messages. There is some potential waste with having
> to
> > > > > > > populate the type fields multiple times when communicating a
> > sequence
> > > > > > > of "chunks" from the same logical dataset.
> > > > > > >
> > > > > > > > > * The "formal" C protocol having the "assembled" shape
> means
> > that
> > > > > many
> > > > > > > > > minimal Arrow users won't have to implement any separate
> data
> > > > > > > > > structures. They can just use the C struct directly or a
> > slightly
> > > > > > > > > wrapped version thereof with some convenience functions.
> > > > > > > >
> > > > > > > > Yes, but the same applies to the current proposal.
> > > > > > > >
> > > > > > > > > * I think that requiring building a Flatbuffer for minimal
> > use
> > > > > cases
> > > > > > > > > (e.g. communicating simple record batches with primitive
> > types)
> > > > > passes
> > > > > > > > > on implementation burden to minimal users.
> > > > > > > >
> > > > > > > > It certainly does.
> > > > > > > >
> > > > > > > > > I think the mantra of the C protocol should be the
> following:
> > > > > > > > >
> > > > > > > > > * Users of the protocol have to write little to no code to
> > use
> > > > it.
> > > > > For
> > > > > > > > > example, populating an INT32 array should require only a
> few
> > > > lines
> > > > > of
> > > > > > > > > code
> > > > > > > >
> > > > > > > > Agreed.  As a sidenote, the spec should have an example of
> > doing
> > > > > this in
> > > > > > > > raw C.
> > > > > > > >
> > > > > > > > Regards
> > > > > > > >
> > > > > > > > Antoine.
> > > > > > >
> > > > >
> > > >
> >
>

Re: [DISCUSS] C-level in-process array protocol

Posted by Micah Kornfield <em...@gmail.com>.

Hi Wes,
I agree for third-parties "A" (Field data structures) is the most useful.

At least in my mind the discussion was for both first and third-parties.  I
was trying to point out that "A" is less necessary as a first step for
first-party integrations and could potentially require more effort if we
already have the code that does "B" (field reassembly).

Thanks,
Micah

On Wed, Oct 2, 2019 at 10:28 PM Wes McKinney <we...@gmail.com> wrote:

> On Wed, Oct 2, 2019 at 11:05 PM Micah Kornfield <em...@gmail.com>
> wrote:
> >
> > I've tried to summarize my understanding of the debate so far and give
> some
> > initial thoughts. I think there are two potentially different sets of
> users
> > that we are targeting with a stable C API/ABI ourselves and external
> > parties.
> >
> > 1.  Different language implementations within the Arrow project that want
> > to call into each other's code.  We still don't have a great story around
> > this in terms of reusable libraries and questions like [1] are a
> motivating
> > examples of making something better in this context.
> > 2.  third-parties wishing to support/integrate with Arrow.  Some
> > conjectures about these users:
> >   - Users in this group are NOT necessarily familiar with existing
> > technologies Arrow uses (i.e. flatbuffers)
> >   - The stability of the API is the primary concern (consumers don't want
> > to change when a new version of the library ships)
> >   - An important secondary concern is additional libraries that need to
> be
> > integrated in addition to the API
> >
> > The main debate points seems to be:
> >
> > 1.  Vector/Array oriented API vs existing Record Batch.  Will an
> additional
> > column oriented API become too much of a maintenance headache/cause
> > fragmentation?
> >
> >  - In my mind the question here is which set of users we are
> prioritizing.
> > IMO the combination of flatbuffers and translation to/from RecordBatch
> > format offers too much friction to make it easy for a third-party
> > implementer to use. If we are prioritizing for our own internal
> use-cases I
> > think we should try out a RecordBatch+Flatbuffers based C-API. We already
> > have all the necessary building blocks.
> >
>
> If a C function passes you a string containing a RecordBatch
> Flatbuffers message, what happens next? This message has to be
> reassembled into a recursive data structure before you can "do"
> anything with it. Are we expecting every third party project to
> implement:
>
> A. Data structures appropriate to represent a logical "field" in a
> record batch (which have to be recursive to account for nested types'
> children)
> B. The logic to convert from the flattened Flatbuffers representation
> to some implementation of A
>
> I'm arguing that we should provide both to third parties. To build B,
> you need A. Some consumers will only use A. This discussion is
> essentially about developing an ultraminimalist "drop-in" C
> implementation of A.
>
> > 2.  How onerous is the dependency on flat-buffers both from a learning
> > curve perspective and as dependency for third-party integrators?
> > - Flatbuffers aren't entirely straight-forward and I think if we do move
> > forward with an API based on Column/Array we should consider alternatives
> > as long as the necessary parsing code can be done in a small amount of
> code
> > (I'm personally against JSON for this, but can see the arguments for it).
> >
> > 3.  Do all existing library implementations need to support both
> > Column/Array a ABI?  How will compliance be checked for the new API/ABI?
> >
> > - I'm still thinking this through.
> >
> > [1]
> >
> https://lists.apache.org/thread.html/18244b294d0b9bd568b5cfd1b1ac2b6a25088383a08202cc7a8a3563@%3Cuser.arrow.apache.org%3E
> >
> > On Wed, Oct 2, 2019 at 6:46 PM Jacques Nadeau <ja...@apache.org>
> wrote:
> >
> > > I'd like to hear more opinions from others on this topic. This
> conversation
> > > seems mostly dominated by comments from myself, Wes and Antoine.
> > >
> > > I think it is reasonable to argue that keeping any ABI (or
> header/struct
> > > pattern) as narrow as possible would allow us to minimize overlap with
> the
> > > existing in-memory specification. In Arrow's case, this could be as
> simple
> > > as a single memory pointer for schema (backed by flatbuffers) and a
> single
> > > memory location for data (that references the record batch header,
> which in
> > > turn provides pointers into the actual arrow data). Extensions would
> need
> > > to be added for reference management as done here but I continue to
> think
> > > we should defer discussion of that until the base data structures are
> > > resolved. I see the comments here as arguing for a much broader ABI, in
> > > part to support having people build "Arrow" components that
> interconnect
> > > using this new interface. I understand the desire to expand the ABI to
> be
> > > driven by needs to reduce dependencies and ease usability.
> > >
> > > The representation within the related patch is being presented as a
> way for
> > > applications to share Arrow data but is not easily accessible to all
> > > languages. I want to avoid a situation where someone says "I produced
> an
> > > Arrow API" when what they've really done is created a C interface which
> > > only a small subset of languages can actually leverage. For example,
> every
> > > language now knows how to parse the existing schema definition as
> rendered
> > > in flatbuf. In order to interact with something that implements this
> new
> > > pattern one would also be required to implement completely new schema
> > > consumption code. In the proposal itself it suggests this (for example
> > > enhancing the C++ library to consume structures produced this way).
> > >
> > > As I said, I really want to hear more opinions. Running this past
> various
> > > developers I know, many have echoed my concerns but that really doesn't
> > > matter (and who knows how much of that is colored by my presentation
> of the
> > > issue). What do people here think? If someone builds an "Arrow" library
> > > that implements this set of structures, how does one use it in Node? In
> > > Java? Does it drive creation of a secondary set of interfaces in each
> of
> > > those languages to work with this kind of pattern? (For example, in a
> JVM
> > > view of the world, working with a plain struct in java rather than a
> set of
> > > memory pointers against our existing IPC formats would be quite
> painful and
> > > we'd definitely need to create some glue code for users. I worry the
> same
> > > pattern would occur in many other languages.)
> > >
> > > To respond directly to some of Wes's most recent comments from the
> email
> > > below. I struggle to map your description of the situation to the rest
> of
> > > the thread and the proposed patch.  For example, you say that a
> non-goal is
> > > "creating a new canonical way to serialize metadata" bute the patch
> > > proposes a concrete string based encoding system to describe data
> types.
> > > Aren't those things in conflict?
> > >
> > > I'll also think more on this and challenge my own perspective. This
> isn't
> > > where my focus is so my comments aren't as developed/thoughtful as I'd
> > > like.
> > >
> > >
> > > On Tue, Oct 1, 2019 at 7:33 PM Wes McKinney <we...@gmail.com>
> wrote:
> > >
> > > > hi Jacques,
> > > >
> > > > I think we've veered off course a bit and maybe we could reframe the
> > > > discussion.
> > > >
> > > > Goals
> > > > * A "drop-in" header-only C file that projects can use as a
> > > > programming interface either internally only or to expose in-memory
> > > > data structures between C functions at call sites. Ideally little to
> > > > no disassembly/reassembly should be required on either "side" of the
> > > > call site.
> > > > * Simplifying adoption of Arrow for C programmers, or languages based
> > > > around C FFI
> > > >
> > > > Non-goals
> > > > * Expanding the columnar format or creating an alternative canonical
> > > > in-memory representation
> > > > * Creating a new canonical way to serialize metadata
> > > >
> > > > Note that this use case has been on my mind for more than 2 years:
> > > > https://issues.apache.org/jira/browse/ARROW-1058
> > > >
> > > > I think there are a couple of potentially misleading things at play
> here
> > > >
> > > > 1. The use of the word "protocol". In C, a struct has a well-defined
> > > > binary layout, so a C API is also an ABI. Using C structs to
> > > > communicate data can be considered to be a protocol, but it means
> > > > something different in the context of the "Arrow protocol". I think
> we
> > > > need to call this a "C API"
> > > >
> > > > 2. The documentation for this in Antoine's PR is in the format/
> > > > directory. It would probably be better to have a "C API" section in
> > > > the documentation.
> > > >
> > > > The header file under discussion and the documentation about it is
> > > > best considered as a "library".
> > > >
> > > > It might be useful at some point to create a C99 implementation of
> the
> > > > IPC protocol as well using FlatCC with the goal of having a complete
> > > > implementation of the columnar format in C with minimal binary
> > > > footprint. This is analogous to the NanoPB project which is an
> > > > implementation of Protocol Buffers with small code size
> > > >
> > > > https://github.com/nanopb/nanopb
> > > >
> > > > Let me know if this makes more sense.
> > > >
> > > > I think it's important to communicate clearly about this primarily
> for
> > > > the benefit of the outside world which can confuse easily as we have
> > > > observed over the last few years =)
> > > >
> > > > Wes
> > > >
> > > > On Tue, Oct 1, 2019 at 2:55 PM Jacques Nadeau <ja...@apache.org>
> > > wrote:
> > > > >
> > > > > I disagree with this statement:
> > > > >
> > > > > - the IPC format is meant for serialization while the C data
> protocol
> > > is
> > > > > meants for in-memory communication, so different concerns apply
> > > > >
> > > > > If that is how the a particular implementation presents it, that
> is a
> > > > > weaknesses of the implementation, not the format. The primary use
> case
> > > I
> > > > > was focused on when working on the initial format was communication
> > > > within
> > > > > the same process. It seems like this is being used as a basis for
> the
> > > > > introduction of new things when the premise is inconsistent with
> the
> > > > > intention of the creation. The specific reason we used flatbuffers
> in
> > > the
> > > > > project was to collapse the separation of in-process and
> out-of-process
> > > > > communication. It means the same thing it does with the Arrow data
> > > > itself:
> > > > > that a consumer doesn't have to use a particular library to
> interact
> > > with
> > > > > and use the data.
> > > > >
> > > > > It seems like there are two ideas here:
> > > > >
> > > > > 1) How do we make it easier for people to use Arrow?
> > > > > 2) Should we implement a new in memory representation of Arrow
> that is
> > > > > language specific.
> > > > >
> > > > > I'm entirely in support of number one. If for a particular type of
> > > > domain,
> > > > > people want an easier way to interact with Arrow, let's make a new
> > > > library
> > > > > that helps with that. In easy of our current libraries, we do many
> > > things
> > > > > to make it easier to work with Arrow. None of those require a
> change to
> > > > the
> > > > > core format or are formalized as a new in-memory standard. The
> > > in-memory
> > > > > representation of rust or javascript or java objects are
> implementation
> > > > > details.
> > > > >
> > > > > I'm against number two as it creates a fragmentation problem.
> Arrow is
> > > > > about having a single canonical format for memory for both
> metadata and
> > > > > data. Having multiple in-memory formats (especially when some are
> not
> > > > > language independent) is counter to the goals of the project.
> > > >
> > > > I don't think anyone is proposing anything that would cause
> > > fragmentation.
> > > >
> > > > A central question is whether it is useful to define a reusable C ABI
> > > > for the Arrow columnar format, and if there is sufficient interest, a
> > > > tiny C implementation of the IPC protocol (which uses the Flatbuffers
> > > > message) that assembles and disassembles the data structures defined
> > > > in the C ABI.
> > > >
> > > > We could separately create a tiny implementation of the Arrow IPC
> > > > protocol using FlatCC that could be dropped into applications
> > > > requiring only a C compiler and nothing else.
> > > >
> > > >
> > > > >
> > > > > Two other, separate comments:
> > > > > 1) I don't understand the idea that we need to change the way Arrow
> > > > > fundamentally works so that people can avoid using a dependency.
> If the
> > > > > dependency is small, open source and easy to build, people can
> fork it
> > > > and
> > > > > include directly if they want to. Let's not violate project
> principles
> > > > > because DuckDB has a religious perspective on dependencies. If the
> > > > problem
> > > > > is people have to swallow too large of a pill to do basic things
> with
> > > > Arrow
> > > > > in C, let's focus on fixing that (to our definition of ease, not
> > > someone
> > > > > else's). If FlatCC solves some those things, great. If we need to
> > > build a
> > > > > baby integration library that is more C centric, great. Neither of
> > > those
> > > > > things require implementing something at the format level.
> > > > >
> > > > > 2) It seems like we should discuss the data structure problem
> > > separately
> > > > > from the reference management concern.
> > > > >
> > > > >
> > > > > On Tue, Oct 1, 2019 at 5:42 AM Wes McKinney <we...@gmail.com>
> > > wrote:
> > > > >
> > > > > > hi Antoine,
> > > > > >
> > > > > > On Tue, Oct 1, 2019 at 4:29 AM Antoine Pitrou <
> antoine@python.org>
> > > > wrote:
> > > > > > >
> > > > > > >
> > > > > > > Le 01/10/2019 à 00:39, Wes McKinney a écrit :
> > > > > > > > A couple things:
> > > > > > > >
> > > > > > > > * I think a C protocol / FFI for Arrow array/vectors would be
> > > > better
> > > > > > > > to have the same "shape" as an assembled array. Note that
> the C
> > > > > > > > structs here have very nearly the same "shape" as the data
> > > > structure
> > > > > > > > representing a C++ Array object [1]. The disassembly and
> > > reassembly
> > > > > > > > here is substantially simpler than the IPC protocol. A
> recursive
> > > > > > > > structure in Flatbuffers would make RecordBatch messages much
> > > > larger,
> > > > > > > > so the flattened / disassembled representation we use for
> > > > serialized
> > > > > > > > record batches is the correct one
> > > > > > >
> > > > > > > I'm not sure I agree:
> > > > > > >
> > > > > > > - indeed, it's not a coincidence that the ArrowArray struct
> looks
> > > > quite
> > > > > > > closely like the C++ ArrayData object :-)  We have good
> experience
> > > > with
> > > > > > > that abstraction and it has proven to work quite well
> > > > > > >
> > > > > > > - the IPC format is meant for serialization while the C data
> > > > protocol is
> > > > > > > meants for in-memory communication, so different concerns apply
> > > > > > >
> > > > > > > - the fact that this makes the layout slightly larger doesn't
> seem
> > > > > > > important at all; we're not talking about transferring data
> over
> > > the
> > > > wire
> > > > > > >
> > > > > > > There's also another argument for having a recursive struct: it
> > > > > > > simplifies how the data type is represented, since we can
> encode
> > > each
> > > > > > > child type individually instead of encoding it in the parent's
> > > format
> > > > > > > string (same applies for metadata and individual flags).
> > > > > > >
> > > > > >
> > > > > > I was saying something different here. I was making an argument
> about
> > > > > > why we use the flattened array-of-structs in the IPC protocol.
> One
> > > > > > reason is that it's a more compact representation. That is not
> very
> > > > > > important here because this protocol is only for *in-process*
> (for
> > > > > > languages that have a C FFI facility) rather than *inter-process*
> > > > > > communication.
> > > > > >
> > > > > > I agree also that the type encoding is simple, here, too, since
> we
> > > > > > aren't having to split the schema and record batch between
> different
> > > > > > serialized messages. There is some potential waste with having to
> > > > > > populate the type fields multiple times when communicating a
> sequence
> > > > > > of "chunks" from the same logical dataset.
> > > > > >
> > > > > > > > * The "formal" C protocol having the "assembled" shape means
> that
> > > > many
> > > > > > > > minimal Arrow users won't have to implement any separate data
> > > > > > > > structures. They can just use the C struct directly or a
> slightly
> > > > > > > > wrapped version thereof with some convenience functions.
> > > > > > >
> > > > > > > Yes, but the same applies to the current proposal.
> > > > > > >
> > > > > > > > * I think that requiring building a Flatbuffer for minimal
> use
> > > > cases
> > > > > > > > (e.g. communicating simple record batches with primitive
> types)
> > > > passes
> > > > > > > > on implementation burden to minimal users.
> > > > > > >
> > > > > > > It certainly does.
> > > > > > >
> > > > > > > > I think the mantra of the C protocol should be the following:
> > > > > > > >
> > > > > > > > * Users of the protocol have to write little to no code to
> use
> > > it.
> > > > For
> > > > > > > > example, populating an INT32 array should require only a few
> > > lines
> > > > of
> > > > > > > > code
> > > > > > >
> > > > > > > Agreed.  As a sidenote, the spec should have an example of
> doing
> > > > this in
> > > > > > > raw C.
> > > > > > >
> > > > > > > Regards
> > > > > > >
> > > > > > > Antoine.
> > > > > >
> > > >
> > >
>

Re: [DISCUSS] C-level in-process array protocol

Posted by Wes McKinney <we...@gmail.com>.

On Wed, Oct 2, 2019 at 11:05 PM Micah Kornfield <em...@gmail.com> wrote:
>
> I've tried to summarize my understanding of the debate so far and give some
> initial thoughts. I think there are two potentially different sets of users
> that we are targeting with a stable C API/ABI ourselves and external
> parties.
>
> 1.  Different language implementations within the Arrow project that want
> to call into each other's code.  We still don't have a great story around
> this in terms of reusable libraries and questions like [1] are a motivating
> examples of making something better in this context.
> 2.  third-parties wishing to support/integrate with Arrow.  Some
> conjectures about these users:
>   - Users in this group are NOT necessarily familiar with existing
> technologies Arrow uses (i.e. flatbuffers)
>   - The stability of the API is the primary concern (consumers don't want
> to change when a new version of the library ships)
>   - An important secondary concern is additional libraries that need to be
> integrated in addition to the API
>
> The main debate points seems to be:
>
> 1.  Vector/Array oriented API vs existing Record Batch.  Will an additional
> column oriented API become too much of a maintenance headache/cause
> fragmentation?
>
>  - In my mind the question here is which set of users we are prioritizing.
> IMO the combination of flatbuffers and translation to/from RecordBatch
> format offers too much friction to make it easy for a third-party
> implementer to use. If we are prioritizing for our own internal use-cases I
> think we should try out a RecordBatch+Flatbuffers based C-API. We already
> have all the necessary building blocks.
>

If a C function passes you a string containing a RecordBatch
Flatbuffers message, what happens next? This message has to be
reassembled into a recursive data structure before you can "do"
anything with it. Are we expecting every third party project to
implement:

A. Data structures appropriate to represent a logical "field" in a
record batch (which have to be recursive to account for nested types'
children)
B. The logic to convert from the flattened Flatbuffers representation
to some implementation of A

I'm arguing that we should provide both to third parties. To build B,
you need A. Some consumers will only use A. This discussion is
essentially about developing an ultraminimalist "drop-in" C
implementation of A.

> 2.  How onerous is the dependency on flat-buffers both from a learning
> curve perspective and as dependency for third-party integrators?
> - Flatbuffers aren't entirely straight-forward and I think if we do move
> forward with an API based on Column/Array we should consider alternatives
> as long as the necessary parsing code can be done in a small amount of code
> (I'm personally against JSON for this, but can see the arguments for it).
>
> 3.  Do all existing library implementations need to support both
> Column/Array a ABI?  How will compliance be checked for the new API/ABI?
>
> - I'm still thinking this through.
>
> [1]
> https://lists.apache.org/thread.html/18244b294d0b9bd568b5cfd1b1ac2b6a25088383a08202cc7a8a3563@%3Cuser.arrow.apache.org%3E
>
> On Wed, Oct 2, 2019 at 6:46 PM Jacques Nadeau <ja...@apache.org> wrote:
>
> > I'd like to hear more opinions from others on this topic. This conversation
> > seems mostly dominated by comments from myself, Wes and Antoine.
> >
> > I think it is reasonable to argue that keeping any ABI (or header/struct
> > pattern) as narrow as possible would allow us to minimize overlap with the
> > existing in-memory specification. In Arrow's case, this could be as simple
> > as a single memory pointer for schema (backed by flatbuffers) and a single
> > memory location for data (that references the record batch header, which in
> > turn provides pointers into the actual arrow data). Extensions would need
> > to be added for reference management as done here but I continue to think
> > we should defer discussion of that until the base data structures are
> > resolved. I see the comments here as arguing for a much broader ABI, in
> > part to support having people build "Arrow" components that interconnect
> > using this new interface. I understand the desire to expand the ABI to be
> > driven by needs to reduce dependencies and ease usability.
> >
> > The representation within the related patch is being presented as a way for
> > applications to share Arrow data but is not easily accessible to all
> > languages. I want to avoid a situation where someone says "I produced an
> > Arrow API" when what they've really done is created a C interface which
> > only a small subset of languages can actually leverage. For example, every
> > language now knows how to parse the existing schema definition as rendered
> > in flatbuf. In order to interact with something that implements this new
> > pattern one would also be required to implement completely new schema
> > consumption code. In the proposal itself it suggests this (for example
> > enhancing the C++ library to consume structures produced this way).
> >
> > As I said, I really want to hear more opinions. Running this past various
> > developers I know, many have echoed my concerns but that really doesn't
> > matter (and who knows how much of that is colored by my presentation of the
> > issue). What do people here think? If someone builds an "Arrow" library
> > that implements this set of structures, how does one use it in Node? In
> > Java? Does it drive creation of a secondary set of interfaces in each of
> > those languages to work with this kind of pattern? (For example, in a JVM
> > view of the world, working with a plain struct in java rather than a set of
> > memory pointers against our existing IPC formats would be quite painful and
> > we'd definitely need to create some glue code for users. I worry the same
> > pattern would occur in many other languages.)
> >
> > To respond directly to some of Wes's most recent comments from the email
> > below. I struggle to map your description of the situation to the rest of
> > the thread and the proposed patch.  For example, you say that a non-goal is
> > "creating a new canonical way to serialize metadata" bute the patch
> > proposes a concrete string based encoding system to describe data types.
> > Aren't those things in conflict?
> >
> > I'll also think more on this and challenge my own perspective. This isn't
> > where my focus is so my comments aren't as developed/thoughtful as I'd
> > like.
> >
> >
> > On Tue, Oct 1, 2019 at 7:33 PM Wes McKinney <we...@gmail.com> wrote:
> >
> > > hi Jacques,
> > >
> > > I think we've veered off course a bit and maybe we could reframe the
> > > discussion.
> > >
> > > Goals
> > > * A "drop-in" header-only C file that projects can use as a
> > > programming interface either internally only or to expose in-memory
> > > data structures between C functions at call sites. Ideally little to
> > > no disassembly/reassembly should be required on either "side" of the
> > > call site.
> > > * Simplifying adoption of Arrow for C programmers, or languages based
> > > around C FFI
> > >
> > > Non-goals
> > > * Expanding the columnar format or creating an alternative canonical
> > > in-memory representation
> > > * Creating a new canonical way to serialize metadata
> > >
> > > Note that this use case has been on my mind for more than 2 years:
> > > https://issues.apache.org/jira/browse/ARROW-1058
> > >
> > > I think there are a couple of potentially misleading things at play here
> > >
> > > 1. The use of the word "protocol". In C, a struct has a well-defined
> > > binary layout, so a C API is also an ABI. Using C structs to
> > > communicate data can be considered to be a protocol, but it means
> > > something different in the context of the "Arrow protocol". I think we
> > > need to call this a "C API"
> > >
> > > 2. The documentation for this in Antoine's PR is in the format/
> > > directory. It would probably be better to have a "C API" section in
> > > the documentation.
> > >
> > > The header file under discussion and the documentation about it is
> > > best considered as a "library".
> > >
> > > It might be useful at some point to create a C99 implementation of the
> > > IPC protocol as well using FlatCC with the goal of having a complete
> > > implementation of the columnar format in C with minimal binary
> > > footprint. This is analogous to the NanoPB project which is an
> > > implementation of Protocol Buffers with small code size
> > >
> > > https://github.com/nanopb/nanopb
> > >
> > > Let me know if this makes more sense.
> > >
> > > I think it's important to communicate clearly about this primarily for
> > > the benefit of the outside world which can confuse easily as we have
> > > observed over the last few years =)
> > >
> > > Wes
> > >
> > > On Tue, Oct 1, 2019 at 2:55 PM Jacques Nadeau <ja...@apache.org>
> > wrote:
> > > >
> > > > I disagree with this statement:
> > > >
> > > > - the IPC format is meant for serialization while the C data protocol
> > is
> > > > meants for in-memory communication, so different concerns apply
> > > >
> > > > If that is how the a particular implementation presents it, that is a
> > > > weaknesses of the implementation, not the format. The primary use case
> > I
> > > > was focused on when working on the initial format was communication
> > > within
> > > > the same process. It seems like this is being used as a basis for the
> > > > introduction of new things when the premise is inconsistent with the
> > > > intention of the creation. The specific reason we used flatbuffers in
> > the
> > > > project was to collapse the separation of in-process and out-of-process
> > > > communication. It means the same thing it does with the Arrow data
> > > itself:
> > > > that a consumer doesn't have to use a particular library to interact
> > with
> > > > and use the data.
> > > >
> > > > It seems like there are two ideas here:
> > > >
> > > > 1) How do we make it easier for people to use Arrow?
> > > > 2) Should we implement a new in memory representation of Arrow that is
> > > > language specific.
> > > >
> > > > I'm entirely in support of number one. If for a particular type of
> > > domain,
> > > > people want an easier way to interact with Arrow, let's make a new
> > > library
> > > > that helps with that. In easy of our current libraries, we do many
> > things
> > > > to make it easier to work with Arrow. None of those require a change to
> > > the
> > > > core format or are formalized as a new in-memory standard. The
> > in-memory
> > > > representation of rust or javascript or java objects are implementation
> > > > details.
> > > >
> > > > I'm against number two as it creates a fragmentation problem. Arrow is
> > > > about having a single canonical format for memory for both metadata and
> > > > data. Having multiple in-memory formats (especially when some are not
> > > > language independent) is counter to the goals of the project.
> > >
> > > I don't think anyone is proposing anything that would cause
> > fragmentation.
> > >
> > > A central question is whether it is useful to define a reusable C ABI
> > > for the Arrow columnar format, and if there is sufficient interest, a
> > > tiny C implementation of the IPC protocol (which uses the Flatbuffers
> > > message) that assembles and disassembles the data structures defined
> > > in the C ABI.
> > >
> > > We could separately create a tiny implementation of the Arrow IPC
> > > protocol using FlatCC that could be dropped into applications
> > > requiring only a C compiler and nothing else.
> > >
> > >
> > > >
> > > > Two other, separate comments:
> > > > 1) I don't understand the idea that we need to change the way Arrow
> > > > fundamentally works so that people can avoid using a dependency. If the
> > > > dependency is small, open source and easy to build, people can fork it
> > > and
> > > > include directly if they want to. Let's not violate project principles
> > > > because DuckDB has a religious perspective on dependencies. If the
> > > problem
> > > > is people have to swallow too large of a pill to do basic things with
> > > Arrow
> > > > in C, let's focus on fixing that (to our definition of ease, not
> > someone
> > > > else's). If FlatCC solves some those things, great. If we need to
> > build a
> > > > baby integration library that is more C centric, great. Neither of
> > those
> > > > things require implementing something at the format level.
> > > >
> > > > 2) It seems like we should discuss the data structure problem
> > separately
> > > > from the reference management concern.
> > > >
> > > >
> > > > On Tue, Oct 1, 2019 at 5:42 AM Wes McKinney <we...@gmail.com>
> > wrote:
> > > >
> > > > > hi Antoine,
> > > > >
> > > > > On Tue, Oct 1, 2019 at 4:29 AM Antoine Pitrou <an...@python.org>
> > > wrote:
> > > > > >
> > > > > >
> > > > > > Le 01/10/2019 à 00:39, Wes McKinney a écrit :
> > > > > > > A couple things:
> > > > > > >
> > > > > > > * I think a C protocol / FFI for Arrow array/vectors would be
> > > better
> > > > > > > to have the same "shape" as an assembled array. Note that the C
> > > > > > > structs here have very nearly the same "shape" as the data
> > > structure
> > > > > > > representing a C++ Array object [1]. The disassembly and
> > reassembly
> > > > > > > here is substantially simpler than the IPC protocol. A recursive
> > > > > > > structure in Flatbuffers would make RecordBatch messages much
> > > larger,
> > > > > > > so the flattened / disassembled representation we use for
> > > serialized
> > > > > > > record batches is the correct one
> > > > > >
> > > > > > I'm not sure I agree:
> > > > > >
> > > > > > - indeed, it's not a coincidence that the ArrowArray struct looks
> > > quite
> > > > > > closely like the C++ ArrayData object :-)  We have good experience
> > > with
> > > > > > that abstraction and it has proven to work quite well
> > > > > >
> > > > > > - the IPC format is meant for serialization while the C data
> > > protocol is
> > > > > > meants for in-memory communication, so different concerns apply
> > > > > >
> > > > > > - the fact that this makes the layout slightly larger doesn't seem
> > > > > > important at all; we're not talking about transferring data over
> > the
> > > wire
> > > > > >
> > > > > > There's also another argument for having a recursive struct: it
> > > > > > simplifies how the data type is represented, since we can encode
> > each
> > > > > > child type individually instead of encoding it in the parent's
> > format
> > > > > > string (same applies for metadata and individual flags).
> > > > > >
> > > > >
> > > > > I was saying something different here. I was making an argument about
> > > > > why we use the flattened array-of-structs in the IPC protocol. One
> > > > > reason is that it's a more compact representation. That is not very
> > > > > important here because this protocol is only for *in-process* (for
> > > > > languages that have a C FFI facility) rather than *inter-process*
> > > > > communication.
> > > > >
> > > > > I agree also that the type encoding is simple, here, too, since we
> > > > > aren't having to split the schema and record batch between different
> > > > > serialized messages. There is some potential waste with having to
> > > > > populate the type fields multiple times when communicating a sequence
> > > > > of "chunks" from the same logical dataset.
> > > > >
> > > > > > > * The "formal" C protocol having the "assembled" shape means that
> > > many
> > > > > > > minimal Arrow users won't have to implement any separate data
> > > > > > > structures. They can just use the C struct directly or a slightly
> > > > > > > wrapped version thereof with some convenience functions.
> > > > > >
> > > > > > Yes, but the same applies to the current proposal.
> > > > > >
> > > > > > > * I think that requiring building a Flatbuffer for minimal use
> > > cases
> > > > > > > (e.g. communicating simple record batches with primitive types)
> > > passes
> > > > > > > on implementation burden to minimal users.
> > > > > >
> > > > > > It certainly does.
> > > > > >
> > > > > > > I think the mantra of the C protocol should be the following:
> > > > > > >
> > > > > > > * Users of the protocol have to write little to no code to use
> > it.
> > > For
> > > > > > > example, populating an INT32 array should require only a few
> > lines
> > > of
> > > > > > > code
> > > > > >
> > > > > > Agreed.  As a sidenote, the spec should have an example of doing
> > > this in
> > > > > > raw C.
> > > > > >
> > > > > > Regards
> > > > > >
> > > > > > Antoine.
> > > > >
> > >
> >

Re: [DISCUSS] C-level in-process array protocol

Posted by Micah Kornfield <em...@gmail.com>.

I've tried to summarize my understanding of the debate so far and give some
initial thoughts. I think there are two potentially different sets of users
that we are targeting with a stable C API/ABI ourselves and external
parties.

1.  Different language implementations within the Arrow project that want
to call into each other's code.  We still don't have a great story around
this in terms of reusable libraries and questions like [1] are a motivating
examples of making something better in this context.
2.  third-parties wishing to support/integrate with Arrow.  Some
conjectures about these users:
  - Users in this group are NOT necessarily familiar with existing
technologies Arrow uses (i.e. flatbuffers)
  - The stability of the API is the primary concern (consumers don't want
to change when a new version of the library ships)
  - An important secondary concern is additional libraries that need to be
integrated in addition to the API

The main debate points seems to be:

1.  Vector/Array oriented API vs existing Record Batch.  Will an additional
column oriented API become too much of a maintenance headache/cause
fragmentation?

 - In my mind the question here is which set of users we are prioritizing.
IMO the combination of flatbuffers and translation to/from RecordBatch
format offers too much friction to make it easy for a third-party
implementer to use. If we are prioritizing for our own internal use-cases I
think we should try out a RecordBatch+Flatbuffers based C-API. We already
have all the necessary building blocks.

2.  How onerous is the dependency on flat-buffers both from a learning
curve perspective and as dependency for third-party integrators?
- Flatbuffers aren't entirely straight-forward and I think if we do move
forward with an API based on Column/Array we should consider alternatives
as long as the necessary parsing code can be done in a small amount of code
(I'm personally against JSON for this, but can see the arguments for it).

3.  Do all existing library implementations need to support both
Column/Array a ABI?  How will compliance be checked for the new API/ABI?

- I'm still thinking this through.

[1]
https://lists.apache.org/thread.html/18244b294d0b9bd568b5cfd1b1ac2b6a25088383a08202cc7a8a3563@%3Cuser.arrow.apache.org%3E

On Wed, Oct 2, 2019 at 6:46 PM Jacques Nadeau <ja...@apache.org> wrote:

> I'd like to hear more opinions from others on this topic. This conversation
> seems mostly dominated by comments from myself, Wes and Antoine.
>
> I think it is reasonable to argue that keeping any ABI (or header/struct
> pattern) as narrow as possible would allow us to minimize overlap with the
> existing in-memory specification. In Arrow's case, this could be as simple
> as a single memory pointer for schema (backed by flatbuffers) and a single
> memory location for data (that references the record batch header, which in
> turn provides pointers into the actual arrow data). Extensions would need
> to be added for reference management as done here but I continue to think
> we should defer discussion of that until the base data structures are
> resolved. I see the comments here as arguing for a much broader ABI, in
> part to support having people build "Arrow" components that interconnect
> using this new interface. I understand the desire to expand the ABI to be
> driven by needs to reduce dependencies and ease usability.
>
> The representation within the related patch is being presented as a way for
> applications to share Arrow data but is not easily accessible to all
> languages. I want to avoid a situation where someone says "I produced an
> Arrow API" when what they've really done is created a C interface which
> only a small subset of languages can actually leverage. For example, every
> language now knows how to parse the existing schema definition as rendered
> in flatbuf. In order to interact with something that implements this new
> pattern one would also be required to implement completely new schema
> consumption code. In the proposal itself it suggests this (for example
> enhancing the C++ library to consume structures produced this way).
>
> As I said, I really want to hear more opinions. Running this past various
> developers I know, many have echoed my concerns but that really doesn't
> matter (and who knows how much of that is colored by my presentation of the
> issue). What do people here think? If someone builds an "Arrow" library
> that implements this set of structures, how does one use it in Node? In
> Java? Does it drive creation of a secondary set of interfaces in each of
> those languages to work with this kind of pattern? (For example, in a JVM
> view of the world, working with a plain struct in java rather than a set of
> memory pointers against our existing IPC formats would be quite painful and
> we'd definitely need to create some glue code for users. I worry the same
> pattern would occur in many other languages.)
>
> To respond directly to some of Wes's most recent comments from the email
> below. I struggle to map your description of the situation to the rest of
> the thread and the proposed patch.  For example, you say that a non-goal is
> "creating a new canonical way to serialize metadata" bute the patch
> proposes a concrete string based encoding system to describe data types.
> Aren't those things in conflict?
>
> I'll also think more on this and challenge my own perspective. This isn't
> where my focus is so my comments aren't as developed/thoughtful as I'd
> like.
>
>
> On Tue, Oct 1, 2019 at 7:33 PM Wes McKinney <we...@gmail.com> wrote:
>
> > hi Jacques,
> >
> > I think we've veered off course a bit and maybe we could reframe the
> > discussion.
> >
> > Goals
> > * A "drop-in" header-only C file that projects can use as a
> > programming interface either internally only or to expose in-memory
> > data structures between C functions at call sites. Ideally little to
> > no disassembly/reassembly should be required on either "side" of the
> > call site.
> > * Simplifying adoption of Arrow for C programmers, or languages based
> > around C FFI
> >
> > Non-goals
> > * Expanding the columnar format or creating an alternative canonical
> > in-memory representation
> > * Creating a new canonical way to serialize metadata
> >
> > Note that this use case has been on my mind for more than 2 years:
> > https://issues.apache.org/jira/browse/ARROW-1058
> >
> > I think there are a couple of potentially misleading things at play here
> >
> > 1. The use of the word "protocol". In C, a struct has a well-defined
> > binary layout, so a C API is also an ABI. Using C structs to
> > communicate data can be considered to be a protocol, but it means
> > something different in the context of the "Arrow protocol". I think we
> > need to call this a "C API"
> >
> > 2. The documentation for this in Antoine's PR is in the format/
> > directory. It would probably be better to have a "C API" section in
> > the documentation.
> >
> > The header file under discussion and the documentation about it is
> > best considered as a "library".
> >
> > It might be useful at some point to create a C99 implementation of the
> > IPC protocol as well using FlatCC with the goal of having a complete
> > implementation of the columnar format in C with minimal binary
> > footprint. This is analogous to the NanoPB project which is an
> > implementation of Protocol Buffers with small code size
> >
> > https://github.com/nanopb/nanopb
> >
> > Let me know if this makes more sense.
> >
> > I think it's important to communicate clearly about this primarily for
> > the benefit of the outside world which can confuse easily as we have
> > observed over the last few years =)
> >
> > Wes
> >
> > On Tue, Oct 1, 2019 at 2:55 PM Jacques Nadeau <ja...@apache.org>
> wrote:
> > >
> > > I disagree with this statement:
> > >
> > > - the IPC format is meant for serialization while the C data protocol
> is
> > > meants for in-memory communication, so different concerns apply
> > >
> > > If that is how the a particular implementation presents it, that is a
> > > weaknesses of the implementation, not the format. The primary use case
> I
> > > was focused on when working on the initial format was communication
> > within
> > > the same process. It seems like this is being used as a basis for the
> > > introduction of new things when the premise is inconsistent with the
> > > intention of the creation. The specific reason we used flatbuffers in
> the
> > > project was to collapse the separation of in-process and out-of-process
> > > communication. It means the same thing it does with the Arrow data
> > itself:
> > > that a consumer doesn't have to use a particular library to interact
> with
> > > and use the data.
> > >
> > > It seems like there are two ideas here:
> > >
> > > 1) How do we make it easier for people to use Arrow?
> > > 2) Should we implement a new in memory representation of Arrow that is
> > > language specific.
> > >
> > > I'm entirely in support of number one. If for a particular type of
> > domain,
> > > people want an easier way to interact with Arrow, let's make a new
> > library
> > > that helps with that. In easy of our current libraries, we do many
> things
> > > to make it easier to work with Arrow. None of those require a change to
> > the
> > > core format or are formalized as a new in-memory standard. The
> in-memory
> > > representation of rust or javascript or java objects are implementation
> > > details.
> > >
> > > I'm against number two as it creates a fragmentation problem. Arrow is
> > > about having a single canonical format for memory for both metadata and
> > > data. Having multiple in-memory formats (especially when some are not
> > > language independent) is counter to the goals of the project.
> >
> > I don't think anyone is proposing anything that would cause
> fragmentation.
> >
> > A central question is whether it is useful to define a reusable C ABI
> > for the Arrow columnar format, and if there is sufficient interest, a
> > tiny C implementation of the IPC protocol (which uses the Flatbuffers
> > message) that assembles and disassembles the data structures defined
> > in the C ABI.
> >
> > We could separately create a tiny implementation of the Arrow IPC
> > protocol using FlatCC that could be dropped into applications
> > requiring only a C compiler and nothing else.
> >
> >
> > >
> > > Two other, separate comments:
> > > 1) I don't understand the idea that we need to change the way Arrow
> > > fundamentally works so that people can avoid using a dependency. If the
> > > dependency is small, open source and easy to build, people can fork it
> > and
> > > include directly if they want to. Let's not violate project principles
> > > because DuckDB has a religious perspective on dependencies. If the
> > problem
> > > is people have to swallow too large of a pill to do basic things with
> > Arrow
> > > in C, let's focus on fixing that (to our definition of ease, not
> someone
> > > else's). If FlatCC solves some those things, great. If we need to
> build a
> > > baby integration library that is more C centric, great. Neither of
> those
> > > things require implementing something at the format level.
> > >
> > > 2) It seems like we should discuss the data structure problem
> separately
> > > from the reference management concern.
> > >
> > >
> > > On Tue, Oct 1, 2019 at 5:42 AM Wes McKinney <we...@gmail.com>
> wrote:
> > >
> > > > hi Antoine,
> > > >
> > > > On Tue, Oct 1, 2019 at 4:29 AM Antoine Pitrou <an...@python.org>
> > wrote:
> > > > >
> > > > >
> > > > > Le 01/10/2019 à 00:39, Wes McKinney a écrit :
> > > > > > A couple things:
> > > > > >
> > > > > > * I think a C protocol / FFI for Arrow array/vectors would be
> > better
> > > > > > to have the same "shape" as an assembled array. Note that the C
> > > > > > structs here have very nearly the same "shape" as the data
> > structure
> > > > > > representing a C++ Array object [1]. The disassembly and
> reassembly
> > > > > > here is substantially simpler than the IPC protocol. A recursive
> > > > > > structure in Flatbuffers would make RecordBatch messages much
> > larger,
> > > > > > so the flattened / disassembled representation we use for
> > serialized
> > > > > > record batches is the correct one
> > > > >
> > > > > I'm not sure I agree:
> > > > >
> > > > > - indeed, it's not a coincidence that the ArrowArray struct looks
> > quite
> > > > > closely like the C++ ArrayData object :-)  We have good experience
> > with
> > > > > that abstraction and it has proven to work quite well
> > > > >
> > > > > - the IPC format is meant for serialization while the C data
> > protocol is
> > > > > meants for in-memory communication, so different concerns apply
> > > > >
> > > > > - the fact that this makes the layout slightly larger doesn't seem
> > > > > important at all; we're not talking about transferring data over
> the
> > wire
> > > > >
> > > > > There's also another argument for having a recursive struct: it
> > > > > simplifies how the data type is represented, since we can encode
> each
> > > > > child type individually instead of encoding it in the parent's
> format
> > > > > string (same applies for metadata and individual flags).
> > > > >
> > > >
> > > > I was saying something different here. I was making an argument about
> > > > why we use the flattened array-of-structs in the IPC protocol. One
> > > > reason is that it's a more compact representation. That is not very
> > > > important here because this protocol is only for *in-process* (for
> > > > languages that have a C FFI facility) rather than *inter-process*
> > > > communication.
> > > >
> > > > I agree also that the type encoding is simple, here, too, since we
> > > > aren't having to split the schema and record batch between different
> > > > serialized messages. There is some potential waste with having to
> > > > populate the type fields multiple times when communicating a sequence
> > > > of "chunks" from the same logical dataset.
> > > >
> > > > > > * The "formal" C protocol having the "assembled" shape means that
> > many
> > > > > > minimal Arrow users won't have to implement any separate data
> > > > > > structures. They can just use the C struct directly or a slightly
> > > > > > wrapped version thereof with some convenience functions.
> > > > >
> > > > > Yes, but the same applies to the current proposal.
> > > > >
> > > > > > * I think that requiring building a Flatbuffer for minimal use
> > cases
> > > > > > (e.g. communicating simple record batches with primitive types)
> > passes
> > > > > > on implementation burden to minimal users.
> > > > >
> > > > > It certainly does.
> > > > >
> > > > > > I think the mantra of the C protocol should be the following:
> > > > > >
> > > > > > * Users of the protocol have to write little to no code to use
> it.
> > For
> > > > > > example, populating an INT32 array should require only a few
> lines
> > of
> > > > > > code
> > > > >
> > > > > Agreed.  As a sidenote, the spec should have an example of doing
> > this in
> > > > > raw C.
> > > > >
> > > > > Regards
> > > > >
> > > > > Antoine.
> > > >
> >
>

Re: [DISCUSS] C-level in-process array protocol

Posted by Jacques Nadeau <ja...@apache.org>.

I'd like to hear more opinions from others on this topic. This conversation
seems mostly dominated by comments from myself, Wes and Antoine.

I think it is reasonable to argue that keeping any ABI (or header/struct
pattern) as narrow as possible would allow us to minimize overlap with the
existing in-memory specification. In Arrow's case, this could be as simple
as a single memory pointer for schema (backed by flatbuffers) and a single
memory location for data (that references the record batch header, which in
turn provides pointers into the actual arrow data). Extensions would need
to be added for reference management as done here but I continue to think
we should defer discussion of that until the base data structures are
resolved. I see the comments here as arguing for a much broader ABI, in
part to support having people build "Arrow" components that interconnect
using this new interface. I understand the desire to expand the ABI to be
driven by needs to reduce dependencies and ease usability.

The representation within the related patch is being presented as a way for
applications to share Arrow data but is not easily accessible to all
languages. I want to avoid a situation where someone says "I produced an
Arrow API" when what they've really done is created a C interface which
only a small subset of languages can actually leverage. For example, every
language now knows how to parse the existing schema definition as rendered
in flatbuf. In order to interact with something that implements this new
pattern one would also be required to implement completely new schema
consumption code. In the proposal itself it suggests this (for example
enhancing the C++ library to consume structures produced this way).

As I said, I really want to hear more opinions. Running this past various
developers I know, many have echoed my concerns but that really doesn't
matter (and who knows how much of that is colored by my presentation of the
issue). What do people here think? If someone builds an "Arrow" library
that implements this set of structures, how does one use it in Node? In
Java? Does it drive creation of a secondary set of interfaces in each of
those languages to work with this kind of pattern? (For example, in a JVM
view of the world, working with a plain struct in java rather than a set of
memory pointers against our existing IPC formats would be quite painful and
we'd definitely need to create some glue code for users. I worry the same
pattern would occur in many other languages.)

To respond directly to some of Wes's most recent comments from the email
below. I struggle to map your description of the situation to the rest of
the thread and the proposed patch.  For example, you say that a non-goal is
"creating a new canonical way to serialize metadata" bute the patch
proposes a concrete string based encoding system to describe data types.
Aren't those things in conflict?

I'll also think more on this and challenge my own perspective. This isn't
where my focus is so my comments aren't as developed/thoughtful as I'd like.


On Tue, Oct 1, 2019 at 7:33 PM Wes McKinney <we...@gmail.com> wrote:

> hi Jacques,
>
> I think we've veered off course a bit and maybe we could reframe the
> discussion.
>
> Goals
> * A "drop-in" header-only C file that projects can use as a
> programming interface either internally only or to expose in-memory
> data structures between C functions at call sites. Ideally little to
> no disassembly/reassembly should be required on either "side" of the
> call site.
> * Simplifying adoption of Arrow for C programmers, or languages based
> around C FFI
>
> Non-goals
> * Expanding the columnar format or creating an alternative canonical
> in-memory representation
> * Creating a new canonical way to serialize metadata
>
> Note that this use case has been on my mind for more than 2 years:
> https://issues.apache.org/jira/browse/ARROW-1058
>
> I think there are a couple of potentially misleading things at play here
>
> 1. The use of the word "protocol". In C, a struct has a well-defined
> binary layout, so a C API is also an ABI. Using C structs to
> communicate data can be considered to be a protocol, but it means
> something different in the context of the "Arrow protocol". I think we
> need to call this a "C API"
>
> 2. The documentation for this in Antoine's PR is in the format/
> directory. It would probably be better to have a "C API" section in
> the documentation.
>
> The header file under discussion and the documentation about it is
> best considered as a "library".
>
> It might be useful at some point to create a C99 implementation of the
> IPC protocol as well using FlatCC with the goal of having a complete
> implementation of the columnar format in C with minimal binary
> footprint. This is analogous to the NanoPB project which is an
> implementation of Protocol Buffers with small code size
>
> https://github.com/nanopb/nanopb
>
> Let me know if this makes more sense.
>
> I think it's important to communicate clearly about this primarily for
> the benefit of the outside world which can confuse easily as we have
> observed over the last few years =)
>
> Wes
>
> On Tue, Oct 1, 2019 at 2:55 PM Jacques Nadeau <ja...@apache.org> wrote:
> >
> > I disagree with this statement:
> >
> > - the IPC format is meant for serialization while the C data protocol is
> > meants for in-memory communication, so different concerns apply
> >
> > If that is how the a particular implementation presents it, that is a
> > weaknesses of the implementation, not the format. The primary use case I
> > was focused on when working on the initial format was communication
> within
> > the same process. It seems like this is being used as a basis for the
> > introduction of new things when the premise is inconsistent with the
> > intention of the creation. The specific reason we used flatbuffers in the
> > project was to collapse the separation of in-process and out-of-process
> > communication. It means the same thing it does with the Arrow data
> itself:
> > that a consumer doesn't have to use a particular library to interact with
> > and use the data.
> >
> > It seems like there are two ideas here:
> >
> > 1) How do we make it easier for people to use Arrow?
> > 2) Should we implement a new in memory representation of Arrow that is
> > language specific.
> >
> > I'm entirely in support of number one. If for a particular type of
> domain,
> > people want an easier way to interact with Arrow, let's make a new
> library
> > that helps with that. In easy of our current libraries, we do many things
> > to make it easier to work with Arrow. None of those require a change to
> the
> > core format or are formalized as a new in-memory standard. The in-memory
> > representation of rust or javascript or java objects are implementation
> > details.
> >
> > I'm against number two as it creates a fragmentation problem. Arrow is
> > about having a single canonical format for memory for both metadata and
> > data. Having multiple in-memory formats (especially when some are not
> > language independent) is counter to the goals of the project.
>
> I don't think anyone is proposing anything that would cause fragmentation.
>
> A central question is whether it is useful to define a reusable C ABI
> for the Arrow columnar format, and if there is sufficient interest, a
> tiny C implementation of the IPC protocol (which uses the Flatbuffers
> message) that assembles and disassembles the data structures defined
> in the C ABI.
>
> We could separately create a tiny implementation of the Arrow IPC
> protocol using FlatCC that could be dropped into applications
> requiring only a C compiler and nothing else.
>
>
> >
> > Two other, separate comments:
> > 1) I don't understand the idea that we need to change the way Arrow
> > fundamentally works so that people can avoid using a dependency. If the
> > dependency is small, open source and easy to build, people can fork it
> and
> > include directly if they want to. Let's not violate project principles
> > because DuckDB has a religious perspective on dependencies. If the
> problem
> > is people have to swallow too large of a pill to do basic things with
> Arrow
> > in C, let's focus on fixing that (to our definition of ease, not someone
> > else's). If FlatCC solves some those things, great. If we need to build a
> > baby integration library that is more C centric, great. Neither of those
> > things require implementing something at the format level.
> >
> > 2) It seems like we should discuss the data structure problem separately
> > from the reference management concern.
> >
> >
> > On Tue, Oct 1, 2019 at 5:42 AM Wes McKinney <we...@gmail.com> wrote:
> >
> > > hi Antoine,
> > >
> > > On Tue, Oct 1, 2019 at 4:29 AM Antoine Pitrou <an...@python.org>
> wrote:
> > > >
> > > >
> > > > Le 01/10/2019 à 00:39, Wes McKinney a écrit :
> > > > > A couple things:
> > > > >
> > > > > * I think a C protocol / FFI for Arrow array/vectors would be
> better
> > > > > to have the same "shape" as an assembled array. Note that the C
> > > > > structs here have very nearly the same "shape" as the data
> structure
> > > > > representing a C++ Array object [1]. The disassembly and reassembly
> > > > > here is substantially simpler than the IPC protocol. A recursive
> > > > > structure in Flatbuffers would make RecordBatch messages much
> larger,
> > > > > so the flattened / disassembled representation we use for
> serialized
> > > > > record batches is the correct one
> > > >
> > > > I'm not sure I agree:
> > > >
> > > > - indeed, it's not a coincidence that the ArrowArray struct looks
> quite
> > > > closely like the C++ ArrayData object :-)  We have good experience
> with
> > > > that abstraction and it has proven to work quite well
> > > >
> > > > - the IPC format is meant for serialization while the C data
> protocol is
> > > > meants for in-memory communication, so different concerns apply
> > > >
> > > > - the fact that this makes the layout slightly larger doesn't seem
> > > > important at all; we're not talking about transferring data over the
> wire
> > > >
> > > > There's also another argument for having a recursive struct: it
> > > > simplifies how the data type is represented, since we can encode each
> > > > child type individually instead of encoding it in the parent's format
> > > > string (same applies for metadata and individual flags).
> > > >
> > >
> > > I was saying something different here. I was making an argument about
> > > why we use the flattened array-of-structs in the IPC protocol. One
> > > reason is that it's a more compact representation. That is not very
> > > important here because this protocol is only for *in-process* (for
> > > languages that have a C FFI facility) rather than *inter-process*
> > > communication.
> > >
> > > I agree also that the type encoding is simple, here, too, since we
> > > aren't having to split the schema and record batch between different
> > > serialized messages. There is some potential waste with having to
> > > populate the type fields multiple times when communicating a sequence
> > > of "chunks" from the same logical dataset.
> > >
> > > > > * The "formal" C protocol having the "assembled" shape means that
> many
> > > > > minimal Arrow users won't have to implement any separate data
> > > > > structures. They can just use the C struct directly or a slightly
> > > > > wrapped version thereof with some convenience functions.
> > > >
> > > > Yes, but the same applies to the current proposal.
> > > >
> > > > > * I think that requiring building a Flatbuffer for minimal use
> cases
> > > > > (e.g. communicating simple record batches with primitive types)
> passes
> > > > > on implementation burden to minimal users.
> > > >
> > > > It certainly does.
> > > >
> > > > > I think the mantra of the C protocol should be the following:
> > > > >
> > > > > * Users of the protocol have to write little to no code to use it.
> For
> > > > > example, populating an INT32 array should require only a few lines
> of
> > > > > code
> > > >
> > > > Agreed.  As a sidenote, the spec should have an example of doing
> this in
> > > > raw C.
> > > >
> > > > Regards
> > > >
> > > > Antoine.
> > >
>

Re: [DISCUSS] C-level in-process array protocol

Posted by Wes McKinney <we...@gmail.com>.

I had an e-mail editing snafu so you can ignore the bottom "inline"
portion since it's just a restatement of what is written more clearly
above

On Tue, Oct 1, 2019 at 9:32 PM Wes McKinney <we...@gmail.com> wrote:
>
> hi Jacques,
>
> I think we've veered off course a bit and maybe we could reframe the discussion.
>
> Goals
> * A "drop-in" header-only C file that projects can use as a
> programming interface either internally only or to expose in-memory
> data structures between C functions at call sites. Ideally little to
> no disassembly/reassembly should be required on either "side" of the
> call site.
> * Simplifying adoption of Arrow for C programmers, or languages based
> around C FFI
>
> Non-goals
> * Expanding the columnar format or creating an alternative canonical
> in-memory representation
> * Creating a new canonical way to serialize metadata
>
> Note that this use case has been on my mind for more than 2 years:
> https://issues.apache.org/jira/browse/ARROW-1058
>
> I think there are a couple of potentially misleading things at play here
>
> 1. The use of the word "protocol". In C, a struct has a well-defined
> binary layout, so a C API is also an ABI. Using C structs to
> communicate data can be considered to be a protocol, but it means
> something different in the context of the "Arrow protocol". I think we
> need to call this a "C API"
>
> 2. The documentation for this in Antoine's PR is in the format/
> directory. It would probably be better to have a "C API" section in
> the documentation.
>
> The header file under discussion and the documentation about it is
> best considered as a "library".
>
> It might be useful at some point to create a C99 implementation of the
> IPC protocol as well using FlatCC with the goal of having a complete
> implementation of the columnar format in C with minimal binary
> footprint. This is analogous to the NanoPB project which is an
> implementation of Protocol Buffers with small code size
>
> https://github.com/nanopb/nanopb
>
> Let me know if this makes more sense.
>
> I think it's important to communicate clearly about this primarily for
> the benefit of the outside world which can confuse easily as we have
> observed over the last few years =)
>
> Wes
>

Re: [DISCUSS] C-level in-process array protocol

Posted by Wes McKinney <we...@gmail.com>.

hi Jacques,

I think we've veered off course a bit and maybe we could reframe the discussion.

Goals
* A "drop-in" header-only C file that projects can use as a
programming interface either internally only or to expose in-memory
data structures between C functions at call sites. Ideally little to
no disassembly/reassembly should be required on either "side" of the
call site.
* Simplifying adoption of Arrow for C programmers, or languages based
around C FFI

Non-goals
* Expanding the columnar format or creating an alternative canonical
in-memory representation
* Creating a new canonical way to serialize metadata

Note that this use case has been on my mind for more than 2 years:
https://issues.apache.org/jira/browse/ARROW-1058

I think there are a couple of potentially misleading things at play here

1. The use of the word "protocol". In C, a struct has a well-defined
binary layout, so a C API is also an ABI. Using C structs to
communicate data can be considered to be a protocol, but it means
something different in the context of the "Arrow protocol". I think we
need to call this a "C API"

2. The documentation for this in Antoine's PR is in the format/
directory. It would probably be better to have a "C API" section in
the documentation.

The header file under discussion and the documentation about it is
best considered as a "library".

It might be useful at some point to create a C99 implementation of the
IPC protocol as well using FlatCC with the goal of having a complete
implementation of the columnar format in C with minimal binary
footprint. This is analogous to the NanoPB project which is an
implementation of Protocol Buffers with small code size

https://github.com/nanopb/nanopb

Let me know if this makes more sense.

I think it's important to communicate clearly about this primarily for
the benefit of the outside world which can confuse easily as we have
observed over the last few years =)

Wes

On Tue, Oct 1, 2019 at 2:55 PM Jacques Nadeau <ja...@apache.org> wrote:
>
> I disagree with this statement:
>
> - the IPC format is meant for serialization while the C data protocol is
> meants for in-memory communication, so different concerns apply
>
> If that is how the a particular implementation presents it, that is a
> weaknesses of the implementation, not the format. The primary use case I
> was focused on when working on the initial format was communication within
> the same process. It seems like this is being used as a basis for the
> introduction of new things when the premise is inconsistent with the
> intention of the creation. The specific reason we used flatbuffers in the
> project was to collapse the separation of in-process and out-of-process
> communication. It means the same thing it does with the Arrow data itself:
> that a consumer doesn't have to use a particular library to interact with
> and use the data.
>
> It seems like there are two ideas here:
>
> 1) How do we make it easier for people to use Arrow?
> 2) Should we implement a new in memory representation of Arrow that is
> language specific.
>
> I'm entirely in support of number one. If for a particular type of domain,
> people want an easier way to interact with Arrow, let's make a new library
> that helps with that. In easy of our current libraries, we do many things
> to make it easier to work with Arrow. None of those require a change to the
> core format or are formalized as a new in-memory standard. The in-memory
> representation of rust or javascript or java objects are implementation
> details.
>
> I'm against number two as it creates a fragmentation problem. Arrow is
> about having a single canonical format for memory for both metadata and
> data. Having multiple in-memory formats (especially when some are not
> language independent) is counter to the goals of the project.

I don't think anyone is proposing anything that would cause fragmentation.

A central question is whether it is useful to define a reusable C ABI
for the Arrow columnar format, and if there is sufficient interest, a
tiny C implementation of the IPC protocol (which uses the Flatbuffers
message) that assembles and disassembles the data structures defined
in the C ABI.

We could separately create a tiny implementation of the Arrow IPC
protocol using FlatCC that could be dropped into applications
requiring only a C compiler and nothing else.


>
> Two other, separate comments:
> 1) I don't understand the idea that we need to change the way Arrow
> fundamentally works so that people can avoid using a dependency. If the
> dependency is small, open source and easy to build, people can fork it and
> include directly if they want to. Let's not violate project principles
> because DuckDB has a religious perspective on dependencies. If the problem
> is people have to swallow too large of a pill to do basic things with Arrow
> in C, let's focus on fixing that (to our definition of ease, not someone
> else's). If FlatCC solves some those things, great. If we need to build a
> baby integration library that is more C centric, great. Neither of those
> things require implementing something at the format level.
>
> 2) It seems like we should discuss the data structure problem separately
> from the reference management concern.
>
>
> On Tue, Oct 1, 2019 at 5:42 AM Wes McKinney <we...@gmail.com> wrote:
>
> > hi Antoine,
> >
> > On Tue, Oct 1, 2019 at 4:29 AM Antoine Pitrou <an...@python.org> wrote:
> > >
> > >
> > > Le 01/10/2019 à 00:39, Wes McKinney a écrit :
> > > > A couple things:
> > > >
> > > > * I think a C protocol / FFI for Arrow array/vectors would be better
> > > > to have the same "shape" as an assembled array. Note that the C
> > > > structs here have very nearly the same "shape" as the data structure
> > > > representing a C++ Array object [1]. The disassembly and reassembly
> > > > here is substantially simpler than the IPC protocol. A recursive
> > > > structure in Flatbuffers would make RecordBatch messages much larger,
> > > > so the flattened / disassembled representation we use for serialized
> > > > record batches is the correct one
> > >
> > > I'm not sure I agree:
> > >
> > > - indeed, it's not a coincidence that the ArrowArray struct looks quite
> > > closely like the C++ ArrayData object :-)  We have good experience with
> > > that abstraction and it has proven to work quite well
> > >
> > > - the IPC format is meant for serialization while the C data protocol is
> > > meants for in-memory communication, so different concerns apply
> > >
> > > - the fact that this makes the layout slightly larger doesn't seem
> > > important at all; we're not talking about transferring data over the wire
> > >
> > > There's also another argument for having a recursive struct: it
> > > simplifies how the data type is represented, since we can encode each
> > > child type individually instead of encoding it in the parent's format
> > > string (same applies for metadata and individual flags).
> > >
> >
> > I was saying something different here. I was making an argument about
> > why we use the flattened array-of-structs in the IPC protocol. One
> > reason is that it's a more compact representation. That is not very
> > important here because this protocol is only for *in-process* (for
> > languages that have a C FFI facility) rather than *inter-process*
> > communication.
> >
> > I agree also that the type encoding is simple, here, too, since we
> > aren't having to split the schema and record batch between different
> > serialized messages. There is some potential waste with having to
> > populate the type fields multiple times when communicating a sequence
> > of "chunks" from the same logical dataset.
> >
> > > > * The "formal" C protocol having the "assembled" shape means that many
> > > > minimal Arrow users won't have to implement any separate data
> > > > structures. They can just use the C struct directly or a slightly
> > > > wrapped version thereof with some convenience functions.
> > >
> > > Yes, but the same applies to the current proposal.
> > >
> > > > * I think that requiring building a Flatbuffer for minimal use cases
> > > > (e.g. communicating simple record batches with primitive types) passes
> > > > on implementation burden to minimal users.
> > >
> > > It certainly does.
> > >
> > > > I think the mantra of the C protocol should be the following:
> > > >
> > > > * Users of the protocol have to write little to no code to use it. For
> > > > example, populating an INT32 array should require only a few lines of
> > > > code
> > >
> > > Agreed.  As a sidenote, the spec should have an example of doing this in
> > > raw C.
> > >
> > > Regards
> > >
> > > Antoine.
> >

Re: [DISCUSS] C-level in-process array protocol

Posted by Wes McKinney <we...@gmail.com>.

On Tue, Oct 1, 2019 at 3:22 PM Jed Brown <je...@jedbrown.org> wrote:
>
> I'd just like to chime in with the use case of in-situ data analysis for
> simulations.  This domain tends to be cautious with dependencies and
> there is a lot of C and Fortran, but the in-situ analysis tools will
> preferably reside in separate processes while sharing memory via shared
> memory (/dev/shm or MPI_Win_allocate_shared).  An in-memory protocol
> that holds raw pointers would be problematic because they are typically
> in different virtual address spaces when shared between processes.  I
> think this is a potential application for a C interface with lean
> dependencies, but it wouldn't be useful if it can't be shared
> out-of-process.
>

hi Jed -- I will respond to Jacques's e-mail when I have some time to
compose, but we're looking at a pretty different use case which is C
libraries (or libraries with C FFI) exposing in-memory data structures
to each other in the same process using the same virtual address
space.

- Wes

> Jacques Nadeau <ja...@apache.org> writes:
>
> > I disagree with this statement:
> >
> > - the IPC format is meant for serialization while the C data protocol is
> > meants for in-memory communication, so different concerns apply
> >
> > If that is how the a particular implementation presents it, that is a
> > weaknesses of the implementation, not the format. The primary use case I
> > was focused on when working on the initial format was communication within
> > the same process. It seems like this is being used as a basis for the
> > introduction of new things when the premise is inconsistent with the
> > intention of the creation. The specific reason we used flatbuffers in the
> > project was to collapse the separation of in-process and out-of-process
> > communication. It means the same thing it does with the Arrow data itself:
> > that a consumer doesn't have to use a particular library to interact with
> > and use the data.
> >
> > It seems like there are two ideas here:
> >
> > 1) How do we make it easier for people to use Arrow?
> > 2) Should we implement a new in memory representation of Arrow that is
> > language specific.
> >
> > I'm entirely in support of number one. If for a particular type of domain,
> > people want an easier way to interact with Arrow, let's make a new library
> > that helps with that. In easy of our current libraries, we do many things
> > to make it easier to work with Arrow. None of those require a change to the
> > core format or are formalized as a new in-memory standard. The in-memory
> > representation of rust or javascript or java objects are implementation
> > details.
> >
> > I'm against number two as it creates a fragmentation problem. Arrow is
> > about having a single canonical format for memory for both metadata and
> > data. Having multiple in-memory formats (especially when some are not
> > language independent) is counter to the goals of the project.
> >
> > Two other, separate comments:
> > 1) I don't understand the idea that we need to change the way Arrow
> > fundamentally works so that people can avoid using a dependency. If the
> > dependency is small, open source and easy to build, people can fork it and
> > include directly if they want to. Let's not violate project principles
> > because DuckDB has a religious perspective on dependencies. If the problem
> > is people have to swallow too large of a pill to do basic things with Arrow
> > in C, let's focus on fixing that (to our definition of ease, not someone
> > else's). If FlatCC solves some those things, great. If we need to build a
> > baby integration library that is more C centric, great. Neither of those
> > things require implementing something at the format level.
> >
> > 2) It seems like we should discuss the data structure problem separately
> > from the reference management concern.
> >
> >
> > On Tue, Oct 1, 2019 at 5:42 AM Wes McKinney <we...@gmail.com> wrote:
> >
> >> hi Antoine,
> >>
> >> On Tue, Oct 1, 2019 at 4:29 AM Antoine Pitrou <an...@python.org> wrote:
> >> >
> >> >
> >> > Le 01/10/2019 à 00:39, Wes McKinney a écrit :
> >> > > A couple things:
> >> > >
> >> > > * I think a C protocol / FFI for Arrow array/vectors would be better
> >> > > to have the same "shape" as an assembled array. Note that the C
> >> > > structs here have very nearly the same "shape" as the data structure
> >> > > representing a C++ Array object [1]. The disassembly and reassembly
> >> > > here is substantially simpler than the IPC protocol. A recursive
> >> > > structure in Flatbuffers would make RecordBatch messages much larger,
> >> > > so the flattened / disassembled representation we use for serialized
> >> > > record batches is the correct one
> >> >
> >> > I'm not sure I agree:
> >> >
> >> > - indeed, it's not a coincidence that the ArrowArray struct looks quite
> >> > closely like the C++ ArrayData object :-)  We have good experience with
> >> > that abstraction and it has proven to work quite well
> >> >
> >> > - the IPC format is meant for serialization while the C data protocol is
> >> > meants for in-memory communication, so different concerns apply
> >> >
> >> > - the fact that this makes the layout slightly larger doesn't seem
> >> > important at all; we're not talking about transferring data over the wire
> >> >
> >> > There's also another argument for having a recursive struct: it
> >> > simplifies how the data type is represented, since we can encode each
> >> > child type individually instead of encoding it in the parent's format
> >> > string (same applies for metadata and individual flags).
> >> >
> >>
> >> I was saying something different here. I was making an argument about
> >> why we use the flattened array-of-structs in the IPC protocol. One
> >> reason is that it's a more compact representation. That is not very
> >> important here because this protocol is only for *in-process* (for
> >> languages that have a C FFI facility) rather than *inter-process*
> >> communication.
> >>
> >> I agree also that the type encoding is simple, here, too, since we
> >> aren't having to split the schema and record batch between different
> >> serialized messages. There is some potential waste with having to
> >> populate the type fields multiple times when communicating a sequence
> >> of "chunks" from the same logical dataset.
> >>
> >> > > * The "formal" C protocol having the "assembled" shape means that many
> >> > > minimal Arrow users won't have to implement any separate data
> >> > > structures. They can just use the C struct directly or a slightly
> >> > > wrapped version thereof with some convenience functions.
> >> >
> >> > Yes, but the same applies to the current proposal.
> >> >
> >> > > * I think that requiring building a Flatbuffer for minimal use cases
> >> > > (e.g. communicating simple record batches with primitive types) passes
> >> > > on implementation burden to minimal users.
> >> >
> >> > It certainly does.
> >> >
> >> > > I think the mantra of the C protocol should be the following:
> >> > >
> >> > > * Users of the protocol have to write little to no code to use it. For
> >> > > example, populating an INT32 array should require only a few lines of
> >> > > code
> >> >
> >> > Agreed.  As a sidenote, the spec should have an example of doing this in
> >> > raw C.
> >> >
> >> > Regards
> >> >
> >> > Antoine.
> >>

Re: [DISCUSS] C-level in-process array protocol

Posted by Antoine Pitrou <an...@python.org>.

As currently designed, it's entirely in-process.  Shared memory with
buffer lifetime handling is taking care of by something like Plasma.

Regards

Antoine.


Le 01/10/2019 à 22:22, Jed Brown a écrit :
> I'd just like to chime in with the use case of in-situ data analysis for
> simulations.  This domain tends to be cautious with dependencies and
> there is a lot of C and Fortran, but the in-situ analysis tools will
> preferably reside in separate processes while sharing memory via shared
> memory (/dev/shm or MPI_Win_allocate_shared).  An in-memory protocol
> that holds raw pointers would be problematic because they are typically
> in different virtual address spaces when shared between processes.  I
> think this is a potential application for a C interface with lean
> dependencies, but it wouldn't be useful if it can't be shared
> out-of-process.

Re: [DISCUSS] C-level in-process array protocol

Posted by Jed Brown <je...@jedbrown.org>.

I'd just like to chime in with the use case of in-situ data analysis for
simulations.  This domain tends to be cautious with dependencies and
there is a lot of C and Fortran, but the in-situ analysis tools will
preferably reside in separate processes while sharing memory via shared
memory (/dev/shm or MPI_Win_allocate_shared).  An in-memory protocol
that holds raw pointers would be problematic because they are typically
in different virtual address spaces when shared between processes.  I
think this is a potential application for a C interface with lean
dependencies, but it wouldn't be useful if it can't be shared
out-of-process.

Jacques Nadeau <ja...@apache.org> writes:

> I disagree with this statement:
>
> - the IPC format is meant for serialization while the C data protocol is
> meants for in-memory communication, so different concerns apply
>
> If that is how the a particular implementation presents it, that is a
> weaknesses of the implementation, not the format. The primary use case I
> was focused on when working on the initial format was communication within
> the same process. It seems like this is being used as a basis for the
> introduction of new things when the premise is inconsistent with the
> intention of the creation. The specific reason we used flatbuffers in the
> project was to collapse the separation of in-process and out-of-process
> communication. It means the same thing it does with the Arrow data itself:
> that a consumer doesn't have to use a particular library to interact with
> and use the data.
>
> It seems like there are two ideas here:
>
> 1) How do we make it easier for people to use Arrow?
> 2) Should we implement a new in memory representation of Arrow that is
> language specific.
>
> I'm entirely in support of number one. If for a particular type of domain,
> people want an easier way to interact with Arrow, let's make a new library
> that helps with that. In easy of our current libraries, we do many things
> to make it easier to work with Arrow. None of those require a change to the
> core format or are formalized as a new in-memory standard. The in-memory
> representation of rust or javascript or java objects are implementation
> details.
>
> I'm against number two as it creates a fragmentation problem. Arrow is
> about having a single canonical format for memory for both metadata and
> data. Having multiple in-memory formats (especially when some are not
> language independent) is counter to the goals of the project.
>
> Two other, separate comments:
> 1) I don't understand the idea that we need to change the way Arrow
> fundamentally works so that people can avoid using a dependency. If the
> dependency is small, open source and easy to build, people can fork it and
> include directly if they want to. Let's not violate project principles
> because DuckDB has a religious perspective on dependencies. If the problem
> is people have to swallow too large of a pill to do basic things with Arrow
> in C, let's focus on fixing that (to our definition of ease, not someone
> else's). If FlatCC solves some those things, great. If we need to build a
> baby integration library that is more C centric, great. Neither of those
> things require implementing something at the format level.
>
> 2) It seems like we should discuss the data structure problem separately
> from the reference management concern.
>
>
> On Tue, Oct 1, 2019 at 5:42 AM Wes McKinney <we...@gmail.com> wrote:
>
>> hi Antoine,
>>
>> On Tue, Oct 1, 2019 at 4:29 AM Antoine Pitrou <an...@python.org> wrote:
>> >
>> >
>> > Le 01/10/2019 à 00:39, Wes McKinney a écrit :
>> > > A couple things:
>> > >
>> > > * I think a C protocol / FFI for Arrow array/vectors would be better
>> > > to have the same "shape" as an assembled array. Note that the C
>> > > structs here have very nearly the same "shape" as the data structure
>> > > representing a C++ Array object [1]. The disassembly and reassembly
>> > > here is substantially simpler than the IPC protocol. A recursive
>> > > structure in Flatbuffers would make RecordBatch messages much larger,
>> > > so the flattened / disassembled representation we use for serialized
>> > > record batches is the correct one
>> >
>> > I'm not sure I agree:
>> >
>> > - indeed, it's not a coincidence that the ArrowArray struct looks quite
>> > closely like the C++ ArrayData object :-)  We have good experience with
>> > that abstraction and it has proven to work quite well
>> >
>> > - the IPC format is meant for serialization while the C data protocol is
>> > meants for in-memory communication, so different concerns apply
>> >
>> > - the fact that this makes the layout slightly larger doesn't seem
>> > important at all; we're not talking about transferring data over the wire
>> >
>> > There's also another argument for having a recursive struct: it
>> > simplifies how the data type is represented, since we can encode each
>> > child type individually instead of encoding it in the parent's format
>> > string (same applies for metadata and individual flags).
>> >
>>
>> I was saying something different here. I was making an argument about
>> why we use the flattened array-of-structs in the IPC protocol. One
>> reason is that it's a more compact representation. That is not very
>> important here because this protocol is only for *in-process* (for
>> languages that have a C FFI facility) rather than *inter-process*
>> communication.
>>
>> I agree also that the type encoding is simple, here, too, since we
>> aren't having to split the schema and record batch between different
>> serialized messages. There is some potential waste with having to
>> populate the type fields multiple times when communicating a sequence
>> of "chunks" from the same logical dataset.
>>
>> > > * The "formal" C protocol having the "assembled" shape means that many
>> > > minimal Arrow users won't have to implement any separate data
>> > > structures. They can just use the C struct directly or a slightly
>> > > wrapped version thereof with some convenience functions.
>> >
>> > Yes, but the same applies to the current proposal.
>> >
>> > > * I think that requiring building a Flatbuffer for minimal use cases
>> > > (e.g. communicating simple record batches with primitive types) passes
>> > > on implementation burden to minimal users.
>> >
>> > It certainly does.
>> >
>> > > I think the mantra of the C protocol should be the following:
>> > >
>> > > * Users of the protocol have to write little to no code to use it. For
>> > > example, populating an INT32 array should require only a few lines of
>> > > code
>> >
>> > Agreed.  As a sidenote, the spec should have an example of doing this in
>> > raw C.
>> >
>> > Regards
>> >
>> > Antoine.
>>

Re: [DISCUSS] C-level in-process array protocol

Posted by Jacques Nadeau <ja...@apache.org>.

I disagree with this statement:

- the IPC format is meant for serialization while the C data protocol is
meants for in-memory communication, so different concerns apply

If that is how the a particular implementation presents it, that is a
weaknesses of the implementation, not the format. The primary use case I
was focused on when working on the initial format was communication within
the same process. It seems like this is being used as a basis for the
introduction of new things when the premise is inconsistent with the
intention of the creation. The specific reason we used flatbuffers in the
project was to collapse the separation of in-process and out-of-process
communication. It means the same thing it does with the Arrow data itself:
that a consumer doesn't have to use a particular library to interact with
and use the data.

It seems like there are two ideas here:

1) How do we make it easier for people to use Arrow?
2) Should we implement a new in memory representation of Arrow that is
language specific.

I'm entirely in support of number one. If for a particular type of domain,
people want an easier way to interact with Arrow, let's make a new library
that helps with that. In easy of our current libraries, we do many things
to make it easier to work with Arrow. None of those require a change to the
core format or are formalized as a new in-memory standard. The in-memory
representation of rust or javascript or java objects are implementation
details.

I'm against number two as it creates a fragmentation problem. Arrow is
about having a single canonical format for memory for both metadata and
data. Having multiple in-memory formats (especially when some are not
language independent) is counter to the goals of the project.

Two other, separate comments:
1) I don't understand the idea that we need to change the way Arrow
fundamentally works so that people can avoid using a dependency. If the
dependency is small, open source and easy to build, people can fork it and
include directly if they want to. Let's not violate project principles
because DuckDB has a religious perspective on dependencies. If the problem
is people have to swallow too large of a pill to do basic things with Arrow
in C, let's focus on fixing that (to our definition of ease, not someone
else's). If FlatCC solves some those things, great. If we need to build a
baby integration library that is more C centric, great. Neither of those
things require implementing something at the format level.

2) It seems like we should discuss the data structure problem separately
from the reference management concern.

On Tue, Oct 1, 2019 at 5:42 AM Wes McKinney <we...@gmail.com> wrote:

> hi Antoine,
>
> On Tue, Oct 1, 2019 at 4:29 AM Antoine Pitrou <an...@python.org> wrote:
> >
> >
> > Le 01/10/2019 à 00:39, Wes McKinney a écrit :
> > > A couple things:
> > >
> > > * I think a C protocol / FFI for Arrow array/vectors would be better
> > > to have the same "shape" as an assembled array. Note that the C
> > > structs here have very nearly the same "shape" as the data structure
> > > representing a C++ Array object [1]. The disassembly and reassembly
> > > here is substantially simpler than the IPC protocol. A recursive
> > > structure in Flatbuffers would make RecordBatch messages much larger,
> > > so the flattened / disassembled representation we use for serialized
> > > record batches is the correct one
> >
> > I'm not sure I agree:
> >
> > - indeed, it's not a coincidence that the ArrowArray struct looks quite
> > closely like the C++ ArrayData object :-)  We have good experience with
> > that abstraction and it has proven to work quite well
> >
> > - the IPC format is meant for serialization while the C data protocol is
> > meants for in-memory communication, so different concerns apply
> >
> > - the fact that this makes the layout slightly larger doesn't seem
> > important at all; we're not talking about transferring data over the wire
> >
> > There's also another argument for having a recursive struct: it
> > simplifies how the data type is represented, since we can encode each
> > child type individually instead of encoding it in the parent's format
> > string (same applies for metadata and individual flags).
> >
>
> I was saying something different here. I was making an argument about
> why we use the flattened array-of-structs in the IPC protocol. One
> reason is that it's a more compact representation. That is not very
> important here because this protocol is only for *in-process* (for
> languages that have a C FFI facility) rather than *inter-process*
> communication.
>
> I agree also that the type encoding is simple, here, too, since we
> aren't having to split the schema and record batch between different
> serialized messages. There is some potential waste with having to
> populate the type fields multiple times when communicating a sequence
> of "chunks" from the same logical dataset.
>
> > > * The "formal" C protocol having the "assembled" shape means that many
> > > minimal Arrow users won't have to implement any separate data
> > > structures. They can just use the C struct directly or a slightly
> > > wrapped version thereof with some convenience functions.
> >
> > Yes, but the same applies to the current proposal.
> >
> > > * I think that requiring building a Flatbuffer for minimal use cases
> > > (e.g. communicating simple record batches with primitive types) passes
> > > on implementation burden to minimal users.
> >
> > It certainly does.
> >
> > > I think the mantra of the C protocol should be the following:
> > >
> > > * Users of the protocol have to write little to no code to use it. For
> > > example, populating an INT32 array should require only a few lines of
> > > code
> >
> > Agreed.  As a sidenote, the spec should have an example of doing this in
> > raw C.
> >
> > Regards
> >
> > Antoine.
>

Re: [DISCUSS] C-level in-process array protocol

Posted by Wes McKinney <we...@gmail.com>.

hi Antoine,

On Tue, Oct 1, 2019 at 4:29 AM Antoine Pitrou <an...@python.org> wrote:
>
>
> Le 01/10/2019 à 00:39, Wes McKinney a écrit :
> > A couple things:
> >
> > * I think a C protocol / FFI for Arrow array/vectors would be better
> > to have the same "shape" as an assembled array. Note that the C
> > structs here have very nearly the same "shape" as the data structure
> > representing a C++ Array object [1]. The disassembly and reassembly
> > here is substantially simpler than the IPC protocol. A recursive
> > structure in Flatbuffers would make RecordBatch messages much larger,
> > so the flattened / disassembled representation we use for serialized
> > record batches is the correct one
>
> I'm not sure I agree:
>
> - indeed, it's not a coincidence that the ArrowArray struct looks quite
> closely like the C++ ArrayData object :-)  We have good experience with
> that abstraction and it has proven to work quite well
>
> - the IPC format is meant for serialization while the C data protocol is
> meants for in-memory communication, so different concerns apply
>
> - the fact that this makes the layout slightly larger doesn't seem
> important at all; we're not talking about transferring data over the wire
>
> There's also another argument for having a recursive struct: it
> simplifies how the data type is represented, since we can encode each
> child type individually instead of encoding it in the parent's format
> string (same applies for metadata and individual flags).
>

I was saying something different here. I was making an argument about
why we use the flattened array-of-structs in the IPC protocol. One
reason is that it's a more compact representation. That is not very
important here because this protocol is only for *in-process* (for
languages that have a C FFI facility) rather than *inter-process*
communication.

I agree also that the type encoding is simple, here, too, since we
aren't having to split the schema and record batch between different
serialized messages. There is some potential waste with having to
populate the type fields multiple times when communicating a sequence
of "chunks" from the same logical dataset.

> > * The "formal" C protocol having the "assembled" shape means that many
> > minimal Arrow users won't have to implement any separate data
> > structures. They can just use the C struct directly or a slightly
> > wrapped version thereof with some convenience functions.
>
> Yes, but the same applies to the current proposal.
>
> > * I think that requiring building a Flatbuffer for minimal use cases
> > (e.g. communicating simple record batches with primitive types) passes
> > on implementation burden to minimal users.
>
> It certainly does.
>
> > I think the mantra of the C protocol should be the following:
> >
> > * Users of the protocol have to write little to no code to use it. For
> > example, populating an INT32 array should require only a few lines of
> > code
>
> Agreed.  As a sidenote, the spec should have an example of doing this in
> raw C.
>
> Regards
>
> Antoine.