You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by Antoine Pitrou <so...@pitrou.net> on 2020/08/10 14:21:46 UTC

Re: [DISCUSS] Adding a pull-style iterator API to the C data interface

From the absence of response, it would seem there isn't much interest
in this.  Please speak up if you think this would be useful to you.

Regards

Antoine.


On Tue, 7 Jul 2020 07:49:17 -0500
Wes McKinney <we...@gmail.com> wrote:
> Any opinions about this? It seems the next steps would be a concrete
> API proposal and perhaps a reference implementation thereof.
> 
> On Sun, Jun 28, 2020 at 11:26 PM Wes McKinney <we...@gmail.com> wrote:
> >
> > In ARROW-8301 [1] and elsewhere we've been discussing how to
> > communicate what amounts to a sequence of arrays or a sequence of
> > RecordBatch objects using the C data interface.
> >
> > Example use cases:
> >
> > * Returning a sequence of record / row batches from a database driver
> > * Sending a C++ arrow::ChunkedArray or arrow::Table to a consumer
> > using only the C interface
> >
> > Applications could define their own custom iterator interfaces to
> > communicate what amounts to a sequence of the ArrowArray C interface
> > objects, but it is likely a common enough use case to have an
> > off-the-shelf solution so that we can support this solution in our
> > reference libraries (e.g. Arrow C++, pyarrow, Arrow R)
> >
> > I suggested a C structure as follows
> >
> > struct ArrowArrayStream {
> >   void (*get_schema)(struct ArrowSchema*);
> >   // Non-zero return value indicates an error?
> >   int (*get_next)(struct ArrowArray*);
> >   void (*get_error)(... ERROR HANDLING TODO );
> >   void (*release)(struct ArrowArrayStream*);
> >   void* private_data;
> > };
> >
> > The producer would populate this object with pointers to its
> > implementations of these functions.
> >
> > Thoughts about this?
> >
> > Thanks,
> > Wes
> >
> > [1]: https://issues.apache.org/jira/browse/ARROW-8301  
>

Re: [DISCUSS] Adding a pull-style iterator API to the C data interface

Posted by Antoine Pitrou <an...@python.org>.

I proposed an API here:
https://github.com/apache/arrow/pull/8052

It is not much different from what Wes proposed earlier in the thread,
except in error reporting.  Comments welcome (here or on the PR).

Regards

Antoine.



Le 16/08/2020 à 21:28, Wes McKinney a écrit :
> I opened https://issues.apache.org/jira/browse/ARROW-9761 about adding
> a preliminary C++ (and Python) implementation to help stir the pot. My
> understanding is that DuckDB is working on using the C interface right
> now [1] and the absence of an iterator interface makes such
> integration require more work than would be ideal
> 
> [1]: https://github.com/cwida/duckdb/issues/151#issuecomment-674120291
> 
> On Fri, Aug 14, 2020 at 6:57 PM Jacques Nadeau <ja...@apache.org> wrote:
>>
>> I think this unlocks a bunch of use cases. I think people are generally
>> using Arrow in simpler, non-streaming ways right now and thus the quiet.
>> Producing an iterator pattern is logical as you move to streams of smaller
>> chunks (common in distributed and multi-tenant systems).
>>
>> On Mon, Aug 10, 2020 at 11:56 AM Wes McKinney <we...@gmail.com> wrote:
>>
>>> I'm still in need of it. I'd be interested in developing a solution
>>> that can be used in some database APIs, e.g. using it for the result
>>> interface for an embedded SQL database like SQLite or DuckDB would be
>>> an interesting motivating use case.
>>>
>>> One approach would be to create something unofficial and used only in
>>> the C++ library's implementation of the C API so that it can make
>>> breaking changes for a time and then propose to formalize it in the
>>> ABI later.
>>>
>>> On Mon, Aug 10, 2020 at 9:22 AM Antoine Pitrou <so...@pitrou.net>
>>> wrote:
>>>>
>>>>
>>>> From the absence of response, it would seem there isn't much interest
>>>> in this.  Please speak up if you think this would be useful to you.
>>>>
>>>> Regards
>>>>
>>>> Antoine.
>>>>
>>>>
>>>> On Tue, 7 Jul 2020 07:49:17 -0500
>>>> Wes McKinney <we...@gmail.com> wrote:
>>>>> Any opinions about this? It seems the next steps would be a concrete
>>>>> API proposal and perhaps a reference implementation thereof.
>>>>>
>>>>> On Sun, Jun 28, 2020 at 11:26 PM Wes McKinney <we...@gmail.com>
>>> wrote:
>>>>>>
>>>>>> In ARROW-8301 [1] and elsewhere we've been discussing how to
>>>>>> communicate what amounts to a sequence of arrays or a sequence of
>>>>>> RecordBatch objects using the C data interface.
>>>>>>
>>>>>> Example use cases:
>>>>>>
>>>>>> * Returning a sequence of record / row batches from a database driver
>>>>>> * Sending a C++ arrow::ChunkedArray or arrow::Table to a consumer
>>>>>> using only the C interface
>>>>>>
>>>>>> Applications could define their own custom iterator interfaces to
>>>>>> communicate what amounts to a sequence of the ArrowArray C interface
>>>>>> objects, but it is likely a common enough use case to have an
>>>>>> off-the-shelf solution so that we can support this solution in our
>>>>>> reference libraries (e.g. Arrow C++, pyarrow, Arrow R)
>>>>>>
>>>>>> I suggested a C structure as follows
>>>>>>
>>>>>> struct ArrowArrayStream {
>>>>>>   void (*get_schema)(struct ArrowSchema*);
>>>>>>   // Non-zero return value indicates an error?
>>>>>>   int (*get_next)(struct ArrowArray*);
>>>>>>   void (*get_error)(... ERROR HANDLING TODO );
>>>>>>   void (*release)(struct ArrowArrayStream*);
>>>>>>   void* private_data;
>>>>>> };
>>>>>>
>>>>>> The producer would populate this object with pointers to its
>>>>>> implementations of these functions.
>>>>>>
>>>>>> Thoughts about this?
>>>>>>
>>>>>> Thanks,
>>>>>> Wes
>>>>>>
>>>>>> [1]: https://issues.apache.org/jira/browse/ARROW-8301
>>>>>
>>>>
>>>>
>>>>
>>>

Re: [DISCUSS] Adding a pull-style iterator API to the C data interface

Posted by Wes McKinney <we...@gmail.com>.

I opened https://issues.apache.org/jira/browse/ARROW-9761 about adding
a preliminary C++ (and Python) implementation to help stir the pot. My
understanding is that DuckDB is working on using the C interface right
now [1] and the absence of an iterator interface makes such
integration require more work than would be ideal

[1]: https://github.com/cwida/duckdb/issues/151#issuecomment-674120291

On Fri, Aug 14, 2020 at 6:57 PM Jacques Nadeau <ja...@apache.org> wrote:
>
> I think this unlocks a bunch of use cases. I think people are generally
> using Arrow in simpler, non-streaming ways right now and thus the quiet.
> Producing an iterator pattern is logical as you move to streams of smaller
> chunks (common in distributed and multi-tenant systems).
>
> On Mon, Aug 10, 2020 at 11:56 AM Wes McKinney <we...@gmail.com> wrote:
>
> > I'm still in need of it. I'd be interested in developing a solution
> > that can be used in some database APIs, e.g. using it for the result
> > interface for an embedded SQL database like SQLite or DuckDB would be
> > an interesting motivating use case.
> >
> > One approach would be to create something unofficial and used only in
> > the C++ library's implementation of the C API so that it can make
> > breaking changes for a time and then propose to formalize it in the
> > ABI later.
> >
> > On Mon, Aug 10, 2020 at 9:22 AM Antoine Pitrou <so...@pitrou.net>
> > wrote:
> > >
> > >
> > > From the absence of response, it would seem there isn't much interest
> > > in this.  Please speak up if you think this would be useful to you.
> > >
> > > Regards
> > >
> > > Antoine.
> > >
> > >
> > > On Tue, 7 Jul 2020 07:49:17 -0500
> > > Wes McKinney <we...@gmail.com> wrote:
> > > > Any opinions about this? It seems the next steps would be a concrete
> > > > API proposal and perhaps a reference implementation thereof.
> > > >
> > > > On Sun, Jun 28, 2020 at 11:26 PM Wes McKinney <we...@gmail.com>
> > wrote:
> > > > >
> > > > > In ARROW-8301 [1] and elsewhere we've been discussing how to
> > > > > communicate what amounts to a sequence of arrays or a sequence of
> > > > > RecordBatch objects using the C data interface.
> > > > >
> > > > > Example use cases:
> > > > >
> > > > > * Returning a sequence of record / row batches from a database driver
> > > > > * Sending a C++ arrow::ChunkedArray or arrow::Table to a consumer
> > > > > using only the C interface
> > > > >
> > > > > Applications could define their own custom iterator interfaces to
> > > > > communicate what amounts to a sequence of the ArrowArray C interface
> > > > > objects, but it is likely a common enough use case to have an
> > > > > off-the-shelf solution so that we can support this solution in our
> > > > > reference libraries (e.g. Arrow C++, pyarrow, Arrow R)
> > > > >
> > > > > I suggested a C structure as follows
> > > > >
> > > > > struct ArrowArrayStream {
> > > > >   void (*get_schema)(struct ArrowSchema*);
> > > > >   // Non-zero return value indicates an error?
> > > > >   int (*get_next)(struct ArrowArray*);
> > > > >   void (*get_error)(... ERROR HANDLING TODO );
> > > > >   void (*release)(struct ArrowArrayStream*);
> > > > >   void* private_data;
> > > > > };
> > > > >
> > > > > The producer would populate this object with pointers to its
> > > > > implementations of these functions.
> > > > >
> > > > > Thoughts about this?
> > > > >
> > > > > Thanks,
> > > > > Wes
> > > > >
> > > > > [1]: https://issues.apache.org/jira/browse/ARROW-8301
> > > >
> > >
> > >
> > >
> >

Re: [DISCUSS] Adding a pull-style iterator API to the C data interface

Posted by Jacques Nadeau <ja...@apache.org>.

I think this unlocks a bunch of use cases. I think people are generally
using Arrow in simpler, non-streaming ways right now and thus the quiet.
Producing an iterator pattern is logical as you move to streams of smaller
chunks (common in distributed and multi-tenant systems).

On Mon, Aug 10, 2020 at 11:56 AM Wes McKinney <we...@gmail.com> wrote:

> I'm still in need of it. I'd be interested in developing a solution
> that can be used in some database APIs, e.g. using it for the result
> interface for an embedded SQL database like SQLite or DuckDB would be
> an interesting motivating use case.
>
> One approach would be to create something unofficial and used only in
> the C++ library's implementation of the C API so that it can make
> breaking changes for a time and then propose to formalize it in the
> ABI later.
>
> On Mon, Aug 10, 2020 at 9:22 AM Antoine Pitrou <so...@pitrou.net>
> wrote:
> >
> >
> > From the absence of response, it would seem there isn't much interest
> > in this.  Please speak up if you think this would be useful to you.
> >
> > Regards
> >
> > Antoine.
> >
> >
> > On Tue, 7 Jul 2020 07:49:17 -0500
> > Wes McKinney <we...@gmail.com> wrote:
> > > Any opinions about this? It seems the next steps would be a concrete
> > > API proposal and perhaps a reference implementation thereof.
> > >
> > > On Sun, Jun 28, 2020 at 11:26 PM Wes McKinney <we...@gmail.com>
> wrote:
> > > >
> > > > In ARROW-8301 [1] and elsewhere we've been discussing how to
> > > > communicate what amounts to a sequence of arrays or a sequence of
> > > > RecordBatch objects using the C data interface.
> > > >
> > > > Example use cases:
> > > >
> > > > * Returning a sequence of record / row batches from a database driver
> > > > * Sending a C++ arrow::ChunkedArray or arrow::Table to a consumer
> > > > using only the C interface
> > > >
> > > > Applications could define their own custom iterator interfaces to
> > > > communicate what amounts to a sequence of the ArrowArray C interface
> > > > objects, but it is likely a common enough use case to have an
> > > > off-the-shelf solution so that we can support this solution in our
> > > > reference libraries (e.g. Arrow C++, pyarrow, Arrow R)
> > > >
> > > > I suggested a C structure as follows
> > > >
> > > > struct ArrowArrayStream {
> > > >   void (*get_schema)(struct ArrowSchema*);
> > > >   // Non-zero return value indicates an error?
> > > >   int (*get_next)(struct ArrowArray*);
> > > >   void (*get_error)(... ERROR HANDLING TODO );
> > > >   void (*release)(struct ArrowArrayStream*);
> > > >   void* private_data;
> > > > };
> > > >
> > > > The producer would populate this object with pointers to its
> > > > implementations of these functions.
> > > >
> > > > Thoughts about this?
> > > >
> > > > Thanks,
> > > > Wes
> > > >
> > > > [1]: https://issues.apache.org/jira/browse/ARROW-8301
> > >
> >
> >
> >
>

Re: [DISCUSS] Adding a pull-style iterator API to the C data interface

Posted by Wes McKinney <we...@gmail.com>.

I'm still in need of it. I'd be interested in developing a solution
that can be used in some database APIs, e.g. using it for the result
interface for an embedded SQL database like SQLite or DuckDB would be
an interesting motivating use case.

One approach would be to create something unofficial and used only in
the C++ library's implementation of the C API so that it can make
breaking changes for a time and then propose to formalize it in the
ABI later.

On Mon, Aug 10, 2020 at 9:22 AM Antoine Pitrou <so...@pitrou.net> wrote:
>
>
> From the absence of response, it would seem there isn't much interest
> in this.  Please speak up if you think this would be useful to you.
>
> Regards
>
> Antoine.
>
>
> On Tue, 7 Jul 2020 07:49:17 -0500
> Wes McKinney <we...@gmail.com> wrote:
> > Any opinions about this? It seems the next steps would be a concrete
> > API proposal and perhaps a reference implementation thereof.
> >
> > On Sun, Jun 28, 2020 at 11:26 PM Wes McKinney <we...@gmail.com> wrote:
> > >
> > > In ARROW-8301 [1] and elsewhere we've been discussing how to
> > > communicate what amounts to a sequence of arrays or a sequence of
> > > RecordBatch objects using the C data interface.
> > >
> > > Example use cases:
> > >
> > > * Returning a sequence of record / row batches from a database driver
> > > * Sending a C++ arrow::ChunkedArray or arrow::Table to a consumer
> > > using only the C interface
> > >
> > > Applications could define their own custom iterator interfaces to
> > > communicate what amounts to a sequence of the ArrowArray C interface
> > > objects, but it is likely a common enough use case to have an
> > > off-the-shelf solution so that we can support this solution in our
> > > reference libraries (e.g. Arrow C++, pyarrow, Arrow R)
> > >
> > > I suggested a C structure as follows
> > >
> > > struct ArrowArrayStream {
> > >   void (*get_schema)(struct ArrowSchema*);
> > >   // Non-zero return value indicates an error?
> > >   int (*get_next)(struct ArrowArray*);
> > >   void (*get_error)(... ERROR HANDLING TODO );
> > >   void (*release)(struct ArrowArrayStream*);
> > >   void* private_data;
> > > };
> > >
> > > The producer would populate this object with pointers to its
> > > implementations of these functions.
> > >
> > > Thoughts about this?
> > >
> > > Thanks,
> > > Wes
> > >
> > > [1]: https://issues.apache.org/jira/browse/ARROW-8301
> >
>
>
>