You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Jonathan Keane <jk...@gmail.com> on 2022/06/03 13:34:42 UTC

Re: Existence/name/scope for minimal C/C++ Arrow C Data interface helpers

cc Hannes Mühleisen from DuckDB Labs

-Jon


On Tue, May 31, 2022 at 5:03 PM Wes McKinney <we...@gmail.com> wrote:

> I'm also supportive of having a small vendorable C/C++ "Arrow
> middleware" that provides:
>
> * Schemas and types
> * Columnar data structures and minimal APIs to build them and iterate over
> them
> * C data interface
> * Minimal validation (at the level of Validate but not ValidateFull)
>
> I don't think it's going to be practical to try to refactor parts of
> the existing Arrow C++ core to be vendorable since there are many
> features / requirements (e.g. an extensible buffer and device API)
> that these C++ classes include that aren't needed in this
> limited-feature middleware library.
>
> This also relates to the "Improving Arrow's database support" project
> that David Li raised some time ago [1]. If we want to encourage
> database driver libraries to add new APIs that emit the Arrow C
> interface, we need to make it easier to generate the C interface
> without requiring a new library dependency.
>
> [1]: https://lists.apache.org/thread/gnz1kz2rj3rb8rh8qz7l0mv8lvzq254w
>
> On Mon, May 30, 2022 at 11:31 AM Jonathan Keane <jk...@gmail.com> wrote:
> >
> > Thanks for working on this. I've heard people asking about something
> > like this from a number of different fronts on top of the obvious use
> > case in geoarrow | other geospatial libraries. I think a minimal piece
> > of Arrow that other packages could depend on without needing to bring
> > in all of arrow would be super valuable in building the bridges we
> > want across other systems.
> >
> > Do you have any (design) documentation that describes the scope of
> > what you're thinking? I know there have been others floating around
> > [1] [2] that were in a similar spirit.
> >
> > A few more questions I hope will spark more conversation: How do the
> > header files you linked in [3] overlap with these other efforts? Are
> > those headers something we could|should "just" PR into apache/arrow
> > and write up how to use them? If not what is the work to make them so
> > that they could be (the answer of course could be design something
> > else entirely and PR that!)?
> >
> > [1] https://github.com/paleolimbot/narrow
> > [2] https://paleolimbot.github.io/narrow/articles/why-narrow.html
> > [3]
> https://github.com/paleolimbot/geoarrow-cpp/tree/main/src/geoarrow/internal/arrow-hpp
> >
> > -Jon
> >
> > -Jon
> >
> >
> > On Wed, May 25, 2022 at 9:29 AM Dewey Dunnington <de...@voltrondata.com>
> wrote:
> > >
> > > I'm writing to gauge interest in a set of helpers in C and/or C++ for
> > > reading/exporting Arrow C Data interface structures. My use-case is
> > > building Arrow geospatial support in R [1], and while the set of
> helpers
> > > I've been using [2] has served the purpose of me writing about the
> > > opportunities for Arrow + geospatial [3], I would like to rewrite the
> > > prototype based on something developed by/with the Arrow community.
> > >
> > > Does a set of C/C++ helpers for Arrow C Data interface structures
> already
> > > exist? *Should* it exist?
> > >
> > > If it doesn't, what should the name/scope of that library be? The names
> > > 'nanoarrow', 'narrow', 'sparrow', and 'arrow-hpp' have all surfaced in
> my
> > > limited discussion of this so far. For the purpose of starting the
> > > discussion, I'll posit that the library should include helpers to
> > > allocate/destroy C Data interface structures, a schema metadata
> > > encoder/decoder, validation of a schema/array pair, and something like
> the
> > > ArrayBuilder C++ class.
> > >
> > > [1] https://lists.apache.org/thread/yb7p9wpg3k128njskhwj9j788opb67g7
> > > [2]
> > >
> https://github.com/paleolimbot/geoarrow-cpp/tree/main/src/geoarrow/internal/arrow-hpp
> > > [3]
> > >
> https://docs.google.com/document/d/1A6e3XCerjhXVFHBDaoAlBBNFb2HG4RB9SVRpuBru7E4/edit?usp=sharing
>

Re: Existence/name/scope for minimal C/C++ Arrow C Data interface helpers

Posted by Dewey Dunnington <de...@voltrondata.com>.
> Can we name it miniarrow or nanoarrow?

I'm happy to call it something else! Probably nanoarrow if I get to pick
because of the parallel with nanopb/nanodbc.

On Thu, Jun 16, 2022 at 6:26 AM Antoine Pitrou <an...@python.org> wrote:

>
> Can we name it miniarrow or nanoarrow? We don't want to convey the
> message that there is a parallel C API for Arrow.
>
>
> Le 15/06/2022 à 05:18, Dewey Dunnington a écrit :
> > Hi all,
> >
> > I drafted a second PR [1] drafting a design for storing parsed
> information
> > obtained from a struct ArrowSchema (i.e., parsing the format string into
> > usable C structures). There are some unsolved problems that could use a
> > fresh perspective...all comments welcome!
> >
> > [1] https://github.com/paleolimbot/arrow-c/pull/5
> >
> > On Fri, Jun 10, 2022 at 12:27 PM Dewey Dunnington <dewey@voltrondata.com
> >
> > wrote:
> >
> >> Hi all,
> >>
> >> As promised, I converted the design document [1] into an initial PR [2].
> >> Rather than draft the whole header, I started with README +
> implementations
> >> + testing for error handling and schema allocation (depending on
> feedback,
> >> next week I will draft another reviewable chunk).
> >>
> >> Also feel free to suggest another place to put this if one exists (the
> >> choice to put it in its own repo was based on informal feedback that
> >> perhaps that might be the best way to go).
> >>
> >> [1]
> >>
> https://docs.google.com/document/d/11n7ICVZO8exZ-z3GRlI26VLzKPXlYlEz5xjLl1y0ujU/edit?usp=sharing
> >> [2] https://github.com/paleolimbot/arrow-c/pull/1/files
> >>
> >> On Fri, Jun 3, 2022 at 12:41 PM Dewey Dunnington <dewey@voltrondata.com
> >
> >> wrote:
> >>
> >>> Hi all,
> >>>
> >>> Based on the points raised above and a few adventures implementing some
> >>> of this in related projects, I put together a brief design document
> >>> proposing a scope and structure to perhaps solidify a few of these
> >>> discussions:
> >>>
> https://docs.google.com/document/d/11n7ICVZO8exZ-z3GRlI26VLzKPXlYlEz5xjLl1y0ujU/edit?usp=sharing
> >>> .
> >>>
> >>> Any and all should feel free to add, rewrite, or propose a new
> >>> structure...I wrote many of the pieces for argument's sake or because
> >>> that's how I'd implemented them before.
> >>>
> >>> Next week I will phrase it as a skeleton header (like the one in the
> >>> excellent ADBC design discussions) depending on feedback to keep the
> >>> discussion going!
> >>>
> >>> Cheers,
> >>>
> >>> -dewey
> >>>
> >>> On Fri, Jun 3, 2022 at 9:57 AM Hannes Mühleisen <hannes@duckdblabs.com
> >
> >>> wrote:
> >>>
> >>>> Hello List,
> >>>>
> >>>> we at DuckDB are happy users of the Arrow C Data Interface and use it
> to
> >>>> feed SQL queries and also use it to provide query results in Arrow
> format
> >>>> again. It is particularly appealing to us that the interface is
> merely a
> >>>> (C) header file that we just ship with our source code [1].
> Internally,
> >>>> our
> >>>> implementation then constructs DuckDB internal vectors from the Arrow
> >>>> format [2] or vice-versa [3].
> >>>>
> >>>> As you can see from [2, 3] there is some complexity in getting the
> >>>> conversion right, especially for more complex data types like nested
> >>>> types
> >>>> (list, strings). A lightweight, dependency-free library to help
> >>>> constructing those would certainly be appreciated. What would also
> help a
> >>>> lot is validation code, Arrow structures are very delicate and one
> wrong
> >>>> pointer can lead to disaster (which is then blamed on us), so a way to
> >>>> verify the structures in said lightweight library would be very
> helpful.
> >>>>
> >>>> Best from Amsterdam, and Quack
> >>>>
> >>>> Hannes
> >>>>
> >>>> [1]
> >>>>
> >>>>
> https://github.com/duckdb/duckdb/blob/master/src/include/duckdb/common/arrow.hpp
> >>>> [2]
> >>>>
> https://github.com/duckdb/duckdb/blob/master/src/function/table/arrow.cpp
> >>>> [3]
> >>>>
> >>>>
> https://github.com/duckdb/duckdb/blob/master/src/common/types/data_chunk.cpp
> >>>>
> >>>>
> >>>> On Fri, Jun 03, 2022 at 15:34:42, Jonathan Keane <jk...@gmail.com>
> >>>> wrote:
> >>>>
> >>>>> cc Hannes Mühleisen from DuckDB Labs
> >>>>>
> >>>>> -Jon
> >>>>>
> >>>>>
> >>>>> On Tue, May 31, 2022 at 5:03 PM Wes McKinney <we...@gmail.com>
> >>>> wrote:
> >>>>>
> >>>>> I'm also supportive of having a small vendorable C/C++ "Arrow
> >>>>> middleware" that provides:
> >>>>>
> >>>>> * Schemas and types
> >>>>> * Columnar data structures and minimal APIs to build them and iterate
> >>>> over
> >>>>> them
> >>>>> * C data interface
> >>>>> * Minimal validation (at the level of Validate but not ValidateFull)
> >>>>>
> >>>>> I don't think it's going to be practical to try to refactor parts of
> >>>>> the existing Arrow C++ core to be vendorable since there are many
> >>>>> features / requirements (e.g. an extensible buffer and device API)
> >>>>> that these C++ classes include that aren't needed in this
> >>>>> limited-feature middleware library.
> >>>>>
> >>>>> This also relates to the "Improving Arrow's database support" project
> >>>>> that David Li raised some time ago [1]. If we want to encourage
> >>>>> database driver libraries to add new APIs that emit the Arrow C
> >>>>> interface, we need to make it easier to generate the C interface
> >>>>> without requiring a new library dependency.
> >>>>>
> >>>>> [1]:
> https://lists.apache.org/thread/gnz1kz2rj3rb8rh8qz7l0mv8lvzq254w
> >>>>>
> >>>>> On Mon, May 30, 2022 at 11:31 AM Jonathan Keane <jk...@gmail.com>
> >>>> wrote:
> >>>>>>
> >>>>>> Thanks for working on this. I've heard people asking about something
> >>>>>> like this from a number of different fronts on top of the obvious
> use
> >>>>>> case in geoarrow | other geospatial libraries. I think a minimal
> >>>> piece
> >>>>>> of Arrow that other packages could depend on without needing to
> bring
> >>>>>> in all of arrow would be super valuable in building the bridges we
> >>>>>> want across other systems.
> >>>>>>
> >>>>>> Do you have any (design) documentation that describes the scope of
> >>>>>> what you're thinking? I know there have been others floating around
> >>>>>> [1] [2] that were in a similar spirit.
> >>>>>>
> >>>>>> A few more questions I hope will spark more conversation: How do the
> >>>>>> header files you linked in [3] overlap with these other efforts? Are
> >>>>>> those headers something we could|should "just" PR into apache/arrow
> >>>>>> and write up how to use them? If not what is the work to make them
> so
> >>>>>> that they could be (the answer of course could be design something
> >>>>>> else entirely and PR that!)?
> >>>>>>
> >>>>>> [1] https://github.com/paleolimbot/narrow
> >>>>>> [2] https://paleolimbot.github.io/narrow/articles/why-narrow.html
> >>>>>> [3]
> >>>> https://github.com/paleolimbot/geoarrow-cpp/tree/main/src/geoarrow/
> >>>>> internal/arrow-hpp
> >>>>>>
> >>>>>> -Jon
> >>>>>>
> >>>>>> -Jon
> >>>>>>
> >>>>>>
> >>>>>> On Wed, May 25, 2022 at 9:29 AM Dewey Dunnington <
> >>>> dewey@voltrondata.com>
> >>>>> wrote:
> >>>>>>>
> >>>>>>> I'm writing to gauge interest in a set of helpers in C and/or C++
> >>>> for
> >>>>>>> reading/exporting Arrow C Data interface structures. My use-case is
> >>>>>>> building Arrow geospatial support in R [1], and while the set of
> >>>>> helpers
> >>>>>>> I've been using [2] has served the purpose of me writing about the
> >>>>>>> opportunities for Arrow + geospatial [3], I would like to rewrite
> >>>> the
> >>>>>>> prototype based on something developed by/with the Arrow community.
> >>>>>>>
> >>>>>>> Does a set of C/C++ helpers for Arrow C Data interface structures
> >>>>> already
> >>>>>>> exist? *Should* it exist?
> >>>>>>>
> >>>>>>> If it doesn't, what should the name/scope of that library be? The
> >>>> names
> >>>>>>> 'nanoarrow', 'narrow', 'sparrow', and 'arrow-hpp' have all
> >>>> surfaced in
> >>>>> my
> >>>>>>> limited discussion of this so far. For the purpose of starting the
> >>>>>>> discussion, I'll posit that the library should include helpers to
> >>>>>>> allocate/destroy C Data interface structures, a schema metadata
> >>>>>>> encoder/decoder, validation of a schema/array pair, and something
> >>>> like
> >>>>> the
> >>>>>>> ArrayBuilder C++ class.
> >>>>>>>
> >>>>>>> [1]
> >>>> https://lists.apache.org/thread/yb7p9wpg3k128njskhwj9j788opb67g7
> >>>>>>> [2]
> >>>>>>>
> >>>> https://github.com/paleolimbot/geoarrow-cpp/tree/main/src/geoarrow/
> >>>>> internal/arrow-hpp
> >>>>>>> [3]
> >>>>>>> https://docs.google.com/document/d/
> >>>>> 1A6e3XCerjhXVFHBDaoAlBBNFb2HG4RB9SVRpuBru7E4/edit?usp=sharing
> >>>>>
> >>>>>
> >>>>
> >>>
> >
>

Re: Existence/name/scope for minimal C/C++ Arrow C Data interface helpers

Posted by Antoine Pitrou <an...@python.org>.
Can we name it miniarrow or nanoarrow? We don't want to convey the 
message that there is a parallel C API for Arrow.


Le 15/06/2022 à 05:18, Dewey Dunnington a écrit :
> Hi all,
> 
> I drafted a second PR [1] drafting a design for storing parsed information
> obtained from a struct ArrowSchema (i.e., parsing the format string into
> usable C structures). There are some unsolved problems that could use a
> fresh perspective...all comments welcome!
> 
> [1] https://github.com/paleolimbot/arrow-c/pull/5
> 
> On Fri, Jun 10, 2022 at 12:27 PM Dewey Dunnington <de...@voltrondata.com>
> wrote:
> 
>> Hi all,
>>
>> As promised, I converted the design document [1] into an initial PR [2].
>> Rather than draft the whole header, I started with README + implementations
>> + testing for error handling and schema allocation (depending on feedback,
>> next week I will draft another reviewable chunk).
>>
>> Also feel free to suggest another place to put this if one exists (the
>> choice to put it in its own repo was based on informal feedback that
>> perhaps that might be the best way to go).
>>
>> [1]
>> https://docs.google.com/document/d/11n7ICVZO8exZ-z3GRlI26VLzKPXlYlEz5xjLl1y0ujU/edit?usp=sharing
>> [2] https://github.com/paleolimbot/arrow-c/pull/1/files
>>
>> On Fri, Jun 3, 2022 at 12:41 PM Dewey Dunnington <de...@voltrondata.com>
>> wrote:
>>
>>> Hi all,
>>>
>>> Based on the points raised above and a few adventures implementing some
>>> of this in related projects, I put together a brief design document
>>> proposing a scope and structure to perhaps solidify a few of these
>>> discussions:
>>> https://docs.google.com/document/d/11n7ICVZO8exZ-z3GRlI26VLzKPXlYlEz5xjLl1y0ujU/edit?usp=sharing
>>> .
>>>
>>> Any and all should feel free to add, rewrite, or propose a new
>>> structure...I wrote many of the pieces for argument's sake or because
>>> that's how I'd implemented them before.
>>>
>>> Next week I will phrase it as a skeleton header (like the one in the
>>> excellent ADBC design discussions) depending on feedback to keep the
>>> discussion going!
>>>
>>> Cheers,
>>>
>>> -dewey
>>>
>>> On Fri, Jun 3, 2022 at 9:57 AM Hannes Mühleisen <ha...@duckdblabs.com>
>>> wrote:
>>>
>>>> Hello List,
>>>>
>>>> we at DuckDB are happy users of the Arrow C Data Interface and use it to
>>>> feed SQL queries and also use it to provide query results in Arrow format
>>>> again. It is particularly appealing to us that the interface is merely a
>>>> (C) header file that we just ship with our source code [1]. Internally,
>>>> our
>>>> implementation then constructs DuckDB internal vectors from the Arrow
>>>> format [2] or vice-versa [3].
>>>>
>>>> As you can see from [2, 3] there is some complexity in getting the
>>>> conversion right, especially for more complex data types like nested
>>>> types
>>>> (list, strings). A lightweight, dependency-free library to help
>>>> constructing those would certainly be appreciated. What would also help a
>>>> lot is validation code, Arrow structures are very delicate and one wrong
>>>> pointer can lead to disaster (which is then blamed on us), so a way to
>>>> verify the structures in said lightweight library would be very helpful.
>>>>
>>>> Best from Amsterdam, and Quack
>>>>
>>>> Hannes
>>>>
>>>> [1]
>>>>
>>>> https://github.com/duckdb/duckdb/blob/master/src/include/duckdb/common/arrow.hpp
>>>> [2]
>>>> https://github.com/duckdb/duckdb/blob/master/src/function/table/arrow.cpp
>>>> [3]
>>>>
>>>> https://github.com/duckdb/duckdb/blob/master/src/common/types/data_chunk.cpp
>>>>
>>>>
>>>> On Fri, Jun 03, 2022 at 15:34:42, Jonathan Keane <jk...@gmail.com>
>>>> wrote:
>>>>
>>>>> cc Hannes Mühleisen from DuckDB Labs
>>>>>
>>>>> -Jon
>>>>>
>>>>>
>>>>> On Tue, May 31, 2022 at 5:03 PM Wes McKinney <we...@gmail.com>
>>>> wrote:
>>>>>
>>>>> I'm also supportive of having a small vendorable C/C++ "Arrow
>>>>> middleware" that provides:
>>>>>
>>>>> * Schemas and types
>>>>> * Columnar data structures and minimal APIs to build them and iterate
>>>> over
>>>>> them
>>>>> * C data interface
>>>>> * Minimal validation (at the level of Validate but not ValidateFull)
>>>>>
>>>>> I don't think it's going to be practical to try to refactor parts of
>>>>> the existing Arrow C++ core to be vendorable since there are many
>>>>> features / requirements (e.g. an extensible buffer and device API)
>>>>> that these C++ classes include that aren't needed in this
>>>>> limited-feature middleware library.
>>>>>
>>>>> This also relates to the "Improving Arrow's database support" project
>>>>> that David Li raised some time ago [1]. If we want to encourage
>>>>> database driver libraries to add new APIs that emit the Arrow C
>>>>> interface, we need to make it easier to generate the C interface
>>>>> without requiring a new library dependency.
>>>>>
>>>>> [1]: https://lists.apache.org/thread/gnz1kz2rj3rb8rh8qz7l0mv8lvzq254w
>>>>>
>>>>> On Mon, May 30, 2022 at 11:31 AM Jonathan Keane <jk...@gmail.com>
>>>> wrote:
>>>>>>
>>>>>> Thanks for working on this. I've heard people asking about something
>>>>>> like this from a number of different fronts on top of the obvious use
>>>>>> case in geoarrow | other geospatial libraries. I think a minimal
>>>> piece
>>>>>> of Arrow that other packages could depend on without needing to bring
>>>>>> in all of arrow would be super valuable in building the bridges we
>>>>>> want across other systems.
>>>>>>
>>>>>> Do you have any (design) documentation that describes the scope of
>>>>>> what you're thinking? I know there have been others floating around
>>>>>> [1] [2] that were in a similar spirit.
>>>>>>
>>>>>> A few more questions I hope will spark more conversation: How do the
>>>>>> header files you linked in [3] overlap with these other efforts? Are
>>>>>> those headers something we could|should "just" PR into apache/arrow
>>>>>> and write up how to use them? If not what is the work to make them so
>>>>>> that they could be (the answer of course could be design something
>>>>>> else entirely and PR that!)?
>>>>>>
>>>>>> [1] https://github.com/paleolimbot/narrow
>>>>>> [2] https://paleolimbot.github.io/narrow/articles/why-narrow.html
>>>>>> [3]
>>>> https://github.com/paleolimbot/geoarrow-cpp/tree/main/src/geoarrow/
>>>>> internal/arrow-hpp
>>>>>>
>>>>>> -Jon
>>>>>>
>>>>>> -Jon
>>>>>>
>>>>>>
>>>>>> On Wed, May 25, 2022 at 9:29 AM Dewey Dunnington <
>>>> dewey@voltrondata.com>
>>>>> wrote:
>>>>>>>
>>>>>>> I'm writing to gauge interest in a set of helpers in C and/or C++
>>>> for
>>>>>>> reading/exporting Arrow C Data interface structures. My use-case is
>>>>>>> building Arrow geospatial support in R [1], and while the set of
>>>>> helpers
>>>>>>> I've been using [2] has served the purpose of me writing about the
>>>>>>> opportunities for Arrow + geospatial [3], I would like to rewrite
>>>> the
>>>>>>> prototype based on something developed by/with the Arrow community.
>>>>>>>
>>>>>>> Does a set of C/C++ helpers for Arrow C Data interface structures
>>>>> already
>>>>>>> exist? *Should* it exist?
>>>>>>>
>>>>>>> If it doesn't, what should the name/scope of that library be? The
>>>> names
>>>>>>> 'nanoarrow', 'narrow', 'sparrow', and 'arrow-hpp' have all
>>>> surfaced in
>>>>> my
>>>>>>> limited discussion of this so far. For the purpose of starting the
>>>>>>> discussion, I'll posit that the library should include helpers to
>>>>>>> allocate/destroy C Data interface structures, a schema metadata
>>>>>>> encoder/decoder, validation of a schema/array pair, and something
>>>> like
>>>>> the
>>>>>>> ArrayBuilder C++ class.
>>>>>>>
>>>>>>> [1]
>>>> https://lists.apache.org/thread/yb7p9wpg3k128njskhwj9j788opb67g7
>>>>>>> [2]
>>>>>>>
>>>> https://github.com/paleolimbot/geoarrow-cpp/tree/main/src/geoarrow/
>>>>> internal/arrow-hpp
>>>>>>> [3]
>>>>>>> https://docs.google.com/document/d/
>>>>> 1A6e3XCerjhXVFHBDaoAlBBNFb2HG4RB9SVRpuBru7E4/edit?usp=sharing
>>>>>
>>>>>
>>>>
>>>
> 

Re: Existence/name/scope for minimal C/C++ Arrow C Data interface helpers

Posted by David Li <li...@apache.org>.
Now at https://github.com/apache/arrow-nanoarrow

Dewey: you can use .asf.yaml to enable issues and such: https://cwiki.apache.org/confluence/display/INFRA/Git+-+.asf.yaml+features#Git.asf.yamlfeatures-GitHubsettings

On Thu, Jul 7, 2022, at 09:06, David Li wrote:
> I'll go ahead and set up arrow-nanoarrow for convenience.
>
> In the medium term we should think about whether arrow-adbc and 
> arrow-nanoarrow should be folded back into the arrow monorepo, in order 
> to potentially reduce the release/CI maintenance burden, or document 
> why we've chosen to split those off (while other languages like Go and 
> JS remain). 
>
> On Wed, Jul 6, 2022, at 15:18, Dewey Dunnington wrote:
>> I'm happy to develop anywhere anytime! My personal vote would be
>> apache/arrow-nanoarrow because it highlights the minimal-ness of it but am
>> happy to move forward however the community sees fit.
>>
>> Cheers,
>>
>> -dewey
>>
>> On Wed, Jul 6, 2022 at 12:46 PM Wes McKinney <we...@gmail.com> wrote:
>>
>>> hi all,
>>>
>>> Is there a path to doing this development work in project-owned
>>> repositories so the IP is "blessed" from an ASF governance / IP
>>> lineage standpoint? I see two potential routes:
>>>
>>> * Working in a subdirectory of apache/arrow
>>> * Creating a new repository like apache/arrow-c (or some other
>>> arrow-$SOMETHING)
>>>
>>> Otherwise we could be looking at having to do an IP clearance /
>>> software grant at a later time.
>>>
>>> Thanks,
>>> Wes
>>>
>>> On Sat, Jun 25, 2022 at 8:52 PM Dewey Dunnington <de...@voltrondata.com>
>>> wrote:
>>> >
>>> > Hi all,
>>> >
>>> > Thanks for all the feedback so far! I've opened up two more draft PRs
>>> > implementing [1] an API for owning buffers (precursor to creating struct
>>> > ArrowArrays) and [2] an API for creating ArrowSchema objects for all
>>> Arrow
>>> > types. All comments welcome!
>>> >
>>> > -dewey
>>> >
>>> > [1] https://github.com/paleolimbot/nanoarrow/pull/9
>>> > [2] https://github.com/paleolimbot/nanoarrow/pull/10
>>> >
>>> > On Wed, Jun 15, 2022 at 12:18 AM Dewey Dunnington <dewey@voltrondata.com
>>> >
>>> > wrote:
>>> >
>>> > > Hi all,
>>> > >
>>> > > I drafted a second PR [1] drafting a design for storing parsed
>>> information
>>> > > obtained from a struct ArrowSchema (i.e., parsing the format string
>>> into
>>> > > usable C structures). There are some unsolved problems that could use a
>>> > > fresh perspective...all comments welcome!
>>> > >
>>> > > [1] https://github.com/paleolimbot/arrow-c/pull/5
>>> > >
>>> > > On Fri, Jun 10, 2022 at 12:27 PM Dewey Dunnington <
>>> dewey@voltrondata.com>
>>> > > wrote:
>>> > >
>>> > >> Hi all,
>>> > >>
>>> > >> As promised, I converted the design document [1] into an initial PR
>>> [2].
>>> > >> Rather than draft the whole header, I started with README +
>>> implementations
>>> > >> + testing for error handling and schema allocation (depending on
>>> feedback,
>>> > >> next week I will draft another reviewable chunk).
>>> > >>
>>> > >> Also feel free to suggest another place to put this if one exists (the
>>> > >> choice to put it in its own repo was based on informal feedback that
>>> > >> perhaps that might be the best way to go).
>>> > >>
>>> > >> [1]
>>> > >>
>>> https://docs.google.com/document/d/11n7ICVZO8exZ-z3GRlI26VLzKPXlYlEz5xjLl1y0ujU/edit?usp=sharing
>>> > >> [2] https://github.com/paleolimbot/arrow-c/pull/1/files
>>> > >>
>>> > >> On Fri, Jun 3, 2022 at 12:41 PM Dewey Dunnington <
>>> dewey@voltrondata.com>
>>> > >> wrote:
>>> > >>
>>> > >>> Hi all,
>>> > >>>
>>> > >>> Based on the points raised above and a few adventures implementing
>>> some
>>> > >>> of this in related projects, I put together a brief design document
>>> > >>> proposing a scope and structure to perhaps solidify a few of these
>>> > >>> discussions:
>>> > >>>
>>> https://docs.google.com/document/d/11n7ICVZO8exZ-z3GRlI26VLzKPXlYlEz5xjLl1y0ujU/edit?usp=sharing
>>> > >>> .
>>> > >>>
>>> > >>> Any and all should feel free to add, rewrite, or propose a new
>>> > >>> structure...I wrote many of the pieces for argument's sake or because
>>> > >>> that's how I'd implemented them before.
>>> > >>>
>>> > >>> Next week I will phrase it as a skeleton header (like the one in the
>>> > >>> excellent ADBC design discussions) depending on feedback to keep the
>>> > >>> discussion going!
>>> > >>>
>>> > >>> Cheers,
>>> > >>>
>>> > >>> -dewey
>>> > >>>
>>> > >>> On Fri, Jun 3, 2022 at 9:57 AM Hannes Mühleisen <
>>> hannes@duckdblabs.com>
>>> > >>> wrote:
>>> > >>>
>>> > >>>> Hello List,
>>> > >>>>
>>> > >>>> we at DuckDB are happy users of the Arrow C Data Interface and use
>>> it to
>>> > >>>> feed SQL queries and also use it to provide query results in Arrow
>>> > >>>> format
>>> > >>>> again. It is particularly appealing to us that the interface is
>>> merely a
>>> > >>>> (C) header file that we just ship with our source code [1].
>>> Internally,
>>> > >>>> our
>>> > >>>> implementation then constructs DuckDB internal vectors from the
>>> Arrow
>>> > >>>> format [2] or vice-versa [3].
>>> > >>>>
>>> > >>>> As you can see from [2, 3] there is some complexity in getting the
>>> > >>>> conversion right, especially for more complex data types like nested
>>> > >>>> types
>>> > >>>> (list, strings). A lightweight, dependency-free library to help
>>> > >>>> constructing those would certainly be appreciated. What would also
>>> help
>>> > >>>> a
>>> > >>>> lot is validation code, Arrow structures are very delicate and one
>>> wrong
>>> > >>>> pointer can lead to disaster (which is then blamed on us), so a way
>>> to
>>> > >>>> verify the structures in said lightweight library would be very
>>> helpful.
>>> > >>>>
>>> > >>>> Best from Amsterdam, and Quack
>>> > >>>>
>>> > >>>> Hannes
>>> > >>>>
>>> > >>>> [1]
>>> > >>>>
>>> > >>>>
>>> https://github.com/duckdb/duckdb/blob/master/src/include/duckdb/common/arrow.hpp
>>> > >>>> [2]
>>> > >>>>
>>> > >>>>
>>> https://github.com/duckdb/duckdb/blob/master/src/function/table/arrow.cpp
>>> > >>>> [3]
>>> > >>>>
>>> > >>>>
>>> https://github.com/duckdb/duckdb/blob/master/src/common/types/data_chunk.cpp
>>> > >>>>
>>> > >>>>
>>> > >>>> On Fri, Jun 03, 2022 at 15:34:42, Jonathan Keane <jk...@gmail.com>
>>> > >>>> wrote:
>>> > >>>>
>>> > >>>> > cc Hannes Mühleisen from DuckDB Labs
>>> > >>>> >
>>> > >>>> > -Jon
>>> > >>>> >
>>> > >>>> >
>>> > >>>> > On Tue, May 31, 2022 at 5:03 PM Wes McKinney <wesmckinn@gmail.com
>>> >
>>> > >>>> wrote:
>>> > >>>> >
>>> > >>>> > I'm also supportive of having a small vendorable C/C++ "Arrow
>>> > >>>> > middleware" that provides:
>>> > >>>> >
>>> > >>>> > * Schemas and types
>>> > >>>> > * Columnar data structures and minimal APIs to build them and
>>> iterate
>>> > >>>> over
>>> > >>>> > them
>>> > >>>> > * C data interface
>>> > >>>> > * Minimal validation (at the level of Validate but not
>>> ValidateFull)
>>> > >>>> >
>>> > >>>> > I don't think it's going to be practical to try to refactor parts
>>> of
>>> > >>>> > the existing Arrow C++ core to be vendorable since there are many
>>> > >>>> > features / requirements (e.g. an extensible buffer and device API)
>>> > >>>> > that these C++ classes include that aren't needed in this
>>> > >>>> > limited-feature middleware library.
>>> > >>>> >
>>> > >>>> > This also relates to the "Improving Arrow's database support"
>>> project
>>> > >>>> > that David Li raised some time ago [1]. If we want to encourage
>>> > >>>> > database driver libraries to add new APIs that emit the Arrow C
>>> > >>>> > interface, we need to make it easier to generate the C interface
>>> > >>>> > without requiring a new library dependency.
>>> > >>>> >
>>> > >>>> > [1]:
>>> https://lists.apache.org/thread/gnz1kz2rj3rb8rh8qz7l0mv8lvzq254w
>>> > >>>> >
>>> > >>>> > On Mon, May 30, 2022 at 11:31 AM Jonathan Keane <jkeane@gmail.com
>>> >
>>> > >>>> wrote:
>>> > >>>> > >
>>> > >>>> > > Thanks for working on this. I've heard people asking about
>>> something
>>> > >>>> > > like this from a number of different fronts on top of the
>>> obvious
>>> > >>>> use
>>> > >>>> > > case in geoarrow | other geospatial libraries. I think a minimal
>>> > >>>> piece
>>> > >>>> > > of Arrow that other packages could depend on without needing to
>>> > >>>> bring
>>> > >>>> > > in all of arrow would be super valuable in building the bridges
>>> we
>>> > >>>> > > want across other systems.
>>> > >>>> > >
>>> > >>>> > > Do you have any (design) documentation that describes the scope
>>> of
>>> > >>>> > > what you're thinking? I know there have been others floating
>>> around
>>> > >>>> > > [1] [2] that were in a similar spirit.
>>> > >>>> > >
>>> > >>>> > > A few more questions I hope will spark more conversation: How
>>> do the
>>> > >>>> > > header files you linked in [3] overlap with these other
>>> efforts? Are
>>> > >>>> > > those headers something we could|should "just" PR into
>>> apache/arrow
>>> > >>>> > > and write up how to use them? If not what is the work to make
>>> them
>>> > >>>> so
>>> > >>>> > > that they could be (the answer of course could be design
>>> something
>>> > >>>> > > else entirely and PR that!)?
>>> > >>>> > >
>>> > >>>> > > [1] https://github.com/paleolimbot/narrow
>>> > >>>> > > [2]
>>> https://paleolimbot.github.io/narrow/articles/why-narrow.html
>>> > >>>> > > [3]
>>> > >>>> https://github.com/paleolimbot/geoarrow-cpp/tree/main/src/geoarrow/
>>> > >>>> > internal/arrow-hpp
>>> > >>>> > >
>>> > >>>> > > -Jon
>>> > >>>> > >
>>> > >>>> > > -Jon
>>> > >>>> > >
>>> > >>>> > >
>>> > >>>> > > On Wed, May 25, 2022 at 9:29 AM Dewey Dunnington <
>>> > >>>> dewey@voltrondata.com>
>>> > >>>> > wrote:
>>> > >>>> > > >
>>> > >>>> > > > I'm writing to gauge interest in a set of helpers in C and/or
>>> C++
>>> > >>>> for
>>> > >>>> > > > reading/exporting Arrow C Data interface structures. My
>>> use-case
>>> > >>>> is
>>> > >>>> > > > building Arrow geospatial support in R [1], and while the set
>>> of
>>> > >>>> > helpers
>>> > >>>> > > > I've been using [2] has served the purpose of me writing
>>> about the
>>> > >>>> > > > opportunities for Arrow + geospatial [3], I would like to
>>> rewrite
>>> > >>>> the
>>> > >>>> > > > prototype based on something developed by/with the Arrow
>>> > >>>> community.
>>> > >>>> > > >
>>> > >>>> > > > Does a set of C/C++ helpers for Arrow C Data interface
>>> structures
>>> > >>>> > already
>>> > >>>> > > > exist? *Should* it exist?
>>> > >>>> > > >
>>> > >>>> > > > If it doesn't, what should the name/scope of that library be?
>>> The
>>> > >>>> names
>>> > >>>> > > > 'nanoarrow', 'narrow', 'sparrow', and 'arrow-hpp' have all
>>> > >>>> surfaced in
>>> > >>>> > my
>>> > >>>> > > > limited discussion of this so far. For the purpose of
>>> starting the
>>> > >>>> > > > discussion, I'll posit that the library should include
>>> helpers to
>>> > >>>> > > > allocate/destroy C Data interface structures, a schema
>>> metadata
>>> > >>>> > > > encoder/decoder, validation of a schema/array pair, and
>>> something
>>> > >>>> like
>>> > >>>> > the
>>> > >>>> > > > ArrayBuilder C++ class.
>>> > >>>> > > >
>>> > >>>> > > > [1]
>>> > >>>> https://lists.apache.org/thread/yb7p9wpg3k128njskhwj9j788opb67g7
>>> > >>>> > > > [2]
>>> > >>>> > > >
>>> > >>>> https://github.com/paleolimbot/geoarrow-cpp/tree/main/src/geoarrow/
>>> > >>>> > internal/arrow-hpp
>>> > >>>> > > > [3]
>>> > >>>> > > > https://docs.google.com/document/d/
>>> > >>>> > 1A6e3XCerjhXVFHBDaoAlBBNFb2HG4RB9SVRpuBru7E4/edit?usp=sharing
>>> > >>>> >
>>> > >>>> >
>>> > >>>>
>>> > >>>
>>>

Re: Existence/name/scope for minimal C/C++ Arrow C Data interface helpers

Posted by David Li <li...@apache.org>.
I'll go ahead and set up arrow-nanoarrow for convenience.

In the medium term we should think about whether arrow-adbc and arrow-nanoarrow should be folded back into the arrow monorepo, in order to potentially reduce the release/CI maintenance burden, or document why we've chosen to split those off (while other languages like Go and JS remain). 

On Wed, Jul 6, 2022, at 15:18, Dewey Dunnington wrote:
> I'm happy to develop anywhere anytime! My personal vote would be
> apache/arrow-nanoarrow because it highlights the minimal-ness of it but am
> happy to move forward however the community sees fit.
>
> Cheers,
>
> -dewey
>
> On Wed, Jul 6, 2022 at 12:46 PM Wes McKinney <we...@gmail.com> wrote:
>
>> hi all,
>>
>> Is there a path to doing this development work in project-owned
>> repositories so the IP is "blessed" from an ASF governance / IP
>> lineage standpoint? I see two potential routes:
>>
>> * Working in a subdirectory of apache/arrow
>> * Creating a new repository like apache/arrow-c (or some other
>> arrow-$SOMETHING)
>>
>> Otherwise we could be looking at having to do an IP clearance /
>> software grant at a later time.
>>
>> Thanks,
>> Wes
>>
>> On Sat, Jun 25, 2022 at 8:52 PM Dewey Dunnington <de...@voltrondata.com>
>> wrote:
>> >
>> > Hi all,
>> >
>> > Thanks for all the feedback so far! I've opened up two more draft PRs
>> > implementing [1] an API for owning buffers (precursor to creating struct
>> > ArrowArrays) and [2] an API for creating ArrowSchema objects for all
>> Arrow
>> > types. All comments welcome!
>> >
>> > -dewey
>> >
>> > [1] https://github.com/paleolimbot/nanoarrow/pull/9
>> > [2] https://github.com/paleolimbot/nanoarrow/pull/10
>> >
>> > On Wed, Jun 15, 2022 at 12:18 AM Dewey Dunnington <dewey@voltrondata.com
>> >
>> > wrote:
>> >
>> > > Hi all,
>> > >
>> > > I drafted a second PR [1] drafting a design for storing parsed
>> information
>> > > obtained from a struct ArrowSchema (i.e., parsing the format string
>> into
>> > > usable C structures). There are some unsolved problems that could use a
>> > > fresh perspective...all comments welcome!
>> > >
>> > > [1] https://github.com/paleolimbot/arrow-c/pull/5
>> > >
>> > > On Fri, Jun 10, 2022 at 12:27 PM Dewey Dunnington <
>> dewey@voltrondata.com>
>> > > wrote:
>> > >
>> > >> Hi all,
>> > >>
>> > >> As promised, I converted the design document [1] into an initial PR
>> [2].
>> > >> Rather than draft the whole header, I started with README +
>> implementations
>> > >> + testing for error handling and schema allocation (depending on
>> feedback,
>> > >> next week I will draft another reviewable chunk).
>> > >>
>> > >> Also feel free to suggest another place to put this if one exists (the
>> > >> choice to put it in its own repo was based on informal feedback that
>> > >> perhaps that might be the best way to go).
>> > >>
>> > >> [1]
>> > >>
>> https://docs.google.com/document/d/11n7ICVZO8exZ-z3GRlI26VLzKPXlYlEz5xjLl1y0ujU/edit?usp=sharing
>> > >> [2] https://github.com/paleolimbot/arrow-c/pull/1/files
>> > >>
>> > >> On Fri, Jun 3, 2022 at 12:41 PM Dewey Dunnington <
>> dewey@voltrondata.com>
>> > >> wrote:
>> > >>
>> > >>> Hi all,
>> > >>>
>> > >>> Based on the points raised above and a few adventures implementing
>> some
>> > >>> of this in related projects, I put together a brief design document
>> > >>> proposing a scope and structure to perhaps solidify a few of these
>> > >>> discussions:
>> > >>>
>> https://docs.google.com/document/d/11n7ICVZO8exZ-z3GRlI26VLzKPXlYlEz5xjLl1y0ujU/edit?usp=sharing
>> > >>> .
>> > >>>
>> > >>> Any and all should feel free to add, rewrite, or propose a new
>> > >>> structure...I wrote many of the pieces for argument's sake or because
>> > >>> that's how I'd implemented them before.
>> > >>>
>> > >>> Next week I will phrase it as a skeleton header (like the one in the
>> > >>> excellent ADBC design discussions) depending on feedback to keep the
>> > >>> discussion going!
>> > >>>
>> > >>> Cheers,
>> > >>>
>> > >>> -dewey
>> > >>>
>> > >>> On Fri, Jun 3, 2022 at 9:57 AM Hannes Mühleisen <
>> hannes@duckdblabs.com>
>> > >>> wrote:
>> > >>>
>> > >>>> Hello List,
>> > >>>>
>> > >>>> we at DuckDB are happy users of the Arrow C Data Interface and use
>> it to
>> > >>>> feed SQL queries and also use it to provide query results in Arrow
>> > >>>> format
>> > >>>> again. It is particularly appealing to us that the interface is
>> merely a
>> > >>>> (C) header file that we just ship with our source code [1].
>> Internally,
>> > >>>> our
>> > >>>> implementation then constructs DuckDB internal vectors from the
>> Arrow
>> > >>>> format [2] or vice-versa [3].
>> > >>>>
>> > >>>> As you can see from [2, 3] there is some complexity in getting the
>> > >>>> conversion right, especially for more complex data types like nested
>> > >>>> types
>> > >>>> (list, strings). A lightweight, dependency-free library to help
>> > >>>> constructing those would certainly be appreciated. What would also
>> help
>> > >>>> a
>> > >>>> lot is validation code, Arrow structures are very delicate and one
>> wrong
>> > >>>> pointer can lead to disaster (which is then blamed on us), so a way
>> to
>> > >>>> verify the structures in said lightweight library would be very
>> helpful.
>> > >>>>
>> > >>>> Best from Amsterdam, and Quack
>> > >>>>
>> > >>>> Hannes
>> > >>>>
>> > >>>> [1]
>> > >>>>
>> > >>>>
>> https://github.com/duckdb/duckdb/blob/master/src/include/duckdb/common/arrow.hpp
>> > >>>> [2]
>> > >>>>
>> > >>>>
>> https://github.com/duckdb/duckdb/blob/master/src/function/table/arrow.cpp
>> > >>>> [3]
>> > >>>>
>> > >>>>
>> https://github.com/duckdb/duckdb/blob/master/src/common/types/data_chunk.cpp
>> > >>>>
>> > >>>>
>> > >>>> On Fri, Jun 03, 2022 at 15:34:42, Jonathan Keane <jk...@gmail.com>
>> > >>>> wrote:
>> > >>>>
>> > >>>> > cc Hannes Mühleisen from DuckDB Labs
>> > >>>> >
>> > >>>> > -Jon
>> > >>>> >
>> > >>>> >
>> > >>>> > On Tue, May 31, 2022 at 5:03 PM Wes McKinney <wesmckinn@gmail.com
>> >
>> > >>>> wrote:
>> > >>>> >
>> > >>>> > I'm also supportive of having a small vendorable C/C++ "Arrow
>> > >>>> > middleware" that provides:
>> > >>>> >
>> > >>>> > * Schemas and types
>> > >>>> > * Columnar data structures and minimal APIs to build them and
>> iterate
>> > >>>> over
>> > >>>> > them
>> > >>>> > * C data interface
>> > >>>> > * Minimal validation (at the level of Validate but not
>> ValidateFull)
>> > >>>> >
>> > >>>> > I don't think it's going to be practical to try to refactor parts
>> of
>> > >>>> > the existing Arrow C++ core to be vendorable since there are many
>> > >>>> > features / requirements (e.g. an extensible buffer and device API)
>> > >>>> > that these C++ classes include that aren't needed in this
>> > >>>> > limited-feature middleware library.
>> > >>>> >
>> > >>>> > This also relates to the "Improving Arrow's database support"
>> project
>> > >>>> > that David Li raised some time ago [1]. If we want to encourage
>> > >>>> > database driver libraries to add new APIs that emit the Arrow C
>> > >>>> > interface, we need to make it easier to generate the C interface
>> > >>>> > without requiring a new library dependency.
>> > >>>> >
>> > >>>> > [1]:
>> https://lists.apache.org/thread/gnz1kz2rj3rb8rh8qz7l0mv8lvzq254w
>> > >>>> >
>> > >>>> > On Mon, May 30, 2022 at 11:31 AM Jonathan Keane <jkeane@gmail.com
>> >
>> > >>>> wrote:
>> > >>>> > >
>> > >>>> > > Thanks for working on this. I've heard people asking about
>> something
>> > >>>> > > like this from a number of different fronts on top of the
>> obvious
>> > >>>> use
>> > >>>> > > case in geoarrow | other geospatial libraries. I think a minimal
>> > >>>> piece
>> > >>>> > > of Arrow that other packages could depend on without needing to
>> > >>>> bring
>> > >>>> > > in all of arrow would be super valuable in building the bridges
>> we
>> > >>>> > > want across other systems.
>> > >>>> > >
>> > >>>> > > Do you have any (design) documentation that describes the scope
>> of
>> > >>>> > > what you're thinking? I know there have been others floating
>> around
>> > >>>> > > [1] [2] that were in a similar spirit.
>> > >>>> > >
>> > >>>> > > A few more questions I hope will spark more conversation: How
>> do the
>> > >>>> > > header files you linked in [3] overlap with these other
>> efforts? Are
>> > >>>> > > those headers something we could|should "just" PR into
>> apache/arrow
>> > >>>> > > and write up how to use them? If not what is the work to make
>> them
>> > >>>> so
>> > >>>> > > that they could be (the answer of course could be design
>> something
>> > >>>> > > else entirely and PR that!)?
>> > >>>> > >
>> > >>>> > > [1] https://github.com/paleolimbot/narrow
>> > >>>> > > [2]
>> https://paleolimbot.github.io/narrow/articles/why-narrow.html
>> > >>>> > > [3]
>> > >>>> https://github.com/paleolimbot/geoarrow-cpp/tree/main/src/geoarrow/
>> > >>>> > internal/arrow-hpp
>> > >>>> > >
>> > >>>> > > -Jon
>> > >>>> > >
>> > >>>> > > -Jon
>> > >>>> > >
>> > >>>> > >
>> > >>>> > > On Wed, May 25, 2022 at 9:29 AM Dewey Dunnington <
>> > >>>> dewey@voltrondata.com>
>> > >>>> > wrote:
>> > >>>> > > >
>> > >>>> > > > I'm writing to gauge interest in a set of helpers in C and/or
>> C++
>> > >>>> for
>> > >>>> > > > reading/exporting Arrow C Data interface structures. My
>> use-case
>> > >>>> is
>> > >>>> > > > building Arrow geospatial support in R [1], and while the set
>> of
>> > >>>> > helpers
>> > >>>> > > > I've been using [2] has served the purpose of me writing
>> about the
>> > >>>> > > > opportunities for Arrow + geospatial [3], I would like to
>> rewrite
>> > >>>> the
>> > >>>> > > > prototype based on something developed by/with the Arrow
>> > >>>> community.
>> > >>>> > > >
>> > >>>> > > > Does a set of C/C++ helpers for Arrow C Data interface
>> structures
>> > >>>> > already
>> > >>>> > > > exist? *Should* it exist?
>> > >>>> > > >
>> > >>>> > > > If it doesn't, what should the name/scope of that library be?
>> The
>> > >>>> names
>> > >>>> > > > 'nanoarrow', 'narrow', 'sparrow', and 'arrow-hpp' have all
>> > >>>> surfaced in
>> > >>>> > my
>> > >>>> > > > limited discussion of this so far. For the purpose of
>> starting the
>> > >>>> > > > discussion, I'll posit that the library should include
>> helpers to
>> > >>>> > > > allocate/destroy C Data interface structures, a schema
>> metadata
>> > >>>> > > > encoder/decoder, validation of a schema/array pair, and
>> something
>> > >>>> like
>> > >>>> > the
>> > >>>> > > > ArrayBuilder C++ class.
>> > >>>> > > >
>> > >>>> > > > [1]
>> > >>>> https://lists.apache.org/thread/yb7p9wpg3k128njskhwj9j788opb67g7
>> > >>>> > > > [2]
>> > >>>> > > >
>> > >>>> https://github.com/paleolimbot/geoarrow-cpp/tree/main/src/geoarrow/
>> > >>>> > internal/arrow-hpp
>> > >>>> > > > [3]
>> > >>>> > > > https://docs.google.com/document/d/
>> > >>>> > 1A6e3XCerjhXVFHBDaoAlBBNFb2HG4RB9SVRpuBru7E4/edit?usp=sharing
>> > >>>> >
>> > >>>> >
>> > >>>>
>> > >>>
>>

Re: Existence/name/scope for minimal C/C++ Arrow C Data interface helpers

Posted by Dewey Dunnington <de...@voltrondata.com>.
I'm happy to develop anywhere anytime! My personal vote would be
apache/arrow-nanoarrow because it highlights the minimal-ness of it but am
happy to move forward however the community sees fit.

Cheers,

-dewey

On Wed, Jul 6, 2022 at 12:46 PM Wes McKinney <we...@gmail.com> wrote:

> hi all,
>
> Is there a path to doing this development work in project-owned
> repositories so the IP is "blessed" from an ASF governance / IP
> lineage standpoint? I see two potential routes:
>
> * Working in a subdirectory of apache/arrow
> * Creating a new repository like apache/arrow-c (or some other
> arrow-$SOMETHING)
>
> Otherwise we could be looking at having to do an IP clearance /
> software grant at a later time.
>
> Thanks,
> Wes
>
> On Sat, Jun 25, 2022 at 8:52 PM Dewey Dunnington <de...@voltrondata.com>
> wrote:
> >
> > Hi all,
> >
> > Thanks for all the feedback so far! I've opened up two more draft PRs
> > implementing [1] an API for owning buffers (precursor to creating struct
> > ArrowArrays) and [2] an API for creating ArrowSchema objects for all
> Arrow
> > types. All comments welcome!
> >
> > -dewey
> >
> > [1] https://github.com/paleolimbot/nanoarrow/pull/9
> > [2] https://github.com/paleolimbot/nanoarrow/pull/10
> >
> > On Wed, Jun 15, 2022 at 12:18 AM Dewey Dunnington <dewey@voltrondata.com
> >
> > wrote:
> >
> > > Hi all,
> > >
> > > I drafted a second PR [1] drafting a design for storing parsed
> information
> > > obtained from a struct ArrowSchema (i.e., parsing the format string
> into
> > > usable C structures). There are some unsolved problems that could use a
> > > fresh perspective...all comments welcome!
> > >
> > > [1] https://github.com/paleolimbot/arrow-c/pull/5
> > >
> > > On Fri, Jun 10, 2022 at 12:27 PM Dewey Dunnington <
> dewey@voltrondata.com>
> > > wrote:
> > >
> > >> Hi all,
> > >>
> > >> As promised, I converted the design document [1] into an initial PR
> [2].
> > >> Rather than draft the whole header, I started with README +
> implementations
> > >> + testing for error handling and schema allocation (depending on
> feedback,
> > >> next week I will draft another reviewable chunk).
> > >>
> > >> Also feel free to suggest another place to put this if one exists (the
> > >> choice to put it in its own repo was based on informal feedback that
> > >> perhaps that might be the best way to go).
> > >>
> > >> [1]
> > >>
> https://docs.google.com/document/d/11n7ICVZO8exZ-z3GRlI26VLzKPXlYlEz5xjLl1y0ujU/edit?usp=sharing
> > >> [2] https://github.com/paleolimbot/arrow-c/pull/1/files
> > >>
> > >> On Fri, Jun 3, 2022 at 12:41 PM Dewey Dunnington <
> dewey@voltrondata.com>
> > >> wrote:
> > >>
> > >>> Hi all,
> > >>>
> > >>> Based on the points raised above and a few adventures implementing
> some
> > >>> of this in related projects, I put together a brief design document
> > >>> proposing a scope and structure to perhaps solidify a few of these
> > >>> discussions:
> > >>>
> https://docs.google.com/document/d/11n7ICVZO8exZ-z3GRlI26VLzKPXlYlEz5xjLl1y0ujU/edit?usp=sharing
> > >>> .
> > >>>
> > >>> Any and all should feel free to add, rewrite, or propose a new
> > >>> structure...I wrote many of the pieces for argument's sake or because
> > >>> that's how I'd implemented them before.
> > >>>
> > >>> Next week I will phrase it as a skeleton header (like the one in the
> > >>> excellent ADBC design discussions) depending on feedback to keep the
> > >>> discussion going!
> > >>>
> > >>> Cheers,
> > >>>
> > >>> -dewey
> > >>>
> > >>> On Fri, Jun 3, 2022 at 9:57 AM Hannes Mühleisen <
> hannes@duckdblabs.com>
> > >>> wrote:
> > >>>
> > >>>> Hello List,
> > >>>>
> > >>>> we at DuckDB are happy users of the Arrow C Data Interface and use
> it to
> > >>>> feed SQL queries and also use it to provide query results in Arrow
> > >>>> format
> > >>>> again. It is particularly appealing to us that the interface is
> merely a
> > >>>> (C) header file that we just ship with our source code [1].
> Internally,
> > >>>> our
> > >>>> implementation then constructs DuckDB internal vectors from the
> Arrow
> > >>>> format [2] or vice-versa [3].
> > >>>>
> > >>>> As you can see from [2, 3] there is some complexity in getting the
> > >>>> conversion right, especially for more complex data types like nested
> > >>>> types
> > >>>> (list, strings). A lightweight, dependency-free library to help
> > >>>> constructing those would certainly be appreciated. What would also
> help
> > >>>> a
> > >>>> lot is validation code, Arrow structures are very delicate and one
> wrong
> > >>>> pointer can lead to disaster (which is then blamed on us), so a way
> to
> > >>>> verify the structures in said lightweight library would be very
> helpful.
> > >>>>
> > >>>> Best from Amsterdam, and Quack
> > >>>>
> > >>>> Hannes
> > >>>>
> > >>>> [1]
> > >>>>
> > >>>>
> https://github.com/duckdb/duckdb/blob/master/src/include/duckdb/common/arrow.hpp
> > >>>> [2]
> > >>>>
> > >>>>
> https://github.com/duckdb/duckdb/blob/master/src/function/table/arrow.cpp
> > >>>> [3]
> > >>>>
> > >>>>
> https://github.com/duckdb/duckdb/blob/master/src/common/types/data_chunk.cpp
> > >>>>
> > >>>>
> > >>>> On Fri, Jun 03, 2022 at 15:34:42, Jonathan Keane <jk...@gmail.com>
> > >>>> wrote:
> > >>>>
> > >>>> > cc Hannes Mühleisen from DuckDB Labs
> > >>>> >
> > >>>> > -Jon
> > >>>> >
> > >>>> >
> > >>>> > On Tue, May 31, 2022 at 5:03 PM Wes McKinney <wesmckinn@gmail.com
> >
> > >>>> wrote:
> > >>>> >
> > >>>> > I'm also supportive of having a small vendorable C/C++ "Arrow
> > >>>> > middleware" that provides:
> > >>>> >
> > >>>> > * Schemas and types
> > >>>> > * Columnar data structures and minimal APIs to build them and
> iterate
> > >>>> over
> > >>>> > them
> > >>>> > * C data interface
> > >>>> > * Minimal validation (at the level of Validate but not
> ValidateFull)
> > >>>> >
> > >>>> > I don't think it's going to be practical to try to refactor parts
> of
> > >>>> > the existing Arrow C++ core to be vendorable since there are many
> > >>>> > features / requirements (e.g. an extensible buffer and device API)
> > >>>> > that these C++ classes include that aren't needed in this
> > >>>> > limited-feature middleware library.
> > >>>> >
> > >>>> > This also relates to the "Improving Arrow's database support"
> project
> > >>>> > that David Li raised some time ago [1]. If we want to encourage
> > >>>> > database driver libraries to add new APIs that emit the Arrow C
> > >>>> > interface, we need to make it easier to generate the C interface
> > >>>> > without requiring a new library dependency.
> > >>>> >
> > >>>> > [1]:
> https://lists.apache.org/thread/gnz1kz2rj3rb8rh8qz7l0mv8lvzq254w
> > >>>> >
> > >>>> > On Mon, May 30, 2022 at 11:31 AM Jonathan Keane <jkeane@gmail.com
> >
> > >>>> wrote:
> > >>>> > >
> > >>>> > > Thanks for working on this. I've heard people asking about
> something
> > >>>> > > like this from a number of different fronts on top of the
> obvious
> > >>>> use
> > >>>> > > case in geoarrow | other geospatial libraries. I think a minimal
> > >>>> piece
> > >>>> > > of Arrow that other packages could depend on without needing to
> > >>>> bring
> > >>>> > > in all of arrow would be super valuable in building the bridges
> we
> > >>>> > > want across other systems.
> > >>>> > >
> > >>>> > > Do you have any (design) documentation that describes the scope
> of
> > >>>> > > what you're thinking? I know there have been others floating
> around
> > >>>> > > [1] [2] that were in a similar spirit.
> > >>>> > >
> > >>>> > > A few more questions I hope will spark more conversation: How
> do the
> > >>>> > > header files you linked in [3] overlap with these other
> efforts? Are
> > >>>> > > those headers something we could|should "just" PR into
> apache/arrow
> > >>>> > > and write up how to use them? If not what is the work to make
> them
> > >>>> so
> > >>>> > > that they could be (the answer of course could be design
> something
> > >>>> > > else entirely and PR that!)?
> > >>>> > >
> > >>>> > > [1] https://github.com/paleolimbot/narrow
> > >>>> > > [2]
> https://paleolimbot.github.io/narrow/articles/why-narrow.html
> > >>>> > > [3]
> > >>>> https://github.com/paleolimbot/geoarrow-cpp/tree/main/src/geoarrow/
> > >>>> > internal/arrow-hpp
> > >>>> > >
> > >>>> > > -Jon
> > >>>> > >
> > >>>> > > -Jon
> > >>>> > >
> > >>>> > >
> > >>>> > > On Wed, May 25, 2022 at 9:29 AM Dewey Dunnington <
> > >>>> dewey@voltrondata.com>
> > >>>> > wrote:
> > >>>> > > >
> > >>>> > > > I'm writing to gauge interest in a set of helpers in C and/or
> C++
> > >>>> for
> > >>>> > > > reading/exporting Arrow C Data interface structures. My
> use-case
> > >>>> is
> > >>>> > > > building Arrow geospatial support in R [1], and while the set
> of
> > >>>> > helpers
> > >>>> > > > I've been using [2] has served the purpose of me writing
> about the
> > >>>> > > > opportunities for Arrow + geospatial [3], I would like to
> rewrite
> > >>>> the
> > >>>> > > > prototype based on something developed by/with the Arrow
> > >>>> community.
> > >>>> > > >
> > >>>> > > > Does a set of C/C++ helpers for Arrow C Data interface
> structures
> > >>>> > already
> > >>>> > > > exist? *Should* it exist?
> > >>>> > > >
> > >>>> > > > If it doesn't, what should the name/scope of that library be?
> The
> > >>>> names
> > >>>> > > > 'nanoarrow', 'narrow', 'sparrow', and 'arrow-hpp' have all
> > >>>> surfaced in
> > >>>> > my
> > >>>> > > > limited discussion of this so far. For the purpose of
> starting the
> > >>>> > > > discussion, I'll posit that the library should include
> helpers to
> > >>>> > > > allocate/destroy C Data interface structures, a schema
> metadata
> > >>>> > > > encoder/decoder, validation of a schema/array pair, and
> something
> > >>>> like
> > >>>> > the
> > >>>> > > > ArrayBuilder C++ class.
> > >>>> > > >
> > >>>> > > > [1]
> > >>>> https://lists.apache.org/thread/yb7p9wpg3k128njskhwj9j788opb67g7
> > >>>> > > > [2]
> > >>>> > > >
> > >>>> https://github.com/paleolimbot/geoarrow-cpp/tree/main/src/geoarrow/
> > >>>> > internal/arrow-hpp
> > >>>> > > > [3]
> > >>>> > > > https://docs.google.com/document/d/
> > >>>> > 1A6e3XCerjhXVFHBDaoAlBBNFb2HG4RB9SVRpuBru7E4/edit?usp=sharing
> > >>>> >
> > >>>> >
> > >>>>
> > >>>
>

Re: Existence/name/scope for minimal C/C++ Arrow C Data interface helpers

Posted by Wes McKinney <we...@gmail.com>.
hi all,

Is there a path to doing this development work in project-owned
repositories so the IP is "blessed" from an ASF governance / IP
lineage standpoint? I see two potential routes:

* Working in a subdirectory of apache/arrow
* Creating a new repository like apache/arrow-c (or some other arrow-$SOMETHING)

Otherwise we could be looking at having to do an IP clearance /
software grant at a later time.

Thanks,
Wes

On Sat, Jun 25, 2022 at 8:52 PM Dewey Dunnington <de...@voltrondata.com> wrote:
>
> Hi all,
>
> Thanks for all the feedback so far! I've opened up two more draft PRs
> implementing [1] an API for owning buffers (precursor to creating struct
> ArrowArrays) and [2] an API for creating ArrowSchema objects for all Arrow
> types. All comments welcome!
>
> -dewey
>
> [1] https://github.com/paleolimbot/nanoarrow/pull/9
> [2] https://github.com/paleolimbot/nanoarrow/pull/10
>
> On Wed, Jun 15, 2022 at 12:18 AM Dewey Dunnington <de...@voltrondata.com>
> wrote:
>
> > Hi all,
> >
> > I drafted a second PR [1] drafting a design for storing parsed information
> > obtained from a struct ArrowSchema (i.e., parsing the format string into
> > usable C structures). There are some unsolved problems that could use a
> > fresh perspective...all comments welcome!
> >
> > [1] https://github.com/paleolimbot/arrow-c/pull/5
> >
> > On Fri, Jun 10, 2022 at 12:27 PM Dewey Dunnington <de...@voltrondata.com>
> > wrote:
> >
> >> Hi all,
> >>
> >> As promised, I converted the design document [1] into an initial PR [2].
> >> Rather than draft the whole header, I started with README + implementations
> >> + testing for error handling and schema allocation (depending on feedback,
> >> next week I will draft another reviewable chunk).
> >>
> >> Also feel free to suggest another place to put this if one exists (the
> >> choice to put it in its own repo was based on informal feedback that
> >> perhaps that might be the best way to go).
> >>
> >> [1]
> >> https://docs.google.com/document/d/11n7ICVZO8exZ-z3GRlI26VLzKPXlYlEz5xjLl1y0ujU/edit?usp=sharing
> >> [2] https://github.com/paleolimbot/arrow-c/pull/1/files
> >>
> >> On Fri, Jun 3, 2022 at 12:41 PM Dewey Dunnington <de...@voltrondata.com>
> >> wrote:
> >>
> >>> Hi all,
> >>>
> >>> Based on the points raised above and a few adventures implementing some
> >>> of this in related projects, I put together a brief design document
> >>> proposing a scope and structure to perhaps solidify a few of these
> >>> discussions:
> >>> https://docs.google.com/document/d/11n7ICVZO8exZ-z3GRlI26VLzKPXlYlEz5xjLl1y0ujU/edit?usp=sharing
> >>> .
> >>>
> >>> Any and all should feel free to add, rewrite, or propose a new
> >>> structure...I wrote many of the pieces for argument's sake or because
> >>> that's how I'd implemented them before.
> >>>
> >>> Next week I will phrase it as a skeleton header (like the one in the
> >>> excellent ADBC design discussions) depending on feedback to keep the
> >>> discussion going!
> >>>
> >>> Cheers,
> >>>
> >>> -dewey
> >>>
> >>> On Fri, Jun 3, 2022 at 9:57 AM Hannes Mühleisen <ha...@duckdblabs.com>
> >>> wrote:
> >>>
> >>>> Hello List,
> >>>>
> >>>> we at DuckDB are happy users of the Arrow C Data Interface and use it to
> >>>> feed SQL queries and also use it to provide query results in Arrow
> >>>> format
> >>>> again. It is particularly appealing to us that the interface is merely a
> >>>> (C) header file that we just ship with our source code [1]. Internally,
> >>>> our
> >>>> implementation then constructs DuckDB internal vectors from the Arrow
> >>>> format [2] or vice-versa [3].
> >>>>
> >>>> As you can see from [2, 3] there is some complexity in getting the
> >>>> conversion right, especially for more complex data types like nested
> >>>> types
> >>>> (list, strings). A lightweight, dependency-free library to help
> >>>> constructing those would certainly be appreciated. What would also help
> >>>> a
> >>>> lot is validation code, Arrow structures are very delicate and one wrong
> >>>> pointer can lead to disaster (which is then blamed on us), so a way to
> >>>> verify the structures in said lightweight library would be very helpful.
> >>>>
> >>>> Best from Amsterdam, and Quack
> >>>>
> >>>> Hannes
> >>>>
> >>>> [1]
> >>>>
> >>>> https://github.com/duckdb/duckdb/blob/master/src/include/duckdb/common/arrow.hpp
> >>>> [2]
> >>>>
> >>>> https://github.com/duckdb/duckdb/blob/master/src/function/table/arrow.cpp
> >>>> [3]
> >>>>
> >>>> https://github.com/duckdb/duckdb/blob/master/src/common/types/data_chunk.cpp
> >>>>
> >>>>
> >>>> On Fri, Jun 03, 2022 at 15:34:42, Jonathan Keane <jk...@gmail.com>
> >>>> wrote:
> >>>>
> >>>> > cc Hannes Mühleisen from DuckDB Labs
> >>>> >
> >>>> > -Jon
> >>>> >
> >>>> >
> >>>> > On Tue, May 31, 2022 at 5:03 PM Wes McKinney <we...@gmail.com>
> >>>> wrote:
> >>>> >
> >>>> > I'm also supportive of having a small vendorable C/C++ "Arrow
> >>>> > middleware" that provides:
> >>>> >
> >>>> > * Schemas and types
> >>>> > * Columnar data structures and minimal APIs to build them and iterate
> >>>> over
> >>>> > them
> >>>> > * C data interface
> >>>> > * Minimal validation (at the level of Validate but not ValidateFull)
> >>>> >
> >>>> > I don't think it's going to be practical to try to refactor parts of
> >>>> > the existing Arrow C++ core to be vendorable since there are many
> >>>> > features / requirements (e.g. an extensible buffer and device API)
> >>>> > that these C++ classes include that aren't needed in this
> >>>> > limited-feature middleware library.
> >>>> >
> >>>> > This also relates to the "Improving Arrow's database support" project
> >>>> > that David Li raised some time ago [1]. If we want to encourage
> >>>> > database driver libraries to add new APIs that emit the Arrow C
> >>>> > interface, we need to make it easier to generate the C interface
> >>>> > without requiring a new library dependency.
> >>>> >
> >>>> > [1]: https://lists.apache.org/thread/gnz1kz2rj3rb8rh8qz7l0mv8lvzq254w
> >>>> >
> >>>> > On Mon, May 30, 2022 at 11:31 AM Jonathan Keane <jk...@gmail.com>
> >>>> wrote:
> >>>> > >
> >>>> > > Thanks for working on this. I've heard people asking about something
> >>>> > > like this from a number of different fronts on top of the obvious
> >>>> use
> >>>> > > case in geoarrow | other geospatial libraries. I think a minimal
> >>>> piece
> >>>> > > of Arrow that other packages could depend on without needing to
> >>>> bring
> >>>> > > in all of arrow would be super valuable in building the bridges we
> >>>> > > want across other systems.
> >>>> > >
> >>>> > > Do you have any (design) documentation that describes the scope of
> >>>> > > what you're thinking? I know there have been others floating around
> >>>> > > [1] [2] that were in a similar spirit.
> >>>> > >
> >>>> > > A few more questions I hope will spark more conversation: How do the
> >>>> > > header files you linked in [3] overlap with these other efforts? Are
> >>>> > > those headers something we could|should "just" PR into apache/arrow
> >>>> > > and write up how to use them? If not what is the work to make them
> >>>> so
> >>>> > > that they could be (the answer of course could be design something
> >>>> > > else entirely and PR that!)?
> >>>> > >
> >>>> > > [1] https://github.com/paleolimbot/narrow
> >>>> > > [2] https://paleolimbot.github.io/narrow/articles/why-narrow.html
> >>>> > > [3]
> >>>> https://github.com/paleolimbot/geoarrow-cpp/tree/main/src/geoarrow/
> >>>> > internal/arrow-hpp
> >>>> > >
> >>>> > > -Jon
> >>>> > >
> >>>> > > -Jon
> >>>> > >
> >>>> > >
> >>>> > > On Wed, May 25, 2022 at 9:29 AM Dewey Dunnington <
> >>>> dewey@voltrondata.com>
> >>>> > wrote:
> >>>> > > >
> >>>> > > > I'm writing to gauge interest in a set of helpers in C and/or C++
> >>>> for
> >>>> > > > reading/exporting Arrow C Data interface structures. My use-case
> >>>> is
> >>>> > > > building Arrow geospatial support in R [1], and while the set of
> >>>> > helpers
> >>>> > > > I've been using [2] has served the purpose of me writing about the
> >>>> > > > opportunities for Arrow + geospatial [3], I would like to rewrite
> >>>> the
> >>>> > > > prototype based on something developed by/with the Arrow
> >>>> community.
> >>>> > > >
> >>>> > > > Does a set of C/C++ helpers for Arrow C Data interface structures
> >>>> > already
> >>>> > > > exist? *Should* it exist?
> >>>> > > >
> >>>> > > > If it doesn't, what should the name/scope of that library be? The
> >>>> names
> >>>> > > > 'nanoarrow', 'narrow', 'sparrow', and 'arrow-hpp' have all
> >>>> surfaced in
> >>>> > my
> >>>> > > > limited discussion of this so far. For the purpose of starting the
> >>>> > > > discussion, I'll posit that the library should include helpers to
> >>>> > > > allocate/destroy C Data interface structures, a schema metadata
> >>>> > > > encoder/decoder, validation of a schema/array pair, and something
> >>>> like
> >>>> > the
> >>>> > > > ArrayBuilder C++ class.
> >>>> > > >
> >>>> > > > [1]
> >>>> https://lists.apache.org/thread/yb7p9wpg3k128njskhwj9j788opb67g7
> >>>> > > > [2]
> >>>> > > >
> >>>> https://github.com/paleolimbot/geoarrow-cpp/tree/main/src/geoarrow/
> >>>> > internal/arrow-hpp
> >>>> > > > [3]
> >>>> > > > https://docs.google.com/document/d/
> >>>> > 1A6e3XCerjhXVFHBDaoAlBBNFb2HG4RB9SVRpuBru7E4/edit?usp=sharing
> >>>> >
> >>>> >
> >>>>
> >>>

Re: Existence/name/scope for minimal C/C++ Arrow C Data interface helpers

Posted by Dewey Dunnington <de...@voltrondata.com>.
Hi all,

Thanks for all the feedback so far! I've opened up two more draft PRs
implementing [1] an API for owning buffers (precursor to creating struct
ArrowArrays) and [2] an API for creating ArrowSchema objects for all Arrow
types. All comments welcome!

-dewey

[1] https://github.com/paleolimbot/nanoarrow/pull/9
[2] https://github.com/paleolimbot/nanoarrow/pull/10

On Wed, Jun 15, 2022 at 12:18 AM Dewey Dunnington <de...@voltrondata.com>
wrote:

> Hi all,
>
> I drafted a second PR [1] drafting a design for storing parsed information
> obtained from a struct ArrowSchema (i.e., parsing the format string into
> usable C structures). There are some unsolved problems that could use a
> fresh perspective...all comments welcome!
>
> [1] https://github.com/paleolimbot/arrow-c/pull/5
>
> On Fri, Jun 10, 2022 at 12:27 PM Dewey Dunnington <de...@voltrondata.com>
> wrote:
>
>> Hi all,
>>
>> As promised, I converted the design document [1] into an initial PR [2].
>> Rather than draft the whole header, I started with README + implementations
>> + testing for error handling and schema allocation (depending on feedback,
>> next week I will draft another reviewable chunk).
>>
>> Also feel free to suggest another place to put this if one exists (the
>> choice to put it in its own repo was based on informal feedback that
>> perhaps that might be the best way to go).
>>
>> [1]
>> https://docs.google.com/document/d/11n7ICVZO8exZ-z3GRlI26VLzKPXlYlEz5xjLl1y0ujU/edit?usp=sharing
>> [2] https://github.com/paleolimbot/arrow-c/pull/1/files
>>
>> On Fri, Jun 3, 2022 at 12:41 PM Dewey Dunnington <de...@voltrondata.com>
>> wrote:
>>
>>> Hi all,
>>>
>>> Based on the points raised above and a few adventures implementing some
>>> of this in related projects, I put together a brief design document
>>> proposing a scope and structure to perhaps solidify a few of these
>>> discussions:
>>> https://docs.google.com/document/d/11n7ICVZO8exZ-z3GRlI26VLzKPXlYlEz5xjLl1y0ujU/edit?usp=sharing
>>> .
>>>
>>> Any and all should feel free to add, rewrite, or propose a new
>>> structure...I wrote many of the pieces for argument's sake or because
>>> that's how I'd implemented them before.
>>>
>>> Next week I will phrase it as a skeleton header (like the one in the
>>> excellent ADBC design discussions) depending on feedback to keep the
>>> discussion going!
>>>
>>> Cheers,
>>>
>>> -dewey
>>>
>>> On Fri, Jun 3, 2022 at 9:57 AM Hannes Mühleisen <ha...@duckdblabs.com>
>>> wrote:
>>>
>>>> Hello List,
>>>>
>>>> we at DuckDB are happy users of the Arrow C Data Interface and use it to
>>>> feed SQL queries and also use it to provide query results in Arrow
>>>> format
>>>> again. It is particularly appealing to us that the interface is merely a
>>>> (C) header file that we just ship with our source code [1]. Internally,
>>>> our
>>>> implementation then constructs DuckDB internal vectors from the Arrow
>>>> format [2] or vice-versa [3].
>>>>
>>>> As you can see from [2, 3] there is some complexity in getting the
>>>> conversion right, especially for more complex data types like nested
>>>> types
>>>> (list, strings). A lightweight, dependency-free library to help
>>>> constructing those would certainly be appreciated. What would also help
>>>> a
>>>> lot is validation code, Arrow structures are very delicate and one wrong
>>>> pointer can lead to disaster (which is then blamed on us), so a way to
>>>> verify the structures in said lightweight library would be very helpful.
>>>>
>>>> Best from Amsterdam, and Quack
>>>>
>>>> Hannes
>>>>
>>>> [1]
>>>>
>>>> https://github.com/duckdb/duckdb/blob/master/src/include/duckdb/common/arrow.hpp
>>>> [2]
>>>>
>>>> https://github.com/duckdb/duckdb/blob/master/src/function/table/arrow.cpp
>>>> [3]
>>>>
>>>> https://github.com/duckdb/duckdb/blob/master/src/common/types/data_chunk.cpp
>>>>
>>>>
>>>> On Fri, Jun 03, 2022 at 15:34:42, Jonathan Keane <jk...@gmail.com>
>>>> wrote:
>>>>
>>>> > cc Hannes Mühleisen from DuckDB Labs
>>>> >
>>>> > -Jon
>>>> >
>>>> >
>>>> > On Tue, May 31, 2022 at 5:03 PM Wes McKinney <we...@gmail.com>
>>>> wrote:
>>>> >
>>>> > I'm also supportive of having a small vendorable C/C++ "Arrow
>>>> > middleware" that provides:
>>>> >
>>>> > * Schemas and types
>>>> > * Columnar data structures and minimal APIs to build them and iterate
>>>> over
>>>> > them
>>>> > * C data interface
>>>> > * Minimal validation (at the level of Validate but not ValidateFull)
>>>> >
>>>> > I don't think it's going to be practical to try to refactor parts of
>>>> > the existing Arrow C++ core to be vendorable since there are many
>>>> > features / requirements (e.g. an extensible buffer and device API)
>>>> > that these C++ classes include that aren't needed in this
>>>> > limited-feature middleware library.
>>>> >
>>>> > This also relates to the "Improving Arrow's database support" project
>>>> > that David Li raised some time ago [1]. If we want to encourage
>>>> > database driver libraries to add new APIs that emit the Arrow C
>>>> > interface, we need to make it easier to generate the C interface
>>>> > without requiring a new library dependency.
>>>> >
>>>> > [1]: https://lists.apache.org/thread/gnz1kz2rj3rb8rh8qz7l0mv8lvzq254w
>>>> >
>>>> > On Mon, May 30, 2022 at 11:31 AM Jonathan Keane <jk...@gmail.com>
>>>> wrote:
>>>> > >
>>>> > > Thanks for working on this. I've heard people asking about something
>>>> > > like this from a number of different fronts on top of the obvious
>>>> use
>>>> > > case in geoarrow | other geospatial libraries. I think a minimal
>>>> piece
>>>> > > of Arrow that other packages could depend on without needing to
>>>> bring
>>>> > > in all of arrow would be super valuable in building the bridges we
>>>> > > want across other systems.
>>>> > >
>>>> > > Do you have any (design) documentation that describes the scope of
>>>> > > what you're thinking? I know there have been others floating around
>>>> > > [1] [2] that were in a similar spirit.
>>>> > >
>>>> > > A few more questions I hope will spark more conversation: How do the
>>>> > > header files you linked in [3] overlap with these other efforts? Are
>>>> > > those headers something we could|should "just" PR into apache/arrow
>>>> > > and write up how to use them? If not what is the work to make them
>>>> so
>>>> > > that they could be (the answer of course could be design something
>>>> > > else entirely and PR that!)?
>>>> > >
>>>> > > [1] https://github.com/paleolimbot/narrow
>>>> > > [2] https://paleolimbot.github.io/narrow/articles/why-narrow.html
>>>> > > [3]
>>>> https://github.com/paleolimbot/geoarrow-cpp/tree/main/src/geoarrow/
>>>> > internal/arrow-hpp
>>>> > >
>>>> > > -Jon
>>>> > >
>>>> > > -Jon
>>>> > >
>>>> > >
>>>> > > On Wed, May 25, 2022 at 9:29 AM Dewey Dunnington <
>>>> dewey@voltrondata.com>
>>>> > wrote:
>>>> > > >
>>>> > > > I'm writing to gauge interest in a set of helpers in C and/or C++
>>>> for
>>>> > > > reading/exporting Arrow C Data interface structures. My use-case
>>>> is
>>>> > > > building Arrow geospatial support in R [1], and while the set of
>>>> > helpers
>>>> > > > I've been using [2] has served the purpose of me writing about the
>>>> > > > opportunities for Arrow + geospatial [3], I would like to rewrite
>>>> the
>>>> > > > prototype based on something developed by/with the Arrow
>>>> community.
>>>> > > >
>>>> > > > Does a set of C/C++ helpers for Arrow C Data interface structures
>>>> > already
>>>> > > > exist? *Should* it exist?
>>>> > > >
>>>> > > > If it doesn't, what should the name/scope of that library be? The
>>>> names
>>>> > > > 'nanoarrow', 'narrow', 'sparrow', and 'arrow-hpp' have all
>>>> surfaced in
>>>> > my
>>>> > > > limited discussion of this so far. For the purpose of starting the
>>>> > > > discussion, I'll posit that the library should include helpers to
>>>> > > > allocate/destroy C Data interface structures, a schema metadata
>>>> > > > encoder/decoder, validation of a schema/array pair, and something
>>>> like
>>>> > the
>>>> > > > ArrayBuilder C++ class.
>>>> > > >
>>>> > > > [1]
>>>> https://lists.apache.org/thread/yb7p9wpg3k128njskhwj9j788opb67g7
>>>> > > > [2]
>>>> > > >
>>>> https://github.com/paleolimbot/geoarrow-cpp/tree/main/src/geoarrow/
>>>> > internal/arrow-hpp
>>>> > > > [3]
>>>> > > > https://docs.google.com/document/d/
>>>> > 1A6e3XCerjhXVFHBDaoAlBBNFb2HG4RB9SVRpuBru7E4/edit?usp=sharing
>>>> >
>>>> >
>>>>
>>>

Re: Existence/name/scope for minimal C/C++ Arrow C Data interface helpers

Posted by Dewey Dunnington <de...@voltrondata.com>.
Hi all,

I drafted a second PR [1] drafting a design for storing parsed information
obtained from a struct ArrowSchema (i.e., parsing the format string into
usable C structures). There are some unsolved problems that could use a
fresh perspective...all comments welcome!

[1] https://github.com/paleolimbot/arrow-c/pull/5

On Fri, Jun 10, 2022 at 12:27 PM Dewey Dunnington <de...@voltrondata.com>
wrote:

> Hi all,
>
> As promised, I converted the design document [1] into an initial PR [2].
> Rather than draft the whole header, I started with README + implementations
> + testing for error handling and schema allocation (depending on feedback,
> next week I will draft another reviewable chunk).
>
> Also feel free to suggest another place to put this if one exists (the
> choice to put it in its own repo was based on informal feedback that
> perhaps that might be the best way to go).
>
> [1]
> https://docs.google.com/document/d/11n7ICVZO8exZ-z3GRlI26VLzKPXlYlEz5xjLl1y0ujU/edit?usp=sharing
> [2] https://github.com/paleolimbot/arrow-c/pull/1/files
>
> On Fri, Jun 3, 2022 at 12:41 PM Dewey Dunnington <de...@voltrondata.com>
> wrote:
>
>> Hi all,
>>
>> Based on the points raised above and a few adventures implementing some
>> of this in related projects, I put together a brief design document
>> proposing a scope and structure to perhaps solidify a few of these
>> discussions:
>> https://docs.google.com/document/d/11n7ICVZO8exZ-z3GRlI26VLzKPXlYlEz5xjLl1y0ujU/edit?usp=sharing
>> .
>>
>> Any and all should feel free to add, rewrite, or propose a new
>> structure...I wrote many of the pieces for argument's sake or because
>> that's how I'd implemented them before.
>>
>> Next week I will phrase it as a skeleton header (like the one in the
>> excellent ADBC design discussions) depending on feedback to keep the
>> discussion going!
>>
>> Cheers,
>>
>> -dewey
>>
>> On Fri, Jun 3, 2022 at 9:57 AM Hannes Mühleisen <ha...@duckdblabs.com>
>> wrote:
>>
>>> Hello List,
>>>
>>> we at DuckDB are happy users of the Arrow C Data Interface and use it to
>>> feed SQL queries and also use it to provide query results in Arrow format
>>> again. It is particularly appealing to us that the interface is merely a
>>> (C) header file that we just ship with our source code [1]. Internally,
>>> our
>>> implementation then constructs DuckDB internal vectors from the Arrow
>>> format [2] or vice-versa [3].
>>>
>>> As you can see from [2, 3] there is some complexity in getting the
>>> conversion right, especially for more complex data types like nested
>>> types
>>> (list, strings). A lightweight, dependency-free library to help
>>> constructing those would certainly be appreciated. What would also help a
>>> lot is validation code, Arrow structures are very delicate and one wrong
>>> pointer can lead to disaster (which is then blamed on us), so a way to
>>> verify the structures in said lightweight library would be very helpful.
>>>
>>> Best from Amsterdam, and Quack
>>>
>>> Hannes
>>>
>>> [1]
>>>
>>> https://github.com/duckdb/duckdb/blob/master/src/include/duckdb/common/arrow.hpp
>>> [2]
>>> https://github.com/duckdb/duckdb/blob/master/src/function/table/arrow.cpp
>>> [3]
>>>
>>> https://github.com/duckdb/duckdb/blob/master/src/common/types/data_chunk.cpp
>>>
>>>
>>> On Fri, Jun 03, 2022 at 15:34:42, Jonathan Keane <jk...@gmail.com>
>>> wrote:
>>>
>>> > cc Hannes Mühleisen from DuckDB Labs
>>> >
>>> > -Jon
>>> >
>>> >
>>> > On Tue, May 31, 2022 at 5:03 PM Wes McKinney <we...@gmail.com>
>>> wrote:
>>> >
>>> > I'm also supportive of having a small vendorable C/C++ "Arrow
>>> > middleware" that provides:
>>> >
>>> > * Schemas and types
>>> > * Columnar data structures and minimal APIs to build them and iterate
>>> over
>>> > them
>>> > * C data interface
>>> > * Minimal validation (at the level of Validate but not ValidateFull)
>>> >
>>> > I don't think it's going to be practical to try to refactor parts of
>>> > the existing Arrow C++ core to be vendorable since there are many
>>> > features / requirements (e.g. an extensible buffer and device API)
>>> > that these C++ classes include that aren't needed in this
>>> > limited-feature middleware library.
>>> >
>>> > This also relates to the "Improving Arrow's database support" project
>>> > that David Li raised some time ago [1]. If we want to encourage
>>> > database driver libraries to add new APIs that emit the Arrow C
>>> > interface, we need to make it easier to generate the C interface
>>> > without requiring a new library dependency.
>>> >
>>> > [1]: https://lists.apache.org/thread/gnz1kz2rj3rb8rh8qz7l0mv8lvzq254w
>>> >
>>> > On Mon, May 30, 2022 at 11:31 AM Jonathan Keane <jk...@gmail.com>
>>> wrote:
>>> > >
>>> > > Thanks for working on this. I've heard people asking about something
>>> > > like this from a number of different fronts on top of the obvious use
>>> > > case in geoarrow | other geospatial libraries. I think a minimal
>>> piece
>>> > > of Arrow that other packages could depend on without needing to bring
>>> > > in all of arrow would be super valuable in building the bridges we
>>> > > want across other systems.
>>> > >
>>> > > Do you have any (design) documentation that describes the scope of
>>> > > what you're thinking? I know there have been others floating around
>>> > > [1] [2] that were in a similar spirit.
>>> > >
>>> > > A few more questions I hope will spark more conversation: How do the
>>> > > header files you linked in [3] overlap with these other efforts? Are
>>> > > those headers something we could|should "just" PR into apache/arrow
>>> > > and write up how to use them? If not what is the work to make them so
>>> > > that they could be (the answer of course could be design something
>>> > > else entirely and PR that!)?
>>> > >
>>> > > [1] https://github.com/paleolimbot/narrow
>>> > > [2] https://paleolimbot.github.io/narrow/articles/why-narrow.html
>>> > > [3]
>>> https://github.com/paleolimbot/geoarrow-cpp/tree/main/src/geoarrow/
>>> > internal/arrow-hpp
>>> > >
>>> > > -Jon
>>> > >
>>> > > -Jon
>>> > >
>>> > >
>>> > > On Wed, May 25, 2022 at 9:29 AM Dewey Dunnington <
>>> dewey@voltrondata.com>
>>> > wrote:
>>> > > >
>>> > > > I'm writing to gauge interest in a set of helpers in C and/or C++
>>> for
>>> > > > reading/exporting Arrow C Data interface structures. My use-case is
>>> > > > building Arrow geospatial support in R [1], and while the set of
>>> > helpers
>>> > > > I've been using [2] has served the purpose of me writing about the
>>> > > > opportunities for Arrow + geospatial [3], I would like to rewrite
>>> the
>>> > > > prototype based on something developed by/with the Arrow community.
>>> > > >
>>> > > > Does a set of C/C++ helpers for Arrow C Data interface structures
>>> > already
>>> > > > exist? *Should* it exist?
>>> > > >
>>> > > > If it doesn't, what should the name/scope of that library be? The
>>> names
>>> > > > 'nanoarrow', 'narrow', 'sparrow', and 'arrow-hpp' have all
>>> surfaced in
>>> > my
>>> > > > limited discussion of this so far. For the purpose of starting the
>>> > > > discussion, I'll posit that the library should include helpers to
>>> > > > allocate/destroy C Data interface structures, a schema metadata
>>> > > > encoder/decoder, validation of a schema/array pair, and something
>>> like
>>> > the
>>> > > > ArrayBuilder C++ class.
>>> > > >
>>> > > > [1]
>>> https://lists.apache.org/thread/yb7p9wpg3k128njskhwj9j788opb67g7
>>> > > > [2]
>>> > > >
>>> https://github.com/paleolimbot/geoarrow-cpp/tree/main/src/geoarrow/
>>> > internal/arrow-hpp
>>> > > > [3]
>>> > > > https://docs.google.com/document/d/
>>> > 1A6e3XCerjhXVFHBDaoAlBBNFb2HG4RB9SVRpuBru7E4/edit?usp=sharing
>>> >
>>> >
>>>
>>

Re: Existence/name/scope for minimal C/C++ Arrow C Data interface helpers

Posted by Dewey Dunnington <de...@voltrondata.com>.
Hi all,

As promised, I converted the design document [1] into an initial PR [2].
Rather than draft the whole header, I started with README + implementations
+ testing for error handling and schema allocation (depending on feedback,
next week I will draft another reviewable chunk).

Also feel free to suggest another place to put this if one exists (the
choice to put it in its own repo was based on informal feedback that
perhaps that might be the best way to go).

[1]
https://docs.google.com/document/d/11n7ICVZO8exZ-z3GRlI26VLzKPXlYlEz5xjLl1y0ujU/edit?usp=sharing
[2] https://github.com/paleolimbot/arrow-c/pull/1/files

On Fri, Jun 3, 2022 at 12:41 PM Dewey Dunnington <de...@voltrondata.com>
wrote:

> Hi all,
>
> Based on the points raised above and a few adventures implementing some of
> this in related projects, I put together a brief design document proposing
> a scope and structure to perhaps solidify a few of these discussions:
> https://docs.google.com/document/d/11n7ICVZO8exZ-z3GRlI26VLzKPXlYlEz5xjLl1y0ujU/edit?usp=sharing
> .
>
> Any and all should feel free to add, rewrite, or propose a new
> structure...I wrote many of the pieces for argument's sake or because
> that's how I'd implemented them before.
>
> Next week I will phrase it as a skeleton header (like the one in the
> excellent ADBC design discussions) depending on feedback to keep the
> discussion going!
>
> Cheers,
>
> -dewey
>
> On Fri, Jun 3, 2022 at 9:57 AM Hannes Mühleisen <ha...@duckdblabs.com>
> wrote:
>
>> Hello List,
>>
>> we at DuckDB are happy users of the Arrow C Data Interface and use it to
>> feed SQL queries and also use it to provide query results in Arrow format
>> again. It is particularly appealing to us that the interface is merely a
>> (C) header file that we just ship with our source code [1]. Internally,
>> our
>> implementation then constructs DuckDB internal vectors from the Arrow
>> format [2] or vice-versa [3].
>>
>> As you can see from [2, 3] there is some complexity in getting the
>> conversion right, especially for more complex data types like nested types
>> (list, strings). A lightweight, dependency-free library to help
>> constructing those would certainly be appreciated. What would also help a
>> lot is validation code, Arrow structures are very delicate and one wrong
>> pointer can lead to disaster (which is then blamed on us), so a way to
>> verify the structures in said lightweight library would be very helpful.
>>
>> Best from Amsterdam, and Quack
>>
>> Hannes
>>
>> [1]
>>
>> https://github.com/duckdb/duckdb/blob/master/src/include/duckdb/common/arrow.hpp
>> [2]
>> https://github.com/duckdb/duckdb/blob/master/src/function/table/arrow.cpp
>> [3]
>>
>> https://github.com/duckdb/duckdb/blob/master/src/common/types/data_chunk.cpp
>>
>>
>> On Fri, Jun 03, 2022 at 15:34:42, Jonathan Keane <jk...@gmail.com>
>> wrote:
>>
>> > cc Hannes Mühleisen from DuckDB Labs
>> >
>> > -Jon
>> >
>> >
>> > On Tue, May 31, 2022 at 5:03 PM Wes McKinney <we...@gmail.com>
>> wrote:
>> >
>> > I'm also supportive of having a small vendorable C/C++ "Arrow
>> > middleware" that provides:
>> >
>> > * Schemas and types
>> > * Columnar data structures and minimal APIs to build them and iterate
>> over
>> > them
>> > * C data interface
>> > * Minimal validation (at the level of Validate but not ValidateFull)
>> >
>> > I don't think it's going to be practical to try to refactor parts of
>> > the existing Arrow C++ core to be vendorable since there are many
>> > features / requirements (e.g. an extensible buffer and device API)
>> > that these C++ classes include that aren't needed in this
>> > limited-feature middleware library.
>> >
>> > This also relates to the "Improving Arrow's database support" project
>> > that David Li raised some time ago [1]. If we want to encourage
>> > database driver libraries to add new APIs that emit the Arrow C
>> > interface, we need to make it easier to generate the C interface
>> > without requiring a new library dependency.
>> >
>> > [1]: https://lists.apache.org/thread/gnz1kz2rj3rb8rh8qz7l0mv8lvzq254w
>> >
>> > On Mon, May 30, 2022 at 11:31 AM Jonathan Keane <jk...@gmail.com>
>> wrote:
>> > >
>> > > Thanks for working on this. I've heard people asking about something
>> > > like this from a number of different fronts on top of the obvious use
>> > > case in geoarrow | other geospatial libraries. I think a minimal piece
>> > > of Arrow that other packages could depend on without needing to bring
>> > > in all of arrow would be super valuable in building the bridges we
>> > > want across other systems.
>> > >
>> > > Do you have any (design) documentation that describes the scope of
>> > > what you're thinking? I know there have been others floating around
>> > > [1] [2] that were in a similar spirit.
>> > >
>> > > A few more questions I hope will spark more conversation: How do the
>> > > header files you linked in [3] overlap with these other efforts? Are
>> > > those headers something we could|should "just" PR into apache/arrow
>> > > and write up how to use them? If not what is the work to make them so
>> > > that they could be (the answer of course could be design something
>> > > else entirely and PR that!)?
>> > >
>> > > [1] https://github.com/paleolimbot/narrow
>> > > [2] https://paleolimbot.github.io/narrow/articles/why-narrow.html
>> > > [3]
>> https://github.com/paleolimbot/geoarrow-cpp/tree/main/src/geoarrow/
>> > internal/arrow-hpp
>> > >
>> > > -Jon
>> > >
>> > > -Jon
>> > >
>> > >
>> > > On Wed, May 25, 2022 at 9:29 AM Dewey Dunnington <
>> dewey@voltrondata.com>
>> > wrote:
>> > > >
>> > > > I'm writing to gauge interest in a set of helpers in C and/or C++
>> for
>> > > > reading/exporting Arrow C Data interface structures. My use-case is
>> > > > building Arrow geospatial support in R [1], and while the set of
>> > helpers
>> > > > I've been using [2] has served the purpose of me writing about the
>> > > > opportunities for Arrow + geospatial [3], I would like to rewrite
>> the
>> > > > prototype based on something developed by/with the Arrow community.
>> > > >
>> > > > Does a set of C/C++ helpers for Arrow C Data interface structures
>> > already
>> > > > exist? *Should* it exist?
>> > > >
>> > > > If it doesn't, what should the name/scope of that library be? The
>> names
>> > > > 'nanoarrow', 'narrow', 'sparrow', and 'arrow-hpp' have all surfaced
>> in
>> > my
>> > > > limited discussion of this so far. For the purpose of starting the
>> > > > discussion, I'll posit that the library should include helpers to
>> > > > allocate/destroy C Data interface structures, a schema metadata
>> > > > encoder/decoder, validation of a schema/array pair, and something
>> like
>> > the
>> > > > ArrayBuilder C++ class.
>> > > >
>> > > > [1]
>> https://lists.apache.org/thread/yb7p9wpg3k128njskhwj9j788opb67g7
>> > > > [2]
>> > > > https://github.com/paleolimbot/geoarrow-cpp/tree/main/src/geoarrow/
>> > internal/arrow-hpp
>> > > > [3]
>> > > > https://docs.google.com/document/d/
>> > 1A6e3XCerjhXVFHBDaoAlBBNFb2HG4RB9SVRpuBru7E4/edit?usp=sharing
>> >
>> >
>>
>

Re: Existence/name/scope for minimal C/C++ Arrow C Data interface helpers

Posted by Dewey Dunnington <de...@voltrondata.com>.
Hi all,

Based on the points raised above and a few adventures implementing some of
this in related projects, I put together a brief design document proposing
a scope and structure to perhaps solidify a few of these discussions:
https://docs.google.com/document/d/11n7ICVZO8exZ-z3GRlI26VLzKPXlYlEz5xjLl1y0ujU/edit?usp=sharing
.

Any and all should feel free to add, rewrite, or propose a new
structure...I wrote many of the pieces for argument's sake or because
that's how I'd implemented them before.

Next week I will phrase it as a skeleton header (like the one in the
excellent ADBC design discussions) depending on feedback to keep the
discussion going!

Cheers,

-dewey

On Fri, Jun 3, 2022 at 9:57 AM Hannes Mühleisen <ha...@duckdblabs.com>
wrote:

> Hello List,
>
> we at DuckDB are happy users of the Arrow C Data Interface and use it to
> feed SQL queries and also use it to provide query results in Arrow format
> again. It is particularly appealing to us that the interface is merely a
> (C) header file that we just ship with our source code [1]. Internally, our
> implementation then constructs DuckDB internal vectors from the Arrow
> format [2] or vice-versa [3].
>
> As you can see from [2, 3] there is some complexity in getting the
> conversion right, especially for more complex data types like nested types
> (list, strings). A lightweight, dependency-free library to help
> constructing those would certainly be appreciated. What would also help a
> lot is validation code, Arrow structures are very delicate and one wrong
> pointer can lead to disaster (which is then blamed on us), so a way to
> verify the structures in said lightweight library would be very helpful.
>
> Best from Amsterdam, and Quack
>
> Hannes
>
> [1]
>
> https://github.com/duckdb/duckdb/blob/master/src/include/duckdb/common/arrow.hpp
> [2]
> https://github.com/duckdb/duckdb/blob/master/src/function/table/arrow.cpp
> [3]
>
> https://github.com/duckdb/duckdb/blob/master/src/common/types/data_chunk.cpp
>
>
> On Fri, Jun 03, 2022 at 15:34:42, Jonathan Keane <jk...@gmail.com> wrote:
>
> > cc Hannes Mühleisen from DuckDB Labs
> >
> > -Jon
> >
> >
> > On Tue, May 31, 2022 at 5:03 PM Wes McKinney <we...@gmail.com>
> wrote:
> >
> > I'm also supportive of having a small vendorable C/C++ "Arrow
> > middleware" that provides:
> >
> > * Schemas and types
> > * Columnar data structures and minimal APIs to build them and iterate
> over
> > them
> > * C data interface
> > * Minimal validation (at the level of Validate but not ValidateFull)
> >
> > I don't think it's going to be practical to try to refactor parts of
> > the existing Arrow C++ core to be vendorable since there are many
> > features / requirements (e.g. an extensible buffer and device API)
> > that these C++ classes include that aren't needed in this
> > limited-feature middleware library.
> >
> > This also relates to the "Improving Arrow's database support" project
> > that David Li raised some time ago [1]. If we want to encourage
> > database driver libraries to add new APIs that emit the Arrow C
> > interface, we need to make it easier to generate the C interface
> > without requiring a new library dependency.
> >
> > [1]: https://lists.apache.org/thread/gnz1kz2rj3rb8rh8qz7l0mv8lvzq254w
> >
> > On Mon, May 30, 2022 at 11:31 AM Jonathan Keane <jk...@gmail.com>
> wrote:
> > >
> > > Thanks for working on this. I've heard people asking about something
> > > like this from a number of different fronts on top of the obvious use
> > > case in geoarrow | other geospatial libraries. I think a minimal piece
> > > of Arrow that other packages could depend on without needing to bring
> > > in all of arrow would be super valuable in building the bridges we
> > > want across other systems.
> > >
> > > Do you have any (design) documentation that describes the scope of
> > > what you're thinking? I know there have been others floating around
> > > [1] [2] that were in a similar spirit.
> > >
> > > A few more questions I hope will spark more conversation: How do the
> > > header files you linked in [3] overlap with these other efforts? Are
> > > those headers something we could|should "just" PR into apache/arrow
> > > and write up how to use them? If not what is the work to make them so
> > > that they could be (the answer of course could be design something
> > > else entirely and PR that!)?
> > >
> > > [1] https://github.com/paleolimbot/narrow
> > > [2] https://paleolimbot.github.io/narrow/articles/why-narrow.html
> > > [3]
> https://github.com/paleolimbot/geoarrow-cpp/tree/main/src/geoarrow/
> > internal/arrow-hpp
> > >
> > > -Jon
> > >
> > > -Jon
> > >
> > >
> > > On Wed, May 25, 2022 at 9:29 AM Dewey Dunnington <
> dewey@voltrondata.com>
> > wrote:
> > > >
> > > > I'm writing to gauge interest in a set of helpers in C and/or C++ for
> > > > reading/exporting Arrow C Data interface structures. My use-case is
> > > > building Arrow geospatial support in R [1], and while the set of
> > helpers
> > > > I've been using [2] has served the purpose of me writing about the
> > > > opportunities for Arrow + geospatial [3], I would like to rewrite the
> > > > prototype based on something developed by/with the Arrow community.
> > > >
> > > > Does a set of C/C++ helpers for Arrow C Data interface structures
> > already
> > > > exist? *Should* it exist?
> > > >
> > > > If it doesn't, what should the name/scope of that library be? The
> names
> > > > 'nanoarrow', 'narrow', 'sparrow', and 'arrow-hpp' have all surfaced
> in
> > my
> > > > limited discussion of this so far. For the purpose of starting the
> > > > discussion, I'll posit that the library should include helpers to
> > > > allocate/destroy C Data interface structures, a schema metadata
> > > > encoder/decoder, validation of a schema/array pair, and something
> like
> > the
> > > > ArrayBuilder C++ class.
> > > >
> > > > [1] https://lists.apache.org/thread/yb7p9wpg3k128njskhwj9j788opb67g7
> > > > [2]
> > > > https://github.com/paleolimbot/geoarrow-cpp/tree/main/src/geoarrow/
> > internal/arrow-hpp
> > > > [3]
> > > > https://docs.google.com/document/d/
> > 1A6e3XCerjhXVFHBDaoAlBBNFb2HG4RB9SVRpuBru7E4/edit?usp=sharing
> >
> >
>

Re: Existence/name/scope for minimal C/C++ Arrow C Data interface helpers

Posted by Hannes Mühleisen <ha...@duckdblabs.com>.
Hello List,

we at DuckDB are happy users of the Arrow C Data Interface and use it to
feed SQL queries and also use it to provide query results in Arrow format
again. It is particularly appealing to us that the interface is merely a
(C) header file that we just ship with our source code [1]. Internally, our
implementation then constructs DuckDB internal vectors from the Arrow
format [2] or vice-versa [3].

As you can see from [2, 3] there is some complexity in getting the
conversion right, especially for more complex data types like nested types
(list, strings). A lightweight, dependency-free library to help
constructing those would certainly be appreciated. What would also help a
lot is validation code, Arrow structures are very delicate and one wrong
pointer can lead to disaster (which is then blamed on us), so a way to
verify the structures in said lightweight library would be very helpful.

Best from Amsterdam, and Quack

Hannes

[1]
https://github.com/duckdb/duckdb/blob/master/src/include/duckdb/common/arrow.hpp
[2]
https://github.com/duckdb/duckdb/blob/master/src/function/table/arrow.cpp
[3]
https://github.com/duckdb/duckdb/blob/master/src/common/types/data_chunk.cpp


On Fri, Jun 03, 2022 at 15:34:42, Jonathan Keane <jk...@gmail.com> wrote:

> cc Hannes Mühleisen from DuckDB Labs
>
> -Jon
>
>
> On Tue, May 31, 2022 at 5:03 PM Wes McKinney <we...@gmail.com> wrote:
>
> I'm also supportive of having a small vendorable C/C++ "Arrow
> middleware" that provides:
>
> * Schemas and types
> * Columnar data structures and minimal APIs to build them and iterate over
> them
> * C data interface
> * Minimal validation (at the level of Validate but not ValidateFull)
>
> I don't think it's going to be practical to try to refactor parts of
> the existing Arrow C++ core to be vendorable since there are many
> features / requirements (e.g. an extensible buffer and device API)
> that these C++ classes include that aren't needed in this
> limited-feature middleware library.
>
> This also relates to the "Improving Arrow's database support" project
> that David Li raised some time ago [1]. If we want to encourage
> database driver libraries to add new APIs that emit the Arrow C
> interface, we need to make it easier to generate the C interface
> without requiring a new library dependency.
>
> [1]: https://lists.apache.org/thread/gnz1kz2rj3rb8rh8qz7l0mv8lvzq254w
>
> On Mon, May 30, 2022 at 11:31 AM Jonathan Keane <jk...@gmail.com> wrote:
> >
> > Thanks for working on this. I've heard people asking about something
> > like this from a number of different fronts on top of the obvious use
> > case in geoarrow | other geospatial libraries. I think a minimal piece
> > of Arrow that other packages could depend on without needing to bring
> > in all of arrow would be super valuable in building the bridges we
> > want across other systems.
> >
> > Do you have any (design) documentation that describes the scope of
> > what you're thinking? I know there have been others floating around
> > [1] [2] that were in a similar spirit.
> >
> > A few more questions I hope will spark more conversation: How do the
> > header files you linked in [3] overlap with these other efforts? Are
> > those headers something we could|should "just" PR into apache/arrow
> > and write up how to use them? If not what is the work to make them so
> > that they could be (the answer of course could be design something
> > else entirely and PR that!)?
> >
> > [1] https://github.com/paleolimbot/narrow
> > [2] https://paleolimbot.github.io/narrow/articles/why-narrow.html
> > [3] https://github.com/paleolimbot/geoarrow-cpp/tree/main/src/geoarrow/
> internal/arrow-hpp
> >
> > -Jon
> >
> > -Jon
> >
> >
> > On Wed, May 25, 2022 at 9:29 AM Dewey Dunnington <de...@voltrondata.com>
> wrote:
> > >
> > > I'm writing to gauge interest in a set of helpers in C and/or C++ for
> > > reading/exporting Arrow C Data interface structures. My use-case is
> > > building Arrow geospatial support in R [1], and while the set of
> helpers
> > > I've been using [2] has served the purpose of me writing about the
> > > opportunities for Arrow + geospatial [3], I would like to rewrite the
> > > prototype based on something developed by/with the Arrow community.
> > >
> > > Does a set of C/C++ helpers for Arrow C Data interface structures
> already
> > > exist? *Should* it exist?
> > >
> > > If it doesn't, what should the name/scope of that library be? The names
> > > 'nanoarrow', 'narrow', 'sparrow', and 'arrow-hpp' have all surfaced in
> my
> > > limited discussion of this so far. For the purpose of starting the
> > > discussion, I'll posit that the library should include helpers to
> > > allocate/destroy C Data interface structures, a schema metadata
> > > encoder/decoder, validation of a schema/array pair, and something like
> the
> > > ArrayBuilder C++ class.
> > >
> > > [1] https://lists.apache.org/thread/yb7p9wpg3k128njskhwj9j788opb67g7
> > > [2]
> > > https://github.com/paleolimbot/geoarrow-cpp/tree/main/src/geoarrow/
> internal/arrow-hpp
> > > [3]
> > > https://docs.google.com/document/d/
> 1A6e3XCerjhXVFHBDaoAlBBNFb2HG4RB9SVRpuBru7E4/edit?usp=sharing
>
>