You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by David Li <li...@apache.org> on 2022/04/26 16:29:47 UTC

[DISC] Improving Arrow's database support

Hello,

In light of recent efforts around Flight SQL, projects like pgeon [1], and long-standing tickets/discussions about database support in Arrow [2], it seems there's an opportunity to define standard database interfaces for Arrow that could unify these efforts. So we've put together a proposal for "ADBC", a common Arrow-based database client API:

https://docs.google.com/document/d/1t7NrC76SyxL_OffATmjzZs2xcj1owdUsIF2WKL_Zw1U/edit#heading=h.r6o6j2navi4c

A common API and implementations could help combine/simplify client-side projects like pgeon, or what DBI is considering [3], and help them take advantage of developments like Flight SQL and existing columnar APIs.

We'd appreciate any feedback. (Comments should be open, please let me know if not.)

[1]: https://github.com/0x0L/pgeon
[2]: https://issues.apache.org/jira/browse/ARROW-11670
[3]: https://github.com/r-dbi/dbi3/issues/48

Thanks,
David

Re: [DISC] Improving Arrow's database support

Posted by David Li <li...@apache.org>.

I put up [1] as the PR to apache/arrow to vote on. There is a bit of a circular dependency here: my thought is that we will vote on this, then tag the 1.0.0 API standard on apache/arrow-adbc, and finally update the PR before merging. But actual releases of the packages may be a later commit/tag as we set up all the necessary infrastructure. 

I'll start a vote thread soon unless there are comments/concerns.

Also, I plan to make a ticket to INFRA for apache/arrow-adbc, to switch the default commit message to "PR title + description" [2] to go along with the conventional commit suggestion, unless anyone has other ideas.

In other words, I'm trying to set up the Flight SQL driver now [3] which will give us actual Python bindings (this adds an optional runtime dependency from PyArrow to ADBC); I would like to get back to the libpq driver [4] and set up benchmarks and start trying to compare it to other alternatives (pgeon, psycopg, etc.)

[1]: https://github.com/apache/arrow/pull/14079
[2]: https://docs.github.com/en/repositories/configuring-branches-and-merges-in-your-repository/configuring-pull-request-merges/configuring-commit-squashing-for-pull-requests
[3]: https://github.com/apache/arrow/pull/14082

On Tue, Sep 13, 2022, at 15:12, David Li wrote:
> Ah, thanks for the clarification Neal!
>
> Jacob/Matt: I put up https://github.com/apache/arrow-adbc/pull/124 to 
> describe the convention but I wonder if we should partition components 
> more granularly than we have so far.
>
> On Mon, Sep 12, 2022, at 12:57, Neal Richardson wrote:
>> On Mon, Sep 12, 2022 at 12:44 PM David Li <li...@apache.org> wrote:
>>
>>> I like this idea. I would also like to set up some sort of automated ABI
>>> checker as well (the options I found were GPL/LGPL so I need to figure out
>>> how to proceed).
>>>
>>
>> You should be able to use GPL software in CI, that's no problem. You can
>> even depend on GPL software as long as it is "optional":
>> https://www.apache.org/legal/resolved.html#optional But this would not even
>> count as that since the ABI checker wouldn't be required to use the
>> software.
>>
>> Neal
>>
>>
>>>
>>> I can put up a PR later that formalizes these guidelines in
>>> CONTRIBUTING.md. It looks like there's a pre-commit hook for this sort of
>>> thing too, which'll let us enforce it in CI!
>>>
>>> On Mon, Sep 12, 2022, at 10:18, Matthew Topol wrote:
>>> > Automated semver would be ideal if we can do it.....
>>> >
>>> > There's quite a lot of utilities that exist which would automatically
>>> > handle the versioning if we're using conventional commits.
>>> >
>>> > On Mon, Sep 12 2022 at 02:26:15 PM +0200, Jacob Wujciak
>>> > <ja...@voltrondata.com.INVALID> wrote:
>>> >> + 1 to independent, semver versioning for adbc.
>>> >> I would propose we use conventional commit style [1] commit messages
>>> >> for
>>> >> the pr commits (I assume squash + merge) so we can automate the
>>> >> versioning|double check manual versioning.
>>> >>
>>> >> [1]: <https://www.conventionalcommits.org/>
>>> >>
>>> >> On Thu, Sep 8, 2022 at 6:05 PM David Li <lidavidm@apache.org
>>> >> <ma...@apache.org>> wrote:
>>> >>
>>> >>>  Thanks all, I've updated the header with the proposed versioning
>>> >>> scheme.
>>> >>>
>>> >>>  At this point I believe the core definitions are ready. (Note that
>>> >>> I'm
>>> >>>  explicitly punting on [1][2][3] here.) Absent further comments, I'd
>>> >>> like to
>>> >>>  do the following:
>>> >>>
>>> >>>  - Start a vote on mirroring adbc.h to arrow/format, as well adding
>>> >>>  docs/source/format/ADBC.rst that describes the header, the Java
>>> >>> interface,
>>> >>>  the Go interface, and the versioning scheme (I will put up a PR
>>> >>> beforehand)
>>> >>>  - Begin work on CI/packaging, with a release hopefully coinciding
>>> >>> with
>>> >>>  Arrow 10.0.0
>>> >>>  - Begin work on changes to the main repository, also hopefully in
>>> >>> time for
>>> >>>  10.0.0 (moving the Flight SQL driver to be part of apache/arrow;
>>> >>> exposing
>>> >>>  it in PyArrow; possibly also exposing Acero via ADBC)
>>> >>>
>>> >>>  [1]: <https://github.com/apache/arrow-adbc/issues/46>
>>> >>>  [2]: <https://github.com/apache/arrow-adbc/issues/55>
>>> >>>  [3]: <https://github.com/apache/arrow-adbc/issues/59>
>>> >>>
>>> >>>  On Sat, Sep 3, 2022, at 18:36, Matthew Topol wrote:
>>> >>>  > +1 from me on the strategy proposed by Kou.
>>> >>>  >
>>> >>>  > That would be my preference also. I agree it is preferable to be
>>> >>>  versioned
>>> >>>  > independently.
>>> >>>  >
>>> >>>  > --Matt
>>> >>>  >
>>> >>>  > On Sat, Sep 3, 2022, 6:24 PM Sutou Kouhei <kou@clear-code.com
>>> >>> <ma...@clear-code.com>> wrote:
>>> >>>  >
>>> >>>  >> Hi,
>>> >>>  >>
>>> >>>  >> > Do we have a preference for versioning strategy? Should we
>>> >>>  >> > proceed in lockstep with the Arrow C++ library et. al. and
>>> >>>  >> > release "ADBC 1.0.0" (the API standard) with "drivers
>>> >>>  >> > version 10.0.0", or use an independent versioning scheme?
>>> >>>  >> > (For example, release API standard and components at
>>> >>>  >> > "1.0.0". Then further releases of components that do not
>>> >>>  >> > change the spec would be "1.1", "1.2", ...; if/when we
>>> >>>  >> > change the spec, start over with "2.0", "2.1", ...)
>>> >>>  >>
>>> >>>  >> I like an independent versioning schema. I assume that ADBC
>>> >>>  >> doesn't need backward incompatible changes frequently. How
>>> >>>  >> about incrementing major version only when ADBC needs
>>> >>>  >> any backward incompatible changes?
>>> >>>  >>
>>> >>>  >> e.g.:
>>> >>>  >>
>>> >>>  >>   1.  Release ADBC (the API standard) 1.0.0
>>> >>>  >>   2.  Release adbc_driver_manager 1.0.0
>>> >>>  >>   3.  Release adbc_driver_postgres 1.0.0
>>> >>>  >>   4.  Add a new feature to adbc_driver_postgres without
>>> >>>  >>       any backward incompatible changes
>>> >>>  >>   5.  Release adbc_driver_postgres 1.1.0
>>> >>>  >>   6.  Fix a bug in adbc_driver_manager without
>>> >>>  >>       any backward incompatible changes
>>> >>>  >>   7.  Release adbc_driver_manager 1.0.1
>>> >>>  >>   8.  Add a backward incompatible change to adbc_driver_manager
>>> >>>  >>   9.  Release adbc_driver_manager 2.0.0
>>> >>>  >>   10. Add a new feature to ADBC without any
>>> >>>  >>       backward incompatible changes
>>> >>>  >>   11. Release ADBC (the API standard) 1.1.0
>>> >>>  >>
>>> >>>  >>
>>> >>>  >> Thanks,
>>> >>>  >> --
>>> >>>  >> kou
>>> >>>  >>
>>> >>>  >> In <7b20d730-b85e-4818-b99e-3335c40c2f08@www.fastmail.com
>>> >>> <ma...@www.fastmail.com>>
>>> >>>  >>   "Re: [DISC] Improving Arrow's database support" on Thu, 01 Sep
>>> >>> 2022
>>> >>>  >> 16:36:43 -0400,
>>> >>>  >>   "David Li" <lidavidm@apache.org <ma...@apache.org>>
>>> >>> wrote:
>>> >>>  >>
>>> >>>  >> > Following up here with some specific questions:
>>> >>>  >> >
>>> >>>  >> > Matt Topol added some Go definitions [1] (thanks!) I'd assume
>>> >>> we want
>>> >>>  to
>>> >>>  >> vote on those as well?
>>> >>>  >> >
>>> >>>  >> > How should the process work for Java/Go? For C/C++, I assume
>>> >>> we'd
>>> >>>  treat
>>> >>>  >> it like the C Data Interface and copy adbc.h to format/ after a
>>> >>> vote,
>>> >>>  and
>>> >>>  >> then vote on releases of components. Or do we really only
>>> >>> consider the C
>>> >>>  >> header as the 'format', with the others being language-specific
>>> >>>  affordances?
>>> >>>  >> >
>>> >>>  >> > What about for Java and for Go? We could vote on and tag a
>>> >>> release for
>>> >>>  >> Go, and add a documentation page that links to the Java/Go
>>> >>> definitions
>>> >>>  at a
>>> >>>  >> specific revision (as the equivalent 'format' definition for
>>> >>> Java/Go)?
>>> >>>  Or
>>> >>>  >> would we vendor the entire Java module/Go package as the
>>> >>> 'format'?
>>> >>>  >> >
>>> >>>  >> > Do we have a preference for versioning strategy? Should we
>>> >>> proceed in
>>> >>>  >> lockstep with the Arrow C++ library et. al. and release "ADBC
>>> >>> 1.0.0"
>>> >>>  (the
>>> >>>  >> API standard) with "drivers version 10.0.0", or use an
>>> >>> independent
>>> >>>  >> versioning scheme? (For example, release API standard and
>>> >>> components at
>>> >>>  >> "1.0.0". Then further releases of components that do not change
>>> >>> the spec
>>> >>>  >> would be "1.1", "1.2", ...; if/when we change the spec, start
>>> >>> over with
>>> >>>  >> "2.0", "2.1", ...)
>>> >>>  >> >
>>> >>>  >> > [1]:
>>> >>> <https://github.com/apache/arrow-adbc/blob/main/go/adbc/adbc.go>
>>> >>>  >> >
>>> >>>  >> > -David
>>> >>>  >> >
>>> >>>  >> > On Sun, Aug 28, 2022, at 10:56, Sutou Kouhei wrote:
>>> >>>  >> >> Hi,
>>> >>>  >> >>
>>> >>>  >> >> OK. I'll send pull requests for GLib and Ruby soon.
>>> >>>  >> >>
>>> >>>  >> >>> I'm curious if you have a particular use case in mind.
>>> >>>  >> >>
>>> >>>  >> >> I don't have any production-ready use case yet but I want to
>>> >>>  >> >> implement an Active Record adapter for ADBC. Active Record
>>> >>>  >> >> is the O/R mapper for Ruby on Rails. Implementing Web
>>> >>>  >> >> application by Ruby on Rails is one of major Ruby use
>>> >>>  >> >> cases. So providing Active Record interface for ADBC will
>>> >>>  >> >> increase Apache Arrow users in Ruby community.
>>> >>>  >> >>
>>> >>>  >> >> NOTE: Generally, Ruby on Rails users don't process large
>>> >>>  >> >> data but they sometimes need to process large (medium?) data
>>> >>>  >> >> in a batch process. Active Record adapter for ADBC may be
>>> >>>  >> >> useful for such use case.
>>> >>>  >> >>
>>> >>>  >> >>> There's a little bit more API cleanup to do [1]. If you
>>> >>>  >> >>> have comments on that or anything else, I'd appreciate
>>> >>>  >> >>> them. Otherwise, pull requests would also be appreciated.
>>> >>>  >> >>
>>> >>>  >> >> OK. I'll open issues/pull requests when I find
>>> >>>  >> >> something. For now, I think that "MODULE" type library
>>> >>>  >> >> instead of "SHARED" type library in CMake terminology
>>> >>>  >> >> [cmake] is better for driver modules. (I'll open an issue
>>> >>>  >> >> for this later.)
>>> >>>  >> >>
>>> >>>  >> >> [cmake]:
>>> >>>  <https://cmake.org/cmake/help/latest/command/add_library.html>
>>> >>>  >> >>
>>> >>>  >> >>
>>> >>>  >> >> Thanks,
>>> >>>  >> >> --
>>> >>>  >> >> kou
>>> >>>  >> >>
>>> >>>  >> >> In <e6380315-94aa-4dd1-8685-268edd597821@www.fastmail.com
>>> >>> <ma...@www.fastmail.com>>
>>> >>>  >> >>   "Re: [DISC] Improving Arrow's database support" on Sat, 27
>>> >>> Aug 2022
>>> >>>  >> >> 15:28:56 -0400,
>>> >>>  >> >>   "David Li" <lidavidm@apache.org
>>> >>> <ma...@apache.org>> wrote:
>>> >>>  >> >>
>>> >>>  >> >>> I would be very happy to see GLib/Ruby bindings! I'm curious
>>> >>> if you
>>> >>>  >> have a particular use case in mind.
>>> >>>  >> >>>
>>> >>>  >> >>> There's a little bit more API cleanup to do [1]. If you have
>>> >>>  comments
>>> >>>  >> on that or anything else, I'd appreciate them. Otherwise, pull
>>> >>> requests
>>> >>>  >> would also be appreciated.
>>> >>>  >> >>>
>>> >>>  >> >>> [1]: <https://github.com/apache/arrow-adbc/issues/79>
>>> >>>  >> >>>
>>> >>>  >> >>> On Fri, Aug 26, 2022, at 21:53, Sutou Kouhei wrote:
>>> >>>  >> >>>> Hi,
>>> >>>  >> >>>>
>>> >>>  >> >>>> Thanks for sharing the current status!
>>> >>>  >> >>>> I understand.
>>> >>>  >> >>>>
>>> >>>  >> >>>> BTW, can I add GLib/Ruby bindings to apache/arrow-adbc
>>> >>>  >> >>>> before we release the first version? (I want to use ADBC
>>> >>>  >> >>>> from Ruby.) Or should I wait for the first release? If I can
>>> >>>  >> >>>> work on it now, I'll open pull requests for it.
>>> >>>  >> >>>>
>>> >>>  >> >>>> Thanks,
>>> >>>  >> >>>> --
>>> >>>  >> >>>> kou
>>> >>>  >> >>>>
>>> >>>  >> >>>> In <8703efd9-51bd-4f91-b550-73830667d591@www.fastmail.com
>>> >>> <ma...@www.fastmail.com>>
>>> >>>  >> >>>>   "Re: [DISC] Improving Arrow's database support" on Fri,
>>> >>> 26 Aug
>>> >>>  2022
>>> >>>  >> >>>> 11:03:26 -0400,
>>> >>>  >> >>>>   "David Li" <lidavidm@apache.org
>>> >>> <ma...@apache.org>> wrote:
>>> >>>  >> >>>>
>>> >>>  >> >>>>> Thank you Kou!
>>> >>>  >> >>>>>
>>> >>>  >> >>>>> At least initially, I don't think I'll be able to complete
>>> >>> the
>>> >>>  >> Dataset integration in time. So 10.0.0 probably won't ship with
>>> >>> a hard
>>> >>>  >> dependency. That said I am hoping to have PyArrow take an
>>> >>> optional
>>> >>>  >> dependency (so Flight SQL can finally be available from Python).
>>> >>>  >> >>>>>
>>> >>>  >> >>>>> On Fri, Aug 26, 2022, at 01:01, Sutou Kouhei wrote:
>>> >>>  >> >>>>>> Hi,
>>> >>>  >> >>>>>>
>>> >>>  >> >>>>>> As a maintainer of Linux packages, I want
>>> >>> apache/arrow-adbc
>>> >>>  >> >>>>>> to be released before apache/arrow is released so that
>>> >>>  >> >>>>>> apache/arrow's .deb/.rpm can depend on apache/arrow-adbc's
>>> >>>  >> >>>>>> .deb/.rpm.
>>> >>>  >> >>>>>>
>>> >>>  >> >>>>>> (If Apache Arrow Dataset uses apache/arrow-adbc,
>>> >>>  >> >>>>>> apache/arrow's .deb/.rpm needs to depend on
>>> >>>  >> >>>>>> apache/arrow-adbc's .deb/.rpm.)
>>> >>>  >> >>>>>>
>>> >>>  >> >>>>>> We can add .deb/.rpm related files
>>> >>>  >> >>>>>> (dev/tasks/linux-packages/ in apache/arrow) to
>>> >>>  >> >>>>>> apache/arrow-adbc to build .deb/.rpm for
>>> >>> apache/arrow-adbc.
>>> >>>  >> >>>>>>
>>> >>>  >> >>>>>> FYI: I did it for datafusion-contrib/datafusion-c:
>>> >>>  >> >>>>>>
>>> >>>  >> >>>>>> *
>>> >>>  >>
>>> >>> <https://github.com/datafusion-contrib/datafusion-c/tree/main/package>
>>> >>>  >> >>>>>> *
>>> >>>  >> >>>>>>
>>> >>>  >>
>>> >>>
>>> >>> <
>>> https://github.com/datafusion-contrib/datafusion-c/blob/main/.github/workflows/package.yaml
>>> >
>>> >>>  >> >>>>>>
>>> >>>  >> >>>>>> I can work on it in apache/arrow-adbc.
>>> >>>  >> >>>>>>
>>> >>>  >> >>>>>>
>>> >>>  >> >>>>>> Thanks,
>>> >>>  >> >>>>>> --
>>> >>>  >> >>>>>> kou
>>> >>>  >> >>>>>>
>>> >>>  >> >>>>>> In <5cbf2923-4fb4-4c5e-b11d-007209fdd8d7@www.fastmail.com
>>> >>> <ma...@www.fastmail.com>>
>>> >>>  >> >>>>>>   "Re: [DISC] Improving Arrow's database support" on Thu,
>>> >>> 25 Aug
>>> >>>  >> 2022
>>> >>>  >> >>>>>> 11:51:08 -0400,
>>> >>>  >> >>>>>>   "David Li" <lidavidm@apache.org
>>> >>> <ma...@apache.org>> wrote:
>>> >>>  >> >>>>>>
>>> >>>  >> >>>>>>> Fair enough, thank you. I'll try to expand a bit. (Sorry
>>> >>> for the
>>> >>>  >> wall of text that follows…)
>>> >>>  >> >>>>>>>
>>> >>>  >> >>>>>>> These are the components:
>>> >>>  >> >>>>>>>
>>> >>>  >> >>>>>>> - Core adbc.h header
>>> >>>  >> >>>>>>> - Driver manager for C/C++
>>> >>>  >> >>>>>>> - Flight SQL-based driver
>>> >>>  >> >>>>>>> - Postgres-based driver (WIP)
>>> >>>  >> >>>>>>> - SQLite-based driver (more of a testbed for me than an
>>> >>> actual
>>> >>>  >> component - I don't think we'd actually distribute this)
>>> >>>  >> >>>>>>> - Java core interfaces
>>> >>>  >> >>>>>>> - Java driver manager
>>> >>>  >> >>>>>>> - Java JDBC-based driver
>>> >>>  >> >>>>>>> - Java Flight SQL-based driver
>>> >>>  >> >>>>>>> - Python driver manager
>>> >>>  >> >>>>>>>
>>> >>>  >> >>>>>>> I think: adbc.h gets mirrored into the Arrow repo. The
>>> >>> Flight
>>> >>>  SQL
>>> >>>  >> drivers get moved to the main Arrow repo and distributed as part
>>> >>> of the
>>> >>>  >> regular Arrow releases.
>>> >>>  >> >>>>>>>
>>> >>>  >> >>>>>>> For the rest of the components: they could be packaged
>>> >>>  >> individually, but versioned and released together. Also, each
>>> >>> C/C++
>>> >>>  driver
>>> >>>  >> probably needs a corresponding Python package so Python users do
>>> >>> not
>>> >>>  have
>>> >>>  >> to futz with shared library configurations. (See [1].) So for
>>> >>> instance,
>>> >>>  >> installing PyArrow would also give you the Flight SQL driver,
>>> >>> and `pip
>>> >>>  >> install adbc_postgres` would get you the Postgres-based driver.
>>> >>>  >> >>>>>>>
>>> >>>  >> >>>>>>> That would mean setting up separate CI, release, etc.
>>> >>> (and
>>> >>>  >> eventually linking Crossbow & Conbench as well?). That does mean
>>> >>>  >> duplication of effort, but the trade off is avoiding bloating
>>> >>> the main
>>> >>>  >> release process even further. However, I'd like to hear from
>>> >>> those
>>> >>>  closer
>>> >>>  >> to the release process on this subject - if it would make
>>> >>> people's lives
>>> >>>  >> easier, we could merge everything into one repo/process.
>>> >>>  >> >>>>>>>
>>> >>>  >> >>>>>>> Integrations would be distributed as part of their
>>> >>> respective
>>> >>>  >> packages (e.g. Arrow Dataset would optionally link to the driver
>>> >>>  manager).
>>> >>>  >> So the "part of Arrow 10.0.0" aspect means having a stable
>>> >>> interface for
>>> >>>  >> adbc.h, and getting the Flight SQL drivers into the main repo.
>>> >>>  >> >>>>>>>
>>> >>>  >> >>>>>>> [1]: <https://github.com/apache/arrow-adbc/issues/53>
>>> >>>  >> >>>>>>>
>>> >>>  >> >>>>>>> On Thu, Aug 25, 2022, at 11:34, Antoine Pitrou wrote:
>>> >>>  >> >>>>>>>> On Fri, 19 Aug 2022 14:09:44 -0400
>>> >>>  >> >>>>>>>> "David Li" <lidavidm@apache.org
>>> >>> <ma...@apache.org>> wrote:
>>> >>>  >> >>>>>>>>> Since it's been a while, I'd like to give an update.
>>> >>> There are
>>> >>>  >> also a few questions I have around distribution.
>>> >>>  >> >>>>>>>>>
>>> >>>  >> >>>>>>>>> Currently:
>>> >>>  >> >>>>>>>>> - Supported in C, Java, and Python.
>>> >>>  >> >>>>>>>>> - For C/Python, there are basic drivers wrapping
>>> >>> Flight SQL
>>> >>>  and
>>> >>>  >> SQLite, with a draft of a libpq (Postgres) driver (using
>>> >>> nanoarrow).
>>> >>>  >> >>>>>>>>> - For Java, there are drivers wrapping JDBC and Flight
>>> >>> SQL.
>>> >>>  >> >>>>>>>>> - For Python, there's low-level bindings to the C API,
>>> >>> and the
>>> >>>  >> DBAPI interface on top of that (+a few extension methods
>>> >>> resembling
>>> >>>  >> DuckDB/Turbodbc).
>>> >>>  >> >>>>>>>>>
>>> >>>  >> >>>>>>>>> There's drafts of integration with Ibis [1], DBI (R),
>>> >>> and
>>> >>>  >> DuckDB. (I'd like to thank Hannes and Kirill for their comments,
>>> >>> as
>>> >>>  well as
>>> >>>  >> Antoine, Dewey, and Matt here.)
>>> >>>  >> >>>>>>>>>
>>> >>>  >> >>>>>>>>> I'd like to have this as part of 10.0.0 in some
>>> >>> fashion.
>>> >>>  >> However, I'm not sure how we would like to handle packaging and
>>> >>>  >> distribution. In particular, there are several sub-components
>>> >>> for each
>>> >>>  >> language (the driver manager + the drivers), increasing the
>>> >>> work. Any
>>> >>>  >> thoughts here?
>>> >>>  >> >>>>>>>>
>>> >>>  >> >>>>>>>> Sorry, forgot to answer here. But I think your question
>>> >>> is too
>>> >>>  >> broadly
>>> >>>  >> >>>>>>>> formulated. It probably deserves a case-by-case
>>> >>> discussion,
>>> >>>  IMHO.
>>> >>>  >> >>>>>>>>
>>> >>>  >> >>>>>>>>> I'm also wondering how we want to handle this in terms
>>> >>> of
>>> >>>  >> specification - I assume we'd consider the core header file/Java
>>> >>>  interfaces
>>> >>>  >> a spec like the C Data Interface/Flight RPC, and vote on
>>> >>> them/mirror
>>> >>>  them
>>> >>>  >> into the format/ directory?
>>> >>>  >> >>>>>>>>
>>> >>>  >> >>>>>>>> That sounds like the right way to me indeed.
>>> >>>  >> >>>>>>>>
>>> >>>  >> >>>>>>>> Regards
>>> >>>  >> >>>>>>>>
>>> >>>  >> >>>>>>>> Antoine.
>>> >>>  >>
>>> >>>
>>>

Re: [DISC] Improving Arrow's database support

Posted by David Li <li...@apache.org>.

Ah, thanks for the clarification Neal!

Jacob/Matt: I put up https://github.com/apache/arrow-adbc/pull/124 to describe the convention but I wonder if we should partition components more granularly than we have so far.

On Mon, Sep 12, 2022, at 12:57, Neal Richardson wrote:
> On Mon, Sep 12, 2022 at 12:44 PM David Li <li...@apache.org> wrote:
>
>> I like this idea. I would also like to set up some sort of automated ABI
>> checker as well (the options I found were GPL/LGPL so I need to figure out
>> how to proceed).
>>
>
> You should be able to use GPL software in CI, that's no problem. You can
> even depend on GPL software as long as it is "optional":
> https://www.apache.org/legal/resolved.html#optional But this would not even
> count as that since the ABI checker wouldn't be required to use the
> software.
>
> Neal
>
>
>>
>> I can put up a PR later that formalizes these guidelines in
>> CONTRIBUTING.md. It looks like there's a pre-commit hook for this sort of
>> thing too, which'll let us enforce it in CI!
>>
>> On Mon, Sep 12, 2022, at 10:18, Matthew Topol wrote:
>> > Automated semver would be ideal if we can do it.....
>> >
>> > There's quite a lot of utilities that exist which would automatically
>> > handle the versioning if we're using conventional commits.
>> >
>> > On Mon, Sep 12 2022 at 02:26:15 PM +0200, Jacob Wujciak
>> > <ja...@voltrondata.com.INVALID> wrote:
>> >> + 1 to independent, semver versioning for adbc.
>> >> I would propose we use conventional commit style [1] commit messages
>> >> for
>> >> the pr commits (I assume squash + merge) so we can automate the
>> >> versioning|double check manual versioning.
>> >>
>> >> [1]: <https://www.conventionalcommits.org/>
>> >>
>> >> On Thu, Sep 8, 2022 at 6:05 PM David Li <lidavidm@apache.org
>> >> <ma...@apache.org>> wrote:
>> >>
>> >>>  Thanks all, I've updated the header with the proposed versioning
>> >>> scheme.
>> >>>
>> >>>  At this point I believe the core definitions are ready. (Note that
>> >>> I'm
>> >>>  explicitly punting on [1][2][3] here.) Absent further comments, I'd
>> >>> like to
>> >>>  do the following:
>> >>>
>> >>>  - Start a vote on mirroring adbc.h to arrow/format, as well adding
>> >>>  docs/source/format/ADBC.rst that describes the header, the Java
>> >>> interface,
>> >>>  the Go interface, and the versioning scheme (I will put up a PR
>> >>> beforehand)
>> >>>  - Begin work on CI/packaging, with a release hopefully coinciding
>> >>> with
>> >>>  Arrow 10.0.0
>> >>>  - Begin work on changes to the main repository, also hopefully in
>> >>> time for
>> >>>  10.0.0 (moving the Flight SQL driver to be part of apache/arrow;
>> >>> exposing
>> >>>  it in PyArrow; possibly also exposing Acero via ADBC)
>> >>>
>> >>>  [1]: <https://github.com/apache/arrow-adbc/issues/46>
>> >>>  [2]: <https://github.com/apache/arrow-adbc/issues/55>
>> >>>  [3]: <https://github.com/apache/arrow-adbc/issues/59>
>> >>>
>> >>>  On Sat, Sep 3, 2022, at 18:36, Matthew Topol wrote:
>> >>>  > +1 from me on the strategy proposed by Kou.
>> >>>  >
>> >>>  > That would be my preference also. I agree it is preferable to be
>> >>>  versioned
>> >>>  > independently.
>> >>>  >
>> >>>  > --Matt
>> >>>  >
>> >>>  > On Sat, Sep 3, 2022, 6:24 PM Sutou Kouhei <kou@clear-code.com
>> >>> <ma...@clear-code.com>> wrote:
>> >>>  >
>> >>>  >> Hi,
>> >>>  >>
>> >>>  >> > Do we have a preference for versioning strategy? Should we
>> >>>  >> > proceed in lockstep with the Arrow C++ library et. al. and
>> >>>  >> > release "ADBC 1.0.0" (the API standard) with "drivers
>> >>>  >> > version 10.0.0", or use an independent versioning scheme?
>> >>>  >> > (For example, release API standard and components at
>> >>>  >> > "1.0.0". Then further releases of components that do not
>> >>>  >> > change the spec would be "1.1", "1.2", ...; if/when we
>> >>>  >> > change the spec, start over with "2.0", "2.1", ...)
>> >>>  >>
>> >>>  >> I like an independent versioning schema. I assume that ADBC
>> >>>  >> doesn't need backward incompatible changes frequently. How
>> >>>  >> about incrementing major version only when ADBC needs
>> >>>  >> any backward incompatible changes?
>> >>>  >>
>> >>>  >> e.g.:
>> >>>  >>
>> >>>  >>   1.  Release ADBC (the API standard) 1.0.0
>> >>>  >>   2.  Release adbc_driver_manager 1.0.0
>> >>>  >>   3.  Release adbc_driver_postgres 1.0.0
>> >>>  >>   4.  Add a new feature to adbc_driver_postgres without
>> >>>  >>       any backward incompatible changes
>> >>>  >>   5.  Release adbc_driver_postgres 1.1.0
>> >>>  >>   6.  Fix a bug in adbc_driver_manager without
>> >>>  >>       any backward incompatible changes
>> >>>  >>   7.  Release adbc_driver_manager 1.0.1
>> >>>  >>   8.  Add a backward incompatible change to adbc_driver_manager
>> >>>  >>   9.  Release adbc_driver_manager 2.0.0
>> >>>  >>   10. Add a new feature to ADBC without any
>> >>>  >>       backward incompatible changes
>> >>>  >>   11. Release ADBC (the API standard) 1.1.0
>> >>>  >>
>> >>>  >>
>> >>>  >> Thanks,
>> >>>  >> --
>> >>>  >> kou
>> >>>  >>
>> >>>  >> In <7b20d730-b85e-4818-b99e-3335c40c2f08@www.fastmail.com
>> >>> <ma...@www.fastmail.com>>
>> >>>  >>   "Re: [DISC] Improving Arrow's database support" on Thu, 01 Sep
>> >>> 2022
>> >>>  >> 16:36:43 -0400,
>> >>>  >>   "David Li" <lidavidm@apache.org <ma...@apache.org>>
>> >>> wrote:
>> >>>  >>
>> >>>  >> > Following up here with some specific questions:
>> >>>  >> >
>> >>>  >> > Matt Topol added some Go definitions [1] (thanks!) I'd assume
>> >>> we want
>> >>>  to
>> >>>  >> vote on those as well?
>> >>>  >> >
>> >>>  >> > How should the process work for Java/Go? For C/C++, I assume
>> >>> we'd
>> >>>  treat
>> >>>  >> it like the C Data Interface and copy adbc.h to format/ after a
>> >>> vote,
>> >>>  and
>> >>>  >> then vote on releases of components. Or do we really only
>> >>> consider the C
>> >>>  >> header as the 'format', with the others being language-specific
>> >>>  affordances?
>> >>>  >> >
>> >>>  >> > What about for Java and for Go? We could vote on and tag a
>> >>> release for
>> >>>  >> Go, and add a documentation page that links to the Java/Go
>> >>> definitions
>> >>>  at a
>> >>>  >> specific revision (as the equivalent 'format' definition for
>> >>> Java/Go)?
>> >>>  Or
>> >>>  >> would we vendor the entire Java module/Go package as the
>> >>> 'format'?
>> >>>  >> >
>> >>>  >> > Do we have a preference for versioning strategy? Should we
>> >>> proceed in
>> >>>  >> lockstep with the Arrow C++ library et. al. and release "ADBC
>> >>> 1.0.0"
>> >>>  (the
>> >>>  >> API standard) with "drivers version 10.0.0", or use an
>> >>> independent
>> >>>  >> versioning scheme? (For example, release API standard and
>> >>> components at
>> >>>  >> "1.0.0". Then further releases of components that do not change
>> >>> the spec
>> >>>  >> would be "1.1", "1.2", ...; if/when we change the spec, start
>> >>> over with
>> >>>  >> "2.0", "2.1", ...)
>> >>>  >> >
>> >>>  >> > [1]:
>> >>> <https://github.com/apache/arrow-adbc/blob/main/go/adbc/adbc.go>
>> >>>  >> >
>> >>>  >> > -David
>> >>>  >> >
>> >>>  >> > On Sun, Aug 28, 2022, at 10:56, Sutou Kouhei wrote:
>> >>>  >> >> Hi,
>> >>>  >> >>
>> >>>  >> >> OK. I'll send pull requests for GLib and Ruby soon.
>> >>>  >> >>
>> >>>  >> >>> I'm curious if you have a particular use case in mind.
>> >>>  >> >>
>> >>>  >> >> I don't have any production-ready use case yet but I want to
>> >>>  >> >> implement an Active Record adapter for ADBC. Active Record
>> >>>  >> >> is the O/R mapper for Ruby on Rails. Implementing Web
>> >>>  >> >> application by Ruby on Rails is one of major Ruby use
>> >>>  >> >> cases. So providing Active Record interface for ADBC will
>> >>>  >> >> increase Apache Arrow users in Ruby community.
>> >>>  >> >>
>> >>>  >> >> NOTE: Generally, Ruby on Rails users don't process large
>> >>>  >> >> data but they sometimes need to process large (medium?) data
>> >>>  >> >> in a batch process. Active Record adapter for ADBC may be
>> >>>  >> >> useful for such use case.
>> >>>  >> >>
>> >>>  >> >>> There's a little bit more API cleanup to do [1]. If you
>> >>>  >> >>> have comments on that or anything else, I'd appreciate
>> >>>  >> >>> them. Otherwise, pull requests would also be appreciated.
>> >>>  >> >>
>> >>>  >> >> OK. I'll open issues/pull requests when I find
>> >>>  >> >> something. For now, I think that "MODULE" type library
>> >>>  >> >> instead of "SHARED" type library in CMake terminology
>> >>>  >> >> [cmake] is better for driver modules. (I'll open an issue
>> >>>  >> >> for this later.)
>> >>>  >> >>
>> >>>  >> >> [cmake]:
>> >>>  <https://cmake.org/cmake/help/latest/command/add_library.html>
>> >>>  >> >>
>> >>>  >> >>
>> >>>  >> >> Thanks,
>> >>>  >> >> --
>> >>>  >> >> kou
>> >>>  >> >>
>> >>>  >> >> In <e6380315-94aa-4dd1-8685-268edd597821@www.fastmail.com
>> >>> <ma...@www.fastmail.com>>
>> >>>  >> >>   "Re: [DISC] Improving Arrow's database support" on Sat, 27
>> >>> Aug 2022
>> >>>  >> >> 15:28:56 -0400,
>> >>>  >> >>   "David Li" <lidavidm@apache.org
>> >>> <ma...@apache.org>> wrote:
>> >>>  >> >>
>> >>>  >> >>> I would be very happy to see GLib/Ruby bindings! I'm curious
>> >>> if you
>> >>>  >> have a particular use case in mind.
>> >>>  >> >>>
>> >>>  >> >>> There's a little bit more API cleanup to do [1]. If you have
>> >>>  comments
>> >>>  >> on that or anything else, I'd appreciate them. Otherwise, pull
>> >>> requests
>> >>>  >> would also be appreciated.
>> >>>  >> >>>
>> >>>  >> >>> [1]: <https://github.com/apache/arrow-adbc/issues/79>
>> >>>  >> >>>
>> >>>  >> >>> On Fri, Aug 26, 2022, at 21:53, Sutou Kouhei wrote:
>> >>>  >> >>>> Hi,
>> >>>  >> >>>>
>> >>>  >> >>>> Thanks for sharing the current status!
>> >>>  >> >>>> I understand.
>> >>>  >> >>>>
>> >>>  >> >>>> BTW, can I add GLib/Ruby bindings to apache/arrow-adbc
>> >>>  >> >>>> before we release the first version? (I want to use ADBC
>> >>>  >> >>>> from Ruby.) Or should I wait for the first release? If I can
>> >>>  >> >>>> work on it now, I'll open pull requests for it.
>> >>>  >> >>>>
>> >>>  >> >>>> Thanks,
>> >>>  >> >>>> --
>> >>>  >> >>>> kou
>> >>>  >> >>>>
>> >>>  >> >>>> In <8703efd9-51bd-4f91-b550-73830667d591@www.fastmail.com
>> >>> <ma...@www.fastmail.com>>
>> >>>  >> >>>>   "Re: [DISC] Improving Arrow's database support" on Fri,
>> >>> 26 Aug
>> >>>  2022
>> >>>  >> >>>> 11:03:26 -0400,
>> >>>  >> >>>>   "David Li" <lidavidm@apache.org
>> >>> <ma...@apache.org>> wrote:
>> >>>  >> >>>>
>> >>>  >> >>>>> Thank you Kou!
>> >>>  >> >>>>>
>> >>>  >> >>>>> At least initially, I don't think I'll be able to complete
>> >>> the
>> >>>  >> Dataset integration in time. So 10.0.0 probably won't ship with
>> >>> a hard
>> >>>  >> dependency. That said I am hoping to have PyArrow take an
>> >>> optional
>> >>>  >> dependency (so Flight SQL can finally be available from Python).
>> >>>  >> >>>>>
>> >>>  >> >>>>> On Fri, Aug 26, 2022, at 01:01, Sutou Kouhei wrote:
>> >>>  >> >>>>>> Hi,
>> >>>  >> >>>>>>
>> >>>  >> >>>>>> As a maintainer of Linux packages, I want
>> >>> apache/arrow-adbc
>> >>>  >> >>>>>> to be released before apache/arrow is released so that
>> >>>  >> >>>>>> apache/arrow's .deb/.rpm can depend on apache/arrow-adbc's
>> >>>  >> >>>>>> .deb/.rpm.
>> >>>  >> >>>>>>
>> >>>  >> >>>>>> (If Apache Arrow Dataset uses apache/arrow-adbc,
>> >>>  >> >>>>>> apache/arrow's .deb/.rpm needs to depend on
>> >>>  >> >>>>>> apache/arrow-adbc's .deb/.rpm.)
>> >>>  >> >>>>>>
>> >>>  >> >>>>>> We can add .deb/.rpm related files
>> >>>  >> >>>>>> (dev/tasks/linux-packages/ in apache/arrow) to
>> >>>  >> >>>>>> apache/arrow-adbc to build .deb/.rpm for
>> >>> apache/arrow-adbc.
>> >>>  >> >>>>>>
>> >>>  >> >>>>>> FYI: I did it for datafusion-contrib/datafusion-c:
>> >>>  >> >>>>>>
>> >>>  >> >>>>>> *
>> >>>  >>
>> >>> <https://github.com/datafusion-contrib/datafusion-c/tree/main/package>
>> >>>  >> >>>>>> *
>> >>>  >> >>>>>>
>> >>>  >>
>> >>>
>> >>> <
>> https://github.com/datafusion-contrib/datafusion-c/blob/main/.github/workflows/package.yaml
>> >
>> >>>  >> >>>>>>
>> >>>  >> >>>>>> I can work on it in apache/arrow-adbc.
>> >>>  >> >>>>>>
>> >>>  >> >>>>>>
>> >>>  >> >>>>>> Thanks,
>> >>>  >> >>>>>> --
>> >>>  >> >>>>>> kou
>> >>>  >> >>>>>>
>> >>>  >> >>>>>> In <5cbf2923-4fb4-4c5e-b11d-007209fdd8d7@www.fastmail.com
>> >>> <ma...@www.fastmail.com>>
>> >>>  >> >>>>>>   "Re: [DISC] Improving Arrow's database support" on Thu,
>> >>> 25 Aug
>> >>>  >> 2022
>> >>>  >> >>>>>> 11:51:08 -0400,
>> >>>  >> >>>>>>   "David Li" <lidavidm@apache.org
>> >>> <ma...@apache.org>> wrote:
>> >>>  >> >>>>>>
>> >>>  >> >>>>>>> Fair enough, thank you. I'll try to expand a bit. (Sorry
>> >>> for the
>> >>>  >> wall of text that follows…)
>> >>>  >> >>>>>>>
>> >>>  >> >>>>>>> These are the components:
>> >>>  >> >>>>>>>
>> >>>  >> >>>>>>> - Core adbc.h header
>> >>>  >> >>>>>>> - Driver manager for C/C++
>> >>>  >> >>>>>>> - Flight SQL-based driver
>> >>>  >> >>>>>>> - Postgres-based driver (WIP)
>> >>>  >> >>>>>>> - SQLite-based driver (more of a testbed for me than an
>> >>> actual
>> >>>  >> component - I don't think we'd actually distribute this)
>> >>>  >> >>>>>>> - Java core interfaces
>> >>>  >> >>>>>>> - Java driver manager
>> >>>  >> >>>>>>> - Java JDBC-based driver
>> >>>  >> >>>>>>> - Java Flight SQL-based driver
>> >>>  >> >>>>>>> - Python driver manager
>> >>>  >> >>>>>>>
>> >>>  >> >>>>>>> I think: adbc.h gets mirrored into the Arrow repo. The
>> >>> Flight
>> >>>  SQL
>> >>>  >> drivers get moved to the main Arrow repo and distributed as part
>> >>> of the
>> >>>  >> regular Arrow releases.
>> >>>  >> >>>>>>>
>> >>>  >> >>>>>>> For the rest of the components: they could be packaged
>> >>>  >> individually, but versioned and released together. Also, each
>> >>> C/C++
>> >>>  driver
>> >>>  >> probably needs a corresponding Python package so Python users do
>> >>> not
>> >>>  have
>> >>>  >> to futz with shared library configurations. (See [1].) So for
>> >>> instance,
>> >>>  >> installing PyArrow would also give you the Flight SQL driver,
>> >>> and `pip
>> >>>  >> install adbc_postgres` would get you the Postgres-based driver.
>> >>>  >> >>>>>>>
>> >>>  >> >>>>>>> That would mean setting up separate CI, release, etc.
>> >>> (and
>> >>>  >> eventually linking Crossbow & Conbench as well?). That does mean
>> >>>  >> duplication of effort, but the trade off is avoiding bloating
>> >>> the main
>> >>>  >> release process even further. However, I'd like to hear from
>> >>> those
>> >>>  closer
>> >>>  >> to the release process on this subject - if it would make
>> >>> people's lives
>> >>>  >> easier, we could merge everything into one repo/process.
>> >>>  >> >>>>>>>
>> >>>  >> >>>>>>> Integrations would be distributed as part of their
>> >>> respective
>> >>>  >> packages (e.g. Arrow Dataset would optionally link to the driver
>> >>>  manager).
>> >>>  >> So the "part of Arrow 10.0.0" aspect means having a stable
>> >>> interface for
>> >>>  >> adbc.h, and getting the Flight SQL drivers into the main repo.
>> >>>  >> >>>>>>>
>> >>>  >> >>>>>>> [1]: <https://github.com/apache/arrow-adbc/issues/53>
>> >>>  >> >>>>>>>
>> >>>  >> >>>>>>> On Thu, Aug 25, 2022, at 11:34, Antoine Pitrou wrote:
>> >>>  >> >>>>>>>> On Fri, 19 Aug 2022 14:09:44 -0400
>> >>>  >> >>>>>>>> "David Li" <lidavidm@apache.org
>> >>> <ma...@apache.org>> wrote:
>> >>>  >> >>>>>>>>> Since it's been a while, I'd like to give an update.
>> >>> There are
>> >>>  >> also a few questions I have around distribution.
>> >>>  >> >>>>>>>>>
>> >>>  >> >>>>>>>>> Currently:
>> >>>  >> >>>>>>>>> - Supported in C, Java, and Python.
>> >>>  >> >>>>>>>>> - For C/Python, there are basic drivers wrapping
>> >>> Flight SQL
>> >>>  and
>> >>>  >> SQLite, with a draft of a libpq (Postgres) driver (using
>> >>> nanoarrow).
>> >>>  >> >>>>>>>>> - For Java, there are drivers wrapping JDBC and Flight
>> >>> SQL.
>> >>>  >> >>>>>>>>> - For Python, there's low-level bindings to the C API,
>> >>> and the
>> >>>  >> DBAPI interface on top of that (+a few extension methods
>> >>> resembling
>> >>>  >> DuckDB/Turbodbc).
>> >>>  >> >>>>>>>>>
>> >>>  >> >>>>>>>>> There's drafts of integration with Ibis [1], DBI (R),
>> >>> and
>> >>>  >> DuckDB. (I'd like to thank Hannes and Kirill for their comments,
>> >>> as
>> >>>  well as
>> >>>  >> Antoine, Dewey, and Matt here.)
>> >>>  >> >>>>>>>>>
>> >>>  >> >>>>>>>>> I'd like to have this as part of 10.0.0 in some
>> >>> fashion.
>> >>>  >> However, I'm not sure how we would like to handle packaging and
>> >>>  >> distribution. In particular, there are several sub-components
>> >>> for each
>> >>>  >> language (the driver manager + the drivers), increasing the
>> >>> work. Any
>> >>>  >> thoughts here?
>> >>>  >> >>>>>>>>
>> >>>  >> >>>>>>>> Sorry, forgot to answer here. But I think your question
>> >>> is too
>> >>>  >> broadly
>> >>>  >> >>>>>>>> formulated. It probably deserves a case-by-case
>> >>> discussion,
>> >>>  IMHO.
>> >>>  >> >>>>>>>>
>> >>>  >> >>>>>>>>> I'm also wondering how we want to handle this in terms
>> >>> of
>> >>>  >> specification - I assume we'd consider the core header file/Java
>> >>>  interfaces
>> >>>  >> a spec like the C Data Interface/Flight RPC, and vote on
>> >>> them/mirror
>> >>>  them
>> >>>  >> into the format/ directory?
>> >>>  >> >>>>>>>>
>> >>>  >> >>>>>>>> That sounds like the right way to me indeed.
>> >>>  >> >>>>>>>>
>> >>>  >> >>>>>>>> Regards
>> >>>  >> >>>>>>>>
>> >>>  >> >>>>>>>> Antoine.
>> >>>  >>
>> >>>
>>

Re: [DISC] Improving Arrow's database support

Posted by Neal Richardson <ne...@gmail.com>.

On Mon, Sep 12, 2022 at 12:44 PM David Li <li...@apache.org> wrote:

> I like this idea. I would also like to set up some sort of automated ABI
> checker as well (the options I found were GPL/LGPL so I need to figure out
> how to proceed).
>

You should be able to use GPL software in CI, that's no problem. You can
even depend on GPL software as long as it is "optional":
https://www.apache.org/legal/resolved.html#optional But this would not even
count as that since the ABI checker wouldn't be required to use the
software.

Neal


>
> I can put up a PR later that formalizes these guidelines in
> CONTRIBUTING.md. It looks like there's a pre-commit hook for this sort of
> thing too, which'll let us enforce it in CI!
>
> On Mon, Sep 12, 2022, at 10:18, Matthew Topol wrote:
> > Automated semver would be ideal if we can do it.....
> >
> > There's quite a lot of utilities that exist which would automatically
> > handle the versioning if we're using conventional commits.
> >
> > On Mon, Sep 12 2022 at 02:26:15 PM +0200, Jacob Wujciak
> > <ja...@voltrondata.com.INVALID> wrote:
> >> + 1 to independent, semver versioning for adbc.
> >> I would propose we use conventional commit style [1] commit messages
> >> for
> >> the pr commits (I assume squash + merge) so we can automate the
> >> versioning|double check manual versioning.
> >>
> >> [1]: <https://www.conventionalcommits.org/>
> >>
> >> On Thu, Sep 8, 2022 at 6:05 PM David Li <lidavidm@apache.org
> >> <ma...@apache.org>> wrote:
> >>
> >>>  Thanks all, I've updated the header with the proposed versioning
> >>> scheme.
> >>>
> >>>  At this point I believe the core definitions are ready. (Note that
> >>> I'm
> >>>  explicitly punting on [1][2][3] here.) Absent further comments, I'd
> >>> like to
> >>>  do the following:
> >>>
> >>>  - Start a vote on mirroring adbc.h to arrow/format, as well adding
> >>>  docs/source/format/ADBC.rst that describes the header, the Java
> >>> interface,
> >>>  the Go interface, and the versioning scheme (I will put up a PR
> >>> beforehand)
> >>>  - Begin work on CI/packaging, with a release hopefully coinciding
> >>> with
> >>>  Arrow 10.0.0
> >>>  - Begin work on changes to the main repository, also hopefully in
> >>> time for
> >>>  10.0.0 (moving the Flight SQL driver to be part of apache/arrow;
> >>> exposing
> >>>  it in PyArrow; possibly also exposing Acero via ADBC)
> >>>
> >>>  [1]: <https://github.com/apache/arrow-adbc/issues/46>
> >>>  [2]: <https://github.com/apache/arrow-adbc/issues/55>
> >>>  [3]: <https://github.com/apache/arrow-adbc/issues/59>
> >>>
> >>>  On Sat, Sep 3, 2022, at 18:36, Matthew Topol wrote:
> >>>  > +1 from me on the strategy proposed by Kou.
> >>>  >
> >>>  > That would be my preference also. I agree it is preferable to be
> >>>  versioned
> >>>  > independently.
> >>>  >
> >>>  > --Matt
> >>>  >
> >>>  > On Sat, Sep 3, 2022, 6:24 PM Sutou Kouhei <kou@clear-code.com
> >>> <ma...@clear-code.com>> wrote:
> >>>  >
> >>>  >> Hi,
> >>>  >>
> >>>  >> > Do we have a preference for versioning strategy? Should we
> >>>  >> > proceed in lockstep with the Arrow C++ library et. al. and
> >>>  >> > release "ADBC 1.0.0" (the API standard) with "drivers
> >>>  >> > version 10.0.0", or use an independent versioning scheme?
> >>>  >> > (For example, release API standard and components at
> >>>  >> > "1.0.0". Then further releases of components that do not
> >>>  >> > change the spec would be "1.1", "1.2", ...; if/when we
> >>>  >> > change the spec, start over with "2.0", "2.1", ...)
> >>>  >>
> >>>  >> I like an independent versioning schema. I assume that ADBC
> >>>  >> doesn't need backward incompatible changes frequently. How
> >>>  >> about incrementing major version only when ADBC needs
> >>>  >> any backward incompatible changes?
> >>>  >>
> >>>  >> e.g.:
> >>>  >>
> >>>  >>   1.  Release ADBC (the API standard) 1.0.0
> >>>  >>   2.  Release adbc_driver_manager 1.0.0
> >>>  >>   3.  Release adbc_driver_postgres 1.0.0
> >>>  >>   4.  Add a new feature to adbc_driver_postgres without
> >>>  >>       any backward incompatible changes
> >>>  >>   5.  Release adbc_driver_postgres 1.1.0
> >>>  >>   6.  Fix a bug in adbc_driver_manager without
> >>>  >>       any backward incompatible changes
> >>>  >>   7.  Release adbc_driver_manager 1.0.1
> >>>  >>   8.  Add a backward incompatible change to adbc_driver_manager
> >>>  >>   9.  Release adbc_driver_manager 2.0.0
> >>>  >>   10. Add a new feature to ADBC without any
> >>>  >>       backward incompatible changes
> >>>  >>   11. Release ADBC (the API standard) 1.1.0
> >>>  >>
> >>>  >>
> >>>  >> Thanks,
> >>>  >> --
> >>>  >> kou
> >>>  >>
> >>>  >> In <7b20d730-b85e-4818-b99e-3335c40c2f08@www.fastmail.com
> >>> <ma...@www.fastmail.com>>
> >>>  >>   "Re: [DISC] Improving Arrow's database support" on Thu, 01 Sep
> >>> 2022
> >>>  >> 16:36:43 -0400,
> >>>  >>   "David Li" <lidavidm@apache.org <ma...@apache.org>>
> >>> wrote:
> >>>  >>
> >>>  >> > Following up here with some specific questions:
> >>>  >> >
> >>>  >> > Matt Topol added some Go definitions [1] (thanks!) I'd assume
> >>> we want
> >>>  to
> >>>  >> vote on those as well?
> >>>  >> >
> >>>  >> > How should the process work for Java/Go? For C/C++, I assume
> >>> we'd
> >>>  treat
> >>>  >> it like the C Data Interface and copy adbc.h to format/ after a
> >>> vote,
> >>>  and
> >>>  >> then vote on releases of components. Or do we really only
> >>> consider the C
> >>>  >> header as the 'format', with the others being language-specific
> >>>  affordances?
> >>>  >> >
> >>>  >> > What about for Java and for Go? We could vote on and tag a
> >>> release for
> >>>  >> Go, and add a documentation page that links to the Java/Go
> >>> definitions
> >>>  at a
> >>>  >> specific revision (as the equivalent 'format' definition for
> >>> Java/Go)?
> >>>  Or
> >>>  >> would we vendor the entire Java module/Go package as the
> >>> 'format'?
> >>>  >> >
> >>>  >> > Do we have a preference for versioning strategy? Should we
> >>> proceed in
> >>>  >> lockstep with the Arrow C++ library et. al. and release "ADBC
> >>> 1.0.0"
> >>>  (the
> >>>  >> API standard) with "drivers version 10.0.0", or use an
> >>> independent
> >>>  >> versioning scheme? (For example, release API standard and
> >>> components at
> >>>  >> "1.0.0". Then further releases of components that do not change
> >>> the spec
> >>>  >> would be "1.1", "1.2", ...; if/when we change the spec, start
> >>> over with
> >>>  >> "2.0", "2.1", ...)
> >>>  >> >
> >>>  >> > [1]:
> >>> <https://github.com/apache/arrow-adbc/blob/main/go/adbc/adbc.go>
> >>>  >> >
> >>>  >> > -David
> >>>  >> >
> >>>  >> > On Sun, Aug 28, 2022, at 10:56, Sutou Kouhei wrote:
> >>>  >> >> Hi,
> >>>  >> >>
> >>>  >> >> OK. I'll send pull requests for GLib and Ruby soon.
> >>>  >> >>
> >>>  >> >>> I'm curious if you have a particular use case in mind.
> >>>  >> >>
> >>>  >> >> I don't have any production-ready use case yet but I want to
> >>>  >> >> implement an Active Record adapter for ADBC. Active Record
> >>>  >> >> is the O/R mapper for Ruby on Rails. Implementing Web
> >>>  >> >> application by Ruby on Rails is one of major Ruby use
> >>>  >> >> cases. So providing Active Record interface for ADBC will
> >>>  >> >> increase Apache Arrow users in Ruby community.
> >>>  >> >>
> >>>  >> >> NOTE: Generally, Ruby on Rails users don't process large
> >>>  >> >> data but they sometimes need to process large (medium?) data
> >>>  >> >> in a batch process. Active Record adapter for ADBC may be
> >>>  >> >> useful for such use case.
> >>>  >> >>
> >>>  >> >>> There's a little bit more API cleanup to do [1]. If you
> >>>  >> >>> have comments on that or anything else, I'd appreciate
> >>>  >> >>> them. Otherwise, pull requests would also be appreciated.
> >>>  >> >>
> >>>  >> >> OK. I'll open issues/pull requests when I find
> >>>  >> >> something. For now, I think that "MODULE" type library
> >>>  >> >> instead of "SHARED" type library in CMake terminology
> >>>  >> >> [cmake] is better for driver modules. (I'll open an issue
> >>>  >> >> for this later.)
> >>>  >> >>
> >>>  >> >> [cmake]:
> >>>  <https://cmake.org/cmake/help/latest/command/add_library.html>
> >>>  >> >>
> >>>  >> >>
> >>>  >> >> Thanks,
> >>>  >> >> --
> >>>  >> >> kou
> >>>  >> >>
> >>>  >> >> In <e6380315-94aa-4dd1-8685-268edd597821@www.fastmail.com
> >>> <ma...@www.fastmail.com>>
> >>>  >> >>   "Re: [DISC] Improving Arrow's database support" on Sat, 27
> >>> Aug 2022
> >>>  >> >> 15:28:56 -0400,
> >>>  >> >>   "David Li" <lidavidm@apache.org
> >>> <ma...@apache.org>> wrote:
> >>>  >> >>
> >>>  >> >>> I would be very happy to see GLib/Ruby bindings! I'm curious
> >>> if you
> >>>  >> have a particular use case in mind.
> >>>  >> >>>
> >>>  >> >>> There's a little bit more API cleanup to do [1]. If you have
> >>>  comments
> >>>  >> on that or anything else, I'd appreciate them. Otherwise, pull
> >>> requests
> >>>  >> would also be appreciated.
> >>>  >> >>>
> >>>  >> >>> [1]: <https://github.com/apache/arrow-adbc/issues/79>
> >>>  >> >>>
> >>>  >> >>> On Fri, Aug 26, 2022, at 21:53, Sutou Kouhei wrote:
> >>>  >> >>>> Hi,
> >>>  >> >>>>
> >>>  >> >>>> Thanks for sharing the current status!
> >>>  >> >>>> I understand.
> >>>  >> >>>>
> >>>  >> >>>> BTW, can I add GLib/Ruby bindings to apache/arrow-adbc
> >>>  >> >>>> before we release the first version? (I want to use ADBC
> >>>  >> >>>> from Ruby.) Or should I wait for the first release? If I can
> >>>  >> >>>> work on it now, I'll open pull requests for it.
> >>>  >> >>>>
> >>>  >> >>>> Thanks,
> >>>  >> >>>> --
> >>>  >> >>>> kou
> >>>  >> >>>>
> >>>  >> >>>> In <8703efd9-51bd-4f91-b550-73830667d591@www.fastmail.com
> >>> <ma...@www.fastmail.com>>
> >>>  >> >>>>   "Re: [DISC] Improving Arrow's database support" on Fri,
> >>> 26 Aug
> >>>  2022
> >>>  >> >>>> 11:03:26 -0400,
> >>>  >> >>>>   "David Li" <lidavidm@apache.org
> >>> <ma...@apache.org>> wrote:
> >>>  >> >>>>
> >>>  >> >>>>> Thank you Kou!
> >>>  >> >>>>>
> >>>  >> >>>>> At least initially, I don't think I'll be able to complete
> >>> the
> >>>  >> Dataset integration in time. So 10.0.0 probably won't ship with
> >>> a hard
> >>>  >> dependency. That said I am hoping to have PyArrow take an
> >>> optional
> >>>  >> dependency (so Flight SQL can finally be available from Python).
> >>>  >> >>>>>
> >>>  >> >>>>> On Fri, Aug 26, 2022, at 01:01, Sutou Kouhei wrote:
> >>>  >> >>>>>> Hi,
> >>>  >> >>>>>>
> >>>  >> >>>>>> As a maintainer of Linux packages, I want
> >>> apache/arrow-adbc
> >>>  >> >>>>>> to be released before apache/arrow is released so that
> >>>  >> >>>>>> apache/arrow's .deb/.rpm can depend on apache/arrow-adbc's
> >>>  >> >>>>>> .deb/.rpm.
> >>>  >> >>>>>>
> >>>  >> >>>>>> (If Apache Arrow Dataset uses apache/arrow-adbc,
> >>>  >> >>>>>> apache/arrow's .deb/.rpm needs to depend on
> >>>  >> >>>>>> apache/arrow-adbc's .deb/.rpm.)
> >>>  >> >>>>>>
> >>>  >> >>>>>> We can add .deb/.rpm related files
> >>>  >> >>>>>> (dev/tasks/linux-packages/ in apache/arrow) to
> >>>  >> >>>>>> apache/arrow-adbc to build .deb/.rpm for
> >>> apache/arrow-adbc.
> >>>  >> >>>>>>
> >>>  >> >>>>>> FYI: I did it for datafusion-contrib/datafusion-c:
> >>>  >> >>>>>>
> >>>  >> >>>>>> *
> >>>  >>
> >>> <https://github.com/datafusion-contrib/datafusion-c/tree/main/package>
> >>>  >> >>>>>> *
> >>>  >> >>>>>>
> >>>  >>
> >>>
> >>> <
> https://github.com/datafusion-contrib/datafusion-c/blob/main/.github/workflows/package.yaml
> >
> >>>  >> >>>>>>
> >>>  >> >>>>>> I can work on it in apache/arrow-adbc.
> >>>  >> >>>>>>
> >>>  >> >>>>>>
> >>>  >> >>>>>> Thanks,
> >>>  >> >>>>>> --
> >>>  >> >>>>>> kou
> >>>  >> >>>>>>
> >>>  >> >>>>>> In <5cbf2923-4fb4-4c5e-b11d-007209fdd8d7@www.fastmail.com
> >>> <ma...@www.fastmail.com>>
> >>>  >> >>>>>>   "Re: [DISC] Improving Arrow's database support" on Thu,
> >>> 25 Aug
> >>>  >> 2022
> >>>  >> >>>>>> 11:51:08 -0400,
> >>>  >> >>>>>>   "David Li" <lidavidm@apache.org
> >>> <ma...@apache.org>> wrote:
> >>>  >> >>>>>>
> >>>  >> >>>>>>> Fair enough, thank you. I'll try to expand a bit. (Sorry
> >>> for the
> >>>  >> wall of text that follows…)
> >>>  >> >>>>>>>
> >>>  >> >>>>>>> These are the components:
> >>>  >> >>>>>>>
> >>>  >> >>>>>>> - Core adbc.h header
> >>>  >> >>>>>>> - Driver manager for C/C++
> >>>  >> >>>>>>> - Flight SQL-based driver
> >>>  >> >>>>>>> - Postgres-based driver (WIP)
> >>>  >> >>>>>>> - SQLite-based driver (more of a testbed for me than an
> >>> actual
> >>>  >> component - I don't think we'd actually distribute this)
> >>>  >> >>>>>>> - Java core interfaces
> >>>  >> >>>>>>> - Java driver manager
> >>>  >> >>>>>>> - Java JDBC-based driver
> >>>  >> >>>>>>> - Java Flight SQL-based driver
> >>>  >> >>>>>>> - Python driver manager
> >>>  >> >>>>>>>
> >>>  >> >>>>>>> I think: adbc.h gets mirrored into the Arrow repo. The
> >>> Flight
> >>>  SQL
> >>>  >> drivers get moved to the main Arrow repo and distributed as part
> >>> of the
> >>>  >> regular Arrow releases.
> >>>  >> >>>>>>>
> >>>  >> >>>>>>> For the rest of the components: they could be packaged
> >>>  >> individually, but versioned and released together. Also, each
> >>> C/C++
> >>>  driver
> >>>  >> probably needs a corresponding Python package so Python users do
> >>> not
> >>>  have
> >>>  >> to futz with shared library configurations. (See [1].) So for
> >>> instance,
> >>>  >> installing PyArrow would also give you the Flight SQL driver,
> >>> and `pip
> >>>  >> install adbc_postgres` would get you the Postgres-based driver.
> >>>  >> >>>>>>>
> >>>  >> >>>>>>> That would mean setting up separate CI, release, etc.
> >>> (and
> >>>  >> eventually linking Crossbow & Conbench as well?). That does mean
> >>>  >> duplication of effort, but the trade off is avoiding bloating
> >>> the main
> >>>  >> release process even further. However, I'd like to hear from
> >>> those
> >>>  closer
> >>>  >> to the release process on this subject - if it would make
> >>> people's lives
> >>>  >> easier, we could merge everything into one repo/process.
> >>>  >> >>>>>>>
> >>>  >> >>>>>>> Integrations would be distributed as part of their
> >>> respective
> >>>  >> packages (e.g. Arrow Dataset would optionally link to the driver
> >>>  manager).
> >>>  >> So the "part of Arrow 10.0.0" aspect means having a stable
> >>> interface for
> >>>  >> adbc.h, and getting the Flight SQL drivers into the main repo.
> >>>  >> >>>>>>>
> >>>  >> >>>>>>> [1]: <https://github.com/apache/arrow-adbc/issues/53>
> >>>  >> >>>>>>>
> >>>  >> >>>>>>> On Thu, Aug 25, 2022, at 11:34, Antoine Pitrou wrote:
> >>>  >> >>>>>>>> On Fri, 19 Aug 2022 14:09:44 -0400
> >>>  >> >>>>>>>> "David Li" <lidavidm@apache.org
> >>> <ma...@apache.org>> wrote:
> >>>  >> >>>>>>>>> Since it's been a while, I'd like to give an update.
> >>> There are
> >>>  >> also a few questions I have around distribution.
> >>>  >> >>>>>>>>>
> >>>  >> >>>>>>>>> Currently:
> >>>  >> >>>>>>>>> - Supported in C, Java, and Python.
> >>>  >> >>>>>>>>> - For C/Python, there are basic drivers wrapping
> >>> Flight SQL
> >>>  and
> >>>  >> SQLite, with a draft of a libpq (Postgres) driver (using
> >>> nanoarrow).
> >>>  >> >>>>>>>>> - For Java, there are drivers wrapping JDBC and Flight
> >>> SQL.
> >>>  >> >>>>>>>>> - For Python, there's low-level bindings to the C API,
> >>> and the
> >>>  >> DBAPI interface on top of that (+a few extension methods
> >>> resembling
> >>>  >> DuckDB/Turbodbc).
> >>>  >> >>>>>>>>>
> >>>  >> >>>>>>>>> There's drafts of integration with Ibis [1], DBI (R),
> >>> and
> >>>  >> DuckDB. (I'd like to thank Hannes and Kirill for their comments,
> >>> as
> >>>  well as
> >>>  >> Antoine, Dewey, and Matt here.)
> >>>  >> >>>>>>>>>
> >>>  >> >>>>>>>>> I'd like to have this as part of 10.0.0 in some
> >>> fashion.
> >>>  >> However, I'm not sure how we would like to handle packaging and
> >>>  >> distribution. In particular, there are several sub-components
> >>> for each
> >>>  >> language (the driver manager + the drivers), increasing the
> >>> work. Any
> >>>  >> thoughts here?
> >>>  >> >>>>>>>>
> >>>  >> >>>>>>>> Sorry, forgot to answer here. But I think your question
> >>> is too
> >>>  >> broadly
> >>>  >> >>>>>>>> formulated. It probably deserves a case-by-case
> >>> discussion,
> >>>  IMHO.
> >>>  >> >>>>>>>>
> >>>  >> >>>>>>>>> I'm also wondering how we want to handle this in terms
> >>> of
> >>>  >> specification - I assume we'd consider the core header file/Java
> >>>  interfaces
> >>>  >> a spec like the C Data Interface/Flight RPC, and vote on
> >>> them/mirror
> >>>  them
> >>>  >> into the format/ directory?
> >>>  >> >>>>>>>>
> >>>  >> >>>>>>>> That sounds like the right way to me indeed.
> >>>  >> >>>>>>>>
> >>>  >> >>>>>>>> Regards
> >>>  >> >>>>>>>>
> >>>  >> >>>>>>>> Antoine.
> >>>  >>
> >>>
>

Re: [DISC] Improving Arrow's database support

Posted by David Li <li...@apache.org>.

I like this idea. I would also like to set up some sort of automated ABI checker as well (the options I found were GPL/LGPL so I need to figure out how to proceed). 

I can put up a PR later that formalizes these guidelines in CONTRIBUTING.md. It looks like there's a pre-commit hook for this sort of thing too, which'll let us enforce it in CI!

On Mon, Sep 12, 2022, at 10:18, Matthew Topol wrote:
> Automated semver would be ideal if we can do it.....
>
> There's quite a lot of utilities that exist which would automatically 
> handle the versioning if we're using conventional commits.
>
> On Mon, Sep 12 2022 at 02:26:15 PM +0200, Jacob Wujciak 
> <ja...@voltrondata.com.INVALID> wrote:
>> + 1 to independent, semver versioning for adbc.
>> I would propose we use conventional commit style [1] commit messages 
>> for
>> the pr commits (I assume squash + merge) so we can automate the
>> versioning|double check manual versioning.
>> 
>> [1]: <https://www.conventionalcommits.org/>
>> 
>> On Thu, Sep 8, 2022 at 6:05 PM David Li <lidavidm@apache.org 
>> <ma...@apache.org>> wrote:
>> 
>>>  Thanks all, I've updated the header with the proposed versioning 
>>> scheme.
>>> 
>>>  At this point I believe the core definitions are ready. (Note that 
>>> I'm
>>>  explicitly punting on [1][2][3] here.) Absent further comments, I'd 
>>> like to
>>>  do the following:
>>> 
>>>  - Start a vote on mirroring adbc.h to arrow/format, as well adding
>>>  docs/source/format/ADBC.rst that describes the header, the Java 
>>> interface,
>>>  the Go interface, and the versioning scheme (I will put up a PR 
>>> beforehand)
>>>  - Begin work on CI/packaging, with a release hopefully coinciding 
>>> with
>>>  Arrow 10.0.0
>>>  - Begin work on changes to the main repository, also hopefully in 
>>> time for
>>>  10.0.0 (moving the Flight SQL driver to be part of apache/arrow; 
>>> exposing
>>>  it in PyArrow; possibly also exposing Acero via ADBC)
>>> 
>>>  [1]: <https://github.com/apache/arrow-adbc/issues/46>
>>>  [2]: <https://github.com/apache/arrow-adbc/issues/55>
>>>  [3]: <https://github.com/apache/arrow-adbc/issues/59>
>>> 
>>>  On Sat, Sep 3, 2022, at 18:36, Matthew Topol wrote:
>>>  > +1 from me on the strategy proposed by Kou.
>>>  >
>>>  > That would be my preference also. I agree it is preferable to be
>>>  versioned
>>>  > independently.
>>>  >
>>>  > --Matt
>>>  >
>>>  > On Sat, Sep 3, 2022, 6:24 PM Sutou Kouhei <kou@clear-code.com 
>>> <ma...@clear-code.com>> wrote:
>>>  >
>>>  >> Hi,
>>>  >>
>>>  >> > Do we have a preference for versioning strategy? Should we
>>>  >> > proceed in lockstep with the Arrow C++ library et. al. and
>>>  >> > release "ADBC 1.0.0" (the API standard) with "drivers
>>>  >> > version 10.0.0", or use an independent versioning scheme?
>>>  >> > (For example, release API standard and components at
>>>  >> > "1.0.0". Then further releases of components that do not
>>>  >> > change the spec would be "1.1", "1.2", ...; if/when we
>>>  >> > change the spec, start over with "2.0", "2.1", ...)
>>>  >>
>>>  >> I like an independent versioning schema. I assume that ADBC
>>>  >> doesn't need backward incompatible changes frequently. How
>>>  >> about incrementing major version only when ADBC needs
>>>  >> any backward incompatible changes?
>>>  >>
>>>  >> e.g.:
>>>  >>
>>>  >>   1.  Release ADBC (the API standard) 1.0.0
>>>  >>   2.  Release adbc_driver_manager 1.0.0
>>>  >>   3.  Release adbc_driver_postgres 1.0.0
>>>  >>   4.  Add a new feature to adbc_driver_postgres without
>>>  >>       any backward incompatible changes
>>>  >>   5.  Release adbc_driver_postgres 1.1.0
>>>  >>   6.  Fix a bug in adbc_driver_manager without
>>>  >>       any backward incompatible changes
>>>  >>   7.  Release adbc_driver_manager 1.0.1
>>>  >>   8.  Add a backward incompatible change to adbc_driver_manager
>>>  >>   9.  Release adbc_driver_manager 2.0.0
>>>  >>   10. Add a new feature to ADBC without any
>>>  >>       backward incompatible changes
>>>  >>   11. Release ADBC (the API standard) 1.1.0
>>>  >>
>>>  >>
>>>  >> Thanks,
>>>  >> --
>>>  >> kou
>>>  >>
>>>  >> In <7b20d730-b85e-4818-b99e-3335c40c2f08@www.fastmail.com 
>>> <ma...@www.fastmail.com>>
>>>  >>   "Re: [DISC] Improving Arrow's database support" on Thu, 01 Sep 
>>> 2022
>>>  >> 16:36:43 -0400,
>>>  >>   "David Li" <lidavidm@apache.org <ma...@apache.org>> 
>>> wrote:
>>>  >>
>>>  >> > Following up here with some specific questions:
>>>  >> >
>>>  >> > Matt Topol added some Go definitions [1] (thanks!) I'd assume 
>>> we want
>>>  to
>>>  >> vote on those as well?
>>>  >> >
>>>  >> > How should the process work for Java/Go? For C/C++, I assume 
>>> we'd
>>>  treat
>>>  >> it like the C Data Interface and copy adbc.h to format/ after a 
>>> vote,
>>>  and
>>>  >> then vote on releases of components. Or do we really only 
>>> consider the C
>>>  >> header as the 'format', with the others being language-specific
>>>  affordances?
>>>  >> >
>>>  >> > What about for Java and for Go? We could vote on and tag a 
>>> release for
>>>  >> Go, and add a documentation page that links to the Java/Go 
>>> definitions
>>>  at a
>>>  >> specific revision (as the equivalent 'format' definition for 
>>> Java/Go)?
>>>  Or
>>>  >> would we vendor the entire Java module/Go package as the 
>>> 'format'?
>>>  >> >
>>>  >> > Do we have a preference for versioning strategy? Should we 
>>> proceed in
>>>  >> lockstep with the Arrow C++ library et. al. and release "ADBC 
>>> 1.0.0"
>>>  (the
>>>  >> API standard) with "drivers version 10.0.0", or use an 
>>> independent
>>>  >> versioning scheme? (For example, release API standard and 
>>> components at
>>>  >> "1.0.0". Then further releases of components that do not change 
>>> the spec
>>>  >> would be "1.1", "1.2", ...; if/when we change the spec, start 
>>> over with
>>>  >> "2.0", "2.1", ...)
>>>  >> >
>>>  >> > [1]: 
>>> <https://github.com/apache/arrow-adbc/blob/main/go/adbc/adbc.go>
>>>  >> >
>>>  >> > -David
>>>  >> >
>>>  >> > On Sun, Aug 28, 2022, at 10:56, Sutou Kouhei wrote:
>>>  >> >> Hi,
>>>  >> >>
>>>  >> >> OK. I'll send pull requests for GLib and Ruby soon.
>>>  >> >>
>>>  >> >>> I'm curious if you have a particular use case in mind.
>>>  >> >>
>>>  >> >> I don't have any production-ready use case yet but I want to
>>>  >> >> implement an Active Record adapter for ADBC. Active Record
>>>  >> >> is the O/R mapper for Ruby on Rails. Implementing Web
>>>  >> >> application by Ruby on Rails is one of major Ruby use
>>>  >> >> cases. So providing Active Record interface for ADBC will
>>>  >> >> increase Apache Arrow users in Ruby community.
>>>  >> >>
>>>  >> >> NOTE: Generally, Ruby on Rails users don't process large
>>>  >> >> data but they sometimes need to process large (medium?) data
>>>  >> >> in a batch process. Active Record adapter for ADBC may be
>>>  >> >> useful for such use case.
>>>  >> >>
>>>  >> >>> There's a little bit more API cleanup to do [1]. If you
>>>  >> >>> have comments on that or anything else, I'd appreciate
>>>  >> >>> them. Otherwise, pull requests would also be appreciated.
>>>  >> >>
>>>  >> >> OK. I'll open issues/pull requests when I find
>>>  >> >> something. For now, I think that "MODULE" type library
>>>  >> >> instead of "SHARED" type library in CMake terminology
>>>  >> >> [cmake] is better for driver modules. (I'll open an issue
>>>  >> >> for this later.)
>>>  >> >>
>>>  >> >> [cmake]:
>>>  <https://cmake.org/cmake/help/latest/command/add_library.html>
>>>  >> >>
>>>  >> >>
>>>  >> >> Thanks,
>>>  >> >> --
>>>  >> >> kou
>>>  >> >>
>>>  >> >> In <e6380315-94aa-4dd1-8685-268edd597821@www.fastmail.com 
>>> <ma...@www.fastmail.com>>
>>>  >> >>   "Re: [DISC] Improving Arrow's database support" on Sat, 27 
>>> Aug 2022
>>>  >> >> 15:28:56 -0400,
>>>  >> >>   "David Li" <lidavidm@apache.org 
>>> <ma...@apache.org>> wrote:
>>>  >> >>
>>>  >> >>> I would be very happy to see GLib/Ruby bindings! I'm curious 
>>> if you
>>>  >> have a particular use case in mind.
>>>  >> >>>
>>>  >> >>> There's a little bit more API cleanup to do [1]. If you have
>>>  comments
>>>  >> on that or anything else, I'd appreciate them. Otherwise, pull 
>>> requests
>>>  >> would also be appreciated.
>>>  >> >>>
>>>  >> >>> [1]: <https://github.com/apache/arrow-adbc/issues/79>
>>>  >> >>>
>>>  >> >>> On Fri, Aug 26, 2022, at 21:53, Sutou Kouhei wrote:
>>>  >> >>>> Hi,
>>>  >> >>>>
>>>  >> >>>> Thanks for sharing the current status!
>>>  >> >>>> I understand.
>>>  >> >>>>
>>>  >> >>>> BTW, can I add GLib/Ruby bindings to apache/arrow-adbc
>>>  >> >>>> before we release the first version? (I want to use ADBC
>>>  >> >>>> from Ruby.) Or should I wait for the first release? If I can
>>>  >> >>>> work on it now, I'll open pull requests for it.
>>>  >> >>>>
>>>  >> >>>> Thanks,
>>>  >> >>>> --
>>>  >> >>>> kou
>>>  >> >>>>
>>>  >> >>>> In <8703efd9-51bd-4f91-b550-73830667d591@www.fastmail.com 
>>> <ma...@www.fastmail.com>>
>>>  >> >>>>   "Re: [DISC] Improving Arrow's database support" on Fri, 
>>> 26 Aug
>>>  2022
>>>  >> >>>> 11:03:26 -0400,
>>>  >> >>>>   "David Li" <lidavidm@apache.org 
>>> <ma...@apache.org>> wrote:
>>>  >> >>>>
>>>  >> >>>>> Thank you Kou!
>>>  >> >>>>>
>>>  >> >>>>> At least initially, I don't think I'll be able to complete 
>>> the
>>>  >> Dataset integration in time. So 10.0.0 probably won't ship with 
>>> a hard
>>>  >> dependency. That said I am hoping to have PyArrow take an 
>>> optional
>>>  >> dependency (so Flight SQL can finally be available from Python).
>>>  >> >>>>>
>>>  >> >>>>> On Fri, Aug 26, 2022, at 01:01, Sutou Kouhei wrote:
>>>  >> >>>>>> Hi,
>>>  >> >>>>>>
>>>  >> >>>>>> As a maintainer of Linux packages, I want 
>>> apache/arrow-adbc
>>>  >> >>>>>> to be released before apache/arrow is released so that
>>>  >> >>>>>> apache/arrow's .deb/.rpm can depend on apache/arrow-adbc's
>>>  >> >>>>>> .deb/.rpm.
>>>  >> >>>>>>
>>>  >> >>>>>> (If Apache Arrow Dataset uses apache/arrow-adbc,
>>>  >> >>>>>> apache/arrow's .deb/.rpm needs to depend on
>>>  >> >>>>>> apache/arrow-adbc's .deb/.rpm.)
>>>  >> >>>>>>
>>>  >> >>>>>> We can add .deb/.rpm related files
>>>  >> >>>>>> (dev/tasks/linux-packages/ in apache/arrow) to
>>>  >> >>>>>> apache/arrow-adbc to build .deb/.rpm for 
>>> apache/arrow-adbc.
>>>  >> >>>>>>
>>>  >> >>>>>> FYI: I did it for datafusion-contrib/datafusion-c:
>>>  >> >>>>>>
>>>  >> >>>>>> *
>>>  >> 
>>> <https://github.com/datafusion-contrib/datafusion-c/tree/main/package>
>>>  >> >>>>>> *
>>>  >> >>>>>>
>>>  >>
>>>  
>>> <https://github.com/datafusion-contrib/datafusion-c/blob/main/.github/workflows/package.yaml>
>>>  >> >>>>>>
>>>  >> >>>>>> I can work on it in apache/arrow-adbc.
>>>  >> >>>>>>
>>>  >> >>>>>>
>>>  >> >>>>>> Thanks,
>>>  >> >>>>>> --
>>>  >> >>>>>> kou
>>>  >> >>>>>>
>>>  >> >>>>>> In <5cbf2923-4fb4-4c5e-b11d-007209fdd8d7@www.fastmail.com 
>>> <ma...@www.fastmail.com>>
>>>  >> >>>>>>   "Re: [DISC] Improving Arrow's database support" on Thu, 
>>> 25 Aug
>>>  >> 2022
>>>  >> >>>>>> 11:51:08 -0400,
>>>  >> >>>>>>   "David Li" <lidavidm@apache.org 
>>> <ma...@apache.org>> wrote:
>>>  >> >>>>>>
>>>  >> >>>>>>> Fair enough, thank you. I'll try to expand a bit. (Sorry 
>>> for the
>>>  >> wall of text that follows…)
>>>  >> >>>>>>>
>>>  >> >>>>>>> These are the components:
>>>  >> >>>>>>>
>>>  >> >>>>>>> - Core adbc.h header
>>>  >> >>>>>>> - Driver manager for C/C++
>>>  >> >>>>>>> - Flight SQL-based driver
>>>  >> >>>>>>> - Postgres-based driver (WIP)
>>>  >> >>>>>>> - SQLite-based driver (more of a testbed for me than an 
>>> actual
>>>  >> component - I don't think we'd actually distribute this)
>>>  >> >>>>>>> - Java core interfaces
>>>  >> >>>>>>> - Java driver manager
>>>  >> >>>>>>> - Java JDBC-based driver
>>>  >> >>>>>>> - Java Flight SQL-based driver
>>>  >> >>>>>>> - Python driver manager
>>>  >> >>>>>>>
>>>  >> >>>>>>> I think: adbc.h gets mirrored into the Arrow repo. The 
>>> Flight
>>>  SQL
>>>  >> drivers get moved to the main Arrow repo and distributed as part 
>>> of the
>>>  >> regular Arrow releases.
>>>  >> >>>>>>>
>>>  >> >>>>>>> For the rest of the components: they could be packaged
>>>  >> individually, but versioned and released together. Also, each 
>>> C/C++
>>>  driver
>>>  >> probably needs a corresponding Python package so Python users do 
>>> not
>>>  have
>>>  >> to futz with shared library configurations. (See [1].) So for 
>>> instance,
>>>  >> installing PyArrow would also give you the Flight SQL driver, 
>>> and `pip
>>>  >> install adbc_postgres` would get you the Postgres-based driver.
>>>  >> >>>>>>>
>>>  >> >>>>>>> That would mean setting up separate CI, release, etc. 
>>> (and
>>>  >> eventually linking Crossbow & Conbench as well?). That does mean
>>>  >> duplication of effort, but the trade off is avoiding bloating 
>>> the main
>>>  >> release process even further. However, I'd like to hear from 
>>> those
>>>  closer
>>>  >> to the release process on this subject - if it would make 
>>> people's lives
>>>  >> easier, we could merge everything into one repo/process.
>>>  >> >>>>>>>
>>>  >> >>>>>>> Integrations would be distributed as part of their 
>>> respective
>>>  >> packages (e.g. Arrow Dataset would optionally link to the driver
>>>  manager).
>>>  >> So the "part of Arrow 10.0.0" aspect means having a stable 
>>> interface for
>>>  >> adbc.h, and getting the Flight SQL drivers into the main repo.
>>>  >> >>>>>>>
>>>  >> >>>>>>> [1]: <https://github.com/apache/arrow-adbc/issues/53>
>>>  >> >>>>>>>
>>>  >> >>>>>>> On Thu, Aug 25, 2022, at 11:34, Antoine Pitrou wrote:
>>>  >> >>>>>>>> On Fri, 19 Aug 2022 14:09:44 -0400
>>>  >> >>>>>>>> "David Li" <lidavidm@apache.org 
>>> <ma...@apache.org>> wrote:
>>>  >> >>>>>>>>> Since it's been a while, I'd like to give an update. 
>>> There are
>>>  >> also a few questions I have around distribution.
>>>  >> >>>>>>>>>
>>>  >> >>>>>>>>> Currently:
>>>  >> >>>>>>>>> - Supported in C, Java, and Python.
>>>  >> >>>>>>>>> - For C/Python, there are basic drivers wrapping 
>>> Flight SQL
>>>  and
>>>  >> SQLite, with a draft of a libpq (Postgres) driver (using 
>>> nanoarrow).
>>>  >> >>>>>>>>> - For Java, there are drivers wrapping JDBC and Flight 
>>> SQL.
>>>  >> >>>>>>>>> - For Python, there's low-level bindings to the C API, 
>>> and the
>>>  >> DBAPI interface on top of that (+a few extension methods 
>>> resembling
>>>  >> DuckDB/Turbodbc).
>>>  >> >>>>>>>>>
>>>  >> >>>>>>>>> There's drafts of integration with Ibis [1], DBI (R), 
>>> and
>>>  >> DuckDB. (I'd like to thank Hannes and Kirill for their comments, 
>>> as
>>>  well as
>>>  >> Antoine, Dewey, and Matt here.)
>>>  >> >>>>>>>>>
>>>  >> >>>>>>>>> I'd like to have this as part of 10.0.0 in some 
>>> fashion.
>>>  >> However, I'm not sure how we would like to handle packaging and
>>>  >> distribution. In particular, there are several sub-components 
>>> for each
>>>  >> language (the driver manager + the drivers), increasing the 
>>> work. Any
>>>  >> thoughts here?
>>>  >> >>>>>>>>
>>>  >> >>>>>>>> Sorry, forgot to answer here. But I think your question 
>>> is too
>>>  >> broadly
>>>  >> >>>>>>>> formulated. It probably deserves a case-by-case 
>>> discussion,
>>>  IMHO.
>>>  >> >>>>>>>>
>>>  >> >>>>>>>>> I'm also wondering how we want to handle this in terms 
>>> of
>>>  >> specification - I assume we'd consider the core header file/Java
>>>  interfaces
>>>  >> a spec like the C Data Interface/Flight RPC, and vote on 
>>> them/mirror
>>>  them
>>>  >> into the format/ directory?
>>>  >> >>>>>>>>
>>>  >> >>>>>>>> That sounds like the right way to me indeed.
>>>  >> >>>>>>>>
>>>  >> >>>>>>>> Regards
>>>  >> >>>>>>>>
>>>  >> >>>>>>>> Antoine.
>>>  >>
>>>

Re: [DISC] Improving Arrow's database support

Posted by Matthew Topol <ma...@voltrondata.com.INVALID>.

Automated semver would be ideal if we can do it.....

There's quite a lot of utilities that exist which would automatically 
handle the versioning if we're using conventional commits.

On Mon, Sep 12 2022 at 02:26:15 PM +0200, Jacob Wujciak 
<ja...@voltrondata.com.INVALID> wrote:
> + 1 to independent, semver versioning for adbc.
> I would propose we use conventional commit style [1] commit messages 
> for
> the pr commits (I assume squash + merge) so we can automate the
> versioning|double check manual versioning.
> 
> [1]: <https://www.conventionalcommits.org/>
> 
> On Thu, Sep 8, 2022 at 6:05 PM David Li <lidavidm@apache.org 
> <ma...@apache.org>> wrote:
> 
>>  Thanks all, I've updated the header with the proposed versioning 
>> scheme.
>> 
>>  At this point I believe the core definitions are ready. (Note that 
>> I'm
>>  explicitly punting on [1][2][3] here.) Absent further comments, I'd 
>> like to
>>  do the following:
>> 
>>  - Start a vote on mirroring adbc.h to arrow/format, as well adding
>>  docs/source/format/ADBC.rst that describes the header, the Java 
>> interface,
>>  the Go interface, and the versioning scheme (I will put up a PR 
>> beforehand)
>>  - Begin work on CI/packaging, with a release hopefully coinciding 
>> with
>>  Arrow 10.0.0
>>  - Begin work on changes to the main repository, also hopefully in 
>> time for
>>  10.0.0 (moving the Flight SQL driver to be part of apache/arrow; 
>> exposing
>>  it in PyArrow; possibly also exposing Acero via ADBC)
>> 
>>  [1]: <https://github.com/apache/arrow-adbc/issues/46>
>>  [2]: <https://github.com/apache/arrow-adbc/issues/55>
>>  [3]: <https://github.com/apache/arrow-adbc/issues/59>
>> 
>>  On Sat, Sep 3, 2022, at 18:36, Matthew Topol wrote:
>>  > +1 from me on the strategy proposed by Kou.
>>  >
>>  > That would be my preference also. I agree it is preferable to be
>>  versioned
>>  > independently.
>>  >
>>  > --Matt
>>  >
>>  > On Sat, Sep 3, 2022, 6:24 PM Sutou Kouhei <kou@clear-code.com 
>> <ma...@clear-code.com>> wrote:
>>  >
>>  >> Hi,
>>  >>
>>  >> > Do we have a preference for versioning strategy? Should we
>>  >> > proceed in lockstep with the Arrow C++ library et. al. and
>>  >> > release "ADBC 1.0.0" (the API standard) with "drivers
>>  >> > version 10.0.0", or use an independent versioning scheme?
>>  >> > (For example, release API standard and components at
>>  >> > "1.0.0". Then further releases of components that do not
>>  >> > change the spec would be "1.1", "1.2", ...; if/when we
>>  >> > change the spec, start over with "2.0", "2.1", ...)
>>  >>
>>  >> I like an independent versioning schema. I assume that ADBC
>>  >> doesn't need backward incompatible changes frequently. How
>>  >> about incrementing major version only when ADBC needs
>>  >> any backward incompatible changes?
>>  >>
>>  >> e.g.:
>>  >>
>>  >>   1.  Release ADBC (the API standard) 1.0.0
>>  >>   2.  Release adbc_driver_manager 1.0.0
>>  >>   3.  Release adbc_driver_postgres 1.0.0
>>  >>   4.  Add a new feature to adbc_driver_postgres without
>>  >>       any backward incompatible changes
>>  >>   5.  Release adbc_driver_postgres 1.1.0
>>  >>   6.  Fix a bug in adbc_driver_manager without
>>  >>       any backward incompatible changes
>>  >>   7.  Release adbc_driver_manager 1.0.1
>>  >>   8.  Add a backward incompatible change to adbc_driver_manager
>>  >>   9.  Release adbc_driver_manager 2.0.0
>>  >>   10. Add a new feature to ADBC without any
>>  >>       backward incompatible changes
>>  >>   11. Release ADBC (the API standard) 1.1.0
>>  >>
>>  >>
>>  >> Thanks,
>>  >> --
>>  >> kou
>>  >>
>>  >> In <7b20d730-b85e-4818-b99e-3335c40c2f08@www.fastmail.com 
>> <ma...@www.fastmail.com>>
>>  >>   "Re: [DISC] Improving Arrow's database support" on Thu, 01 Sep 
>> 2022
>>  >> 16:36:43 -0400,
>>  >>   "David Li" <lidavidm@apache.org <ma...@apache.org>> 
>> wrote:
>>  >>
>>  >> > Following up here with some specific questions:
>>  >> >
>>  >> > Matt Topol added some Go definitions [1] (thanks!) I'd assume 
>> we want
>>  to
>>  >> vote on those as well?
>>  >> >
>>  >> > How should the process work for Java/Go? For C/C++, I assume 
>> we'd
>>  treat
>>  >> it like the C Data Interface and copy adbc.h to format/ after a 
>> vote,
>>  and
>>  >> then vote on releases of components. Or do we really only 
>> consider the C
>>  >> header as the 'format', with the others being language-specific
>>  affordances?
>>  >> >
>>  >> > What about for Java and for Go? We could vote on and tag a 
>> release for
>>  >> Go, and add a documentation page that links to the Java/Go 
>> definitions
>>  at a
>>  >> specific revision (as the equivalent 'format' definition for 
>> Java/Go)?
>>  Or
>>  >> would we vendor the entire Java module/Go package as the 
>> 'format'?
>>  >> >
>>  >> > Do we have a preference for versioning strategy? Should we 
>> proceed in
>>  >> lockstep with the Arrow C++ library et. al. and release "ADBC 
>> 1.0.0"
>>  (the
>>  >> API standard) with "drivers version 10.0.0", or use an 
>> independent
>>  >> versioning scheme? (For example, release API standard and 
>> components at
>>  >> "1.0.0". Then further releases of components that do not change 
>> the spec
>>  >> would be "1.1", "1.2", ...; if/when we change the spec, start 
>> over with
>>  >> "2.0", "2.1", ...)
>>  >> >
>>  >> > [1]: 
>> <https://github.com/apache/arrow-adbc/blob/main/go/adbc/adbc.go>
>>  >> >
>>  >> > -David
>>  >> >
>>  >> > On Sun, Aug 28, 2022, at 10:56, Sutou Kouhei wrote:
>>  >> >> Hi,
>>  >> >>
>>  >> >> OK. I'll send pull requests for GLib and Ruby soon.
>>  >> >>
>>  >> >>> I'm curious if you have a particular use case in mind.
>>  >> >>
>>  >> >> I don't have any production-ready use case yet but I want to
>>  >> >> implement an Active Record adapter for ADBC. Active Record
>>  >> >> is the O/R mapper for Ruby on Rails. Implementing Web
>>  >> >> application by Ruby on Rails is one of major Ruby use
>>  >> >> cases. So providing Active Record interface for ADBC will
>>  >> >> increase Apache Arrow users in Ruby community.
>>  >> >>
>>  >> >> NOTE: Generally, Ruby on Rails users don't process large
>>  >> >> data but they sometimes need to process large (medium?) data
>>  >> >> in a batch process. Active Record adapter for ADBC may be
>>  >> >> useful for such use case.
>>  >> >>
>>  >> >>> There's a little bit more API cleanup to do [1]. If you
>>  >> >>> have comments on that or anything else, I'd appreciate
>>  >> >>> them. Otherwise, pull requests would also be appreciated.
>>  >> >>
>>  >> >> OK. I'll open issues/pull requests when I find
>>  >> >> something. For now, I think that "MODULE" type library
>>  >> >> instead of "SHARED" type library in CMake terminology
>>  >> >> [cmake] is better for driver modules. (I'll open an issue
>>  >> >> for this later.)
>>  >> >>
>>  >> >> [cmake]:
>>  <https://cmake.org/cmake/help/latest/command/add_library.html>
>>  >> >>
>>  >> >>
>>  >> >> Thanks,
>>  >> >> --
>>  >> >> kou
>>  >> >>
>>  >> >> In <e6380315-94aa-4dd1-8685-268edd597821@www.fastmail.com 
>> <ma...@www.fastmail.com>>
>>  >> >>   "Re: [DISC] Improving Arrow's database support" on Sat, 27 
>> Aug 2022
>>  >> >> 15:28:56 -0400,
>>  >> >>   "David Li" <lidavidm@apache.org 
>> <ma...@apache.org>> wrote:
>>  >> >>
>>  >> >>> I would be very happy to see GLib/Ruby bindings! I'm curious 
>> if you
>>  >> have a particular use case in mind.
>>  >> >>>
>>  >> >>> There's a little bit more API cleanup to do [1]. If you have
>>  comments
>>  >> on that or anything else, I'd appreciate them. Otherwise, pull 
>> requests
>>  >> would also be appreciated.
>>  >> >>>
>>  >> >>> [1]: <https://github.com/apache/arrow-adbc/issues/79>
>>  >> >>>
>>  >> >>> On Fri, Aug 26, 2022, at 21:53, Sutou Kouhei wrote:
>>  >> >>>> Hi,
>>  >> >>>>
>>  >> >>>> Thanks for sharing the current status!
>>  >> >>>> I understand.
>>  >> >>>>
>>  >> >>>> BTW, can I add GLib/Ruby bindings to apache/arrow-adbc
>>  >> >>>> before we release the first version? (I want to use ADBC
>>  >> >>>> from Ruby.) Or should I wait for the first release? If I can
>>  >> >>>> work on it now, I'll open pull requests for it.
>>  >> >>>>
>>  >> >>>> Thanks,
>>  >> >>>> --
>>  >> >>>> kou
>>  >> >>>>
>>  >> >>>> In <8703efd9-51bd-4f91-b550-73830667d591@www.fastmail.com 
>> <ma...@www.fastmail.com>>
>>  >> >>>>   "Re: [DISC] Improving Arrow's database support" on Fri, 
>> 26 Aug
>>  2022
>>  >> >>>> 11:03:26 -0400,
>>  >> >>>>   "David Li" <lidavidm@apache.org 
>> <ma...@apache.org>> wrote:
>>  >> >>>>
>>  >> >>>>> Thank you Kou!
>>  >> >>>>>
>>  >> >>>>> At least initially, I don't think I'll be able to complete 
>> the
>>  >> Dataset integration in time. So 10.0.0 probably won't ship with 
>> a hard
>>  >> dependency. That said I am hoping to have PyArrow take an 
>> optional
>>  >> dependency (so Flight SQL can finally be available from Python).
>>  >> >>>>>
>>  >> >>>>> On Fri, Aug 26, 2022, at 01:01, Sutou Kouhei wrote:
>>  >> >>>>>> Hi,
>>  >> >>>>>>
>>  >> >>>>>> As a maintainer of Linux packages, I want 
>> apache/arrow-adbc
>>  >> >>>>>> to be released before apache/arrow is released so that
>>  >> >>>>>> apache/arrow's .deb/.rpm can depend on apache/arrow-adbc's
>>  >> >>>>>> .deb/.rpm.
>>  >> >>>>>>
>>  >> >>>>>> (If Apache Arrow Dataset uses apache/arrow-adbc,
>>  >> >>>>>> apache/arrow's .deb/.rpm needs to depend on
>>  >> >>>>>> apache/arrow-adbc's .deb/.rpm.)
>>  >> >>>>>>
>>  >> >>>>>> We can add .deb/.rpm related files
>>  >> >>>>>> (dev/tasks/linux-packages/ in apache/arrow) to
>>  >> >>>>>> apache/arrow-adbc to build .deb/.rpm for 
>> apache/arrow-adbc.
>>  >> >>>>>>
>>  >> >>>>>> FYI: I did it for datafusion-contrib/datafusion-c:
>>  >> >>>>>>
>>  >> >>>>>> *
>>  >> 
>> <https://github.com/datafusion-contrib/datafusion-c/tree/main/package>
>>  >> >>>>>> *
>>  >> >>>>>>
>>  >>
>>  
>> <https://github.com/datafusion-contrib/datafusion-c/blob/main/.github/workflows/package.yaml>
>>  >> >>>>>>
>>  >> >>>>>> I can work on it in apache/arrow-adbc.
>>  >> >>>>>>
>>  >> >>>>>>
>>  >> >>>>>> Thanks,
>>  >> >>>>>> --
>>  >> >>>>>> kou
>>  >> >>>>>>
>>  >> >>>>>> In <5cbf2923-4fb4-4c5e-b11d-007209fdd8d7@www.fastmail.com 
>> <ma...@www.fastmail.com>>
>>  >> >>>>>>   "Re: [DISC] Improving Arrow's database support" on Thu, 
>> 25 Aug
>>  >> 2022
>>  >> >>>>>> 11:51:08 -0400,
>>  >> >>>>>>   "David Li" <lidavidm@apache.org 
>> <ma...@apache.org>> wrote:
>>  >> >>>>>>
>>  >> >>>>>>> Fair enough, thank you. I'll try to expand a bit. (Sorry 
>> for the
>>  >> wall of text that follows…)
>>  >> >>>>>>>
>>  >> >>>>>>> These are the components:
>>  >> >>>>>>>
>>  >> >>>>>>> - Core adbc.h header
>>  >> >>>>>>> - Driver manager for C/C++
>>  >> >>>>>>> - Flight SQL-based driver
>>  >> >>>>>>> - Postgres-based driver (WIP)
>>  >> >>>>>>> - SQLite-based driver (more of a testbed for me than an 
>> actual
>>  >> component - I don't think we'd actually distribute this)
>>  >> >>>>>>> - Java core interfaces
>>  >> >>>>>>> - Java driver manager
>>  >> >>>>>>> - Java JDBC-based driver
>>  >> >>>>>>> - Java Flight SQL-based driver
>>  >> >>>>>>> - Python driver manager
>>  >> >>>>>>>
>>  >> >>>>>>> I think: adbc.h gets mirrored into the Arrow repo. The 
>> Flight
>>  SQL
>>  >> drivers get moved to the main Arrow repo and distributed as part 
>> of the
>>  >> regular Arrow releases.
>>  >> >>>>>>>
>>  >> >>>>>>> For the rest of the components: they could be packaged
>>  >> individually, but versioned and released together. Also, each 
>> C/C++
>>  driver
>>  >> probably needs a corresponding Python package so Python users do 
>> not
>>  have
>>  >> to futz with shared library configurations. (See [1].) So for 
>> instance,
>>  >> installing PyArrow would also give you the Flight SQL driver, 
>> and `pip
>>  >> install adbc_postgres` would get you the Postgres-based driver.
>>  >> >>>>>>>
>>  >> >>>>>>> That would mean setting up separate CI, release, etc. 
>> (and
>>  >> eventually linking Crossbow & Conbench as well?). That does mean
>>  >> duplication of effort, but the trade off is avoiding bloating 
>> the main
>>  >> release process even further. However, I'd like to hear from 
>> those
>>  closer
>>  >> to the release process on this subject - if it would make 
>> people's lives
>>  >> easier, we could merge everything into one repo/process.
>>  >> >>>>>>>
>>  >> >>>>>>> Integrations would be distributed as part of their 
>> respective
>>  >> packages (e.g. Arrow Dataset would optionally link to the driver
>>  manager).
>>  >> So the "part of Arrow 10.0.0" aspect means having a stable 
>> interface for
>>  >> adbc.h, and getting the Flight SQL drivers into the main repo.
>>  >> >>>>>>>
>>  >> >>>>>>> [1]: <https://github.com/apache/arrow-adbc/issues/53>
>>  >> >>>>>>>
>>  >> >>>>>>> On Thu, Aug 25, 2022, at 11:34, Antoine Pitrou wrote:
>>  >> >>>>>>>> On Fri, 19 Aug 2022 14:09:44 -0400
>>  >> >>>>>>>> "David Li" <lidavidm@apache.org 
>> <ma...@apache.org>> wrote:
>>  >> >>>>>>>>> Since it's been a while, I'd like to give an update. 
>> There are
>>  >> also a few questions I have around distribution.
>>  >> >>>>>>>>>
>>  >> >>>>>>>>> Currently:
>>  >> >>>>>>>>> - Supported in C, Java, and Python.
>>  >> >>>>>>>>> - For C/Python, there are basic drivers wrapping 
>> Flight SQL
>>  and
>>  >> SQLite, with a draft of a libpq (Postgres) driver (using 
>> nanoarrow).
>>  >> >>>>>>>>> - For Java, there are drivers wrapping JDBC and Flight 
>> SQL.
>>  >> >>>>>>>>> - For Python, there's low-level bindings to the C API, 
>> and the
>>  >> DBAPI interface on top of that (+a few extension methods 
>> resembling
>>  >> DuckDB/Turbodbc).
>>  >> >>>>>>>>>
>>  >> >>>>>>>>> There's drafts of integration with Ibis [1], DBI (R), 
>> and
>>  >> DuckDB. (I'd like to thank Hannes and Kirill for their comments, 
>> as
>>  well as
>>  >> Antoine, Dewey, and Matt here.)
>>  >> >>>>>>>>>
>>  >> >>>>>>>>> I'd like to have this as part of 10.0.0 in some 
>> fashion.
>>  >> However, I'm not sure how we would like to handle packaging and
>>  >> distribution. In particular, there are several sub-components 
>> for each
>>  >> language (the driver manager + the drivers), increasing the 
>> work. Any
>>  >> thoughts here?
>>  >> >>>>>>>>
>>  >> >>>>>>>> Sorry, forgot to answer here. But I think your question 
>> is too
>>  >> broadly
>>  >> >>>>>>>> formulated. It probably deserves a case-by-case 
>> discussion,
>>  IMHO.
>>  >> >>>>>>>>
>>  >> >>>>>>>>> I'm also wondering how we want to handle this in terms 
>> of
>>  >> specification - I assume we'd consider the core header file/Java
>>  interfaces
>>  >> a spec like the C Data Interface/Flight RPC, and vote on 
>> them/mirror
>>  them
>>  >> into the format/ directory?
>>  >> >>>>>>>>
>>  >> >>>>>>>> That sounds like the right way to me indeed.
>>  >> >>>>>>>>
>>  >> >>>>>>>> Regards
>>  >> >>>>>>>>
>>  >> >>>>>>>> Antoine.
>>  >>
>>

Re: [DISC] Improving Arrow's database support

Posted by Jacob Wujciak <ja...@voltrondata.com.INVALID>.

+ 1 to independent, semver versioning for adbc.
I would propose we use conventional commit style [1] commit messages for
the pr commits (I assume squash + merge) so we can automate the
versioning|double check manual versioning.

[1]: https://www.conventionalcommits.org/

On Thu, Sep 8, 2022 at 6:05 PM David Li <li...@apache.org> wrote:

> Thanks all, I've updated the header with the proposed versioning scheme.
>
> At this point I believe the core definitions are ready. (Note that I'm
> explicitly punting on [1][2][3] here.) Absent further comments, I'd like to
> do the following:
>
> - Start a vote on mirroring adbc.h to arrow/format, as well adding
> docs/source/format/ADBC.rst that describes the header, the Java interface,
> the Go interface, and the versioning scheme (I will put up a PR beforehand)
> - Begin work on CI/packaging, with a release hopefully coinciding with
> Arrow 10.0.0
> - Begin work on changes to the main repository, also hopefully in time for
> 10.0.0 (moving the Flight SQL driver to be part of apache/arrow; exposing
> it in PyArrow; possibly also exposing Acero via ADBC)
>
> [1]: https://github.com/apache/arrow-adbc/issues/46
> [2]: https://github.com/apache/arrow-adbc/issues/55
> [3]: https://github.com/apache/arrow-adbc/issues/59
>
> On Sat, Sep 3, 2022, at 18:36, Matthew Topol wrote:
> > +1 from me on the strategy proposed by Kou.
> >
> > That would be my preference also. I agree it is preferable to be
> versioned
> > independently.
> >
> > --Matt
> >
> > On Sat, Sep 3, 2022, 6:24 PM Sutou Kouhei <ko...@clear-code.com> wrote:
> >
> >> Hi,
> >>
> >> > Do we have a preference for versioning strategy? Should we
> >> > proceed in lockstep with the Arrow C++ library et. al. and
> >> > release "ADBC 1.0.0" (the API standard) with "drivers
> >> > version 10.0.0", or use an independent versioning scheme?
> >> > (For example, release API standard and components at
> >> > "1.0.0". Then further releases of components that do not
> >> > change the spec would be "1.1", "1.2", ...; if/when we
> >> > change the spec, start over with "2.0", "2.1", ...)
> >>
> >> I like an independent versioning schema. I assume that ADBC
> >> doesn't need backward incompatible changes frequently. How
> >> about incrementing major version only when ADBC needs
> >> any backward incompatible changes?
> >>
> >> e.g.:
> >>
> >>   1.  Release ADBC (the API standard) 1.0.0
> >>   2.  Release adbc_driver_manager 1.0.0
> >>   3.  Release adbc_driver_postgres 1.0.0
> >>   4.  Add a new feature to adbc_driver_postgres without
> >>       any backward incompatible changes
> >>   5.  Release adbc_driver_postgres 1.1.0
> >>   6.  Fix a bug in adbc_driver_manager without
> >>       any backward incompatible changes
> >>   7.  Release adbc_driver_manager 1.0.1
> >>   8.  Add a backward incompatible change to adbc_driver_manager
> >>   9.  Release adbc_driver_manager 2.0.0
> >>   10. Add a new feature to ADBC without any
> >>       backward incompatible changes
> >>   11. Release ADBC (the API standard) 1.1.0
> >>
> >>
> >> Thanks,
> >> --
> >> kou
> >>
> >> In <7b...@www.fastmail.com>
> >>   "Re: [DISC] Improving Arrow's database support" on Thu, 01 Sep 2022
> >> 16:36:43 -0400,
> >>   "David Li" <li...@apache.org> wrote:
> >>
> >> > Following up here with some specific questions:
> >> >
> >> > Matt Topol added some Go definitions [1] (thanks!) I'd assume we want
> to
> >> vote on those as well?
> >> >
> >> > How should the process work for Java/Go? For C/C++, I assume we'd
> treat
> >> it like the C Data Interface and copy adbc.h to format/ after a vote,
> and
> >> then vote on releases of components. Or do we really only consider the C
> >> header as the 'format', with the others being language-specific
> affordances?
> >> >
> >> > What about for Java and for Go? We could vote on and tag a release for
> >> Go, and add a documentation page that links to the Java/Go definitions
> at a
> >> specific revision (as the equivalent 'format' definition for Java/Go)?
> Or
> >> would we vendor the entire Java module/Go package as the 'format'?
> >> >
> >> > Do we have a preference for versioning strategy? Should we proceed in
> >> lockstep with the Arrow C++ library et. al. and release "ADBC 1.0.0"
> (the
> >> API standard) with "drivers version 10.0.0", or use an independent
> >> versioning scheme? (For example, release API standard and components at
> >> "1.0.0". Then further releases of components that do not change the spec
> >> would be "1.1", "1.2", ...; if/when we change the spec, start over with
> >> "2.0", "2.1", ...)
> >> >
> >> > [1]: https://github.com/apache/arrow-adbc/blob/main/go/adbc/adbc.go
> >> >
> >> > -David
> >> >
> >> > On Sun, Aug 28, 2022, at 10:56, Sutou Kouhei wrote:
> >> >> Hi,
> >> >>
> >> >> OK. I'll send pull requests for GLib and Ruby soon.
> >> >>
> >> >>> I'm curious if you have a particular use case in mind.
> >> >>
> >> >> I don't have any production-ready use case yet but I want to
> >> >> implement an Active Record adapter for ADBC. Active Record
> >> >> is the O/R mapper for Ruby on Rails. Implementing Web
> >> >> application by Ruby on Rails is one of major Ruby use
> >> >> cases. So providing Active Record interface for ADBC will
> >> >> increase Apache Arrow users in Ruby community.
> >> >>
> >> >> NOTE: Generally, Ruby on Rails users don't process large
> >> >> data but they sometimes need to process large (medium?) data
> >> >> in a batch process. Active Record adapter for ADBC may be
> >> >> useful for such use case.
> >> >>
> >> >>> There's a little bit more API cleanup to do [1]. If you
> >> >>> have comments on that or anything else, I'd appreciate
> >> >>> them. Otherwise, pull requests would also be appreciated.
> >> >>
> >> >> OK. I'll open issues/pull requests when I find
> >> >> something. For now, I think that "MODULE" type library
> >> >> instead of "SHARED" type library in CMake terminology
> >> >> [cmake] is better for driver modules. (I'll open an issue
> >> >> for this later.)
> >> >>
> >> >> [cmake]:
> https://cmake.org/cmake/help/latest/command/add_library.html
> >> >>
> >> >>
> >> >> Thanks,
> >> >> --
> >> >> kou
> >> >>
> >> >> In <e6...@www.fastmail.com>
> >> >>   "Re: [DISC] Improving Arrow's database support" on Sat, 27 Aug 2022
> >> >> 15:28:56 -0400,
> >> >>   "David Li" <li...@apache.org> wrote:
> >> >>
> >> >>> I would be very happy to see GLib/Ruby bindings! I'm curious if you
> >> have a particular use case in mind.
> >> >>>
> >> >>> There's a little bit more API cleanup to do [1]. If you have
> comments
> >> on that or anything else, I'd appreciate them. Otherwise, pull requests
> >> would also be appreciated.
> >> >>>
> >> >>> [1]: https://github.com/apache/arrow-adbc/issues/79
> >> >>>
> >> >>> On Fri, Aug 26, 2022, at 21:53, Sutou Kouhei wrote:
> >> >>>> Hi,
> >> >>>>
> >> >>>> Thanks for sharing the current status!
> >> >>>> I understand.
> >> >>>>
> >> >>>> BTW, can I add GLib/Ruby bindings to apache/arrow-adbc
> >> >>>> before we release the first version? (I want to use ADBC
> >> >>>> from Ruby.) Or should I wait for the first release? If I can
> >> >>>> work on it now, I'll open pull requests for it.
> >> >>>>
> >> >>>> Thanks,
> >> >>>> --
> >> >>>> kou
> >> >>>>
> >> >>>> In <87...@www.fastmail.com>
> >> >>>>   "Re: [DISC] Improving Arrow's database support" on Fri, 26 Aug
> 2022
> >> >>>> 11:03:26 -0400,
> >> >>>>   "David Li" <li...@apache.org> wrote:
> >> >>>>
> >> >>>>> Thank you Kou!
> >> >>>>>
> >> >>>>> At least initially, I don't think I'll be able to complete the
> >> Dataset integration in time. So 10.0.0 probably won't ship with a hard
> >> dependency. That said I am hoping to have PyArrow take an optional
> >> dependency (so Flight SQL can finally be available from Python).
> >> >>>>>
> >> >>>>> On Fri, Aug 26, 2022, at 01:01, Sutou Kouhei wrote:
> >> >>>>>> Hi,
> >> >>>>>>
> >> >>>>>> As a maintainer of Linux packages, I want apache/arrow-adbc
> >> >>>>>> to be released before apache/arrow is released so that
> >> >>>>>> apache/arrow's .deb/.rpm can depend on apache/arrow-adbc's
> >> >>>>>> .deb/.rpm.
> >> >>>>>>
> >> >>>>>> (If Apache Arrow Dataset uses apache/arrow-adbc,
> >> >>>>>> apache/arrow's .deb/.rpm needs to depend on
> >> >>>>>> apache/arrow-adbc's .deb/.rpm.)
> >> >>>>>>
> >> >>>>>> We can add .deb/.rpm related files
> >> >>>>>> (dev/tasks/linux-packages/ in apache/arrow) to
> >> >>>>>> apache/arrow-adbc to build .deb/.rpm for apache/arrow-adbc.
> >> >>>>>>
> >> >>>>>> FYI: I did it for datafusion-contrib/datafusion-c:
> >> >>>>>>
> >> >>>>>> *
> >> https://github.com/datafusion-contrib/datafusion-c/tree/main/package
> >> >>>>>> *
> >> >>>>>>
> >>
> https://github.com/datafusion-contrib/datafusion-c/blob/main/.github/workflows/package.yaml
> >> >>>>>>
> >> >>>>>> I can work on it in apache/arrow-adbc.
> >> >>>>>>
> >> >>>>>>
> >> >>>>>> Thanks,
> >> >>>>>> --
> >> >>>>>> kou
> >> >>>>>>
> >> >>>>>> In <5c...@www.fastmail.com>
> >> >>>>>>   "Re: [DISC] Improving Arrow's database support" on Thu, 25 Aug
> >> 2022
> >> >>>>>> 11:51:08 -0400,
> >> >>>>>>   "David Li" <li...@apache.org> wrote:
> >> >>>>>>
> >> >>>>>>> Fair enough, thank you. I'll try to expand a bit. (Sorry for the
> >> wall of text that follows…)
> >> >>>>>>>
> >> >>>>>>> These are the components:
> >> >>>>>>>
> >> >>>>>>> - Core adbc.h header
> >> >>>>>>> - Driver manager for C/C++
> >> >>>>>>> - Flight SQL-based driver
> >> >>>>>>> - Postgres-based driver (WIP)
> >> >>>>>>> - SQLite-based driver (more of a testbed for me than an actual
> >> component - I don't think we'd actually distribute this)
> >> >>>>>>> - Java core interfaces
> >> >>>>>>> - Java driver manager
> >> >>>>>>> - Java JDBC-based driver
> >> >>>>>>> - Java Flight SQL-based driver
> >> >>>>>>> - Python driver manager
> >> >>>>>>>
> >> >>>>>>> I think: adbc.h gets mirrored into the Arrow repo. The Flight
> SQL
> >> drivers get moved to the main Arrow repo and distributed as part of the
> >> regular Arrow releases.
> >> >>>>>>>
> >> >>>>>>> For the rest of the components: they could be packaged
> >> individually, but versioned and released together. Also, each C/C++
> driver
> >> probably needs a corresponding Python package so Python users do not
> have
> >> to futz with shared library configurations. (See [1].) So for instance,
> >> installing PyArrow would also give you the Flight SQL driver, and `pip
> >> install adbc_postgres` would get you the Postgres-based driver.
> >> >>>>>>>
> >> >>>>>>> That would mean setting up separate CI, release, etc. (and
> >> eventually linking Crossbow & Conbench as well?). That does mean
> >> duplication of effort, but the trade off is avoiding bloating the main
> >> release process even further. However, I'd like to hear from those
> closer
> >> to the release process on this subject - if it would make people's lives
> >> easier, we could merge everything into one repo/process.
> >> >>>>>>>
> >> >>>>>>> Integrations would be distributed as part of their respective
> >> packages (e.g. Arrow Dataset would optionally link to the driver
> manager).
> >> So the "part of Arrow 10.0.0" aspect means having a stable interface for
> >> adbc.h, and getting the Flight SQL drivers into the main repo.
> >> >>>>>>>
> >> >>>>>>> [1]: https://github.com/apache/arrow-adbc/issues/53
> >> >>>>>>>
> >> >>>>>>> On Thu, Aug 25, 2022, at 11:34, Antoine Pitrou wrote:
> >> >>>>>>>> On Fri, 19 Aug 2022 14:09:44 -0400
> >> >>>>>>>> "David Li" <li...@apache.org> wrote:
> >> >>>>>>>>> Since it's been a while, I'd like to give an update. There are
> >> also a few questions I have around distribution.
> >> >>>>>>>>>
> >> >>>>>>>>> Currently:
> >> >>>>>>>>> - Supported in C, Java, and Python.
> >> >>>>>>>>> - For C/Python, there are basic drivers wrapping Flight SQL
> and
> >> SQLite, with a draft of a libpq (Postgres) driver (using nanoarrow).
> >> >>>>>>>>> - For Java, there are drivers wrapping JDBC and Flight SQL.
> >> >>>>>>>>> - For Python, there's low-level bindings to the C API, and the
> >> DBAPI interface on top of that (+a few extension methods resembling
> >> DuckDB/Turbodbc).
> >> >>>>>>>>>
> >> >>>>>>>>> There's drafts of integration with Ibis [1], DBI (R), and
> >> DuckDB. (I'd like to thank Hannes and Kirill for their comments, as
> well as
> >> Antoine, Dewey, and Matt here.)
> >> >>>>>>>>>
> >> >>>>>>>>> I'd like to have this as part of 10.0.0 in some fashion.
> >> However, I'm not sure how we would like to handle packaging and
> >> distribution. In particular, there are several sub-components for each
> >> language (the driver manager + the drivers), increasing the work. Any
> >> thoughts here?
> >> >>>>>>>>
> >> >>>>>>>> Sorry, forgot to answer here. But I think your question is too
> >> broadly
> >> >>>>>>>> formulated. It probably deserves a case-by-case discussion,
> IMHO.
> >> >>>>>>>>
> >> >>>>>>>>> I'm also wondering how we want to handle this in terms of
> >> specification - I assume we'd consider the core header file/Java
> interfaces
> >> a spec like the C Data Interface/Flight RPC, and vote on them/mirror
> them
> >> into the format/ directory?
> >> >>>>>>>>
> >> >>>>>>>> That sounds like the right way to me indeed.
> >> >>>>>>>>
> >> >>>>>>>> Regards
> >> >>>>>>>>
> >> >>>>>>>> Antoine.
> >>
>

Re: [DISC] Improving Arrow's database support

Posted by David Li <li...@apache.org>.

Thanks all, I've updated the header with the proposed versioning scheme.

At this point I believe the core definitions are ready. (Note that I'm explicitly punting on [1][2][3] here.) Absent further comments, I'd like to do the following:

- Start a vote on mirroring adbc.h to arrow/format, as well adding docs/source/format/ADBC.rst that describes the header, the Java interface, the Go interface, and the versioning scheme (I will put up a PR beforehand)
- Begin work on CI/packaging, with a release hopefully coinciding with Arrow 10.0.0
- Begin work on changes to the main repository, also hopefully in time for 10.0.0 (moving the Flight SQL driver to be part of apache/arrow; exposing it in PyArrow; possibly also exposing Acero via ADBC)

[1]: https://github.com/apache/arrow-adbc/issues/46
[2]: https://github.com/apache/arrow-adbc/issues/55
[3]: https://github.com/apache/arrow-adbc/issues/59

On Sat, Sep 3, 2022, at 18:36, Matthew Topol wrote:
> +1 from me on the strategy proposed by Kou.
>
> That would be my preference also. I agree it is preferable to be versioned
> independently.
>
> --Matt
>
> On Sat, Sep 3, 2022, 6:24 PM Sutou Kouhei <ko...@clear-code.com> wrote:
>
>> Hi,
>>
>> > Do we have a preference for versioning strategy? Should we
>> > proceed in lockstep with the Arrow C++ library et. al. and
>> > release "ADBC 1.0.0" (the API standard) with "drivers
>> > version 10.0.0", or use an independent versioning scheme?
>> > (For example, release API standard and components at
>> > "1.0.0". Then further releases of components that do not
>> > change the spec would be "1.1", "1.2", ...; if/when we
>> > change the spec, start over with "2.0", "2.1", ...)
>>
>> I like an independent versioning schema. I assume that ADBC
>> doesn't need backward incompatible changes frequently. How
>> about incrementing major version only when ADBC needs
>> any backward incompatible changes?
>>
>> e.g.:
>>
>>   1.  Release ADBC (the API standard) 1.0.0
>>   2.  Release adbc_driver_manager 1.0.0
>>   3.  Release adbc_driver_postgres 1.0.0
>>   4.  Add a new feature to adbc_driver_postgres without
>>       any backward incompatible changes
>>   5.  Release adbc_driver_postgres 1.1.0
>>   6.  Fix a bug in adbc_driver_manager without
>>       any backward incompatible changes
>>   7.  Release adbc_driver_manager 1.0.1
>>   8.  Add a backward incompatible change to adbc_driver_manager
>>   9.  Release adbc_driver_manager 2.0.0
>>   10. Add a new feature to ADBC without any
>>       backward incompatible changes
>>   11. Release ADBC (the API standard) 1.1.0
>>
>>
>> Thanks,
>> --
>> kou
>>
>> In <7b...@www.fastmail.com>
>>   "Re: [DISC] Improving Arrow's database support" on Thu, 01 Sep 2022
>> 16:36:43 -0400,
>>   "David Li" <li...@apache.org> wrote:
>>
>> > Following up here with some specific questions:
>> >
>> > Matt Topol added some Go definitions [1] (thanks!) I'd assume we want to
>> vote on those as well?
>> >
>> > How should the process work for Java/Go? For C/C++, I assume we'd treat
>> it like the C Data Interface and copy adbc.h to format/ after a vote, and
>> then vote on releases of components. Or do we really only consider the C
>> header as the 'format', with the others being language-specific affordances?
>> >
>> > What about for Java and for Go? We could vote on and tag a release for
>> Go, and add a documentation page that links to the Java/Go definitions at a
>> specific revision (as the equivalent 'format' definition for Java/Go)? Or
>> would we vendor the entire Java module/Go package as the 'format'?
>> >
>> > Do we have a preference for versioning strategy? Should we proceed in
>> lockstep with the Arrow C++ library et. al. and release "ADBC 1.0.0" (the
>> API standard) with "drivers version 10.0.0", or use an independent
>> versioning scheme? (For example, release API standard and components at
>> "1.0.0". Then further releases of components that do not change the spec
>> would be "1.1", "1.2", ...; if/when we change the spec, start over with
>> "2.0", "2.1", ...)
>> >
>> > [1]: https://github.com/apache/arrow-adbc/blob/main/go/adbc/adbc.go
>> >
>> > -David
>> >
>> > On Sun, Aug 28, 2022, at 10:56, Sutou Kouhei wrote:
>> >> Hi,
>> >>
>> >> OK. I'll send pull requests for GLib and Ruby soon.
>> >>
>> >>> I'm curious if you have a particular use case in mind.
>> >>
>> >> I don't have any production-ready use case yet but I want to
>> >> implement an Active Record adapter for ADBC. Active Record
>> >> is the O/R mapper for Ruby on Rails. Implementing Web
>> >> application by Ruby on Rails is one of major Ruby use
>> >> cases. So providing Active Record interface for ADBC will
>> >> increase Apache Arrow users in Ruby community.
>> >>
>> >> NOTE: Generally, Ruby on Rails users don't process large
>> >> data but they sometimes need to process large (medium?) data
>> >> in a batch process. Active Record adapter for ADBC may be
>> >> useful for such use case.
>> >>
>> >>> There's a little bit more API cleanup to do [1]. If you
>> >>> have comments on that or anything else, I'd appreciate
>> >>> them. Otherwise, pull requests would also be appreciated.
>> >>
>> >> OK. I'll open issues/pull requests when I find
>> >> something. For now, I think that "MODULE" type library
>> >> instead of "SHARED" type library in CMake terminology
>> >> [cmake] is better for driver modules. (I'll open an issue
>> >> for this later.)
>> >>
>> >> [cmake]: https://cmake.org/cmake/help/latest/command/add_library.html
>> >>
>> >>
>> >> Thanks,
>> >> --
>> >> kou
>> >>
>> >> In <e6...@www.fastmail.com>
>> >>   "Re: [DISC] Improving Arrow's database support" on Sat, 27 Aug 2022
>> >> 15:28:56 -0400,
>> >>   "David Li" <li...@apache.org> wrote:
>> >>
>> >>> I would be very happy to see GLib/Ruby bindings! I'm curious if you
>> have a particular use case in mind.
>> >>>
>> >>> There's a little bit more API cleanup to do [1]. If you have comments
>> on that or anything else, I'd appreciate them. Otherwise, pull requests
>> would also be appreciated.
>> >>>
>> >>> [1]: https://github.com/apache/arrow-adbc/issues/79
>> >>>
>> >>> On Fri, Aug 26, 2022, at 21:53, Sutou Kouhei wrote:
>> >>>> Hi,
>> >>>>
>> >>>> Thanks for sharing the current status!
>> >>>> I understand.
>> >>>>
>> >>>> BTW, can I add GLib/Ruby bindings to apache/arrow-adbc
>> >>>> before we release the first version? (I want to use ADBC
>> >>>> from Ruby.) Or should I wait for the first release? If I can
>> >>>> work on it now, I'll open pull requests for it.
>> >>>>
>> >>>> Thanks,
>> >>>> --
>> >>>> kou
>> >>>>
>> >>>> In <87...@www.fastmail.com>
>> >>>>   "Re: [DISC] Improving Arrow's database support" on Fri, 26 Aug 2022
>> >>>> 11:03:26 -0400,
>> >>>>   "David Li" <li...@apache.org> wrote:
>> >>>>
>> >>>>> Thank you Kou!
>> >>>>>
>> >>>>> At least initially, I don't think I'll be able to complete the
>> Dataset integration in time. So 10.0.0 probably won't ship with a hard
>> dependency. That said I am hoping to have PyArrow take an optional
>> dependency (so Flight SQL can finally be available from Python).
>> >>>>>
>> >>>>> On Fri, Aug 26, 2022, at 01:01, Sutou Kouhei wrote:
>> >>>>>> Hi,
>> >>>>>>
>> >>>>>> As a maintainer of Linux packages, I want apache/arrow-adbc
>> >>>>>> to be released before apache/arrow is released so that
>> >>>>>> apache/arrow's .deb/.rpm can depend on apache/arrow-adbc's
>> >>>>>> .deb/.rpm.
>> >>>>>>
>> >>>>>> (If Apache Arrow Dataset uses apache/arrow-adbc,
>> >>>>>> apache/arrow's .deb/.rpm needs to depend on
>> >>>>>> apache/arrow-adbc's .deb/.rpm.)
>> >>>>>>
>> >>>>>> We can add .deb/.rpm related files
>> >>>>>> (dev/tasks/linux-packages/ in apache/arrow) to
>> >>>>>> apache/arrow-adbc to build .deb/.rpm for apache/arrow-adbc.
>> >>>>>>
>> >>>>>> FYI: I did it for datafusion-contrib/datafusion-c:
>> >>>>>>
>> >>>>>> *
>> https://github.com/datafusion-contrib/datafusion-c/tree/main/package
>> >>>>>> *
>> >>>>>>
>> https://github.com/datafusion-contrib/datafusion-c/blob/main/.github/workflows/package.yaml
>> >>>>>>
>> >>>>>> I can work on it in apache/arrow-adbc.
>> >>>>>>
>> >>>>>>
>> >>>>>> Thanks,
>> >>>>>> --
>> >>>>>> kou
>> >>>>>>
>> >>>>>> In <5c...@www.fastmail.com>
>> >>>>>>   "Re: [DISC] Improving Arrow's database support" on Thu, 25 Aug
>> 2022
>> >>>>>> 11:51:08 -0400,
>> >>>>>>   "David Li" <li...@apache.org> wrote:
>> >>>>>>
>> >>>>>>> Fair enough, thank you. I'll try to expand a bit. (Sorry for the
>> wall of text that follows…)
>> >>>>>>>
>> >>>>>>> These are the components:
>> >>>>>>>
>> >>>>>>> - Core adbc.h header
>> >>>>>>> - Driver manager for C/C++
>> >>>>>>> - Flight SQL-based driver
>> >>>>>>> - Postgres-based driver (WIP)
>> >>>>>>> - SQLite-based driver (more of a testbed for me than an actual
>> component - I don't think we'd actually distribute this)
>> >>>>>>> - Java core interfaces
>> >>>>>>> - Java driver manager
>> >>>>>>> - Java JDBC-based driver
>> >>>>>>> - Java Flight SQL-based driver
>> >>>>>>> - Python driver manager
>> >>>>>>>
>> >>>>>>> I think: adbc.h gets mirrored into the Arrow repo. The Flight SQL
>> drivers get moved to the main Arrow repo and distributed as part of the
>> regular Arrow releases.
>> >>>>>>>
>> >>>>>>> For the rest of the components: they could be packaged
>> individually, but versioned and released together. Also, each C/C++ driver
>> probably needs a corresponding Python package so Python users do not have
>> to futz with shared library configurations. (See [1].) So for instance,
>> installing PyArrow would also give you the Flight SQL driver, and `pip
>> install adbc_postgres` would get you the Postgres-based driver.
>> >>>>>>>
>> >>>>>>> That would mean setting up separate CI, release, etc. (and
>> eventually linking Crossbow & Conbench as well?). That does mean
>> duplication of effort, but the trade off is avoiding bloating the main
>> release process even further. However, I'd like to hear from those closer
>> to the release process on this subject - if it would make people's lives
>> easier, we could merge everything into one repo/process.
>> >>>>>>>
>> >>>>>>> Integrations would be distributed as part of their respective
>> packages (e.g. Arrow Dataset would optionally link to the driver manager).
>> So the "part of Arrow 10.0.0" aspect means having a stable interface for
>> adbc.h, and getting the Flight SQL drivers into the main repo.
>> >>>>>>>
>> >>>>>>> [1]: https://github.com/apache/arrow-adbc/issues/53
>> >>>>>>>
>> >>>>>>> On Thu, Aug 25, 2022, at 11:34, Antoine Pitrou wrote:
>> >>>>>>>> On Fri, 19 Aug 2022 14:09:44 -0400
>> >>>>>>>> "David Li" <li...@apache.org> wrote:
>> >>>>>>>>> Since it's been a while, I'd like to give an update. There are
>> also a few questions I have around distribution.
>> >>>>>>>>>
>> >>>>>>>>> Currently:
>> >>>>>>>>> - Supported in C, Java, and Python.
>> >>>>>>>>> - For C/Python, there are basic drivers wrapping Flight SQL and
>> SQLite, with a draft of a libpq (Postgres) driver (using nanoarrow).
>> >>>>>>>>> - For Java, there are drivers wrapping JDBC and Flight SQL.
>> >>>>>>>>> - For Python, there's low-level bindings to the C API, and the
>> DBAPI interface on top of that (+a few extension methods resembling
>> DuckDB/Turbodbc).
>> >>>>>>>>>
>> >>>>>>>>> There's drafts of integration with Ibis [1], DBI (R), and
>> DuckDB. (I'd like to thank Hannes and Kirill for their comments, as well as
>> Antoine, Dewey, and Matt here.)
>> >>>>>>>>>
>> >>>>>>>>> I'd like to have this as part of 10.0.0 in some fashion.
>> However, I'm not sure how we would like to handle packaging and
>> distribution. In particular, there are several sub-components for each
>> language (the driver manager + the drivers), increasing the work. Any
>> thoughts here?
>> >>>>>>>>
>> >>>>>>>> Sorry, forgot to answer here. But I think your question is too
>> broadly
>> >>>>>>>> formulated. It probably deserves a case-by-case discussion, IMHO.
>> >>>>>>>>
>> >>>>>>>>> I'm also wondering how we want to handle this in terms of
>> specification - I assume we'd consider the core header file/Java interfaces
>> a spec like the C Data Interface/Flight RPC, and vote on them/mirror them
>> into the format/ directory?
>> >>>>>>>>
>> >>>>>>>> That sounds like the right way to me indeed.
>> >>>>>>>>
>> >>>>>>>> Regards
>> >>>>>>>>
>> >>>>>>>> Antoine.
>>

Re: [DISC] Improving Arrow's database support

Posted by Matthew Topol <ma...@voltrondata.com.INVALID>.

+1 from me on the strategy proposed by Kou.

That would be my preference also. I agree it is preferable to be versioned
independently.

--Matt

On Sat, Sep 3, 2022, 6:24 PM Sutou Kouhei <ko...@clear-code.com> wrote:

> Hi,
>
> > Do we have a preference for versioning strategy? Should we
> > proceed in lockstep with the Arrow C++ library et. al. and
> > release "ADBC 1.0.0" (the API standard) with "drivers
> > version 10.0.0", or use an independent versioning scheme?
> > (For example, release API standard and components at
> > "1.0.0". Then further releases of components that do not
> > change the spec would be "1.1", "1.2", ...; if/when we
> > change the spec, start over with "2.0", "2.1", ...)
>
> I like an independent versioning schema. I assume that ADBC
> doesn't need backward incompatible changes frequently. How
> about incrementing major version only when ADBC needs
> any backward incompatible changes?
>
> e.g.:
>
>   1.  Release ADBC (the API standard) 1.0.0
>   2.  Release adbc_driver_manager 1.0.0
>   3.  Release adbc_driver_postgres 1.0.0
>   4.  Add a new feature to adbc_driver_postgres without
>       any backward incompatible changes
>   5.  Release adbc_driver_postgres 1.1.0
>   6.  Fix a bug in adbc_driver_manager without
>       any backward incompatible changes
>   7.  Release adbc_driver_manager 1.0.1
>   8.  Add a backward incompatible change to adbc_driver_manager
>   9.  Release adbc_driver_manager 2.0.0
>   10. Add a new feature to ADBC without any
>       backward incompatible changes
>   11. Release ADBC (the API standard) 1.1.0
>
>
> Thanks,
> --
> kou
>
> In <7b...@www.fastmail.com>
>   "Re: [DISC] Improving Arrow's database support" on Thu, 01 Sep 2022
> 16:36:43 -0400,
>   "David Li" <li...@apache.org> wrote:
>
> > Following up here with some specific questions:
> >
> > Matt Topol added some Go definitions [1] (thanks!) I'd assume we want to
> vote on those as well?
> >
> > How should the process work for Java/Go? For C/C++, I assume we'd treat
> it like the C Data Interface and copy adbc.h to format/ after a vote, and
> then vote on releases of components. Or do we really only consider the C
> header as the 'format', with the others being language-specific affordances?
> >
> > What about for Java and for Go? We could vote on and tag a release for
> Go, and add a documentation page that links to the Java/Go definitions at a
> specific revision (as the equivalent 'format' definition for Java/Go)? Or
> would we vendor the entire Java module/Go package as the 'format'?
> >
> > Do we have a preference for versioning strategy? Should we proceed in
> lockstep with the Arrow C++ library et. al. and release "ADBC 1.0.0" (the
> API standard) with "drivers version 10.0.0", or use an independent
> versioning scheme? (For example, release API standard and components at
> "1.0.0". Then further releases of components that do not change the spec
> would be "1.1", "1.2", ...; if/when we change the spec, start over with
> "2.0", "2.1", ...)
> >
> > [1]: https://github.com/apache/arrow-adbc/blob/main/go/adbc/adbc.go
> >
> > -David
> >
> > On Sun, Aug 28, 2022, at 10:56, Sutou Kouhei wrote:
> >> Hi,
> >>
> >> OK. I'll send pull requests for GLib and Ruby soon.
> >>
> >>> I'm curious if you have a particular use case in mind.
> >>
> >> I don't have any production-ready use case yet but I want to
> >> implement an Active Record adapter for ADBC. Active Record
> >> is the O/R mapper for Ruby on Rails. Implementing Web
> >> application by Ruby on Rails is one of major Ruby use
> >> cases. So providing Active Record interface for ADBC will
> >> increase Apache Arrow users in Ruby community.
> >>
> >> NOTE: Generally, Ruby on Rails users don't process large
> >> data but they sometimes need to process large (medium?) data
> >> in a batch process. Active Record adapter for ADBC may be
> >> useful for such use case.
> >>
> >>> There's a little bit more API cleanup to do [1]. If you
> >>> have comments on that or anything else, I'd appreciate
> >>> them. Otherwise, pull requests would also be appreciated.
> >>
> >> OK. I'll open issues/pull requests when I find
> >> something. For now, I think that "MODULE" type library
> >> instead of "SHARED" type library in CMake terminology
> >> [cmake] is better for driver modules. (I'll open an issue
> >> for this later.)
> >>
> >> [cmake]: https://cmake.org/cmake/help/latest/command/add_library.html
> >>
> >>
> >> Thanks,
> >> --
> >> kou
> >>
> >> In <e6...@www.fastmail.com>
> >>   "Re: [DISC] Improving Arrow's database support" on Sat, 27 Aug 2022
> >> 15:28:56 -0400,
> >>   "David Li" <li...@apache.org> wrote:
> >>
> >>> I would be very happy to see GLib/Ruby bindings! I'm curious if you
> have a particular use case in mind.
> >>>
> >>> There's a little bit more API cleanup to do [1]. If you have comments
> on that or anything else, I'd appreciate them. Otherwise, pull requests
> would also be appreciated.
> >>>
> >>> [1]: https://github.com/apache/arrow-adbc/issues/79
> >>>
> >>> On Fri, Aug 26, 2022, at 21:53, Sutou Kouhei wrote:
> >>>> Hi,
> >>>>
> >>>> Thanks for sharing the current status!
> >>>> I understand.
> >>>>
> >>>> BTW, can I add GLib/Ruby bindings to apache/arrow-adbc
> >>>> before we release the first version? (I want to use ADBC
> >>>> from Ruby.) Or should I wait for the first release? If I can
> >>>> work on it now, I'll open pull requests for it.
> >>>>
> >>>> Thanks,
> >>>> --
> >>>> kou
> >>>>
> >>>> In <87...@www.fastmail.com>
> >>>>   "Re: [DISC] Improving Arrow's database support" on Fri, 26 Aug 2022
> >>>> 11:03:26 -0400,
> >>>>   "David Li" <li...@apache.org> wrote:
> >>>>
> >>>>> Thank you Kou!
> >>>>>
> >>>>> At least initially, I don't think I'll be able to complete the
> Dataset integration in time. So 10.0.0 probably won't ship with a hard
> dependency. That said I am hoping to have PyArrow take an optional
> dependency (so Flight SQL can finally be available from Python).
> >>>>>
> >>>>> On Fri, Aug 26, 2022, at 01:01, Sutou Kouhei wrote:
> >>>>>> Hi,
> >>>>>>
> >>>>>> As a maintainer of Linux packages, I want apache/arrow-adbc
> >>>>>> to be released before apache/arrow is released so that
> >>>>>> apache/arrow's .deb/.rpm can depend on apache/arrow-adbc's
> >>>>>> .deb/.rpm.
> >>>>>>
> >>>>>> (If Apache Arrow Dataset uses apache/arrow-adbc,
> >>>>>> apache/arrow's .deb/.rpm needs to depend on
> >>>>>> apache/arrow-adbc's .deb/.rpm.)
> >>>>>>
> >>>>>> We can add .deb/.rpm related files
> >>>>>> (dev/tasks/linux-packages/ in apache/arrow) to
> >>>>>> apache/arrow-adbc to build .deb/.rpm for apache/arrow-adbc.
> >>>>>>
> >>>>>> FYI: I did it for datafusion-contrib/datafusion-c:
> >>>>>>
> >>>>>> *
> https://github.com/datafusion-contrib/datafusion-c/tree/main/package
> >>>>>> *
> >>>>>>
> https://github.com/datafusion-contrib/datafusion-c/blob/main/.github/workflows/package.yaml
> >>>>>>
> >>>>>> I can work on it in apache/arrow-adbc.
> >>>>>>
> >>>>>>
> >>>>>> Thanks,
> >>>>>> --
> >>>>>> kou
> >>>>>>
> >>>>>> In <5c...@www.fastmail.com>
> >>>>>>   "Re: [DISC] Improving Arrow's database support" on Thu, 25 Aug
> 2022
> >>>>>> 11:51:08 -0400,
> >>>>>>   "David Li" <li...@apache.org> wrote:
> >>>>>>
> >>>>>>> Fair enough, thank you. I'll try to expand a bit. (Sorry for the
> wall of text that follows…)
> >>>>>>>
> >>>>>>> These are the components:
> >>>>>>>
> >>>>>>> - Core adbc.h header
> >>>>>>> - Driver manager for C/C++
> >>>>>>> - Flight SQL-based driver
> >>>>>>> - Postgres-based driver (WIP)
> >>>>>>> - SQLite-based driver (more of a testbed for me than an actual
> component - I don't think we'd actually distribute this)
> >>>>>>> - Java core interfaces
> >>>>>>> - Java driver manager
> >>>>>>> - Java JDBC-based driver
> >>>>>>> - Java Flight SQL-based driver
> >>>>>>> - Python driver manager
> >>>>>>>
> >>>>>>> I think: adbc.h gets mirrored into the Arrow repo. The Flight SQL
> drivers get moved to the main Arrow repo and distributed as part of the
> regular Arrow releases.
> >>>>>>>
> >>>>>>> For the rest of the components: they could be packaged
> individually, but versioned and released together. Also, each C/C++ driver
> probably needs a corresponding Python package so Python users do not have
> to futz with shared library configurations. (See [1].) So for instance,
> installing PyArrow would also give you the Flight SQL driver, and `pip
> install adbc_postgres` would get you the Postgres-based driver.
> >>>>>>>
> >>>>>>> That would mean setting up separate CI, release, etc. (and
> eventually linking Crossbow & Conbench as well?). That does mean
> duplication of effort, but the trade off is avoiding bloating the main
> release process even further. However, I'd like to hear from those closer
> to the release process on this subject - if it would make people's lives
> easier, we could merge everything into one repo/process.
> >>>>>>>
> >>>>>>> Integrations would be distributed as part of their respective
> packages (e.g. Arrow Dataset would optionally link to the driver manager).
> So the "part of Arrow 10.0.0" aspect means having a stable interface for
> adbc.h, and getting the Flight SQL drivers into the main repo.
> >>>>>>>
> >>>>>>> [1]: https://github.com/apache/arrow-adbc/issues/53
> >>>>>>>
> >>>>>>> On Thu, Aug 25, 2022, at 11:34, Antoine Pitrou wrote:
> >>>>>>>> On Fri, 19 Aug 2022 14:09:44 -0400
> >>>>>>>> "David Li" <li...@apache.org> wrote:
> >>>>>>>>> Since it's been a while, I'd like to give an update. There are
> also a few questions I have around distribution.
> >>>>>>>>>
> >>>>>>>>> Currently:
> >>>>>>>>> - Supported in C, Java, and Python.
> >>>>>>>>> - For C/Python, there are basic drivers wrapping Flight SQL and
> SQLite, with a draft of a libpq (Postgres) driver (using nanoarrow).
> >>>>>>>>> - For Java, there are drivers wrapping JDBC and Flight SQL.
> >>>>>>>>> - For Python, there's low-level bindings to the C API, and the
> DBAPI interface on top of that (+a few extension methods resembling
> DuckDB/Turbodbc).
> >>>>>>>>>
> >>>>>>>>> There's drafts of integration with Ibis [1], DBI (R), and
> DuckDB. (I'd like to thank Hannes and Kirill for their comments, as well as
> Antoine, Dewey, and Matt here.)
> >>>>>>>>>
> >>>>>>>>> I'd like to have this as part of 10.0.0 in some fashion.
> However, I'm not sure how we would like to handle packaging and
> distribution. In particular, there are several sub-components for each
> language (the driver manager + the drivers), increasing the work. Any
> thoughts here?
> >>>>>>>>
> >>>>>>>> Sorry, forgot to answer here. But I think your question is too
> broadly
> >>>>>>>> formulated. It probably deserves a case-by-case discussion, IMHO.
> >>>>>>>>
> >>>>>>>>> I'm also wondering how we want to handle this in terms of
> specification - I assume we'd consider the core header file/Java interfaces
> a spec like the C Data Interface/Flight RPC, and vote on them/mirror them
> into the format/ directory?
> >>>>>>>>
> >>>>>>>> That sounds like the right way to me indeed.
> >>>>>>>>
> >>>>>>>> Regards
> >>>>>>>>
> >>>>>>>> Antoine.
>

Re: [DISC] Improving Arrow's database support

Posted by Sutou Kouhei <ko...@clear-code.com>.

Hi,

> Do we have a preference for versioning strategy? Should we
> proceed in lockstep with the Arrow C++ library et. al. and
> release "ADBC 1.0.0" (the API standard) with "drivers
> version 10.0.0", or use an independent versioning scheme?
> (For example, release API standard and components at
> "1.0.0". Then further releases of components that do not
> change the spec would be "1.1", "1.2", ...; if/when we
> change the spec, start over with "2.0", "2.1", ...)

I like an independent versioning schema. I assume that ADBC
doesn't need backward incompatible changes frequently. How
about incrementing major version only when ADBC needs
any backward incompatible changes?

e.g.:

  1.  Release ADBC (the API standard) 1.0.0
  2.  Release adbc_driver_manager 1.0.0
  3.  Release adbc_driver_postgres 1.0.0
  4.  Add a new feature to adbc_driver_postgres without
      any backward incompatible changes
  5.  Release adbc_driver_postgres 1.1.0
  6.  Fix a bug in adbc_driver_manager without
      any backward incompatible changes
  7.  Release adbc_driver_manager 1.0.1
  8.  Add a backward incompatible change to adbc_driver_manager
  9.  Release adbc_driver_manager 2.0.0
  10. Add a new feature to ADBC without any
      backward incompatible changes
  11. Release ADBC (the API standard) 1.1.0


Thanks,
-- 
kou

In <7b...@www.fastmail.com>
  "Re: [DISC] Improving Arrow's database support" on Thu, 01 Sep 2022 16:36:43 -0400,
  "David Li" <li...@apache.org> wrote:

> Following up here with some specific questions:
> 
> Matt Topol added some Go definitions [1] (thanks!) I'd assume we want to vote on those as well?
> 
> How should the process work for Java/Go? For C/C++, I assume we'd treat it like the C Data Interface and copy adbc.h to format/ after a vote, and then vote on releases of components. Or do we really only consider the C header as the 'format', with the others being language-specific affordances?
> 
> What about for Java and for Go? We could vote on and tag a release for Go, and add a documentation page that links to the Java/Go definitions at a specific revision (as the equivalent 'format' definition for Java/Go)? Or would we vendor the entire Java module/Go package as the 'format'?
> 
> Do we have a preference for versioning strategy? Should we proceed in lockstep with the Arrow C++ library et. al. and release "ADBC 1.0.0" (the API standard) with "drivers version 10.0.0", or use an independent versioning scheme? (For example, release API standard and components at "1.0.0". Then further releases of components that do not change the spec would be "1.1", "1.2", ...; if/when we change the spec, start over with "2.0", "2.1", ...)
> 
> [1]: https://github.com/apache/arrow-adbc/blob/main/go/adbc/adbc.go
> 
> -David
> 
> On Sun, Aug 28, 2022, at 10:56, Sutou Kouhei wrote:
>> Hi,
>>
>> OK. I'll send pull requests for GLib and Ruby soon.
>>
>>> I'm curious if you have a particular use case in mind.
>>
>> I don't have any production-ready use case yet but I want to
>> implement an Active Record adapter for ADBC. Active Record
>> is the O/R mapper for Ruby on Rails. Implementing Web
>> application by Ruby on Rails is one of major Ruby use
>> cases. So providing Active Record interface for ADBC will
>> increase Apache Arrow users in Ruby community.
>>
>> NOTE: Generally, Ruby on Rails users don't process large
>> data but they sometimes need to process large (medium?) data
>> in a batch process. Active Record adapter for ADBC may be
>> useful for such use case.
>>
>>> There's a little bit more API cleanup to do [1]. If you
>>> have comments on that or anything else, I'd appreciate
>>> them. Otherwise, pull requests would also be appreciated.
>>
>> OK. I'll open issues/pull requests when I find
>> something. For now, I think that "MODULE" type library
>> instead of "SHARED" type library in CMake terminology
>> [cmake] is better for driver modules. (I'll open an issue
>> for this later.)
>>
>> [cmake]: https://cmake.org/cmake/help/latest/command/add_library.html
>>
>>
>> Thanks,
>> -- 
>> kou
>>
>> In <e6...@www.fastmail.com>
>>   "Re: [DISC] Improving Arrow's database support" on Sat, 27 Aug 2022 
>> 15:28:56 -0400,
>>   "David Li" <li...@apache.org> wrote:
>>
>>> I would be very happy to see GLib/Ruby bindings! I'm curious if you have a particular use case in mind. 
>>> 
>>> There's a little bit more API cleanup to do [1]. If you have comments on that or anything else, I'd appreciate them. Otherwise, pull requests would also be appreciated.
>>> 
>>> [1]: https://github.com/apache/arrow-adbc/issues/79
>>> 
>>> On Fri, Aug 26, 2022, at 21:53, Sutou Kouhei wrote:
>>>> Hi,
>>>>
>>>> Thanks for sharing the current status!
>>>> I understand.
>>>>
>>>> BTW, can I add GLib/Ruby bindings to apache/arrow-adbc
>>>> before we release the first version? (I want to use ADBC
>>>> from Ruby.) Or should I wait for the first release? If I can
>>>> work on it now, I'll open pull requests for it.
>>>>
>>>> Thanks,
>>>> -- 
>>>> kou
>>>>
>>>> In <87...@www.fastmail.com>
>>>>   "Re: [DISC] Improving Arrow's database support" on Fri, 26 Aug 2022 
>>>> 11:03:26 -0400,
>>>>   "David Li" <li...@apache.org> wrote:
>>>>
>>>>> Thank you Kou!
>>>>> 
>>>>> At least initially, I don't think I'll be able to complete the Dataset integration in time. So 10.0.0 probably won't ship with a hard dependency. That said I am hoping to have PyArrow take an optional dependency (so Flight SQL can finally be available from Python).
>>>>> 
>>>>> On Fri, Aug 26, 2022, at 01:01, Sutou Kouhei wrote:
>>>>>> Hi,
>>>>>>
>>>>>> As a maintainer of Linux packages, I want apache/arrow-adbc
>>>>>> to be released before apache/arrow is released so that
>>>>>> apache/arrow's .deb/.rpm can depend on apache/arrow-adbc's
>>>>>> .deb/.rpm.
>>>>>>
>>>>>> (If Apache Arrow Dataset uses apache/arrow-adbc,
>>>>>> apache/arrow's .deb/.rpm needs to depend on
>>>>>> apache/arrow-adbc's .deb/.rpm.)
>>>>>>
>>>>>> We can add .deb/.rpm related files
>>>>>> (dev/tasks/linux-packages/ in apache/arrow) to
>>>>>> apache/arrow-adbc to build .deb/.rpm for apache/arrow-adbc.
>>>>>>
>>>>>> FYI: I did it for datafusion-contrib/datafusion-c:
>>>>>>
>>>>>> * https://github.com/datafusion-contrib/datafusion-c/tree/main/package
>>>>>> * 
>>>>>> https://github.com/datafusion-contrib/datafusion-c/blob/main/.github/workflows/package.yaml
>>>>>>
>>>>>> I can work on it in apache/arrow-adbc.
>>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>> -- 
>>>>>> kou
>>>>>>
>>>>>> In <5c...@www.fastmail.com>
>>>>>>   "Re: [DISC] Improving Arrow's database support" on Thu, 25 Aug 2022 
>>>>>> 11:51:08 -0400,
>>>>>>   "David Li" <li...@apache.org> wrote:
>>>>>>
>>>>>>> Fair enough, thank you. I'll try to expand a bit. (Sorry for the wall of text that follows…)
>>>>>>> 
>>>>>>> These are the components:
>>>>>>> 
>>>>>>> - Core adbc.h header
>>>>>>> - Driver manager for C/C++
>>>>>>> - Flight SQL-based driver
>>>>>>> - Postgres-based driver (WIP)
>>>>>>> - SQLite-based driver (more of a testbed for me than an actual component - I don't think we'd actually distribute this)
>>>>>>> - Java core interfaces
>>>>>>> - Java driver manager
>>>>>>> - Java JDBC-based driver
>>>>>>> - Java Flight SQL-based driver
>>>>>>> - Python driver manager
>>>>>>> 
>>>>>>> I think: adbc.h gets mirrored into the Arrow repo. The Flight SQL drivers get moved to the main Arrow repo and distributed as part of the regular Arrow releases.
>>>>>>> 
>>>>>>> For the rest of the components: they could be packaged individually, but versioned and released together. Also, each C/C++ driver probably needs a corresponding Python package so Python users do not have to futz with shared library configurations. (See [1].) So for instance, installing PyArrow would also give you the Flight SQL driver, and `pip install adbc_postgres` would get you the Postgres-based driver.
>>>>>>> 
>>>>>>> That would mean setting up separate CI, release, etc. (and eventually linking Crossbow & Conbench as well?). That does mean duplication of effort, but the trade off is avoiding bloating the main release process even further. However, I'd like to hear from those closer to the release process on this subject - if it would make people's lives easier, we could merge everything into one repo/process.
>>>>>>> 
>>>>>>> Integrations would be distributed as part of their respective packages (e.g. Arrow Dataset would optionally link to the driver manager). So the "part of Arrow 10.0.0" aspect means having a stable interface for adbc.h, and getting the Flight SQL drivers into the main repo.
>>>>>>> 
>>>>>>> [1]: https://github.com/apache/arrow-adbc/issues/53
>>>>>>> 
>>>>>>> On Thu, Aug 25, 2022, at 11:34, Antoine Pitrou wrote:
>>>>>>>> On Fri, 19 Aug 2022 14:09:44 -0400
>>>>>>>> "David Li" <li...@apache.org> wrote:
>>>>>>>>> Since it's been a while, I'd like to give an update. There are also a few questions I have around distribution.
>>>>>>>>> 
>>>>>>>>> Currently:
>>>>>>>>> - Supported in C, Java, and Python.
>>>>>>>>> - For C/Python, there are basic drivers wrapping Flight SQL and SQLite, with a draft of a libpq (Postgres) driver (using nanoarrow).
>>>>>>>>> - For Java, there are drivers wrapping JDBC and Flight SQL.
>>>>>>>>> - For Python, there's low-level bindings to the C API, and the DBAPI interface on top of that (+a few extension methods resembling DuckDB/Turbodbc).
>>>>>>>>>  
>>>>>>>>> There's drafts of integration with Ibis [1], DBI (R), and DuckDB. (I'd like to thank Hannes and Kirill for their comments, as well as Antoine, Dewey, and Matt here.)
>>>>>>>>> 
>>>>>>>>> I'd like to have this as part of 10.0.0 in some fashion. However, I'm not sure how we would like to handle packaging and distribution. In particular, there are several sub-components for each language (the driver manager + the drivers), increasing the work. Any thoughts here?
>>>>>>>>
>>>>>>>> Sorry, forgot to answer here. But I think your question is too broadly
>>>>>>>> formulated. It probably deserves a case-by-case discussion, IMHO.
>>>>>>>>
>>>>>>>>> I'm also wondering how we want to handle this in terms of specification - I assume we'd consider the core header file/Java interfaces a spec like the C Data Interface/Flight RPC, and vote on them/mirror them into the format/ directory?
>>>>>>>>
>>>>>>>> That sounds like the right way to me indeed.
>>>>>>>>
>>>>>>>> Regards
>>>>>>>>
>>>>>>>> Antoine.

Re: [DISC] Improving Arrow's database support

Posted by David Li <li...@apache.org>.

Following up here with some specific questions:

Matt Topol added some Go definitions [1] (thanks!) I'd assume we want to vote on those as well?

How should the process work for Java/Go? For C/C++, I assume we'd treat it like the C Data Interface and copy adbc.h to format/ after a vote, and then vote on releases of components. Or do we really only consider the C header as the 'format', with the others being language-specific affordances?

What about for Java and for Go? We could vote on and tag a release for Go, and add a documentation page that links to the Java/Go definitions at a specific revision (as the equivalent 'format' definition for Java/Go)? Or would we vendor the entire Java module/Go package as the 'format'?

Do we have a preference for versioning strategy? Should we proceed in lockstep with the Arrow C++ library et. al. and release "ADBC 1.0.0" (the API standard) with "drivers version 10.0.0", or use an independent versioning scheme? (For example, release API standard and components at "1.0.0". Then further releases of components that do not change the spec would be "1.1", "1.2", ...; if/when we change the spec, start over with "2.0", "2.1", ...)

[1]: https://github.com/apache/arrow-adbc/blob/main/go/adbc/adbc.go

-David

On Sun, Aug 28, 2022, at 10:56, Sutou Kouhei wrote:
> Hi,
>
> OK. I'll send pull requests for GLib and Ruby soon.
>
>> I'm curious if you have a particular use case in mind.
>
> I don't have any production-ready use case yet but I want to
> implement an Active Record adapter for ADBC. Active Record
> is the O/R mapper for Ruby on Rails. Implementing Web
> application by Ruby on Rails is one of major Ruby use
> cases. So providing Active Record interface for ADBC will
> increase Apache Arrow users in Ruby community.
>
> NOTE: Generally, Ruby on Rails users don't process large
> data but they sometimes need to process large (medium?) data
> in a batch process. Active Record adapter for ADBC may be
> useful for such use case.
>
>> There's a little bit more API cleanup to do [1]. If you
>> have comments on that or anything else, I'd appreciate
>> them. Otherwise, pull requests would also be appreciated.
>
> OK. I'll open issues/pull requests when I find
> something. For now, I think that "MODULE" type library
> instead of "SHARED" type library in CMake terminology
> [cmake] is better for driver modules. (I'll open an issue
> for this later.)
>
> [cmake]: https://cmake.org/cmake/help/latest/command/add_library.html
>
>
> Thanks,
> -- 
> kou
>
> In <e6...@www.fastmail.com>
>   "Re: [DISC] Improving Arrow's database support" on Sat, 27 Aug 2022 
> 15:28:56 -0400,
>   "David Li" <li...@apache.org> wrote:
>
>> I would be very happy to see GLib/Ruby bindings! I'm curious if you have a particular use case in mind. 
>> 
>> There's a little bit more API cleanup to do [1]. If you have comments on that or anything else, I'd appreciate them. Otherwise, pull requests would also be appreciated.
>> 
>> [1]: https://github.com/apache/arrow-adbc/issues/79
>> 
>> On Fri, Aug 26, 2022, at 21:53, Sutou Kouhei wrote:
>>> Hi,
>>>
>>> Thanks for sharing the current status!
>>> I understand.
>>>
>>> BTW, can I add GLib/Ruby bindings to apache/arrow-adbc
>>> before we release the first version? (I want to use ADBC
>>> from Ruby.) Or should I wait for the first release? If I can
>>> work on it now, I'll open pull requests for it.
>>>
>>> Thanks,
>>> -- 
>>> kou
>>>
>>> In <87...@www.fastmail.com>
>>>   "Re: [DISC] Improving Arrow's database support" on Fri, 26 Aug 2022 
>>> 11:03:26 -0400,
>>>   "David Li" <li...@apache.org> wrote:
>>>
>>>> Thank you Kou!
>>>> 
>>>> At least initially, I don't think I'll be able to complete the Dataset integration in time. So 10.0.0 probably won't ship with a hard dependency. That said I am hoping to have PyArrow take an optional dependency (so Flight SQL can finally be available from Python).
>>>> 
>>>> On Fri, Aug 26, 2022, at 01:01, Sutou Kouhei wrote:
>>>>> Hi,
>>>>>
>>>>> As a maintainer of Linux packages, I want apache/arrow-adbc
>>>>> to be released before apache/arrow is released so that
>>>>> apache/arrow's .deb/.rpm can depend on apache/arrow-adbc's
>>>>> .deb/.rpm.
>>>>>
>>>>> (If Apache Arrow Dataset uses apache/arrow-adbc,
>>>>> apache/arrow's .deb/.rpm needs to depend on
>>>>> apache/arrow-adbc's .deb/.rpm.)
>>>>>
>>>>> We can add .deb/.rpm related files
>>>>> (dev/tasks/linux-packages/ in apache/arrow) to
>>>>> apache/arrow-adbc to build .deb/.rpm for apache/arrow-adbc.
>>>>>
>>>>> FYI: I did it for datafusion-contrib/datafusion-c:
>>>>>
>>>>> * https://github.com/datafusion-contrib/datafusion-c/tree/main/package
>>>>> * 
>>>>> https://github.com/datafusion-contrib/datafusion-c/blob/main/.github/workflows/package.yaml
>>>>>
>>>>> I can work on it in apache/arrow-adbc.
>>>>>
>>>>>
>>>>> Thanks,
>>>>> -- 
>>>>> kou
>>>>>
>>>>> In <5c...@www.fastmail.com>
>>>>>   "Re: [DISC] Improving Arrow's database support" on Thu, 25 Aug 2022 
>>>>> 11:51:08 -0400,
>>>>>   "David Li" <li...@apache.org> wrote:
>>>>>
>>>>>> Fair enough, thank you. I'll try to expand a bit. (Sorry for the wall of text that follows…)
>>>>>> 
>>>>>> These are the components:
>>>>>> 
>>>>>> - Core adbc.h header
>>>>>> - Driver manager for C/C++
>>>>>> - Flight SQL-based driver
>>>>>> - Postgres-based driver (WIP)
>>>>>> - SQLite-based driver (more of a testbed for me than an actual component - I don't think we'd actually distribute this)
>>>>>> - Java core interfaces
>>>>>> - Java driver manager
>>>>>> - Java JDBC-based driver
>>>>>> - Java Flight SQL-based driver
>>>>>> - Python driver manager
>>>>>> 
>>>>>> I think: adbc.h gets mirrored into the Arrow repo. The Flight SQL drivers get moved to the main Arrow repo and distributed as part of the regular Arrow releases.
>>>>>> 
>>>>>> For the rest of the components: they could be packaged individually, but versioned and released together. Also, each C/C++ driver probably needs a corresponding Python package so Python users do not have to futz with shared library configurations. (See [1].) So for instance, installing PyArrow would also give you the Flight SQL driver, and `pip install adbc_postgres` would get you the Postgres-based driver.
>>>>>> 
>>>>>> That would mean setting up separate CI, release, etc. (and eventually linking Crossbow & Conbench as well?). That does mean duplication of effort, but the trade off is avoiding bloating the main release process even further. However, I'd like to hear from those closer to the release process on this subject - if it would make people's lives easier, we could merge everything into one repo/process.
>>>>>> 
>>>>>> Integrations would be distributed as part of their respective packages (e.g. Arrow Dataset would optionally link to the driver manager). So the "part of Arrow 10.0.0" aspect means having a stable interface for adbc.h, and getting the Flight SQL drivers into the main repo.
>>>>>> 
>>>>>> [1]: https://github.com/apache/arrow-adbc/issues/53
>>>>>> 
>>>>>> On Thu, Aug 25, 2022, at 11:34, Antoine Pitrou wrote:
>>>>>>> On Fri, 19 Aug 2022 14:09:44 -0400
>>>>>>> "David Li" <li...@apache.org> wrote:
>>>>>>>> Since it's been a while, I'd like to give an update. There are also a few questions I have around distribution.
>>>>>>>> 
>>>>>>>> Currently:
>>>>>>>> - Supported in C, Java, and Python.
>>>>>>>> - For C/Python, there are basic drivers wrapping Flight SQL and SQLite, with a draft of a libpq (Postgres) driver (using nanoarrow).
>>>>>>>> - For Java, there are drivers wrapping JDBC and Flight SQL.
>>>>>>>> - For Python, there's low-level bindings to the C API, and the DBAPI interface on top of that (+a few extension methods resembling DuckDB/Turbodbc).
>>>>>>>>  
>>>>>>>> There's drafts of integration with Ibis [1], DBI (R), and DuckDB. (I'd like to thank Hannes and Kirill for their comments, as well as Antoine, Dewey, and Matt here.)
>>>>>>>> 
>>>>>>>> I'd like to have this as part of 10.0.0 in some fashion. However, I'm not sure how we would like to handle packaging and distribution. In particular, there are several sub-components for each language (the driver manager + the drivers), increasing the work. Any thoughts here?
>>>>>>>
>>>>>>> Sorry, forgot to answer here. But I think your question is too broadly
>>>>>>> formulated. It probably deserves a case-by-case discussion, IMHO.
>>>>>>>
>>>>>>>> I'm also wondering how we want to handle this in terms of specification - I assume we'd consider the core header file/Java interfaces a spec like the C Data Interface/Flight RPC, and vote on them/mirror them into the format/ directory?
>>>>>>>
>>>>>>> That sounds like the right way to me indeed.
>>>>>>>
>>>>>>> Regards
>>>>>>>
>>>>>>> Antoine.

Re: [DISC] Improving Arrow's database support

Posted by Sutou Kouhei <ko...@clear-code.com>.

Hi,

OK. I'll send pull requests for GLib and Ruby soon.

> I'm curious if you have a particular use case in mind.

I don't have any production-ready use case yet but I want to
implement an Active Record adapter for ADBC. Active Record
is the O/R mapper for Ruby on Rails. Implementing Web
application by Ruby on Rails is one of major Ruby use
cases. So providing Active Record interface for ADBC will
increase Apache Arrow users in Ruby community.

NOTE: Generally, Ruby on Rails users don't process large
data but they sometimes need to process large (medium?) data
in a batch process. Active Record adapter for ADBC may be
useful for such use case.

> There's a little bit more API cleanup to do [1]. If you
> have comments on that or anything else, I'd appreciate
> them. Otherwise, pull requests would also be appreciated.

OK. I'll open issues/pull requests when I find
something. For now, I think that "MODULE" type library
instead of "SHARED" type library in CMake terminology
[cmake] is better for driver modules. (I'll open an issue
for this later.)

[cmake]: https://cmake.org/cmake/help/latest/command/add_library.html


Thanks,
-- 
kou

In <e6...@www.fastmail.com>
  "Re: [DISC] Improving Arrow's database support" on Sat, 27 Aug 2022 15:28:56 -0400,
  "David Li" <li...@apache.org> wrote:

> I would be very happy to see GLib/Ruby bindings! I'm curious if you have a particular use case in mind. 
> 
> There's a little bit more API cleanup to do [1]. If you have comments on that or anything else, I'd appreciate them. Otherwise, pull requests would also be appreciated.
> 
> [1]: https://github.com/apache/arrow-adbc/issues/79
> 
> On Fri, Aug 26, 2022, at 21:53, Sutou Kouhei wrote:
>> Hi,
>>
>> Thanks for sharing the current status!
>> I understand.
>>
>> BTW, can I add GLib/Ruby bindings to apache/arrow-adbc
>> before we release the first version? (I want to use ADBC
>> from Ruby.) Or should I wait for the first release? If I can
>> work on it now, I'll open pull requests for it.
>>
>> Thanks,
>> -- 
>> kou
>>
>> In <87...@www.fastmail.com>
>>   "Re: [DISC] Improving Arrow's database support" on Fri, 26 Aug 2022 
>> 11:03:26 -0400,
>>   "David Li" <li...@apache.org> wrote:
>>
>>> Thank you Kou!
>>> 
>>> At least initially, I don't think I'll be able to complete the Dataset integration in time. So 10.0.0 probably won't ship with a hard dependency. That said I am hoping to have PyArrow take an optional dependency (so Flight SQL can finally be available from Python).
>>> 
>>> On Fri, Aug 26, 2022, at 01:01, Sutou Kouhei wrote:
>>>> Hi,
>>>>
>>>> As a maintainer of Linux packages, I want apache/arrow-adbc
>>>> to be released before apache/arrow is released so that
>>>> apache/arrow's .deb/.rpm can depend on apache/arrow-adbc's
>>>> .deb/.rpm.
>>>>
>>>> (If Apache Arrow Dataset uses apache/arrow-adbc,
>>>> apache/arrow's .deb/.rpm needs to depend on
>>>> apache/arrow-adbc's .deb/.rpm.)
>>>>
>>>> We can add .deb/.rpm related files
>>>> (dev/tasks/linux-packages/ in apache/arrow) to
>>>> apache/arrow-adbc to build .deb/.rpm for apache/arrow-adbc.
>>>>
>>>> FYI: I did it for datafusion-contrib/datafusion-c:
>>>>
>>>> * https://github.com/datafusion-contrib/datafusion-c/tree/main/package
>>>> * 
>>>> https://github.com/datafusion-contrib/datafusion-c/blob/main/.github/workflows/package.yaml
>>>>
>>>> I can work on it in apache/arrow-adbc.
>>>>
>>>>
>>>> Thanks,
>>>> -- 
>>>> kou
>>>>
>>>> In <5c...@www.fastmail.com>
>>>>   "Re: [DISC] Improving Arrow's database support" on Thu, 25 Aug 2022 
>>>> 11:51:08 -0400,
>>>>   "David Li" <li...@apache.org> wrote:
>>>>
>>>>> Fair enough, thank you. I'll try to expand a bit. (Sorry for the wall of text that follows…)
>>>>> 
>>>>> These are the components:
>>>>> 
>>>>> - Core adbc.h header
>>>>> - Driver manager for C/C++
>>>>> - Flight SQL-based driver
>>>>> - Postgres-based driver (WIP)
>>>>> - SQLite-based driver (more of a testbed for me than an actual component - I don't think we'd actually distribute this)
>>>>> - Java core interfaces
>>>>> - Java driver manager
>>>>> - Java JDBC-based driver
>>>>> - Java Flight SQL-based driver
>>>>> - Python driver manager
>>>>> 
>>>>> I think: adbc.h gets mirrored into the Arrow repo. The Flight SQL drivers get moved to the main Arrow repo and distributed as part of the regular Arrow releases.
>>>>> 
>>>>> For the rest of the components: they could be packaged individually, but versioned and released together. Also, each C/C++ driver probably needs a corresponding Python package so Python users do not have to futz with shared library configurations. (See [1].) So for instance, installing PyArrow would also give you the Flight SQL driver, and `pip install adbc_postgres` would get you the Postgres-based driver.
>>>>> 
>>>>> That would mean setting up separate CI, release, etc. (and eventually linking Crossbow & Conbench as well?). That does mean duplication of effort, but the trade off is avoiding bloating the main release process even further. However, I'd like to hear from those closer to the release process on this subject - if it would make people's lives easier, we could merge everything into one repo/process.
>>>>> 
>>>>> Integrations would be distributed as part of their respective packages (e.g. Arrow Dataset would optionally link to the driver manager). So the "part of Arrow 10.0.0" aspect means having a stable interface for adbc.h, and getting the Flight SQL drivers into the main repo.
>>>>> 
>>>>> [1]: https://github.com/apache/arrow-adbc/issues/53
>>>>> 
>>>>> On Thu, Aug 25, 2022, at 11:34, Antoine Pitrou wrote:
>>>>>> On Fri, 19 Aug 2022 14:09:44 -0400
>>>>>> "David Li" <li...@apache.org> wrote:
>>>>>>> Since it's been a while, I'd like to give an update. There are also a few questions I have around distribution.
>>>>>>> 
>>>>>>> Currently:
>>>>>>> - Supported in C, Java, and Python.
>>>>>>> - For C/Python, there are basic drivers wrapping Flight SQL and SQLite, with a draft of a libpq (Postgres) driver (using nanoarrow).
>>>>>>> - For Java, there are drivers wrapping JDBC and Flight SQL.
>>>>>>> - For Python, there's low-level bindings to the C API, and the DBAPI interface on top of that (+a few extension methods resembling DuckDB/Turbodbc).
>>>>>>>  
>>>>>>> There's drafts of integration with Ibis [1], DBI (R), and DuckDB. (I'd like to thank Hannes and Kirill for their comments, as well as Antoine, Dewey, and Matt here.)
>>>>>>> 
>>>>>>> I'd like to have this as part of 10.0.0 in some fashion. However, I'm not sure how we would like to handle packaging and distribution. In particular, there are several sub-components for each language (the driver manager + the drivers), increasing the work. Any thoughts here?
>>>>>>
>>>>>> Sorry, forgot to answer here. But I think your question is too broadly
>>>>>> formulated. It probably deserves a case-by-case discussion, IMHO.
>>>>>>
>>>>>>> I'm also wondering how we want to handle this in terms of specification - I assume we'd consider the core header file/Java interfaces a spec like the C Data Interface/Flight RPC, and vote on them/mirror them into the format/ directory?
>>>>>>
>>>>>> That sounds like the right way to me indeed.
>>>>>>
>>>>>> Regards
>>>>>>
>>>>>> Antoine.

Re: [DISC] Improving Arrow's database support

Posted by David Li <li...@apache.org>.

I would be very happy to see GLib/Ruby bindings! I'm curious if you have a particular use case in mind. 

There's a little bit more API cleanup to do [1]. If you have comments on that or anything else, I'd appreciate them. Otherwise, pull requests would also be appreciated.

[1]: https://github.com/apache/arrow-adbc/issues/79

On Fri, Aug 26, 2022, at 21:53, Sutou Kouhei wrote:
> Hi,
>
> Thanks for sharing the current status!
> I understand.
>
> BTW, can I add GLib/Ruby bindings to apache/arrow-adbc
> before we release the first version? (I want to use ADBC
> from Ruby.) Or should I wait for the first release? If I can
> work on it now, I'll open pull requests for it.
>
> Thanks,
> -- 
> kou
>
> In <87...@www.fastmail.com>
>   "Re: [DISC] Improving Arrow's database support" on Fri, 26 Aug 2022 
> 11:03:26 -0400,
>   "David Li" <li...@apache.org> wrote:
>
>> Thank you Kou!
>> 
>> At least initially, I don't think I'll be able to complete the Dataset integration in time. So 10.0.0 probably won't ship with a hard dependency. That said I am hoping to have PyArrow take an optional dependency (so Flight SQL can finally be available from Python).
>> 
>> On Fri, Aug 26, 2022, at 01:01, Sutou Kouhei wrote:
>>> Hi,
>>>
>>> As a maintainer of Linux packages, I want apache/arrow-adbc
>>> to be released before apache/arrow is released so that
>>> apache/arrow's .deb/.rpm can depend on apache/arrow-adbc's
>>> .deb/.rpm.
>>>
>>> (If Apache Arrow Dataset uses apache/arrow-adbc,
>>> apache/arrow's .deb/.rpm needs to depend on
>>> apache/arrow-adbc's .deb/.rpm.)
>>>
>>> We can add .deb/.rpm related files
>>> (dev/tasks/linux-packages/ in apache/arrow) to
>>> apache/arrow-adbc to build .deb/.rpm for apache/arrow-adbc.
>>>
>>> FYI: I did it for datafusion-contrib/datafusion-c:
>>>
>>> * https://github.com/datafusion-contrib/datafusion-c/tree/main/package
>>> * 
>>> https://github.com/datafusion-contrib/datafusion-c/blob/main/.github/workflows/package.yaml
>>>
>>> I can work on it in apache/arrow-adbc.
>>>
>>>
>>> Thanks,
>>> -- 
>>> kou
>>>
>>> In <5c...@www.fastmail.com>
>>>   "Re: [DISC] Improving Arrow's database support" on Thu, 25 Aug 2022 
>>> 11:51:08 -0400,
>>>   "David Li" <li...@apache.org> wrote:
>>>
>>>> Fair enough, thank you. I'll try to expand a bit. (Sorry for the wall of text that follows…)
>>>> 
>>>> These are the components:
>>>> 
>>>> - Core adbc.h header
>>>> - Driver manager for C/C++
>>>> - Flight SQL-based driver
>>>> - Postgres-based driver (WIP)
>>>> - SQLite-based driver (more of a testbed for me than an actual component - I don't think we'd actually distribute this)
>>>> - Java core interfaces
>>>> - Java driver manager
>>>> - Java JDBC-based driver
>>>> - Java Flight SQL-based driver
>>>> - Python driver manager
>>>> 
>>>> I think: adbc.h gets mirrored into the Arrow repo. The Flight SQL drivers get moved to the main Arrow repo and distributed as part of the regular Arrow releases.
>>>> 
>>>> For the rest of the components: they could be packaged individually, but versioned and released together. Also, each C/C++ driver probably needs a corresponding Python package so Python users do not have to futz with shared library configurations. (See [1].) So for instance, installing PyArrow would also give you the Flight SQL driver, and `pip install adbc_postgres` would get you the Postgres-based driver.
>>>> 
>>>> That would mean setting up separate CI, release, etc. (and eventually linking Crossbow & Conbench as well?). That does mean duplication of effort, but the trade off is avoiding bloating the main release process even further. However, I'd like to hear from those closer to the release process on this subject - if it would make people's lives easier, we could merge everything into one repo/process.
>>>> 
>>>> Integrations would be distributed as part of their respective packages (e.g. Arrow Dataset would optionally link to the driver manager). So the "part of Arrow 10.0.0" aspect means having a stable interface for adbc.h, and getting the Flight SQL drivers into the main repo.
>>>> 
>>>> [1]: https://github.com/apache/arrow-adbc/issues/53
>>>> 
>>>> On Thu, Aug 25, 2022, at 11:34, Antoine Pitrou wrote:
>>>>> On Fri, 19 Aug 2022 14:09:44 -0400
>>>>> "David Li" <li...@apache.org> wrote:
>>>>>> Since it's been a while, I'd like to give an update. There are also a few questions I have around distribution.
>>>>>> 
>>>>>> Currently:
>>>>>> - Supported in C, Java, and Python.
>>>>>> - For C/Python, there are basic drivers wrapping Flight SQL and SQLite, with a draft of a libpq (Postgres) driver (using nanoarrow).
>>>>>> - For Java, there are drivers wrapping JDBC and Flight SQL.
>>>>>> - For Python, there's low-level bindings to the C API, and the DBAPI interface on top of that (+a few extension methods resembling DuckDB/Turbodbc).
>>>>>>  
>>>>>> There's drafts of integration with Ibis [1], DBI (R), and DuckDB. (I'd like to thank Hannes and Kirill for their comments, as well as Antoine, Dewey, and Matt here.)
>>>>>> 
>>>>>> I'd like to have this as part of 10.0.0 in some fashion. However, I'm not sure how we would like to handle packaging and distribution. In particular, there are several sub-components for each language (the driver manager + the drivers), increasing the work. Any thoughts here?
>>>>>
>>>>> Sorry, forgot to answer here. But I think your question is too broadly
>>>>> formulated. It probably deserves a case-by-case discussion, IMHO.
>>>>>
>>>>>> I'm also wondering how we want to handle this in terms of specification - I assume we'd consider the core header file/Java interfaces a spec like the C Data Interface/Flight RPC, and vote on them/mirror them into the format/ directory?
>>>>>
>>>>> That sounds like the right way to me indeed.
>>>>>
>>>>> Regards
>>>>>
>>>>> Antoine.

Re: [DISC] Improving Arrow's database support

Posted by Sutou Kouhei <ko...@clear-code.com>.

Hi,

Thanks for sharing the current status!
I understand.

BTW, can I add GLib/Ruby bindings to apache/arrow-adbc
before we release the first version? (I want to use ADBC
from Ruby.) Or should I wait for the first release? If I can
work on it now, I'll open pull requests for it.

Thanks,
-- 
kou

In <87...@www.fastmail.com>
  "Re: [DISC] Improving Arrow's database support" on Fri, 26 Aug 2022 11:03:26 -0400,
  "David Li" <li...@apache.org> wrote:

> Thank you Kou!
> 
> At least initially, I don't think I'll be able to complete the Dataset integration in time. So 10.0.0 probably won't ship with a hard dependency. That said I am hoping to have PyArrow take an optional dependency (so Flight SQL can finally be available from Python).
> 
> On Fri, Aug 26, 2022, at 01:01, Sutou Kouhei wrote:
>> Hi,
>>
>> As a maintainer of Linux packages, I want apache/arrow-adbc
>> to be released before apache/arrow is released so that
>> apache/arrow's .deb/.rpm can depend on apache/arrow-adbc's
>> .deb/.rpm.
>>
>> (If Apache Arrow Dataset uses apache/arrow-adbc,
>> apache/arrow's .deb/.rpm needs to depend on
>> apache/arrow-adbc's .deb/.rpm.)
>>
>> We can add .deb/.rpm related files
>> (dev/tasks/linux-packages/ in apache/arrow) to
>> apache/arrow-adbc to build .deb/.rpm for apache/arrow-adbc.
>>
>> FYI: I did it for datafusion-contrib/datafusion-c:
>>
>> * https://github.com/datafusion-contrib/datafusion-c/tree/main/package
>> * 
>> https://github.com/datafusion-contrib/datafusion-c/blob/main/.github/workflows/package.yaml
>>
>> I can work on it in apache/arrow-adbc.
>>
>>
>> Thanks,
>> -- 
>> kou
>>
>> In <5c...@www.fastmail.com>
>>   "Re: [DISC] Improving Arrow's database support" on Thu, 25 Aug 2022 
>> 11:51:08 -0400,
>>   "David Li" <li...@apache.org> wrote:
>>
>>> Fair enough, thank you. I'll try to expand a bit. (Sorry for the wall of text that follows…)
>>> 
>>> These are the components:
>>> 
>>> - Core adbc.h header
>>> - Driver manager for C/C++
>>> - Flight SQL-based driver
>>> - Postgres-based driver (WIP)
>>> - SQLite-based driver (more of a testbed for me than an actual component - I don't think we'd actually distribute this)
>>> - Java core interfaces
>>> - Java driver manager
>>> - Java JDBC-based driver
>>> - Java Flight SQL-based driver
>>> - Python driver manager
>>> 
>>> I think: adbc.h gets mirrored into the Arrow repo. The Flight SQL drivers get moved to the main Arrow repo and distributed as part of the regular Arrow releases.
>>> 
>>> For the rest of the components: they could be packaged individually, but versioned and released together. Also, each C/C++ driver probably needs a corresponding Python package so Python users do not have to futz with shared library configurations. (See [1].) So for instance, installing PyArrow would also give you the Flight SQL driver, and `pip install adbc_postgres` would get you the Postgres-based driver.
>>> 
>>> That would mean setting up separate CI, release, etc. (and eventually linking Crossbow & Conbench as well?). That does mean duplication of effort, but the trade off is avoiding bloating the main release process even further. However, I'd like to hear from those closer to the release process on this subject - if it would make people's lives easier, we could merge everything into one repo/process.
>>> 
>>> Integrations would be distributed as part of their respective packages (e.g. Arrow Dataset would optionally link to the driver manager). So the "part of Arrow 10.0.0" aspect means having a stable interface for adbc.h, and getting the Flight SQL drivers into the main repo.
>>> 
>>> [1]: https://github.com/apache/arrow-adbc/issues/53
>>> 
>>> On Thu, Aug 25, 2022, at 11:34, Antoine Pitrou wrote:
>>>> On Fri, 19 Aug 2022 14:09:44 -0400
>>>> "David Li" <li...@apache.org> wrote:
>>>>> Since it's been a while, I'd like to give an update. There are also a few questions I have around distribution.
>>>>> 
>>>>> Currently:
>>>>> - Supported in C, Java, and Python.
>>>>> - For C/Python, there are basic drivers wrapping Flight SQL and SQLite, with a draft of a libpq (Postgres) driver (using nanoarrow).
>>>>> - For Java, there are drivers wrapping JDBC and Flight SQL.
>>>>> - For Python, there's low-level bindings to the C API, and the DBAPI interface on top of that (+a few extension methods resembling DuckDB/Turbodbc).
>>>>>  
>>>>> There's drafts of integration with Ibis [1], DBI (R), and DuckDB. (I'd like to thank Hannes and Kirill for their comments, as well as Antoine, Dewey, and Matt here.)
>>>>> 
>>>>> I'd like to have this as part of 10.0.0 in some fashion. However, I'm not sure how we would like to handle packaging and distribution. In particular, there are several sub-components for each language (the driver manager + the drivers), increasing the work. Any thoughts here?
>>>>
>>>> Sorry, forgot to answer here. But I think your question is too broadly
>>>> formulated. It probably deserves a case-by-case discussion, IMHO.
>>>>
>>>>> I'm also wondering how we want to handle this in terms of specification - I assume we'd consider the core header file/Java interfaces a spec like the C Data Interface/Flight RPC, and vote on them/mirror them into the format/ directory?
>>>>
>>>> That sounds like the right way to me indeed.
>>>>
>>>> Regards
>>>>
>>>> Antoine.

Re: [DISC] Improving Arrow's database support

Posted by David Li <li...@apache.org>.

Thank you Kou!

At least initially, I don't think I'll be able to complete the Dataset integration in time. So 10.0.0 probably won't ship with a hard dependency. That said I am hoping to have PyArrow take an optional dependency (so Flight SQL can finally be available from Python).

On Fri, Aug 26, 2022, at 01:01, Sutou Kouhei wrote:
> Hi,
>
> As a maintainer of Linux packages, I want apache/arrow-adbc
> to be released before apache/arrow is released so that
> apache/arrow's .deb/.rpm can depend on apache/arrow-adbc's
> .deb/.rpm.
>
> (If Apache Arrow Dataset uses apache/arrow-adbc,
> apache/arrow's .deb/.rpm needs to depend on
> apache/arrow-adbc's .deb/.rpm.)
>
> We can add .deb/.rpm related files
> (dev/tasks/linux-packages/ in apache/arrow) to
> apache/arrow-adbc to build .deb/.rpm for apache/arrow-adbc.
>
> FYI: I did it for datafusion-contrib/datafusion-c:
>
> * https://github.com/datafusion-contrib/datafusion-c/tree/main/package
> * 
> https://github.com/datafusion-contrib/datafusion-c/blob/main/.github/workflows/package.yaml
>
> I can work on it in apache/arrow-adbc.
>
>
> Thanks,
> -- 
> kou
>
> In <5c...@www.fastmail.com>
>   "Re: [DISC] Improving Arrow's database support" on Thu, 25 Aug 2022 
> 11:51:08 -0400,
>   "David Li" <li...@apache.org> wrote:
>
>> Fair enough, thank you. I'll try to expand a bit. (Sorry for the wall of text that follows…)
>> 
>> These are the components:
>> 
>> - Core adbc.h header
>> - Driver manager for C/C++
>> - Flight SQL-based driver
>> - Postgres-based driver (WIP)
>> - SQLite-based driver (more of a testbed for me than an actual component - I don't think we'd actually distribute this)
>> - Java core interfaces
>> - Java driver manager
>> - Java JDBC-based driver
>> - Java Flight SQL-based driver
>> - Python driver manager
>> 
>> I think: adbc.h gets mirrored into the Arrow repo. The Flight SQL drivers get moved to the main Arrow repo and distributed as part of the regular Arrow releases.
>> 
>> For the rest of the components: they could be packaged individually, but versioned and released together. Also, each C/C++ driver probably needs a corresponding Python package so Python users do not have to futz with shared library configurations. (See [1].) So for instance, installing PyArrow would also give you the Flight SQL driver, and `pip install adbc_postgres` would get you the Postgres-based driver.
>> 
>> That would mean setting up separate CI, release, etc. (and eventually linking Crossbow & Conbench as well?). That does mean duplication of effort, but the trade off is avoiding bloating the main release process even further. However, I'd like to hear from those closer to the release process on this subject - if it would make people's lives easier, we could merge everything into one repo/process.
>> 
>> Integrations would be distributed as part of their respective packages (e.g. Arrow Dataset would optionally link to the driver manager). So the "part of Arrow 10.0.0" aspect means having a stable interface for adbc.h, and getting the Flight SQL drivers into the main repo.
>> 
>> [1]: https://github.com/apache/arrow-adbc/issues/53
>> 
>> On Thu, Aug 25, 2022, at 11:34, Antoine Pitrou wrote:
>>> On Fri, 19 Aug 2022 14:09:44 -0400
>>> "David Li" <li...@apache.org> wrote:
>>>> Since it's been a while, I'd like to give an update. There are also a few questions I have around distribution.
>>>> 
>>>> Currently:
>>>> - Supported in C, Java, and Python.
>>>> - For C/Python, there are basic drivers wrapping Flight SQL and SQLite, with a draft of a libpq (Postgres) driver (using nanoarrow).
>>>> - For Java, there are drivers wrapping JDBC and Flight SQL.
>>>> - For Python, there's low-level bindings to the C API, and the DBAPI interface on top of that (+a few extension methods resembling DuckDB/Turbodbc).
>>>>  
>>>> There's drafts of integration with Ibis [1], DBI (R), and DuckDB. (I'd like to thank Hannes and Kirill for their comments, as well as Antoine, Dewey, and Matt here.)
>>>> 
>>>> I'd like to have this as part of 10.0.0 in some fashion. However, I'm not sure how we would like to handle packaging and distribution. In particular, there are several sub-components for each language (the driver manager + the drivers), increasing the work. Any thoughts here?
>>>
>>> Sorry, forgot to answer here. But I think your question is too broadly
>>> formulated. It probably deserves a case-by-case discussion, IMHO.
>>>
>>>> I'm also wondering how we want to handle this in terms of specification - I assume we'd consider the core header file/Java interfaces a spec like the C Data Interface/Flight RPC, and vote on them/mirror them into the format/ directory?
>>>
>>> That sounds like the right way to me indeed.
>>>
>>> Regards
>>>
>>> Antoine.

Re: [DISC] Improving Arrow's database support

Posted by Sutou Kouhei <ko...@clear-code.com>.

Hi,

As a maintainer of Linux packages, I want apache/arrow-adbc
to be released before apache/arrow is released so that
apache/arrow's .deb/.rpm can depend on apache/arrow-adbc's
.deb/.rpm.

(If Apache Arrow Dataset uses apache/arrow-adbc,
apache/arrow's .deb/.rpm needs to depend on
apache/arrow-adbc's .deb/.rpm.)

We can add .deb/.rpm related files
(dev/tasks/linux-packages/ in apache/arrow) to
apache/arrow-adbc to build .deb/.rpm for apache/arrow-adbc.

FYI: I did it for datafusion-contrib/datafusion-c:

* https://github.com/datafusion-contrib/datafusion-c/tree/main/package
* https://github.com/datafusion-contrib/datafusion-c/blob/main/.github/workflows/package.yaml

I can work on it in apache/arrow-adbc.


Thanks,
-- 
kou

In <5c...@www.fastmail.com>
  "Re: [DISC] Improving Arrow's database support" on Thu, 25 Aug 2022 11:51:08 -0400,
  "David Li" <li...@apache.org> wrote:

> Fair enough, thank you. I'll try to expand a bit. (Sorry for the wall of text that follows…)
> 
> These are the components:
> 
> - Core adbc.h header
> - Driver manager for C/C++
> - Flight SQL-based driver
> - Postgres-based driver (WIP)
> - SQLite-based driver (more of a testbed for me than an actual component - I don't think we'd actually distribute this)
> - Java core interfaces
> - Java driver manager
> - Java JDBC-based driver
> - Java Flight SQL-based driver
> - Python driver manager
> 
> I think: adbc.h gets mirrored into the Arrow repo. The Flight SQL drivers get moved to the main Arrow repo and distributed as part of the regular Arrow releases.
> 
> For the rest of the components: they could be packaged individually, but versioned and released together. Also, each C/C++ driver probably needs a corresponding Python package so Python users do not have to futz with shared library configurations. (See [1].) So for instance, installing PyArrow would also give you the Flight SQL driver, and `pip install adbc_postgres` would get you the Postgres-based driver.
> 
> That would mean setting up separate CI, release, etc. (and eventually linking Crossbow & Conbench as well?). That does mean duplication of effort, but the trade off is avoiding bloating the main release process even further. However, I'd like to hear from those closer to the release process on this subject - if it would make people's lives easier, we could merge everything into one repo/process.
> 
> Integrations would be distributed as part of their respective packages (e.g. Arrow Dataset would optionally link to the driver manager). So the "part of Arrow 10.0.0" aspect means having a stable interface for adbc.h, and getting the Flight SQL drivers into the main repo.
> 
> [1]: https://github.com/apache/arrow-adbc/issues/53
> 
> On Thu, Aug 25, 2022, at 11:34, Antoine Pitrou wrote:
>> On Fri, 19 Aug 2022 14:09:44 -0400
>> "David Li" <li...@apache.org> wrote:
>>> Since it's been a while, I'd like to give an update. There are also a few questions I have around distribution.
>>> 
>>> Currently:
>>> - Supported in C, Java, and Python.
>>> - For C/Python, there are basic drivers wrapping Flight SQL and SQLite, with a draft of a libpq (Postgres) driver (using nanoarrow).
>>> - For Java, there are drivers wrapping JDBC and Flight SQL.
>>> - For Python, there's low-level bindings to the C API, and the DBAPI interface on top of that (+a few extension methods resembling DuckDB/Turbodbc).
>>>  
>>> There's drafts of integration with Ibis [1], DBI (R), and DuckDB. (I'd like to thank Hannes and Kirill for their comments, as well as Antoine, Dewey, and Matt here.)
>>> 
>>> I'd like to have this as part of 10.0.0 in some fashion. However, I'm not sure how we would like to handle packaging and distribution. In particular, there are several sub-components for each language (the driver manager + the drivers), increasing the work. Any thoughts here?
>>
>> Sorry, forgot to answer here. But I think your question is too broadly
>> formulated. It probably deserves a case-by-case discussion, IMHO.
>>
>>> I'm also wondering how we want to handle this in terms of specification - I assume we'd consider the core header file/Java interfaces a spec like the C Data Interface/Flight RPC, and vote on them/mirror them into the format/ directory?
>>
>> That sounds like the right way to me indeed.
>>
>> Regards
>>
>> Antoine.

Re: [DISC] Improving Arrow's database support

Posted by David Li <li...@apache.org>.

It currently does do dlopen()/LoadLibrary but based on how it's being used by Python I'm going to refactor that out separately so that the main method of usage will be to pass it a pointer to the driver-specific initialization function. It does not have any notion of internal registry. (And I'm assuming R, etc. will be similar to Python in how they want to use things.)

The Java one I think will keep the registry but there are established interfaces for that sort of thing (ServiceProvider).

On Thu, Aug 25, 2022, at 12:31, Antoine Pitrou wrote:
> Le 25/08/2022 à 18:19, David Li a écrit :
>>> Hmm, what is a driver manager exactly? Does it actually manage drivers
>>> (how so)? Is it more of a core library?
>> 
>> It implements the ADBC API, but dynamically delegates to an actual implementation underneath, so that you do not have to directly link to the driver, or to help deal with using multiple drivers within one application. (It's analogous to the equivalents in JDBC/ODBC.)
>
> Ok, so just to make sure I understand correctly, it doesn't have any 
> notion of internal registry or automated dynamic loading through 
> dlopen() shenanigans? :-)
>
> Regards
>
> Antoine.
>
>
>
>   I think it would be used more by language bindings; a C/C++ 
> application could just link to the one driver it needs.
>> 
>> For Python, it implements the bindings to the C APIs by linking against the C++ driver manager. As you noted on [1], it's probably not something people would use directly.
>> 
>> [1]: https://github.com/apache/arrow-adbc/issues/53
>> 
>> On Thu, Aug 25, 2022, at 12:08, Antoine Pitrou wrote:
>>> Le 25/08/2022 à 17:51, David Li a écrit :
>>>> Fair enough, thank you. I'll try to expand a bit. (Sorry for the wall of text that follows…)
>>>>
>>>> These are the components:
>>>>
>>>> - Core adbc.h header
>>>> - Driver manager for C/C++
>>>> - Flight SQL-based driver
>>>> - Postgres-based driver (WIP)
>>>> - SQLite-based driver (more of a testbed for me than an actual component - I don't think we'd actually distribute this)
>>>> - Java core interfaces
>>>> - Java driver manager
>>>> - Java JDBC-based driver
>>>> - Java Flight SQL-based driver
>>>> - Python driver manager
>>>
>>> Hmm, what is a driver manager exactly? Does it actually manage drivers
>>> (how so)? Is it more of a core library?
>>>
>>>> I think: adbc.h gets mirrored into the Arrow repo. The Flight SQL drivers get moved to the main Arrow repo and distributed as part of the regular Arrow releases.
>>>
>>> That sounds reasonable to me.
>>>
>>>   >
>>>    So the "part of Arrow 10.0.0" aspect means having a stable interface
>>> for adbc.h, and getting the Flight SQL drivers into the main repo.
>>>
>>> Sounds fine as well.
>>>
>>> Regards
>>>
>>> Antoine.

Re: [DISC] Improving Arrow's database support

Posted by Antoine Pitrou <an...@python.org>.

Le 25/08/2022 à 18:19, David Li a écrit :
>> Hmm, what is a driver manager exactly? Does it actually manage drivers
>> (how so)? Is it more of a core library?
> 
> It implements the ADBC API, but dynamically delegates to an actual implementation underneath, so that you do not have to directly link to the driver, or to help deal with using multiple drivers within one application. (It's analogous to the equivalents in JDBC/ODBC.)

Ok, so just to make sure I understand correctly, it doesn't have any 
notion of internal registry or automated dynamic loading through 
dlopen() shenanigans? :-)

Regards

Antoine.



  I think it would be used more by language bindings; a C/C++ 
application could just link to the one driver it needs.
> 
> For Python, it implements the bindings to the C APIs by linking against the C++ driver manager. As you noted on [1], it's probably not something people would use directly.
> 
> [1]: https://github.com/apache/arrow-adbc/issues/53
> 
> On Thu, Aug 25, 2022, at 12:08, Antoine Pitrou wrote:
>> Le 25/08/2022 à 17:51, David Li a écrit :
>>> Fair enough, thank you. I'll try to expand a bit. (Sorry for the wall of text that follows…)
>>>
>>> These are the components:
>>>
>>> - Core adbc.h header
>>> - Driver manager for C/C++
>>> - Flight SQL-based driver
>>> - Postgres-based driver (WIP)
>>> - SQLite-based driver (more of a testbed for me than an actual component - I don't think we'd actually distribute this)
>>> - Java core interfaces
>>> - Java driver manager
>>> - Java JDBC-based driver
>>> - Java Flight SQL-based driver
>>> - Python driver manager
>>
>> Hmm, what is a driver manager exactly? Does it actually manage drivers
>> (how so)? Is it more of a core library?
>>
>>> I think: adbc.h gets mirrored into the Arrow repo. The Flight SQL drivers get moved to the main Arrow repo and distributed as part of the regular Arrow releases.
>>
>> That sounds reasonable to me.
>>
>>   >
>>    So the "part of Arrow 10.0.0" aspect means having a stable interface
>> for adbc.h, and getting the Flight SQL drivers into the main repo.
>>
>> Sounds fine as well.
>>
>> Regards
>>
>> Antoine.

Re: [DISC] Improving Arrow's database support

Posted by David Li <li...@apache.org>.

> Hmm, what is a driver manager exactly? Does it actually manage drivers 
> (how so)? Is it more of a core library?

It implements the ADBC API, but dynamically delegates to an actual implementation underneath, so that you do not have to directly link to the driver, or to help deal with using multiple drivers within one application. (It's analogous to the equivalents in JDBC/ODBC.) I think it would be used more by language bindings; a C/C++ application could just link to the one driver it needs.

For Python, it implements the bindings to the C APIs by linking against the C++ driver manager. As you noted on [1], it's probably not something people would use directly.

[1]: https://github.com/apache/arrow-adbc/issues/53

On Thu, Aug 25, 2022, at 12:08, Antoine Pitrou wrote:
> Le 25/08/2022 à 17:51, David Li a écrit :
>> Fair enough, thank you. I'll try to expand a bit. (Sorry for the wall of text that follows…)
>> 
>> These are the components:
>> 
>> - Core adbc.h header
>> - Driver manager for C/C++
>> - Flight SQL-based driver
>> - Postgres-based driver (WIP)
>> - SQLite-based driver (more of a testbed for me than an actual component - I don't think we'd actually distribute this)
>> - Java core interfaces
>> - Java driver manager
>> - Java JDBC-based driver
>> - Java Flight SQL-based driver
>> - Python driver manager
>
> Hmm, what is a driver manager exactly? Does it actually manage drivers 
> (how so)? Is it more of a core library?
>
>> I think: adbc.h gets mirrored into the Arrow repo. The Flight SQL drivers get moved to the main Arrow repo and distributed as part of the regular Arrow releases.
>
> That sounds reasonable to me.
>
>  >
>   So the "part of Arrow 10.0.0" aspect means having a stable interface 
> for adbc.h, and getting the Flight SQL drivers into the main repo.
>
> Sounds fine as well.
>
> Regards
>
> Antoine.

Re: [DISC] Improving Arrow's database support

Posted by Antoine Pitrou <an...@python.org>.

Le 25/08/2022 à 17:51, David Li a écrit :
> Fair enough, thank you. I'll try to expand a bit. (Sorry for the wall of text that follows…)
> 
> These are the components:
> 
> - Core adbc.h header
> - Driver manager for C/C++
> - Flight SQL-based driver
> - Postgres-based driver (WIP)
> - SQLite-based driver (more of a testbed for me than an actual component - I don't think we'd actually distribute this)
> - Java core interfaces
> - Java driver manager
> - Java JDBC-based driver
> - Java Flight SQL-based driver
> - Python driver manager

Hmm, what is a driver manager exactly? Does it actually manage drivers 
(how so)? Is it more of a core library?

> I think: adbc.h gets mirrored into the Arrow repo. The Flight SQL drivers get moved to the main Arrow repo and distributed as part of the regular Arrow releases.

That sounds reasonable to me.

 >
  So the "part of Arrow 10.0.0" aspect means having a stable interface 
for adbc.h, and getting the Flight SQL drivers into the main repo.

Sounds fine as well.

Regards

Antoine.

Re: [DISC] Improving Arrow's database support

Posted by David Li <li...@apache.org>.

Fair enough, thank you. I'll try to expand a bit. (Sorry for the wall of text that follows…)

These are the components:

- Core adbc.h header
- Driver manager for C/C++
- Flight SQL-based driver
- Postgres-based driver (WIP)
- SQLite-based driver (more of a testbed for me than an actual component - I don't think we'd actually distribute this)
- Java core interfaces
- Java driver manager
- Java JDBC-based driver
- Java Flight SQL-based driver
- Python driver manager

I think: adbc.h gets mirrored into the Arrow repo. The Flight SQL drivers get moved to the main Arrow repo and distributed as part of the regular Arrow releases.

For the rest of the components: they could be packaged individually, but versioned and released together. Also, each C/C++ driver probably needs a corresponding Python package so Python users do not have to futz with shared library configurations. (See [1].) So for instance, installing PyArrow would also give you the Flight SQL driver, and `pip install adbc_postgres` would get you the Postgres-based driver.

That would mean setting up separate CI, release, etc. (and eventually linking Crossbow & Conbench as well?). That does mean duplication of effort, but the trade off is avoiding bloating the main release process even further. However, I'd like to hear from those closer to the release process on this subject - if it would make people's lives easier, we could merge everything into one repo/process.

Integrations would be distributed as part of their respective packages (e.g. Arrow Dataset would optionally link to the driver manager). So the "part of Arrow 10.0.0" aspect means having a stable interface for adbc.h, and getting the Flight SQL drivers into the main repo.

[1]: https://github.com/apache/arrow-adbc/issues/53

On Thu, Aug 25, 2022, at 11:34, Antoine Pitrou wrote:
> On Fri, 19 Aug 2022 14:09:44 -0400
> "David Li" <li...@apache.org> wrote:
>> Since it's been a while, I'd like to give an update. There are also a few questions I have around distribution.
>> 
>> Currently:
>> - Supported in C, Java, and Python.
>> - For C/Python, there are basic drivers wrapping Flight SQL and SQLite, with a draft of a libpq (Postgres) driver (using nanoarrow).
>> - For Java, there are drivers wrapping JDBC and Flight SQL.
>> - For Python, there's low-level bindings to the C API, and the DBAPI interface on top of that (+a few extension methods resembling DuckDB/Turbodbc).
>>  
>> There's drafts of integration with Ibis [1], DBI (R), and DuckDB. (I'd like to thank Hannes and Kirill for their comments, as well as Antoine, Dewey, and Matt here.)
>> 
>> I'd like to have this as part of 10.0.0 in some fashion. However, I'm not sure how we would like to handle packaging and distribution. In particular, there are several sub-components for each language (the driver manager + the drivers), increasing the work. Any thoughts here?
>
> Sorry, forgot to answer here. But I think your question is too broadly
> formulated. It probably deserves a case-by-case discussion, IMHO.
>
>> I'm also wondering how we want to handle this in terms of specification - I assume we'd consider the core header file/Java interfaces a spec like the C Data Interface/Flight RPC, and vote on them/mirror them into the format/ directory?
>
> That sounds like the right way to me indeed.
>
> Regards
>
> Antoine.

Re: [DISC] Improving Arrow's database support

Posted by Antoine Pitrou <an...@python.org>.

On Fri, 19 Aug 2022 14:09:44 -0400
"David Li" <li...@apache.org> wrote:
> Since it's been a while, I'd like to give an update. There are also a few questions I have around distribution.
> 
> Currently:
> - Supported in C, Java, and Python.
> - For C/Python, there are basic drivers wrapping Flight SQL and SQLite, with a draft of a libpq (Postgres) driver (using nanoarrow).
> - For Java, there are drivers wrapping JDBC and Flight SQL.
> - For Python, there's low-level bindings to the C API, and the DBAPI interface on top of that (+a few extension methods resembling DuckDB/Turbodbc).
>  
> There's drafts of integration with Ibis [1], DBI (R), and DuckDB. (I'd like to thank Hannes and Kirill for their comments, as well as Antoine, Dewey, and Matt here.)
> 
> I'd like to have this as part of 10.0.0 in some fashion. However, I'm not sure how we would like to handle packaging and distribution. In particular, there are several sub-components for each language (the driver manager + the drivers), increasing the work. Any thoughts here?

Sorry, forgot to answer here. But I think your question is too broadly
formulated. It probably deserves a case-by-case discussion, IMHO.

> I'm also wondering how we want to handle this in terms of specification - I assume we'd consider the core header file/Java interfaces a spec like the C Data Interface/Flight RPC, and vote on them/mirror them into the format/ directory?

That sounds like the right way to me indeed.

Regards

Antoine.

Re: [DISC] Improving Arrow's database support

Posted by Antoine Pitrou <an...@python.org>.

That definitely makes sense :-)


Le 20/08/2022 à 00:15, David Li a écrit :
> Flight SQL has no bindings for Python, R, etc. and this would give you bindings, albeit indirectly; also, instead of asking users to choose between different APIs, having Flight SQL under ADBC makes the decision easier - servers implement Flight SQL, clients consume ADBC.
> 
> On Fri, Aug 19, 2022, at 17:01, Antoine Pitrou wrote:
>> I see. What is the point of wrapping Flight SQL in ADBC then? Just for
>> consistency with other drivers?
>>
>>
>> Le 19/08/2022 à 23:00, David Li a écrit :
>>> No, sorry: I meant only the API definitions by "C"; everything so far is actually implemented in C++. There's no reason we couldn't port the SQLite driver to pure C with nanoarrow but I've mostly used it as a testbed and not tried to make it a 'real' driver.
>>>
>>> On Fri, Aug 19, 2022, at 16:19, Antoine Pitrou wrote:
>>>> Le 19/08/2022 à 20:09, David Li a écrit :
>>>>> Since it's been a while, I'd like to give an update. There are also a few questions I have around distribution.
>>>>>
>>>>> Currently:
>>>>> - Supported in C, Java, and Python.
>>>>> - For C/Python, there are basic drivers wrapping Flight SQL and SQLite, with a draft of a libpq (Postgres) driver (using nanoarrow).
>>>>> - For Java, there are drivers wrapping JDBC and Flight SQL.
>>>>> - For Python, there's low-level bindings to the C API, and the DBAPI interface on top of that (+a few extension methods resembling DuckDB/Turbodbc).
>>>>
>>>> Did you wrap Flight SQL in pure C?

Re: [DISC] Improving Arrow's database support

Posted by David Li <li...@apache.org>.

Flight SQL has no bindings for Python, R, etc. and this would give you bindings, albeit indirectly; also, instead of asking users to choose between different APIs, having Flight SQL under ADBC makes the decision easier - servers implement Flight SQL, clients consume ADBC.

On Fri, Aug 19, 2022, at 17:01, Antoine Pitrou wrote:
> I see. What is the point of wrapping Flight SQL in ADBC then? Just for 
> consistency with other drivers?
>
>
> Le 19/08/2022 à 23:00, David Li a écrit :
>> No, sorry: I meant only the API definitions by "C"; everything so far is actually implemented in C++. There's no reason we couldn't port the SQLite driver to pure C with nanoarrow but I've mostly used it as a testbed and not tried to make it a 'real' driver.
>> 
>> On Fri, Aug 19, 2022, at 16:19, Antoine Pitrou wrote:
>>> Le 19/08/2022 à 20:09, David Li a écrit :
>>>> Since it's been a while, I'd like to give an update. There are also a few questions I have around distribution.
>>>>
>>>> Currently:
>>>> - Supported in C, Java, and Python.
>>>> - For C/Python, there are basic drivers wrapping Flight SQL and SQLite, with a draft of a libpq (Postgres) driver (using nanoarrow).
>>>> - For Java, there are drivers wrapping JDBC and Flight SQL.
>>>> - For Python, there's low-level bindings to the C API, and the DBAPI interface on top of that (+a few extension methods resembling DuckDB/Turbodbc).
>>>
>>> Did you wrap Flight SQL in pure C?

Re: [DISC] Improving Arrow's database support

Posted by Antoine Pitrou <an...@python.org>.

I see. What is the point of wrapping Flight SQL in ADBC then? Just for 
consistency with other drivers?


Le 19/08/2022 à 23:00, David Li a écrit :
> No, sorry: I meant only the API definitions by "C"; everything so far is actually implemented in C++. There's no reason we couldn't port the SQLite driver to pure C with nanoarrow but I've mostly used it as a testbed and not tried to make it a 'real' driver.
> 
> On Fri, Aug 19, 2022, at 16:19, Antoine Pitrou wrote:
>> Le 19/08/2022 à 20:09, David Li a écrit :
>>> Since it's been a while, I'd like to give an update. There are also a few questions I have around distribution.
>>>
>>> Currently:
>>> - Supported in C, Java, and Python.
>>> - For C/Python, there are basic drivers wrapping Flight SQL and SQLite, with a draft of a libpq (Postgres) driver (using nanoarrow).
>>> - For Java, there are drivers wrapping JDBC and Flight SQL.
>>> - For Python, there's low-level bindings to the C API, and the DBAPI interface on top of that (+a few extension methods resembling DuckDB/Turbodbc).
>>
>> Did you wrap Flight SQL in pure C?

Re: [DISC] Improving Arrow's database support

Posted by David Li <li...@apache.org>.

No, sorry: I meant only the API definitions by "C"; everything so far is actually implemented in C++. There's no reason we couldn't port the SQLite driver to pure C with nanoarrow but I've mostly used it as a testbed and not tried to make it a 'real' driver.

On Fri, Aug 19, 2022, at 16:19, Antoine Pitrou wrote:
> Le 19/08/2022 à 20:09, David Li a écrit :
>> Since it's been a while, I'd like to give an update. There are also a few questions I have around distribution.
>> 
>> Currently:
>> - Supported in C, Java, and Python.
>> - For C/Python, there are basic drivers wrapping Flight SQL and SQLite, with a draft of a libpq (Postgres) driver (using nanoarrow).
>> - For Java, there are drivers wrapping JDBC and Flight SQL.
>> - For Python, there's low-level bindings to the C API, and the DBAPI interface on top of that (+a few extension methods resembling DuckDB/Turbodbc).
>
> Did you wrap Flight SQL in pure C?

Re: [DISC] Improving Arrow's database support

Posted by Antoine Pitrou <an...@python.org>.

Le 19/08/2022 à 20:09, David Li a écrit :
> Since it's been a while, I'd like to give an update. There are also a few questions I have around distribution.
> 
> Currently:
> - Supported in C, Java, and Python.
> - For C/Python, there are basic drivers wrapping Flight SQL and SQLite, with a draft of a libpq (Postgres) driver (using nanoarrow).
> - For Java, there are drivers wrapping JDBC and Flight SQL.
> - For Python, there's low-level bindings to the C API, and the DBAPI interface on top of that (+a few extension methods resembling DuckDB/Turbodbc).

Did you wrap Flight SQL in pure C?

Re: [DISC] Improving Arrow's database support

Posted by David Li <li...@apache.org>.

Since it's been a while, I'd like to give an update. There are also a few questions I have around distribution.

Currently:
- Supported in C, Java, and Python.
- For C/Python, there are basic drivers wrapping Flight SQL and SQLite, with a draft of a libpq (Postgres) driver (using nanoarrow).
- For Java, there are drivers wrapping JDBC and Flight SQL.
- For Python, there's low-level bindings to the C API, and the DBAPI interface on top of that (+a few extension methods resembling DuckDB/Turbodbc).
 
There's drafts of integration with Ibis [1], DBI (R), and DuckDB. (I'd like to thank Hannes and Kirill for their comments, as well as Antoine, Dewey, and Matt here.)

I'd like to have this as part of 10.0.0 in some fashion. However, I'm not sure how we would like to handle packaging and distribution. In particular, there are several sub-components for each language (the driver manager + the drivers), increasing the work. Any thoughts here?

I'm also wondering how we want to handle this in terms of specification - I assume we'd consider the core header file/Java interfaces a spec like the C Data Interface/Flight RPC, and vote on them/mirror them into the format/ directory?

I'm hoping that longer term, most of the drivers would be maintained outside the community, and we would just distribute the driver managers and 'core' drivers (Flight SQL, probably Acero and JDBC/ODBC wrappers). There's also a lot of potential follow-up work, including integration into more systems (e.g. Arrow Dataset/Acero, pandas.read_sql, Spark DataSourceV2), more drivers (e.g. recycling pgeon/pg2arrow, Turbodbc, and/or the Arrow Hiveserver client; FreeTDS/SQL Server; BigQuery Storage; etc.); setting up benchmarks and integration tests, etc.

[1]: https://github.com/ibis-project/ibis/pull/4267

-David

On Wed, Jun 1, 2022, at 17:52, David Li wrote:
> I've set up the new repo and enabled issues. I still need to get things 
> building independently of Arrow, but now adbc.h is self-contained and 
> the "driver manager" being prototyped can also be built and used 
> independently of Arrow.
>
> On Wed, Jun 1, 2022, at 13:55, David Li wrote:
>> Wes: thanks! I'll move things over and update the list.
>>
>> Gavin: I mean more that ADBC won't support every little feature in 
>> JDBC/ODBC, or won't necessarily make it easy to support certain things 
>> (e.g. updating a single row in a ResultSet). But it's not that OLTP is 
>> taboo, it's just not what is being optimized for. 
>>
>> For instance it would be nice to eventually have JDBC/ODBC drivers that 
>> can wrap ADBC in much the same way that Dremio is working on a JDBC 
>> driver for Flight SQL. But especially in the near term, ADBC just won't 
>> have the feature set to make that possible.
>>
>> What sorts of use cases were you thinking about, though?
>>
>> On Wed, Jun 1, 2022, at 13:18, Gavin Ray wrote:
>>> This sounds great, but I had one question:
>>>
>>> Read the initial ADBC proposal and it mentioned that OLTP was not a
>>> targeted usecase
>>> If this work is intended to take on the role of a sort of standard ABI/SDK,
>>> does that mean that building OLTP-oriented drivers/tooling with it is off
>>> the table?
>>>
>>> On Wed, Jun 1, 2022 at 11:11 AM Wes McKinney <we...@gmail.com> wrote:
>>>
>>>> I went ahead and created
>>>>
>>>> https://github.com/apache/arrow-adbc
>>>>
>>>> I directed issue comments / PRs to issues@
>>>>
>>>> On Tue, May 31, 2022 at 8:49 PM Wes McKinney <we...@gmail.com> wrote:
>>>> >
>>>> > I think spinning up a new repository while this exploratory work
>>>> > progresses is a fine idea — perhaps apache/arrow-dbc / arrow-adbc or
>>>> > similar (the name can always be changed later). That would bubble up
>>>> > discussions in a way that's easier for people to follow (watching your
>>>> > fork isn't ideal!). If it makes sense to move code later, it can
>>>> > always be moved.
>>>> >
>>>> >
>>>> > On Tue, May 31, 2022 at 1:02 PM David Li <li...@apache.org> wrote:
>>>> > >
>>>> > > Some updates:
>>>> > >
>>>> > > The proposal is being updated based on feedback from contributors to
>>>> DuckDB and DBI. We've been using GitHub issues on the fork to discuss the
>>>> API design and how to implement data ingestion/bound parameters:
>>>> https://github.com/lidavidm/arrow/issues
>>>> > >
>>>> > > If anyone has suggestions/ideas/questions, or would like to jump in as
>>>> well, please feel free to chime in there too.
>>>> > >
>>>> > > I have also been wondering if we might want to plan to split off a new
>>>> repo for this work? In particular, some components might be easiest to
>>>> consume if they didn't also have a hard dependency on the Arrow C++
>>>> libraries. And we could use the repo to manage contributed drivers (some of
>>>> which may individually leverage the Arrow libraries). Of course,
>>>> maintaining a parallel build system, setting up releases, etc. is also a
>>>> lot of work.
>>>> > >
>>>> > > -David
>>>> > >
>>>> > > On Tue, Apr 26, 2022, at 15:01, Wes McKinney wrote:
>>>> > > > I don't have major new things to add on this topic except that I've
>>>> > > > long had the aspiration of creating something like Python's DBAPI 2.0
>>>> > > > [1] at the C or C++ level to enable a measure of API standardization
>>>> > > > for Arrow-native read/write interfaces with database drivers. It
>>>> seems
>>>> > > > like a natural complement to the wire-protocol standardization work
>>>> > > > with FlightSQL. I had previously brought in some code that I had
>>>> > > > worked on related to interfacing with the HiveServer2 wire protocol
>>>> > > > (for Hive and Impala, or other HS2-compatible query engines) with the
>>>> > > > intention of prototyping but never was able to find the time.
>>>> > > >
>>>> > > > From an external messaging standpoint, one thing that will be
>>>> > > > important is to assert that this is not intended to displace or
>>>> > > > deprecate ODBC or JDBC drivers. In fact, I would hope that the
>>>> > > > Arrow-native APIs could be added somehow to existing driver libraries
>>>> > > > where it made sense, so that if they are used in an application that
>>>> > > > uses Arrow, they can opt in to using the Arrow-based APIs for getting
>>>> > > > result sets, or doing bulk inserts, etc.
>>>> > > >
>>>> > > > [1]: https://peps.python.org/pep-0249/
>>>> > > >
>>>> > > > On Tue, Apr 26, 2022 at 12:36 PM Antoine Pitrou <an...@python.org>
>>>> wrote:
>>>> > > >>
>>>> > > >>
>>>> > > >> Do we want something more flexible than dlopen() and runtime symbol
>>>> > > >> lookup (a mechanism which constrains the way you can organize and
>>>> > > >> distribute drivers)?
>>>> > > >>
>>>> > > >> For example, perhaps we could expose an API struct of function
>>>> pointers
>>>> > > >> that could be obtained through driver-specific means.
>>>> > > >>
>>>> > > >>
>>>> > > >> Le 26/04/2022 à 18:29, David Li a écrit :
>>>> > > >> > Hello,
>>>> > > >> >
>>>> > > >> > In light of recent efforts around Flight SQL, projects like pgeon
>>>> [1], and long-standing tickets/discussions about database support in Arrow
>>>> [2], it seems there's an opportunity to define standard database interfaces
>>>> for Arrow that could unify these efforts. So we've put together a proposal
>>>> for "ADBC", a common Arrow-based database client API:
>>>> > > >> >
>>>> > > >> >
>>>> https://docs.google.com/document/d/1t7NrC76SyxL_OffATmjzZs2xcj1owdUsIF2WKL_Zw1U/edit#heading=h.r6o6j2navi4c
>>>> > > >> >
>>>> > > >> > A common API and implementations could help combine/simplify
>>>> client-side projects like pgeon, or what DBI is considering [3], and help
>>>> them take advantage of developments like Flight SQL and existing columnar
>>>> APIs.
>>>> > > >> >
>>>> > > >> > We'd appreciate any feedback. (Comments should be open, please
>>>> let me know if not.)
>>>> > > >> >
>>>> > > >> > [1]: https://github.com/0x0L/pgeon
>>>> > > >> > [2]: https://issues.apache.org/jira/browse/ARROW-11670
>>>> > > >> > [3]: https://github.com/r-dbi/dbi3/issues/48
>>>> > > >> >
>>>> > > >> > Thanks,
>>>> > > >> > David
>>>>

Re: [DISC] Improving Arrow's database support

Posted by David Li <li...@apache.org>.

I've set up the new repo and enabled issues. I still need to get things building independently of Arrow, but now adbc.h is self-contained and the "driver manager" being prototyped can also be built and used independently of Arrow.

On Wed, Jun 1, 2022, at 13:55, David Li wrote:
> Wes: thanks! I'll move things over and update the list.
>
> Gavin: I mean more that ADBC won't support every little feature in 
> JDBC/ODBC, or won't necessarily make it easy to support certain things 
> (e.g. updating a single row in a ResultSet). But it's not that OLTP is 
> taboo, it's just not what is being optimized for. 
>
> For instance it would be nice to eventually have JDBC/ODBC drivers that 
> can wrap ADBC in much the same way that Dremio is working on a JDBC 
> driver for Flight SQL. But especially in the near term, ADBC just won't 
> have the feature set to make that possible.
>
> What sorts of use cases were you thinking about, though?
>
> On Wed, Jun 1, 2022, at 13:18, Gavin Ray wrote:
>> This sounds great, but I had one question:
>>
>> Read the initial ADBC proposal and it mentioned that OLTP was not a
>> targeted usecase
>> If this work is intended to take on the role of a sort of standard ABI/SDK,
>> does that mean that building OLTP-oriented drivers/tooling with it is off
>> the table?
>>
>> On Wed, Jun 1, 2022 at 11:11 AM Wes McKinney <we...@gmail.com> wrote:
>>
>>> I went ahead and created
>>>
>>> https://github.com/apache/arrow-adbc
>>>
>>> I directed issue comments / PRs to issues@
>>>
>>> On Tue, May 31, 2022 at 8:49 PM Wes McKinney <we...@gmail.com> wrote:
>>> >
>>> > I think spinning up a new repository while this exploratory work
>>> > progresses is a fine idea — perhaps apache/arrow-dbc / arrow-adbc or
>>> > similar (the name can always be changed later). That would bubble up
>>> > discussions in a way that's easier for people to follow (watching your
>>> > fork isn't ideal!). If it makes sense to move code later, it can
>>> > always be moved.
>>> >
>>> >
>>> > On Tue, May 31, 2022 at 1:02 PM David Li <li...@apache.org> wrote:
>>> > >
>>> > > Some updates:
>>> > >
>>> > > The proposal is being updated based on feedback from contributors to
>>> DuckDB and DBI. We've been using GitHub issues on the fork to discuss the
>>> API design and how to implement data ingestion/bound parameters:
>>> https://github.com/lidavidm/arrow/issues
>>> > >
>>> > > If anyone has suggestions/ideas/questions, or would like to jump in as
>>> well, please feel free to chime in there too.
>>> > >
>>> > > I have also been wondering if we might want to plan to split off a new
>>> repo for this work? In particular, some components might be easiest to
>>> consume if they didn't also have a hard dependency on the Arrow C++
>>> libraries. And we could use the repo to manage contributed drivers (some of
>>> which may individually leverage the Arrow libraries). Of course,
>>> maintaining a parallel build system, setting up releases, etc. is also a
>>> lot of work.
>>> > >
>>> > > -David
>>> > >
>>> > > On Tue, Apr 26, 2022, at 15:01, Wes McKinney wrote:
>>> > > > I don't have major new things to add on this topic except that I've
>>> > > > long had the aspiration of creating something like Python's DBAPI 2.0
>>> > > > [1] at the C or C++ level to enable a measure of API standardization
>>> > > > for Arrow-native read/write interfaces with database drivers. It
>>> seems
>>> > > > like a natural complement to the wire-protocol standardization work
>>> > > > with FlightSQL. I had previously brought in some code that I had
>>> > > > worked on related to interfacing with the HiveServer2 wire protocol
>>> > > > (for Hive and Impala, or other HS2-compatible query engines) with the
>>> > > > intention of prototyping but never was able to find the time.
>>> > > >
>>> > > > From an external messaging standpoint, one thing that will be
>>> > > > important is to assert that this is not intended to displace or
>>> > > > deprecate ODBC or JDBC drivers. In fact, I would hope that the
>>> > > > Arrow-native APIs could be added somehow to existing driver libraries
>>> > > > where it made sense, so that if they are used in an application that
>>> > > > uses Arrow, they can opt in to using the Arrow-based APIs for getting
>>> > > > result sets, or doing bulk inserts, etc.
>>> > > >
>>> > > > [1]: https://peps.python.org/pep-0249/
>>> > > >
>>> > > > On Tue, Apr 26, 2022 at 12:36 PM Antoine Pitrou <an...@python.org>
>>> wrote:
>>> > > >>
>>> > > >>
>>> > > >> Do we want something more flexible than dlopen() and runtime symbol
>>> > > >> lookup (a mechanism which constrains the way you can organize and
>>> > > >> distribute drivers)?
>>> > > >>
>>> > > >> For example, perhaps we could expose an API struct of function
>>> pointers
>>> > > >> that could be obtained through driver-specific means.
>>> > > >>
>>> > > >>
>>> > > >> Le 26/04/2022 à 18:29, David Li a écrit :
>>> > > >> > Hello,
>>> > > >> >
>>> > > >> > In light of recent efforts around Flight SQL, projects like pgeon
>>> [1], and long-standing tickets/discussions about database support in Arrow
>>> [2], it seems there's an opportunity to define standard database interfaces
>>> for Arrow that could unify these efforts. So we've put together a proposal
>>> for "ADBC", a common Arrow-based database client API:
>>> > > >> >
>>> > > >> >
>>> https://docs.google.com/document/d/1t7NrC76SyxL_OffATmjzZs2xcj1owdUsIF2WKL_Zw1U/edit#heading=h.r6o6j2navi4c
>>> > > >> >
>>> > > >> > A common API and implementations could help combine/simplify
>>> client-side projects like pgeon, or what DBI is considering [3], and help
>>> them take advantage of developments like Flight SQL and existing columnar
>>> APIs.
>>> > > >> >
>>> > > >> > We'd appreciate any feedback. (Comments should be open, please
>>> let me know if not.)
>>> > > >> >
>>> > > >> > [1]: https://github.com/0x0L/pgeon
>>> > > >> > [2]: https://issues.apache.org/jira/browse/ARROW-11670
>>> > > >> > [3]: https://github.com/r-dbi/dbi3/issues/48
>>> > > >> >
>>> > > >> > Thanks,
>>> > > >> > David
>>>

Re: [DISC] Improving Arrow's database support

Posted by David Li <li...@apache.org>.

Wes: thanks! I'll move things over and update the list.

Gavin: I mean more that ADBC won't support every little feature in JDBC/ODBC, or won't necessarily make it easy to support certain things (e.g. updating a single row in a ResultSet). But it's not that OLTP is taboo, it's just not what is being optimized for. 

For instance it would be nice to eventually have JDBC/ODBC drivers that can wrap ADBC in much the same way that Dremio is working on a JDBC driver for Flight SQL. But especially in the near term, ADBC just won't have the feature set to make that possible.

What sorts of use cases were you thinking about, though?

On Wed, Jun 1, 2022, at 13:18, Gavin Ray wrote:
> This sounds great, but I had one question:
>
> Read the initial ADBC proposal and it mentioned that OLTP was not a
> targeted usecase
> If this work is intended to take on the role of a sort of standard ABI/SDK,
> does that mean that building OLTP-oriented drivers/tooling with it is off
> the table?
>
> On Wed, Jun 1, 2022 at 11:11 AM Wes McKinney <we...@gmail.com> wrote:
>
>> I went ahead and created
>>
>> https://github.com/apache/arrow-adbc
>>
>> I directed issue comments / PRs to issues@
>>
>> On Tue, May 31, 2022 at 8:49 PM Wes McKinney <we...@gmail.com> wrote:
>> >
>> > I think spinning up a new repository while this exploratory work
>> > progresses is a fine idea — perhaps apache/arrow-dbc / arrow-adbc or
>> > similar (the name can always be changed later). That would bubble up
>> > discussions in a way that's easier for people to follow (watching your
>> > fork isn't ideal!). If it makes sense to move code later, it can
>> > always be moved.
>> >
>> >
>> > On Tue, May 31, 2022 at 1:02 PM David Li <li...@apache.org> wrote:
>> > >
>> > > Some updates:
>> > >
>> > > The proposal is being updated based on feedback from contributors to
>> DuckDB and DBI. We've been using GitHub issues on the fork to discuss the
>> API design and how to implement data ingestion/bound parameters:
>> https://github.com/lidavidm/arrow/issues
>> > >
>> > > If anyone has suggestions/ideas/questions, or would like to jump in as
>> well, please feel free to chime in there too.
>> > >
>> > > I have also been wondering if we might want to plan to split off a new
>> repo for this work? In particular, some components might be easiest to
>> consume if they didn't also have a hard dependency on the Arrow C++
>> libraries. And we could use the repo to manage contributed drivers (some of
>> which may individually leverage the Arrow libraries). Of course,
>> maintaining a parallel build system, setting up releases, etc. is also a
>> lot of work.
>> > >
>> > > -David
>> > >
>> > > On Tue, Apr 26, 2022, at 15:01, Wes McKinney wrote:
>> > > > I don't have major new things to add on this topic except that I've
>> > > > long had the aspiration of creating something like Python's DBAPI 2.0
>> > > > [1] at the C or C++ level to enable a measure of API standardization
>> > > > for Arrow-native read/write interfaces with database drivers. It
>> seems
>> > > > like a natural complement to the wire-protocol standardization work
>> > > > with FlightSQL. I had previously brought in some code that I had
>> > > > worked on related to interfacing with the HiveServer2 wire protocol
>> > > > (for Hive and Impala, or other HS2-compatible query engines) with the
>> > > > intention of prototyping but never was able to find the time.
>> > > >
>> > > > From an external messaging standpoint, one thing that will be
>> > > > important is to assert that this is not intended to displace or
>> > > > deprecate ODBC or JDBC drivers. In fact, I would hope that the
>> > > > Arrow-native APIs could be added somehow to existing driver libraries
>> > > > where it made sense, so that if they are used in an application that
>> > > > uses Arrow, they can opt in to using the Arrow-based APIs for getting
>> > > > result sets, or doing bulk inserts, etc.
>> > > >
>> > > > [1]: https://peps.python.org/pep-0249/
>> > > >
>> > > > On Tue, Apr 26, 2022 at 12:36 PM Antoine Pitrou <an...@python.org>
>> wrote:
>> > > >>
>> > > >>
>> > > >> Do we want something more flexible than dlopen() and runtime symbol
>> > > >> lookup (a mechanism which constrains the way you can organize and
>> > > >> distribute drivers)?
>> > > >>
>> > > >> For example, perhaps we could expose an API struct of function
>> pointers
>> > > >> that could be obtained through driver-specific means.
>> > > >>
>> > > >>
>> > > >> Le 26/04/2022 à 18:29, David Li a écrit :
>> > > >> > Hello,
>> > > >> >
>> > > >> > In light of recent efforts around Flight SQL, projects like pgeon
>> [1], and long-standing tickets/discussions about database support in Arrow
>> [2], it seems there's an opportunity to define standard database interfaces
>> for Arrow that could unify these efforts. So we've put together a proposal
>> for "ADBC", a common Arrow-based database client API:
>> > > >> >
>> > > >> >
>> https://docs.google.com/document/d/1t7NrC76SyxL_OffATmjzZs2xcj1owdUsIF2WKL_Zw1U/edit#heading=h.r6o6j2navi4c
>> > > >> >
>> > > >> > A common API and implementations could help combine/simplify
>> client-side projects like pgeon, or what DBI is considering [3], and help
>> them take advantage of developments like Flight SQL and existing columnar
>> APIs.
>> > > >> >
>> > > >> > We'd appreciate any feedback. (Comments should be open, please
>> let me know if not.)
>> > > >> >
>> > > >> > [1]: https://github.com/0x0L/pgeon
>> > > >> > [2]: https://issues.apache.org/jira/browse/ARROW-11670
>> > > >> > [3]: https://github.com/r-dbi/dbi3/issues/48
>> > > >> >
>> > > >> > Thanks,
>> > > >> > David
>>

Re: [DISC] Improving Arrow's database support

Posted by Gavin Ray <ra...@gmail.com>.

This sounds great, but I had one question:

Read the initial ADBC proposal and it mentioned that OLTP was not a
targeted usecase
If this work is intended to take on the role of a sort of standard ABI/SDK,
does that mean that building OLTP-oriented drivers/tooling with it is off
the table?

On Wed, Jun 1, 2022 at 11:11 AM Wes McKinney <we...@gmail.com> wrote:

> I went ahead and created
>
> https://github.com/apache/arrow-adbc
>
> I directed issue comments / PRs to issues@
>
> On Tue, May 31, 2022 at 8:49 PM Wes McKinney <we...@gmail.com> wrote:
> >
> > I think spinning up a new repository while this exploratory work
> > progresses is a fine idea — perhaps apache/arrow-dbc / arrow-adbc or
> > similar (the name can always be changed later). That would bubble up
> > discussions in a way that's easier for people to follow (watching your
> > fork isn't ideal!). If it makes sense to move code later, it can
> > always be moved.
> >
> >
> > On Tue, May 31, 2022 at 1:02 PM David Li <li...@apache.org> wrote:
> > >
> > > Some updates:
> > >
> > > The proposal is being updated based on feedback from contributors to
> DuckDB and DBI. We've been using GitHub issues on the fork to discuss the
> API design and how to implement data ingestion/bound parameters:
> https://github.com/lidavidm/arrow/issues
> > >
> > > If anyone has suggestions/ideas/questions, or would like to jump in as
> well, please feel free to chime in there too.
> > >
> > > I have also been wondering if we might want to plan to split off a new
> repo for this work? In particular, some components might be easiest to
> consume if they didn't also have a hard dependency on the Arrow C++
> libraries. And we could use the repo to manage contributed drivers (some of
> which may individually leverage the Arrow libraries). Of course,
> maintaining a parallel build system, setting up releases, etc. is also a
> lot of work.
> > >
> > > -David
> > >
> > > On Tue, Apr 26, 2022, at 15:01, Wes McKinney wrote:
> > > > I don't have major new things to add on this topic except that I've
> > > > long had the aspiration of creating something like Python's DBAPI 2.0
> > > > [1] at the C or C++ level to enable a measure of API standardization
> > > > for Arrow-native read/write interfaces with database drivers. It
> seems
> > > > like a natural complement to the wire-protocol standardization work
> > > > with FlightSQL. I had previously brought in some code that I had
> > > > worked on related to interfacing with the HiveServer2 wire protocol
> > > > (for Hive and Impala, or other HS2-compatible query engines) with the
> > > > intention of prototyping but never was able to find the time.
> > > >
> > > > From an external messaging standpoint, one thing that will be
> > > > important is to assert that this is not intended to displace or
> > > > deprecate ODBC or JDBC drivers. In fact, I would hope that the
> > > > Arrow-native APIs could be added somehow to existing driver libraries
> > > > where it made sense, so that if they are used in an application that
> > > > uses Arrow, they can opt in to using the Arrow-based APIs for getting
> > > > result sets, or doing bulk inserts, etc.
> > > >
> > > > [1]: https://peps.python.org/pep-0249/
> > > >
> > > > On Tue, Apr 26, 2022 at 12:36 PM Antoine Pitrou <an...@python.org>
> wrote:
> > > >>
> > > >>
> > > >> Do we want something more flexible than dlopen() and runtime symbol
> > > >> lookup (a mechanism which constrains the way you can organize and
> > > >> distribute drivers)?
> > > >>
> > > >> For example, perhaps we could expose an API struct of function
> pointers
> > > >> that could be obtained through driver-specific means.
> > > >>
> > > >>
> > > >> Le 26/04/2022 à 18:29, David Li a écrit :
> > > >> > Hello,
> > > >> >
> > > >> > In light of recent efforts around Flight SQL, projects like pgeon
> [1], and long-standing tickets/discussions about database support in Arrow
> [2], it seems there's an opportunity to define standard database interfaces
> for Arrow that could unify these efforts. So we've put together a proposal
> for "ADBC", a common Arrow-based database client API:
> > > >> >
> > > >> >
> https://docs.google.com/document/d/1t7NrC76SyxL_OffATmjzZs2xcj1owdUsIF2WKL_Zw1U/edit#heading=h.r6o6j2navi4c
> > > >> >
> > > >> > A common API and implementations could help combine/simplify
> client-side projects like pgeon, or what DBI is considering [3], and help
> them take advantage of developments like Flight SQL and existing columnar
> APIs.
> > > >> >
> > > >> > We'd appreciate any feedback. (Comments should be open, please
> let me know if not.)
> > > >> >
> > > >> > [1]: https://github.com/0x0L/pgeon
> > > >> > [2]: https://issues.apache.org/jira/browse/ARROW-11670
> > > >> > [3]: https://github.com/r-dbi/dbi3/issues/48
> > > >> >
> > > >> > Thanks,
> > > >> > David
>

Re: [DISC] Improving Arrow's database support

Posted by Wes McKinney <we...@gmail.com>.

I went ahead and created

https://github.com/apache/arrow-adbc

I directed issue comments / PRs to issues@

On Tue, May 31, 2022 at 8:49 PM Wes McKinney <we...@gmail.com> wrote:
>
> I think spinning up a new repository while this exploratory work
> progresses is a fine idea — perhaps apache/arrow-dbc / arrow-adbc or
> similar (the name can always be changed later). That would bubble up
> discussions in a way that's easier for people to follow (watching your
> fork isn't ideal!). If it makes sense to move code later, it can
> always be moved.
>
>
> On Tue, May 31, 2022 at 1:02 PM David Li <li...@apache.org> wrote:
> >
> > Some updates:
> >
> > The proposal is being updated based on feedback from contributors to DuckDB and DBI. We've been using GitHub issues on the fork to discuss the API design and how to implement data ingestion/bound parameters: https://github.com/lidavidm/arrow/issues
> >
> > If anyone has suggestions/ideas/questions, or would like to jump in as well, please feel free to chime in there too.
> >
> > I have also been wondering if we might want to plan to split off a new repo for this work? In particular, some components might be easiest to consume if they didn't also have a hard dependency on the Arrow C++ libraries. And we could use the repo to manage contributed drivers (some of which may individually leverage the Arrow libraries). Of course, maintaining a parallel build system, setting up releases, etc. is also a lot of work.
> >
> > -David
> >
> > On Tue, Apr 26, 2022, at 15:01, Wes McKinney wrote:
> > > I don't have major new things to add on this topic except that I've
> > > long had the aspiration of creating something like Python's DBAPI 2.0
> > > [1] at the C or C++ level to enable a measure of API standardization
> > > for Arrow-native read/write interfaces with database drivers. It seems
> > > like a natural complement to the wire-protocol standardization work
> > > with FlightSQL. I had previously brought in some code that I had
> > > worked on related to interfacing with the HiveServer2 wire protocol
> > > (for Hive and Impala, or other HS2-compatible query engines) with the
> > > intention of prototyping but never was able to find the time.
> > >
> > > From an external messaging standpoint, one thing that will be
> > > important is to assert that this is not intended to displace or
> > > deprecate ODBC or JDBC drivers. In fact, I would hope that the
> > > Arrow-native APIs could be added somehow to existing driver libraries
> > > where it made sense, so that if they are used in an application that
> > > uses Arrow, they can opt in to using the Arrow-based APIs for getting
> > > result sets, or doing bulk inserts, etc.
> > >
> > > [1]: https://peps.python.org/pep-0249/
> > >
> > > On Tue, Apr 26, 2022 at 12:36 PM Antoine Pitrou <an...@python.org> wrote:
> > >>
> > >>
> > >> Do we want something more flexible than dlopen() and runtime symbol
> > >> lookup (a mechanism which constrains the way you can organize and
> > >> distribute drivers)?
> > >>
> > >> For example, perhaps we could expose an API struct of function pointers
> > >> that could be obtained through driver-specific means.
> > >>
> > >>
> > >> Le 26/04/2022 à 18:29, David Li a écrit :
> > >> > Hello,
> > >> >
> > >> > In light of recent efforts around Flight SQL, projects like pgeon [1], and long-standing tickets/discussions about database support in Arrow [2], it seems there's an opportunity to define standard database interfaces for Arrow that could unify these efforts. So we've put together a proposal for "ADBC", a common Arrow-based database client API:
> > >> >
> > >> > https://docs.google.com/document/d/1t7NrC76SyxL_OffATmjzZs2xcj1owdUsIF2WKL_Zw1U/edit#heading=h.r6o6j2navi4c
> > >> >
> > >> > A common API and implementations could help combine/simplify client-side projects like pgeon, or what DBI is considering [3], and help them take advantage of developments like Flight SQL and existing columnar APIs.
> > >> >
> > >> > We'd appreciate any feedback. (Comments should be open, please let me know if not.)
> > >> >
> > >> > [1]: https://github.com/0x0L/pgeon
> > >> > [2]: https://issues.apache.org/jira/browse/ARROW-11670
> > >> > [3]: https://github.com/r-dbi/dbi3/issues/48
> > >> >
> > >> > Thanks,
> > >> > David

Re: [DISC] Improving Arrow's database support

Posted by Wes McKinney <we...@gmail.com>.

I think spinning up a new repository while this exploratory work
progresses is a fine idea — perhaps apache/arrow-dbc / arrow-adbc or
similar (the name can always be changed later). That would bubble up
discussions in a way that's easier for people to follow (watching your
fork isn't ideal!). If it makes sense to move code later, it can
always be moved.


On Tue, May 31, 2022 at 1:02 PM David Li <li...@apache.org> wrote:
>
> Some updates:
>
> The proposal is being updated based on feedback from contributors to DuckDB and DBI. We've been using GitHub issues on the fork to discuss the API design and how to implement data ingestion/bound parameters: https://github.com/lidavidm/arrow/issues
>
> If anyone has suggestions/ideas/questions, or would like to jump in as well, please feel free to chime in there too.
>
> I have also been wondering if we might want to plan to split off a new repo for this work? In particular, some components might be easiest to consume if they didn't also have a hard dependency on the Arrow C++ libraries. And we could use the repo to manage contributed drivers (some of which may individually leverage the Arrow libraries). Of course, maintaining a parallel build system, setting up releases, etc. is also a lot of work.
>
> -David
>
> On Tue, Apr 26, 2022, at 15:01, Wes McKinney wrote:
> > I don't have major new things to add on this topic except that I've
> > long had the aspiration of creating something like Python's DBAPI 2.0
> > [1] at the C or C++ level to enable a measure of API standardization
> > for Arrow-native read/write interfaces with database drivers. It seems
> > like a natural complement to the wire-protocol standardization work
> > with FlightSQL. I had previously brought in some code that I had
> > worked on related to interfacing with the HiveServer2 wire protocol
> > (for Hive and Impala, or other HS2-compatible query engines) with the
> > intention of prototyping but never was able to find the time.
> >
> > From an external messaging standpoint, one thing that will be
> > important is to assert that this is not intended to displace or
> > deprecate ODBC or JDBC drivers. In fact, I would hope that the
> > Arrow-native APIs could be added somehow to existing driver libraries
> > where it made sense, so that if they are used in an application that
> > uses Arrow, they can opt in to using the Arrow-based APIs for getting
> > result sets, or doing bulk inserts, etc.
> >
> > [1]: https://peps.python.org/pep-0249/
> >
> > On Tue, Apr 26, 2022 at 12:36 PM Antoine Pitrou <an...@python.org> wrote:
> >>
> >>
> >> Do we want something more flexible than dlopen() and runtime symbol
> >> lookup (a mechanism which constrains the way you can organize and
> >> distribute drivers)?
> >>
> >> For example, perhaps we could expose an API struct of function pointers
> >> that could be obtained through driver-specific means.
> >>
> >>
> >> Le 26/04/2022 à 18:29, David Li a écrit :
> >> > Hello,
> >> >
> >> > In light of recent efforts around Flight SQL, projects like pgeon [1], and long-standing tickets/discussions about database support in Arrow [2], it seems there's an opportunity to define standard database interfaces for Arrow that could unify these efforts. So we've put together a proposal for "ADBC", a common Arrow-based database client API:
> >> >
> >> > https://docs.google.com/document/d/1t7NrC76SyxL_OffATmjzZs2xcj1owdUsIF2WKL_Zw1U/edit#heading=h.r6o6j2navi4c
> >> >
> >> > A common API and implementations could help combine/simplify client-side projects like pgeon, or what DBI is considering [3], and help them take advantage of developments like Flight SQL and existing columnar APIs.
> >> >
> >> > We'd appreciate any feedback. (Comments should be open, please let me know if not.)
> >> >
> >> > [1]: https://github.com/0x0L/pgeon
> >> > [2]: https://issues.apache.org/jira/browse/ARROW-11670
> >> > [3]: https://github.com/r-dbi/dbi3/issues/48
> >> >
> >> > Thanks,
> >> > David

Re: [DISC] Improving Arrow's database support

Posted by David Li <li...@apache.org>.

Some updates:

The proposal is being updated based on feedback from contributors to DuckDB and DBI. We've been using GitHub issues on the fork to discuss the API design and how to implement data ingestion/bound parameters: https://github.com/lidavidm/arrow/issues 

If anyone has suggestions/ideas/questions, or would like to jump in as well, please feel free to chime in there too.

I have also been wondering if we might want to plan to split off a new repo for this work? In particular, some components might be easiest to consume if they didn't also have a hard dependency on the Arrow C++ libraries. And we could use the repo to manage contributed drivers (some of which may individually leverage the Arrow libraries). Of course, maintaining a parallel build system, setting up releases, etc. is also a lot of work.

-David

On Tue, Apr 26, 2022, at 15:01, Wes McKinney wrote:
> I don't have major new things to add on this topic except that I've
> long had the aspiration of creating something like Python's DBAPI 2.0
> [1] at the C or C++ level to enable a measure of API standardization
> for Arrow-native read/write interfaces with database drivers. It seems
> like a natural complement to the wire-protocol standardization work
> with FlightSQL. I had previously brought in some code that I had
> worked on related to interfacing with the HiveServer2 wire protocol
> (for Hive and Impala, or other HS2-compatible query engines) with the
> intention of prototyping but never was able to find the time.
>
> From an external messaging standpoint, one thing that will be
> important is to assert that this is not intended to displace or
> deprecate ODBC or JDBC drivers. In fact, I would hope that the
> Arrow-native APIs could be added somehow to existing driver libraries
> where it made sense, so that if they are used in an application that
> uses Arrow, they can opt in to using the Arrow-based APIs for getting
> result sets, or doing bulk inserts, etc.
>
> [1]: https://peps.python.org/pep-0249/
>
> On Tue, Apr 26, 2022 at 12:36 PM Antoine Pitrou <an...@python.org> wrote:
>>
>>
>> Do we want something more flexible than dlopen() and runtime symbol
>> lookup (a mechanism which constrains the way you can organize and
>> distribute drivers)?
>>
>> For example, perhaps we could expose an API struct of function pointers
>> that could be obtained through driver-specific means.
>>
>>
>> Le 26/04/2022 à 18:29, David Li a écrit :
>> > Hello,
>> >
>> > In light of recent efforts around Flight SQL, projects like pgeon [1], and long-standing tickets/discussions about database support in Arrow [2], it seems there's an opportunity to define standard database interfaces for Arrow that could unify these efforts. So we've put together a proposal for "ADBC", a common Arrow-based database client API:
>> >
>> > https://docs.google.com/document/d/1t7NrC76SyxL_OffATmjzZs2xcj1owdUsIF2WKL_Zw1U/edit#heading=h.r6o6j2navi4c
>> >
>> > A common API and implementations could help combine/simplify client-side projects like pgeon, or what DBI is considering [3], and help them take advantage of developments like Flight SQL and existing columnar APIs.
>> >
>> > We'd appreciate any feedback. (Comments should be open, please let me know if not.)
>> >
>> > [1]: https://github.com/0x0L/pgeon
>> > [2]: https://issues.apache.org/jira/browse/ARROW-11670
>> > [3]: https://github.com/r-dbi/dbi3/issues/48
>> >
>> > Thanks,
>> > David

Re: [DISC] Improving Arrow's database support

Posted by Wes McKinney <we...@gmail.com>.

I don't have major new things to add on this topic except that I've
long had the aspiration of creating something like Python's DBAPI 2.0
[1] at the C or C++ level to enable a measure of API standardization
for Arrow-native read/write interfaces with database drivers. It seems
like a natural complement to the wire-protocol standardization work
with FlightSQL. I had previously brought in some code that I had
worked on related to interfacing with the HiveServer2 wire protocol
(for Hive and Impala, or other HS2-compatible query engines) with the
intention of prototyping but never was able to find the time.

From an external messaging standpoint, one thing that will be
important is to assert that this is not intended to displace or
deprecate ODBC or JDBC drivers. In fact, I would hope that the
Arrow-native APIs could be added somehow to existing driver libraries
where it made sense, so that if they are used in an application that
uses Arrow, they can opt in to using the Arrow-based APIs for getting
result sets, or doing bulk inserts, etc.

[1]: https://peps.python.org/pep-0249/

On Tue, Apr 26, 2022 at 12:36 PM Antoine Pitrou <an...@python.org> wrote:
>
>
> Do we want something more flexible than dlopen() and runtime symbol
> lookup (a mechanism which constrains the way you can organize and
> distribute drivers)?
>
> For example, perhaps we could expose an API struct of function pointers
> that could be obtained through driver-specific means.
>
>
> Le 26/04/2022 à 18:29, David Li a écrit :
> > Hello,
> >
> > In light of recent efforts around Flight SQL, projects like pgeon [1], and long-standing tickets/discussions about database support in Arrow [2], it seems there's an opportunity to define standard database interfaces for Arrow that could unify these efforts. So we've put together a proposal for "ADBC", a common Arrow-based database client API:
> >
> > https://docs.google.com/document/d/1t7NrC76SyxL_OffATmjzZs2xcj1owdUsIF2WKL_Zw1U/edit#heading=h.r6o6j2navi4c
> >
> > A common API and implementations could help combine/simplify client-side projects like pgeon, or what DBI is considering [3], and help them take advantage of developments like Flight SQL and existing columnar APIs.
> >
> > We'd appreciate any feedback. (Comments should be open, please let me know if not.)
> >
> > [1]: https://github.com/0x0L/pgeon
> > [2]: https://issues.apache.org/jira/browse/ARROW-11670
> > [3]: https://github.com/r-dbi/dbi3/issues/48
> >
> > Thanks,
> > David

Re: [DISC] Improving Arrow's database support

Posted by Antoine Pitrou <an...@python.org>.

Do we want something more flexible than dlopen() and runtime symbol 
lookup (a mechanism which constrains the way you can organize and 
distribute drivers)?

For example, perhaps we could expose an API struct of function pointers 
that could be obtained through driver-specific means.


Le 26/04/2022 à 18:29, David Li a écrit :
> Hello,
> 
> In light of recent efforts around Flight SQL, projects like pgeon [1], and long-standing tickets/discussions about database support in Arrow [2], it seems there's an opportunity to define standard database interfaces for Arrow that could unify these efforts. So we've put together a proposal for "ADBC", a common Arrow-based database client API:
> 
> https://docs.google.com/document/d/1t7NrC76SyxL_OffATmjzZs2xcj1owdUsIF2WKL_Zw1U/edit#heading=h.r6o6j2navi4c
> 
> A common API and implementations could help combine/simplify client-side projects like pgeon, or what DBI is considering [3], and help them take advantage of developments like Flight SQL and existing columnar APIs.
> 
> We'd appreciate any feedback. (Comments should be open, please let me know if not.)
> 
> [1]: https://github.com/0x0L/pgeon
> [2]: https://issues.apache.org/jira/browse/ARROW-11670
> [3]: https://github.com/r-dbi/dbi3/issues/48
> 
> Thanks,
> David