You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by "Shoumyo Chakravorti (BLOOMBERG/ 731 LEX)" <sc...@bloomberg.net> on 2023/01/25 18:02:00 UTC

[Discuss] C++ query builder/execution API

Hi Arrow developers!

This is my first time posting on this mailing list, so please let me know if this post belongs elsewhere.

I and a few colleagues plan to implement a C++ interface for building read-only queries and executing those against Arrow Datasets through Substrait consumers like Acero, DuckDB, and Velox. Since we hope to build this out in the open, I have outlined the kind of interface that we intend to build in this Google doc [0].

I'm making this post for a few reasons:

    - To gauge whether the community feels like this work would be worth pursuing as an open-source project
    - To receive feedback on the proposed interface and ensure that we would be able to accommodate a wide variety of use-cases (please feel free to leave comments directly on the doc)
    - To connect with developers who might be interested in collaborating on this effort

Relatedly, I would like to get the Arrow developers' thoughts on whether it would make sense to pursue this work as an official Arrow project (e.g. in an experimental repo) or if it would be better as a standalone project. I understand that pursuing this as an Arrow project would have its downsides (like increased review/maintenance burden) and risks confusing new users as to what the official Arrow libraries aim to solve [1]. On the other hand, making such an interface readily available alongside `libarrow` could increase the adoption of Arrow among certain developers (e.g. in finance/fintech). Regardless of your opinion, I'd love to hear your thoughts on which approach makes more sense.

Please feel free to reply here on the mailing list or leave comments on the linked Google Doc!

[0]: https://docs.google.com/document/d/1_ktKxtOFW1grD-VcbBNc0FaP4g5j7vSx9bO2ht59JFA
[1]: https://www.datawill.io/posts/apache-arrow-2022-reflection/#who-is-libarrows-and-aceros-audience

Re: [Discuss] C++ query builder/execution API

Posted by Weston Pace <we...@gmail.com>.

+1 to what Ian said.  I'll also add, as was brought up in the earlier
pandas API discussion, that a blog post (once you've got something
ready to use) would be a good idea.

> On the other hand, making such an interface readily available alongside `libarrow` could increase the adoption of Arrow among certain developers (e.g. in finance/fintech).

I don't fully understand this statement.  I suspect you are correct
but, since I don't understand the reasons, I can't really say what we
can do to help.  For example, if it's purely the fact that it is an
ASF project then we can make a new arrow-xyz repo (you'd need a
committer to dedicate time to reviews and you would need to convince
several PMC members to commit time for validating your releases).

On Fri, Jan 27, 2023 at 9:10 AM Ian Cook <ia...@ursacomputing.com> wrote:
>
> Hi Shoumyo,
>
> This is exciting—thank you for the thoughtfulness you have put into
> this proposal.
>
> This topic of a C++ dataframe API for Arrow-native engine(s) has come
> up in the past [3], but the bulk of the previous discussion about this
> predated Substrait. With the Substrait project now quickly gaining
> momentum, it seems an excellent time to revisit this topic and to
> incorporate Substrait into it, as you have done.
>
> I strongly believe that this work should happen in a repository that
> is outside of the Arrow project. Many of the exciting developments in
> Arrow-land these days are happening in the broader ecosystem around
> Arrow. The proposed API could be used independently of Arrow libraries
> (for example, it could be used with DuckDB). For projects like this, I
> think our hope as Arrow maintainers is to "let a hundred flowers
> bloom" around Arrow (all with excellent operability based on Arrow
> standards) rather than centralizing the work inside Arrow
> repositories. We can use resources including the "Powered by Arrow"
> and "Powered by Substrait" pages to highlight the project.
>
> Thank you,
> Ian
>
> [3] https://docs.google.com/document/d/1XHe_j87n2VHGzEbnLe786GHbbcbrzbjgG8D0IXWAeHg/
> [4] https://arrow.apache.org/powered_by/
> [5] https://substrait.io/community/powered_by/
>
> On Wed, Jan 25, 2023 at 1:02 PM Shoumyo Chakravorti (BLOOMBERG/ 731
> LEX) <sc...@bloomberg.net> wrote:
> >
> > Hi Arrow developers!
> >
> > This is my first time posting on this mailing list, so please let me know if this post belongs elsewhere.
> >
> > I and a few colleagues plan to implement a C++ interface for building read-only queries and executing those against Arrow Datasets through Substrait consumers like Acero, DuckDB, and Velox. Since we hope to build this out in the open, I have outlined the kind of interface that we intend to build in this Google doc [0].
> >
> > I'm making this post for a few reasons:
> >
> >     - To gauge whether the community feels like this work would be worth pursuing as an open-source project
> >     - To receive feedback on the proposed interface and ensure that we would be able to accommodate a wide variety of use-cases (please feel free to leave comments directly on the doc)
> >     - To connect with developers who might be interested in collaborating on this effort
> >
> > Relatedly, I would like to get the Arrow developers' thoughts on whether it would make sense to pursue this work as an official Arrow project (e.g. in an experimental repo) or if it would be better as a standalone project. I understand that pursuing this as an Arrow project would have its downsides (like increased review/maintenance burden) and risks confusing new users as to what the official Arrow libraries aim to solve [1]. On the other hand, making such an interface readily available alongside `libarrow` could increase the adoption of Arrow among certain developers (e.g. in finance/fintech). Regardless of your opinion, I'd love to hear your thoughts on which approach makes more sense.
> >
> > Please feel free to reply here on the mailing list or leave comments on the linked Google Doc!
> >
> > [0]: https://docs.google.com/document/d/1_ktKxtOFW1grD-VcbBNc0FaP4g5j7vSx9bO2ht59JFA
> > [1]: https://www.datawill.io/posts/apache-arrow-2022-reflection/#who-is-libarrows-and-aceros-audience

Re: [Discuss] C++ query builder/execution API

Posted by Ian Cook <ia...@ursacomputing.com>.

Hi Shoumyo,

This is exciting—thank you for the thoughtfulness you have put into
this proposal.

This topic of a C++ dataframe API for Arrow-native engine(s) has come
up in the past [3], but the bulk of the previous discussion about this
predated Substrait. With the Substrait project now quickly gaining
momentum, it seems an excellent time to revisit this topic and to
incorporate Substrait into it, as you have done.

I strongly believe that this work should happen in a repository that
is outside of the Arrow project. Many of the exciting developments in
Arrow-land these days are happening in the broader ecosystem around
Arrow. The proposed API could be used independently of Arrow libraries
(for example, it could be used with DuckDB). For projects like this, I
think our hope as Arrow maintainers is to "let a hundred flowers
bloom" around Arrow (all with excellent operability based on Arrow
standards) rather than centralizing the work inside Arrow
repositories. We can use resources including the "Powered by Arrow"
and "Powered by Substrait" pages to highlight the project.

Thank you,
Ian

[3] https://docs.google.com/document/d/1XHe_j87n2VHGzEbnLe786GHbbcbrzbjgG8D0IXWAeHg/
[4] https://arrow.apache.org/powered_by/
[5] https://substrait.io/community/powered_by/

On Wed, Jan 25, 2023 at 1:02 PM Shoumyo Chakravorti (BLOOMBERG/ 731
LEX) <sc...@bloomberg.net> wrote:
>
> Hi Arrow developers!
>
> This is my first time posting on this mailing list, so please let me know if this post belongs elsewhere.
>
> I and a few colleagues plan to implement a C++ interface for building read-only queries and executing those against Arrow Datasets through Substrait consumers like Acero, DuckDB, and Velox. Since we hope to build this out in the open, I have outlined the kind of interface that we intend to build in this Google doc [0].
>
> I'm making this post for a few reasons:
>
>     - To gauge whether the community feels like this work would be worth pursuing as an open-source project
>     - To receive feedback on the proposed interface and ensure that we would be able to accommodate a wide variety of use-cases (please feel free to leave comments directly on the doc)
>     - To connect with developers who might be interested in collaborating on this effort
>
> Relatedly, I would like to get the Arrow developers' thoughts on whether it would make sense to pursue this work as an official Arrow project (e.g. in an experimental repo) or if it would be better as a standalone project. I understand that pursuing this as an Arrow project would have its downsides (like increased review/maintenance burden) and risks confusing new users as to what the official Arrow libraries aim to solve [1]. On the other hand, making such an interface readily available alongside `libarrow` could increase the adoption of Arrow among certain developers (e.g. in finance/fintech). Regardless of your opinion, I'd love to hear your thoughts on which approach makes more sense.
>
> Please feel free to reply here on the mailing list or leave comments on the linked Google Doc!
>
> [0]: https://docs.google.com/document/d/1_ktKxtOFW1grD-VcbBNc0FaP4g5j7vSx9bO2ht59JFA
> [1]: https://www.datawill.io/posts/apache-arrow-2022-reflection/#who-is-libarrows-and-aceros-audience