You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by Neal Richardson <ne...@gmail.com> on 2021/09/29 13:28:06 UTC

Update on C++ query engine work

Hi all,
I wanted to give an update on work that several of us have been doing in
recent months on query processing in C++. Back in March, we circulated [1]
a document [2] with a proposal on implementing the basic pieces of a query
execution engine. Recent patches have introduced some key aspects of this
functionality, including

* ExecPlan and ExecNode API
* Scalar and GroupBy aggregation nodes
* OrderBy node
* A faster, asynchronous Dataset scanner implementation
* R bindings

You may also have noticed a lot of activity around compute functions
(kernels), which these can consume. We roughly doubled the number of
compute functions between 4.0.0 and 5.0.0, and for 6.0.0, we’ve added the
most common aggregation functions and have improved the scalar kernels to
support a more complete set of types. See the compute functions page in the
Arrow C++ library user guide for the full list of compute functions
available in the current release [3].

This means we can now scan, project, filter, aggregate, and sort in-memory,
although the functionality is not easily accessible from bindings other
than R just yet.

In the coming weeks and months, we’ll continue to push ahead on this work,
focusing on

* More nodes, including joins [4] and top-k/bottom-k [5]
* Alternative scheduling approaches such as work stealing approaches or
different methods of applying back pressure
* The ability to spill to disk when necessary to cut down on memory pressure
* Connecting the query engine to the experimental Compute IR [6] and
exposing it in Python via ibis

Thank you to the many members of the Arrow developer community who have
contributed code, comments, and reviews. If you are interested in following
or contributing to this work, please see the “Query engine 6.0 release” [7]
and “Compute kernels 6.0 release” [8] Confluence pages.

Neal

[1]:
https://lists.apache.org/thread.html/rb06b4dc2c6e53fe01784e22e669710710be747faadd46b608c9a27f5%40%3Cdev.arrow.apache.org%3E
[2]:
https://docs.google.com/document/d/1AyTdLU-RxA-Gsb9EsYnrQrmqPMOYMfPlWwxRi1Is1tQ/edit#heading=h.t89hffc3t7si
[3]: https://arrow.apache.org/docs/cpp/compute.html
[4]: https://github.com/apache/arrow/pull/11150
[5]: https://issues.apache.org/jira/browse/ARROW-13973
[6]: https://github.com/apache/arrow/pull/10934
[7]:
https://cwiki.apache.org/confluence/display/ARROW/Query+engine+6.0+release
[8]:
https://cwiki.apache.org/confluence/display/ARROW/Compute+kernels+6.0+release

Re: Update on C++ query engine work

Posted by Micah Kornfield <em...@gmail.com>.

I don't have much to add, but it is really cool to see all the progress
being made here.

On Wed, Sep 29, 2021 at 6:29 AM Neal Richardson <ne...@gmail.com>
wrote:

> Hi all,
> I wanted to give an update on work that several of us have been doing in
> recent months on query processing in C++. Back in March, we circulated [1]
> a document [2] with a proposal on implementing the basic pieces of a query
> execution engine. Recent patches have introduced some key aspects of this
> functionality, including
>
> * ExecPlan and ExecNode API
> * Scalar and GroupBy aggregation nodes
> * OrderBy node
> * A faster, asynchronous Dataset scanner implementation
> * R bindings
>
> You may also have noticed a lot of activity around compute functions
> (kernels), which these can consume. We roughly doubled the number of
> compute functions between 4.0.0 and 5.0.0, and for 6.0.0, we’ve added the
> most common aggregation functions and have improved the scalar kernels to
> support a more complete set of types. See the compute functions page in the
> Arrow C++ library user guide for the full list of compute functions
> available in the current release [3].
>
> This means we can now scan, project, filter, aggregate, and sort in-memory,
> although the functionality is not easily accessible from bindings other
> than R just yet.
>
> In the coming weeks and months, we’ll continue to push ahead on this work,
> focusing on
>
> * More nodes, including joins [4] and top-k/bottom-k [5]
> * Alternative scheduling approaches such as work stealing approaches or
> different methods of applying back pressure
> * The ability to spill to disk when necessary to cut down on memory
> pressure
> * Connecting the query engine to the experimental Compute IR [6] and
> exposing it in Python via ibis
>
> Thank you to the many members of the Arrow developer community who have
> contributed code, comments, and reviews. If you are interested in following
> or contributing to this work, please see the “Query engine 6.0 release” [7]
> and “Compute kernels 6.0 release” [8] Confluence pages.
>
> Neal
>
> [1]:
>
> https://lists.apache.org/thread.html/rb06b4dc2c6e53fe01784e22e669710710be747faadd46b608c9a27f5%40%3Cdev.arrow.apache.org%3E
> [2]:
>
> https://docs.google.com/document/d/1AyTdLU-RxA-Gsb9EsYnrQrmqPMOYMfPlWwxRi1Is1tQ/edit#heading=h.t89hffc3t7si
> [3]: https://arrow.apache.org/docs/cpp/compute.html
> [4]: https://github.com/apache/arrow/pull/11150
> [5]: https://issues.apache.org/jira/browse/ARROW-13973
> [6]: https://github.com/apache/arrow/pull/10934
> [7]:
> https://cwiki.apache.org/confluence/display/ARROW/Query+engine+6.0+release
> [8]:
>
> https://cwiki.apache.org/confluence/display/ARROW/Compute+kernels+6.0+release
>