You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Micah Kornfield <em...@gmail.com> on 2021/10/08 04:57:51 UTC

Re: Update on C++ query engine work

I don't have much to add, but it is really cool to see all the progress
being made here.

On Wed, Sep 29, 2021 at 6:29 AM Neal Richardson <ne...@gmail.com>
wrote:

> Hi all,
> I wanted to give an update on work that several of us have been doing in
> recent months on query processing in C++. Back in March, we circulated [1]
> a document [2] with a proposal on implementing the basic pieces of a query
> execution engine. Recent patches have introduced some key aspects of this
> functionality, including
>
> * ExecPlan and ExecNode API
> * Scalar and GroupBy aggregation nodes
> * OrderBy node
> * A faster, asynchronous Dataset scanner implementation
> * R bindings
>
> You may also have noticed a lot of activity around compute functions
> (kernels), which these can consume. We roughly doubled the number of
> compute functions between 4.0.0 and 5.0.0, and for 6.0.0, we’ve added the
> most common aggregation functions and have improved the scalar kernels to
> support a more complete set of types. See the compute functions page in the
> Arrow C++ library user guide for the full list of compute functions
> available in the current release [3].
>
> This means we can now scan, project, filter, aggregate, and sort in-memory,
> although the functionality is not easily accessible from bindings other
> than R just yet.
>
> In the coming weeks and months, we’ll continue to push ahead on this work,
> focusing on
>
> * More nodes, including joins [4] and top-k/bottom-k [5]
> * Alternative scheduling approaches such as work stealing approaches or
> different methods of applying back pressure
> * The ability to spill to disk when necessary to cut down on memory
> pressure
> * Connecting the query engine to the experimental Compute IR [6] and
> exposing it in Python via ibis
>
> Thank you to the many members of the Arrow developer community who have
> contributed code, comments, and reviews. If you are interested in following
> or contributing to this work, please see the “Query engine 6.0 release” [7]
> and “Compute kernels 6.0 release” [8] Confluence pages.
>
> Neal
>
> [1]:
>
> https://lists.apache.org/thread.html/rb06b4dc2c6e53fe01784e22e669710710be747faadd46b608c9a27f5%40%3Cdev.arrow.apache.org%3E
> [2]:
>
> https://docs.google.com/document/d/1AyTdLU-RxA-Gsb9EsYnrQrmqPMOYMfPlWwxRi1Is1tQ/edit#heading=h.t89hffc3t7si
> [3]: https://arrow.apache.org/docs/cpp/compute.html
> [4]: https://github.com/apache/arrow/pull/11150
> [5]: https://issues.apache.org/jira/browse/ARROW-13973
> [6]: https://github.com/apache/arrow/pull/10934
> [7]:
> https://cwiki.apache.org/confluence/display/ARROW/Query+engine+6.0+release
> [8]:
>
> https://cwiki.apache.org/confluence/display/ARROW/Compute+kernels+6.0+release
>