You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "jorisvandenbossche (via GitHub)" <gi...@apache.org> on 2023/02/01 14:29:57 UTC

[GitHub] [arrow] jorisvandenbossche opened a new issue, #33976: [Python] Add low-level pyarrow bindings for Acero

jorisvandenbossche opened a new issue, #33976:
URL: https://github.com/apache/arrow/issues/33976

   The Acero query engine is one of the few parts of the Arrow C++ libraries that is not directly exposed in the pyarrow bindings. While you can already run queries using Acero through a substrait plan (`pyarrow.substrait.run_query`; but that requires first building that plan, and limits to what Acero supports of Substrait), and Acero is already used under the hood in some Dataset methods (`pyarrow.dataset.Dataset.join/sort_by`, but that doesn't currently allow building up a full query), there are some benefits of having direct, low-level bindings:
   
   * It would be useful for direct users (or downstream packages) of pyarrow that don't want / need to go through substrait (yet another thing to learn / implement)
   * It would make it easier for developers to test new Acero functionality from python
   * You can use features of Acero that are not (yet) available in substrait
   * It could (potentially) give an interface for people developing custom exec nodes
   * It would be useful for people that want to do things "right now" (e.g. 12.0.0) instead of waiting for Substrait (tooling) to mature
   
   The idea would be to add a new `pyarrow.acero` module, with low level bindings for the Acero execution engine. The goal would _not_ be to build a complete user-friendly query interface (like ibis or dplyr), but provide rather direct access to build up a query with the Declaration interface. Minimum abilities would be to construct a plan, inspect plans created through other means (eg from substrait), and execute a plan (to_table, to_reader, write). An initial goal would be to be able to use this interface ourselves in the places that currently have custom cython usage of the exec plan such as Dataset.join (i.e. refactoring `_exec_plan.pyx`).
   
   Non-exhaustive list of subtasks:
   
   - [ ] Basic API to create a query plan / declaration (including the options classes)
   - [ ] Replace the internal usage of execplan in Dataset join/sort_by methods
   - [ ] Improve usage of compute functions with (field) expressions
   - [ ] Ability to use registered UDFs in the query (specify the function registry)
   - [ ] Basic bindings for ExecPlan object? (not create it directly, but be able to inspect it when created from declaration or through substrait)
   - [ ] (enable using custom exec nodes?)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org