You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@arrow.apache.org by "Andy Grove (Jira)" <ji...@apache.org> on 2019/09/26 12:56:00 UTC

[jira] [Updated] (ARROW-6659) [Rust] [DataFusion] Refactor of HashAggregateExec to support custom merge

     [ https://issues.apache.org/jira/browse/ARROW-6659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andy Grove updated ARROW-6659:
------------------------------
    Parent Issue: ARROW-6689  (was: ARROW-5227)

> [Rust] [DataFusion] Refactor of HashAggregateExec to support custom merge
> -------------------------------------------------------------------------
>
>                 Key: ARROW-6659
>                 URL: https://issues.apache.org/jira/browse/ARROW-6659
>             Project: Apache Arrow
>          Issue Type: Sub-task
>          Components: Rust, Rust - DataFusion
>            Reporter: Andy Grove
>            Priority: Major
>
> HashAggregateExec current creates one HashPartition per input partition for the initial aggregate per partition, and then explicitly calls MergeExec and then creates another HashPartition for the final reduce operation.
> This is fine for in-memory queries in DataFusion but is not extensible. For example, it is not possible to provide a different MergeExec implementation that would distribute queries to a cluster.
> A better design would be to move the logic into the query planner so that the physical plan contains explicit steps such as:
>  
> {code:java}
> - HashAggregate // final aggregate
>   - MergeExec
>     - HashAggregate // aggregate per partition
>  {code}
> This would then make it easier to customize the plan in other projects, to support distributed execution:
> {code:java}
>  - HashAggregate // final aggregate
>    - MergeExec
>       - DistributedExec
>          - HashAggregate // aggregate per partition{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)