You are viewing a plain text version of this content. The canonical link for it is here.

Posted to jira@arrow.apache.org by "Andy Grove (Jira)" <ji...@apache.org> on 2020/10/03 18:26:00 UTC

[jira] [Updated] (ARROW-9464) [Rust] [DataFusion] Physical plan refactor to support optimization rules and more efficient use of threads

     [ https://issues.apache.org/jira/browse/ARROW-9464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andy Grove updated ARROW-9464:
------------------------------
    Fix Version/s:     (was: 2.0.0)
                   3.0.0

> [Rust] [DataFusion] Physical plan refactor to support optimization rules and more efficient use of threads
> ----------------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-9464
>                 URL: https://issues.apache.org/jira/browse/ARROW-9464
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Rust, Rust - DataFusion
>            Reporter: Andy Grove
>            Assignee: Andy Grove
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 3.0.0
>
>          Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> I would like to propose a refactor of the physical/execution planning based on the experience I have had in implementing distributed execution in Ballista.
> This will likely need subtasks but here is an overview of the changes I am proposing.
> h3. *Introduce physical plan optimization rule to insert "shuffle" operators*
> We should extend the ExecutionPlan trait so that each operator can specify its input and output partitioning needs, and then have an optimization rule that can insert any repartitioning or reordering steps required.
> For example, these are the methods to be added to ExecutionPlan. This design is based on Apache Spark.
>  
> {code:java}
> /// Specifies how data is partitioned across different nodes in the cluster
> fn output_partitioning(&self) -> Partitioning {
>     Partitioning::UnknownPartitioning(0)
> }
> /// Specifies the data distribution requirements of all the children for this operator
> fn required_child_distribution(&self) -> Distribution {
>     Distribution::UnspecifiedDistribution
> }
> /// Specifies how data is ordered in each partition
> fn output_ordering(&self) -> Option<Vec<SortOrder>> {
>     None
> }
> /// Specifies the data distribution requirements of all the children for this operator
> fn required_child_ordering(&self) -> Option<Vec<Vec<SortOrder>>> {
>     None
> }
>  {code}
> A good example of applying this rule would be in the case of hash aggregates where we perform a partial aggregate in parallel across partitions and then coalesce the results and apply a final hash aggregate.
> Another example would be a SortMergeExec specifying the sort order required for its children.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)