You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/08/31 13:54:06 UTC

[GitHub] [arrow-datafusion] rdettai opened a new issue #962: Moving cost based optimizations to physical planning

rdettai opened a new issue #962:
URL: https://github.com/apache/arrow-datafusion/issues/962


   **Is your feature request related to a problem or challenge? Please describe what you are trying to do.**
   Currently some cost based optimizations are executed during the logical plan optimization. This breaks the intent of separating the logical and physical plan. Usually, the logical plan is supposed to be optimized using rule based optimizations only. It is in the physical plan selection that the cost based optimization should kick in.
   
   A more concrete reason to move all cost based optimizations further down the pipe is to avoid the need to fetch statistics when building the logical plan. In Ballista for example the logical plan is built in the client, which might be off cluster, and thus shouldn’t be required to have access to the datastore (or at least not the high performance access that is sometimes required to fetch statistics).
   
   
   **Describe the solution you'd like**
   - Remove `statistics()` and `has_exact_statistics()` from the `TableProvider` trait
   - Add a `statistics()` method to the `ExecutionPlan` trait
     - Each node of the physical plan will try to assess its output statistics itself according to its inputs and its internal logic. This is easier to maintain than the current approach where we are trying to reconstruct the statistics of the nodes externally (e.g. in `hash_build_probe_order.rs` -> `get_num_rows(&LogicalPlan)` ).
     - The returned type will be the same as the current Statistics struct, except that we use a boolean field `is_exact` instead of having another `has_exact_statistics()` methods
     - The method is sync because the statistics are computed eagerly when the `ExecutionPlan` is created. Even though currently some plans might not use the statistics, the more optimizations will be added, the more often statistics will be required. You can still opt out of fetching the statistics completely by specifying that on the datasource, and the choice will automatically propagate to the statistics of all the nodes in the `ExecutionPlan`.
     - [can be postponed to a separate PR] Make the `TableProvider.scan(...)` method async to allow the computation of the statistics when creating the source `ExecutionPlan` nodes. This will require async propagation to the `PhysicalPlanner` trait.
     - The `AggregateStatistics` rule (use the statistics directly to provide the aggregation if possible) could be managed:
       - By the `HashAggregateExec` itself when being constructed
       - During the physical plan optimization
     - The `HashBuildProbeOrder` rule (optimize join orderings according to table sizes) :
       - By the HashJoinExec and CrossJoinExec themselves when being constructed
       - During the physical plan optimization
   
   **Describe alternatives you've considered**
   While moving it to the Execution plan, one possibility would be to make the `statistics()` computations lazy (and at the same time async). This would make it possible not to build the statistics for plans that don't require them. I feel that as the engine becomes feature complete (in particular in terms of optimization), all the plans will need statistics and computing the statistics lazily would just complexify the APIs and implementations (e.g the serialization)
   
   **Additional context**
   The need to move the CBO to the execution plan is linked to problem such as the implementation of table formats 
   - #126 
   - #133
   
   It is also related to the multiple reading of the statistics raised in Ballista:
   - #871
   - #868
   
   Features for which the implementation will likely create merge conflicts:
   - #904
   - #717
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] rdettai edited a comment on issue #962: Moving cost based optimizations to physical planning

Posted by GitBox <gi...@apache.org>.

rdettai edited a comment on issue #962:
URL: https://github.com/apache/arrow-datafusion/issues/962#issuecomment-909268730


   As suggested by @Dandandan, we might also want to consider the fact that the statistics can be updated at runtime (like Spark AQE). In Ballista, an execution plan for a stage that takes a shuffle as input might be re-optimized according to the statistics of the shuffle boundary. For instance, this might change the optimal build/probe order of the tables for a join that has the shuffle boundary as one of its inputs.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] rdettai commented on issue #962: Moving cost based optimizations to physical planning

Posted by GitBox <gi...@apache.org>.

rdettai commented on issue #962:
URL: https://github.com/apache/arrow-datafusion/issues/962#issuecomment-909268730


   As suggested by @Dandandan, we might also want to consider the fact that the statistics can be updated at runtime (like Spark AQE). For example in Ballista, an execution plan for a stage that takes a shuffle as input might be re-optimized according to the statistics of the shuffle boundary. For instance this might change the optimal order (build/probe) of the tables for a join that has the shuffle as one of its inputs.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] rdettai edited a comment on issue #962: Moving cost based optimizations to physical planning

Posted by GitBox <gi...@apache.org>.

rdettai edited a comment on issue #962:
URL: https://github.com/apache/arrow-datafusion/issues/962#issuecomment-909268730


   As suggested by @Dandandan, we might also want to consider the fact that the statistics can be updated at runtime (like Spark AQE). In Ballista, an execution plan for a stage that takes a shuffle as input might be re-optimized according to the statistics of the shuffle boundary. For instance, this might change the optimal build/probe order of the tables for a join that has the shuffle boundary as one of its inputs.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb closed issue #962: Moving cost based optimizations to physical planning

Posted by GitBox <gi...@apache.org>.

alamb closed issue #962:
URL: https://github.com/apache/arrow-datafusion/issues/962


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] rdettai commented on issue #962: Moving cost based optimizations to physical planning

Posted by GitBox <gi...@apache.org>.

rdettai commented on issue #962:
URL: https://github.com/apache/arrow-datafusion/issues/962#issuecomment-909268730


   As suggested by @Dandandan, we might also want to consider the fact that the statistics can be updated at runtime (like Spark AQE). For example in Ballista, an execution plan for a stage that takes a shuffle as input might be re-optimized according to the statistics of the shuffle boundary. For instance this might change the optimal order (build/probe) of the tables for a join that has the shuffle as one of its inputs.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] rdettai commented on issue #962: Moving cost based optimizations to physical planning

Posted by GitBox <gi...@apache.org>.

rdettai commented on issue #962:
URL: https://github.com/apache/arrow-datafusion/issues/962#issuecomment-909268730


   As suggested by @Dandandan, we might also want to consider the fact that the statistics can be updated at runtime (like Spark AQE). For example in Ballista, an execution plan for a stage that takes a shuffle as input might be re-optimized according to the statistics of the shuffle boundary. For instance this might change the optimal order (build/probe) of the tables for a join that has the shuffle as one of its inputs.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org