You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/07/20 16:25:09 UTC

[GitHub] [arrow-datafusion] alamb opened a new pull request #759: Show the result of all optimizer passes in EXPLAIN VERBOSE

alamb opened a new pull request #759:
URL: https://github.com/apache/arrow-datafusion/pull/759


   # Which issue does this PR close?
   
   <!--
   We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123.
   -->
   
   Resolves https://github.com/apache/arrow-datafusion/issues/733
   
    # Rationale for this Change
   Previously, only some logical optimizer passes (and no physical optimizer passes) were shown in `EXPLAIN VERBOSE` output. This was due to the fact that each optimizer had to special case handling for explain and so unsurprisingly some (especially newly written ones) did not.
   
   # What changes are included in this PR?
   1. Handle capturing logical optimizer output in `ExecutionContext::Optimize`
   3. Remove old "optimize_for_explain" plumbing
   3. Show plans that are no different than the previous as "SAME TEXT AS ABOVE"
   3. Capture physical optimizer output in PhysicalPlanner
   3. Clean up how StringifiedPlans are created using traits
   
   # Are there any user-facing changes?
   Yes. Explain output is different. To see the difference, do
   
   ```shell
   echo "1,2" > /tmp/foo.csv
   cargo run --bin datafusion-cli
   ```
   
   Then run
   ```sql
   CREATE EXTERNAL TABLE foo(c1 int, c2 int)
   STORED AS CSV
   LOCATION '/tmp/foo.csv';
   
   EXPLAIN VERBOSE SELECT * from foo;
   ```
   
   ## Before this change:
   Note the reason the optimizer passes appear to be duplicated in this explain is because that is what actually happens -- optimize is called once as part of `ExecutionContext::sql()` and again as part of `DataFrame_impl::collect()`). If we want to avoid the double optimization, I think we should treat it separately and do so in a follow on PR. This PR faithfully captures what DataFusion is actually doing. 
   
   ```
   +-----------------------------------------+--------------------------------------------------------------------------+
   | plan_type                               | plan                                                                     |
   +-----------------------------------------+--------------------------------------------------------------------------+
   | initial_logical_plan                    | Projection: #foo.c1, #foo.c2                                             |
   |                                         |   TableScan: foo projection=None                                         |
   | logical_plan after projection_push_down | Projection: #foo.c1, #foo.c2                                             |
   |                                         |   TableScan: foo projection=Some([0, 1])                                 |
   | logical_plan after simplify_expressions | Projection: #foo.c1, #foo.c2                                             |
   |                                         |   TableScan: foo projection=Some([0, 1])                                 |
   | logical_plan after limit_push_down      | Projection: #foo.c1, #foo.c2                                             |
   |                                         |   TableScan: foo projection=Some([0, 1])                                 |
   | logical_plan after projection_push_down | Projection: #foo.c1, #foo.c2                                             |
   |                                         |   TableScan: foo projection=Some([0, 1])                                 |
   | logical_plan after simplify_expressions | Projection: #foo.c1, #foo.c2                                             |
   |                                         |   TableScan: foo projection=Some([0, 1])                                 |
   | logical_plan after limit_push_down      | Projection: #foo.c1, #foo.c2                                             |
   |                                         |   TableScan: foo projection=Some([0, 1])                                 |
   | logical_plan                            | Projection: #foo.c1, #foo.c2                                             |
   |                                         |   TableScan: foo projection=Some([0, 1])                                 |
   | initial_physical_plan                   | ProjectionExec: expr=[c1@0 as c1, c2@1 as c2]                            |
   |                                         |   CsvExec: source=Path(/tmp/foo.csv: [/tmp/foo.csv]), has_header=false   |
   | physical_plan                           | ProjectionExec: expr=[c1@0 as c1, c2@1 as c2]                            |
   |                                         |   RepartitionExec: partitioning=RoundRobinBatch(16)                      |
   |                                         |     CsvExec: source=Path(/tmp/foo.csv: [/tmp/foo.csv]), has_header=false |
   +-----------------------------------------+--------------------------------------------------------------------------+
   ```
   
   ## After this change:
   
   ```
   > EXPLAIN VERBOSE SELECT * from foo;
   +-------------------------------------------+--------------------------------------------------------------------------+
   | plan_type                                 | plan                                                                     |
   +-------------------------------------------+--------------------------------------------------------------------------+
   | initial_logical_plan                      | Projection: #foo.c1, #foo.c2                                             |
   |                                           |   TableScan: foo projection=None                                         |
   | logical_plan after constant_folding       | SAME TEXT AS ABOVE                                                       |
   | logical_plan after eliminate_limit        | SAME TEXT AS ABOVE                                                       |
   | logical_plan after aggregate_statistics   | SAME TEXT AS ABOVE                                                       |
   | logical_plan after projection_push_down   | Projection: #foo.c1, #foo.c2                                             |
   |                                           |   TableScan: foo projection=Some([0, 1])                                 |
   | logical_plan after filter_push_down       | SAME TEXT AS ABOVE                                                       |
   | logical_plan after simplify_expressions   | SAME TEXT AS ABOVE                                                       |
   | logical_plan after hash_build_probe_order | SAME TEXT AS ABOVE                                                       |
   | logical_plan after limit_push_down        | SAME TEXT AS ABOVE                                                       |
   | logical_plan after constant_folding       | SAME TEXT AS ABOVE                                                       |
   | logical_plan after eliminate_limit        | SAME TEXT AS ABOVE                                                       |
   | logical_plan after aggregate_statistics   | SAME TEXT AS ABOVE                                                       |
   | logical_plan after projection_push_down   | SAME TEXT AS ABOVE                                                       |
   | logical_plan after filter_push_down       | SAME TEXT AS ABOVE                                                       |
   | logical_plan after simplify_expressions   | SAME TEXT AS ABOVE                                                       |
   | logical_plan after hash_build_probe_order | SAME TEXT AS ABOVE                                                       |
   | logical_plan after limit_push_down        | SAME TEXT AS ABOVE                                                       |
   | logical_plan                              | Projection: #foo.c1, #foo.c2                                             |
   |                                           |   TableScan: foo projection=Some([0, 1])                                 |
   | initial_physical_plan                     | ProjectionExec: expr=[c1@0 as c1, c2@1 as c2]                            |
   |                                           |   CsvExec: source=Path(/tmp/foo.csv: [/tmp/foo.csv]), has_header=false   |
   | physical_plan after coalesce_batches      | SAME TEXT AS ABOVE                                                       |
   | physical_plan after repartition           | ProjectionExec: expr=[c1@0 as c1, c2@1 as c2]                            |
   |                                           |   RepartitionExec: partitioning=RoundRobinBatch(16)                      |
   |                                           |     CsvExec: source=Path(/tmp/foo.csv: [/tmp/foo.csv]), has_header=false |
   | physical_plan after add_merge_exec        | SAME TEXT AS ABOVE                                                       |
   | physical_plan                             | ProjectionExec: expr=[c1@0 as c1, c2@1 as c2]                            |
   |                                           |   RepartitionExec: partitioning=RoundRobinBatch(16)                      |
   |                                           |     CsvExec: source=Path(/tmp/foo.csv: [/tmp/foo.csv]), has_header=false |
   +-------------------------------------------+--------------------------------------------------------------------------+
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on a change in pull request #759: Show the result of all optimizer passes in EXPLAIN VERBOSE

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #759:
URL: https://github.com/apache/arrow-datafusion/pull/759#discussion_r673280244



##########
File path: datafusion/src/optimizer/limit_push_down.rs
##########
@@ -43,25 +42,6 @@ fn limit_push_down(
     execution_props: &ExecutionProps,
 ) -> Result<LogicalPlan> {
     match (plan, upper_limit) {
-        (
-            LogicalPlan::Explain {
-                verbose,
-                schema,
-                plan,
-                stringified_plans,
-            },
-            _,
-        ) => {
-            let schema = schema.as_ref().to_owned().into();
-            optimize_explain(

Review comment:
       this is the old style "explain" implementation that needed to be done for each optimizer for it to correctly show explain plans (and it was missing from several)

##########
File path: datafusion/src/physical_plan/explain.rs
##########
@@ -115,9 +115,20 @@ impl ExecutionPlan for ExplainExec {
             .iter()
             .filter(|s| s.should_display(self.verbose));
 
+        // Identify plans that are not changed
+        let mut prev: Option<&StringifiedPlan> = None;
+
         for p in plans_to_print {
             type_builder.append_value(p.plan_type.to_string())?;
-            plan_builder.append_value(&*p.plan)?;
+            match prev {
+                Some(prev) if !should_show(prev, p) => {
+                    plan_builder.append_value("SAME TEXT AS ABOVE")?;

Review comment:
       Once I started dumping out all the explain plans, the mount of replication was enormous, so I also added code to avoid duplication if the optimizer pass did not make any changes




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] jorgecarleitao merged pull request #759: Show the result of all optimizer passes in EXPLAIN VERBOSE

Posted by GitBox <gi...@apache.org>.
jorgecarleitao merged pull request #759:
URL: https://github.com/apache/arrow-datafusion/pull/759


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] NGA-TRAN commented on a change in pull request #759: Show the result of all optimizer passes in EXPLAIN VERBOSE

Posted by GitBox <gi...@apache.org>.
NGA-TRAN commented on a change in pull request #759:
URL: https://github.com/apache/arrow-datafusion/pull/759#discussion_r673470819



##########
File path: datafusion/src/physical_plan/explain.rs
##########
@@ -115,9 +115,20 @@ impl ExecutionPlan for ExplainExec {
             .iter()
             .filter(|s| s.should_display(self.verbose));
 
+        // Identify plans that are not changed
+        let mut prev: Option<&StringifiedPlan> = None;
+
         for p in plans_to_print {
             type_builder.append_value(p.plan_type.to_string())?;
-            plan_builder.append_value(&*p.plan)?;
+            match prev {
+                Some(prev) if !should_show(prev, p) => {
+                    plan_builder.append_value("SAME TEXT AS ABOVE")?;

Review comment:
       +1

##########
File path: datafusion/src/execution/context.rs
##########
@@ -446,19 +447,31 @@ impl ExecutionContext {
 
     /// Optimizes the logical plan by applying optimizer rules.
     pub fn optimize(&self, plan: &LogicalPlan) -> Result<LogicalPlan> {
-        let state = &mut self.state.lock().unwrap();
-        let execution_props = &mut state.execution_props.clone();
-        let optimizers = &state.config.optimizers;
-
-        let execution_props = execution_props.start_execution();
-
-        let mut new_plan = plan.clone();
-        debug!("Logical plan:\n {:?}", plan);
-        for optimizer in optimizers {
-            new_plan = optimizer.optimize(&new_plan, execution_props)?;
+        if let LogicalPlan::Explain {
+            verbose,
+            plan,
+            stringified_plans,
+            schema,
+        } = plan
+        {
+            let mut stringified_plans = stringified_plans.clone();
+
+            // optimize the child plan, capturing the output of each optimizer
+            let plan = self.optimize_internal(plan, |optimized_plan, optimizer| {
+                let optimizer_name = optimizer.name().to_string();
+                let plan_type = PlanType::OptimizedLogicalPlan { optimizer_name };
+                stringified_plans.push(optimized_plan.to_stringified(plan_type));
+            })?;
+
+            Ok(LogicalPlan::Explain {
+                verbose: *verbose,
+                plan: Arc::new(plan),
+                stringified_plans,
+                schema: schema.clone(),
+            })
+        } else {
+            self.optimize_internal(plan, |_, _| {})

Review comment:
       Nice. Now we have the optimized plan displayed 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org