You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/08/11 18:20:12 UTC

[GitHub] [arrow-datafusion] alamb opened a new pull request #858: Add support for EXPLAIN ANALYZE

alamb opened a new pull request #858:
URL: https://github.com/apache/arrow-datafusion/pull/858


   # Which issue does this PR close?
   Resolves https://github.com/apache/arrow-datafusion/issues/779
   
    # Rationale for this change
   `EXPLAIN PLAN` is great to understand what DataFusion plans to do, but it is hard today using the SQL or dataframe interface to understand in more depth what *actually happened* during execution.
   
   My real usecase is being able to see how many rows flowed through each operator as well as the  "time spent" and "rows produced" by each operator.
   
   # What changes are included in this PR?
   1. Add basic plan nodes for `EXPLAIN ANALYZE` and EXPLAIN ANALYZE VERBOSE` (example below) sql
   2. Refactor special case `ParquetStrream` into `RecordBatchReceiverStream` for reuse
   
   
   # Are there any user-facing changes?
   Yes, `EXPLAIN ANALYZE` now does something different than `EXPLAIN`
   
   ## Example of use
   ```shell
   echo "1,A" > /tmp/foo.csv
   echo "1,B" >> /tmp/foo.csv
   echo "2,A" >> /tmp/foo.csv
   ```
   
   Run CLI
   ```shell
   cargo run --bin datafusion-cli
   ```
   
   ```sql
   CREATE EXTERNAL TABLE foo(x INT, b VARCHAR) STORED AS CSV LOCATION '/tmp/foo.csv';
   ```
   
   ## Example `EXPLAIN ANALYZE` output
   
   ```sql
   > EXPLAIN ANALYZE SELECT SUM(x) FROM foo GROUP BY b;
   +-------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------+
   | plan_type         | plan                                                                                                                                                      |
   +-------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------+
   | Plan with Metrics | CoalescePartitionsExec, metrics=[]                                                                                                                        |
   |                   |   ProjectionExec: expr=[SUM(foo.x)@1 as SUM(x)], metrics=[]                                                                                               |
   |                   |     HashAggregateExec: mode=FinalPartitioned, gby=[b@0 as b], aggr=[SUM(x)], metrics=[outputRows=2]                                                       |
   |                   |       CoalesceBatchesExec: target_batch_size=4096, metrics=[]                                                                                             |
   |                   |         RepartitionExec: partitioning=Hash([Column { name: "b", index: 0 }], 16), metrics=[sendTime=839560, fetchTime=122528525, repartitionTime=5327877] |
   |                   |           HashAggregateExec: mode=Partial, gby=[b@1 as b], aggr=[SUM(x)], metrics=[outputRows=2]                                                          |
   |                   |             RepartitionExec: partitioning=RoundRobinBatch(16), metrics=[fetchTime=5660489, repartitionTime=0, sendTime=8012]                              |
   |                   |               CsvExec: source=Path(/tmp/foo.csv: [/tmp/foo.csv]), has_header=false, metrics=[]                                                            |
   +-------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------+
   1 row in set. Query took 0.012 seconds.
   ```
   
   ## Example `EXPLAIN ANALYZE VERBOSE` output
   
   ```sql
   > EXPLAIN ANALYZE VERBOSE SELECT SUM(x) FROM foo GROUP BY b;
   +-------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------+
   | plan_type         | plan                                                                                                                                                      |
   +-------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------+
   | Plan with Metrics | CoalescePartitionsExec, metrics=[]                                                                                                                        |
   |                   |   ProjectionExec: expr=[SUM(foo.x)@1 as SUM(x)], metrics=[]                                                                                               |
   |                   |     HashAggregateExec: mode=FinalPartitioned, gby=[b@0 as b], aggr=[SUM(x)], metrics=[outputRows=2]                                                       |
   |                   |       CoalesceBatchesExec: target_batch_size=4096, metrics=[]                                                                                             |
   |                   |         RepartitionExec: partitioning=Hash([Column { name: "b", index: 0 }], 16), metrics=[repartitionTime=6584110, fetchTime=132927514, sendTime=904001] |
   |                   |           HashAggregateExec: mode=Partial, gby=[b@1 as b], aggr=[SUM(x)], metrics=[outputRows=2]                                                          |
   |                   |             RepartitionExec: partitioning=RoundRobinBatch(16), metrics=[repartitionTime=0, sendTime=8246, fetchTime=6239096]                              |
   |                   |               CsvExec: source=Path(/tmp/foo.csv: [/tmp/foo.csv]), has_header=false, metrics=[]                                                            |
   | Output Rows       | 2                                                                                                                                                         |
   | Duration          | 10.283764ms                                                                                                                                               |
   +-------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------+
   3 rows in set. Query took 0.014 seconds.
   ```
   
   # Future work:
   Note this PR is designed just to hook up / plumb the existing code and metrics we have into SQL (basically what got added in https://github.com/apache/arrow-datafusion/pull/662). I plan a sequence of follow on PRs to both improve the metrics infrastructure https://github.com/apache/arrow-datafusion/issues/679 and add/fix the metrics that are actually reported so they are consistent. The specific metrics that are displayed are verbose and somewhat ad hoc at the moment.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] jorgecarleitao commented on pull request #858: Add support for EXPLAIN ANALYZE

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on pull request #858:
URL: https://github.com/apache/arrow-datafusion/pull/858#issuecomment-897758179


   I agree that a new variant makes sense here, for the reasons @alamb enumerated.
   
   Also, pretty awesome PR! 💯 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on a change in pull request #858: Add support for EXPLAIN ANALYZE

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #858:
URL: https://github.com/apache/arrow-datafusion/pull/858#discussion_r687069872



##########
File path: datafusion/src/logical_plan/plan.rs
##########
@@ -213,6 +213,16 @@ pub enum LogicalPlan {
         /// The output schema of the explain (2 columns of text)
         schema: DFSchemaRef,
     },
+    /// Runs the actual plan, and then prints the physical plan with
+    /// with execution metrics.
+    Analyze {

Review comment:
       NOTE: I chose a new LogicalPlan node because the implementation for ANALYZE is so different than EXPLAIN. However, it would be possible to re-use the same LogicalPlan node if people prefer




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] NGA-TRAN commented on a change in pull request #858: Add support for EXPLAIN ANALYZE

Posted by GitBox <gi...@apache.org>.
NGA-TRAN commented on a change in pull request #858:
URL: https://github.com/apache/arrow-datafusion/pull/858#discussion_r687809683



##########
File path: ballista/rust/core/src/serde/logical_plan/mod.rs
##########
@@ -661,6 +661,43 @@ mod roundtrip_tests {
         Ok(())
     }
 
+    #[test]
+    fn roundtrip_analyze() -> Result<()> {
+        let schema = Schema::new(vec![
+            Field::new("id", DataType::Int32, false),
+            Field::new("first_name", DataType::Utf8, false),
+            Field::new("last_name", DataType::Utf8, false),
+            Field::new("state", DataType::Utf8, false),
+            Field::new("salary", DataType::Int32, false),
+        ]);
+
+        let verbose_plan = LogicalPlanBuilder::scan_csv(
+            "employee.csv",
+            CsvReadOptions::new().schema(&schema).has_header(true),
+            Some(vec![3, 4]),
+        )
+        .and_then(|plan| plan.sort(vec![col("salary")]))
+        .and_then(|plan| plan.explain(true, true))
+        .and_then(|plan| plan.build())
+        .map_err(BallistaError::DataFusionError)?;
+
+        let plan = LogicalPlanBuilder::scan_csv(
+            "employee.csv",
+            CsvReadOptions::new().schema(&schema).has_header(true),
+            Some(vec![3, 4]),
+        )
+        .and_then(|plan| plan.sort(vec![col("salary")]))
+        .and_then(|plan| plan.explain(false, true))
+        .and_then(|plan| plan.build())
+        .map_err(BallistaError::DataFusionError)?;
+
+        roundtrip_test!(plan);

Review comment:
       When we talk I need to learn the the effectiveness of this round trip test thats convert a logical/physical plan into photo and back. The tests look simple and easy to understand this way.

##########
File path: datafusion/src/logical_plan/plan.rs
##########
@@ -213,6 +213,16 @@ pub enum LogicalPlan {
         /// The output schema of the explain (2 columns of text)
         schema: DFSchemaRef,
     },
+    /// Runs the actual plan, and then prints the physical plan with
+    /// with execution metrics.
+    Analyze {

Review comment:
       I am about to ask when I saw the above code that has analyze as different function. I am worry about future inconsistency and headache of keeping them consistent, as well as redundant work when we change or improve something. I would prefer to keep them in the same LogicalPlan




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on a change in pull request #858: Add support for EXPLAIN ANALYZE

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #858:
URL: https://github.com/apache/arrow-datafusion/pull/858#discussion_r687874429



##########
File path: ballista/rust/core/src/serde/logical_plan/mod.rs
##########
@@ -661,6 +661,43 @@ mod roundtrip_tests {
         Ok(())
     }
 
+    #[test]
+    fn roundtrip_analyze() -> Result<()> {
+        let schema = Schema::new(vec![
+            Field::new("id", DataType::Int32, false),
+            Field::new("first_name", DataType::Utf8, false),
+            Field::new("last_name", DataType::Utf8, false),
+            Field::new("state", DataType::Utf8, false),
+            Field::new("salary", DataType::Int32, false),
+        ]);
+
+        let verbose_plan = LogicalPlanBuilder::scan_csv(
+            "employee.csv",
+            CsvReadOptions::new().schema(&schema).has_header(true),
+            Some(vec![3, 4]),
+        )
+        .and_then(|plan| plan.sort(vec![col("salary")]))
+        .and_then(|plan| plan.explain(true, true))
+        .and_then(|plan| plan.build())
+        .map_err(BallistaError::DataFusionError)?;
+
+        let plan = LogicalPlanBuilder::scan_csv(
+            "employee.csv",
+            CsvReadOptions::new().schema(&schema).has_header(true),
+            Some(vec![3, 4]),
+        )
+        .and_then(|plan| plan.sort(vec![col("salary")]))
+        .and_then(|plan| plan.explain(false, true))
+        .and_then(|plan| plan.build())
+        .map_err(BallistaError::DataFusionError)?;
+
+        roundtrip_test!(plan);

Review comment:
       To be honest I simply copy/pasted the test for `roundtrip_explain` below -- I agree the pattern is quite nice




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on pull request #858: Add support for EXPLAIN ANALYZE

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #858:
URL: https://github.com/apache/arrow-datafusion/pull/858#issuecomment-897756517


   > I am about to ask when I saw the above code that has analyze as different function. I am worry about future inconsistency and headache of keeping them consistent, as well as redundant work when we change or improve something. I would prefer to keep them in the same LogicalPlan
   
   @NGA-TRAN  I also went back and forth on this point. The existing `LogicalPlan::Explain` is special cased several times during planning (so it can capture the results of intermediate passes as strings), and since those intermediate strings aren't used by Analyze we would then have to do an extra check in each special case
   
   And thus even though there is definitely some redundancy, I eventually concluded that a new `LogicalPlan` type made things most clear. 
   
   The physical plan (`ExecutionPlan`) for Analyze is also very different but it would be feasible to use the different physical plans for the same logical plan. 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb merged pull request #858: Add support for EXPLAIN ANALYZE

Posted by GitBox <gi...@apache.org>.
alamb merged pull request #858:
URL: https://github.com/apache/arrow-datafusion/pull/858


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org