You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/08/20 17:14:29 UTC

[GitHub] [arrow-datafusion] alamb opened a new pull request #909: Add BaselineMetrics, Timestamp metrics, add for CoalscePartitionExec

alamb opened a new pull request #909:
URL: https://github.com/apache/arrow-datafusion/pull/909


   # Which issue does this PR close?
   
   Built on https://github.com/apache/arrow-datafusion/pull/908 so should review that first
   
   Part of  https://github.com/apache/arrow-datafusion/issues/866
   
    # Rationale for this change
   I would like to be able get an overall understanding of where time is being spent during query execution via  `EXPLAIN ANALYZE` (see https://github.com/apache/arrow-datafusion/pull/858) so that I know where to focuse additional performance optimization activities
   
   Additionally, I would like to be able to graph a stacked flamechart such as the following  see more details on https://github.com/influxdata/influxdb_iox/issues/2273)  that shows when the different operators ran in relation to each other.
   
   <img width="689" alt="Screen Shot 2021-08-12 at 11 14 33 AM" src="https://user-images.githubusercontent.com/490673/129237447-834838c8-aa97-42c4-b905-6114d28ca98b.png">
   
   
   # What changes are included in this PR?
   Begin  adding the following data for each operator, as it makes sense
   1. output_rows: total rows produced at the output of the operator
   2. cpu_nanos: the total time spent (not including any time spent in the input stream or waiting to be scheduled)
   3. start_time: the wall clock time at which `execute` was run
   4. stop_time: the wall clock time at which the last output record batch was produced
   
   # Changes
   1. Adds a `BaselineMetrics` structure that has the common metrics to assist in annotating
   2. Adds a `Timestamp` metric type and `StartTimestamp` and `EndTimestamp` values and associated aggregating code
   3. Print out metrics in deterministic order
   
   # Are there any user-facing changes?
   Better `ANALYZE` output. 
   
   Using this setup
   
   ```shell
   echo "1,A" > /tmp/foo.csv
   echo "1,B" >> /tmp/foo.csv
   echo "2,A" >> /tmp/foo.csv
   ```
   
   Run CLI
   ```shell
   cargo run --bin datafusion-cli
   ```
   
   And then run this SQL:
   ```SQL
   CREATE EXTERNAL TABLE foo(x INT, b VARCHAR) STORED AS CSV LOCATION '/tmp/foo.csv';
   
   SELECT SUM(x) FROM foo GROUP BY b;
   
   EXPLAIN ANALYZE SELECT SUM(x) FROM foo GROUP BY b;
   
   ```
   
   ## Before this PR
   ```
   > EXPLAIN ANALYZE SELECT SUM(x) FROM foo GROUP BY b;
   +-------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------+
   | plan_type         | plan                                                                                                                                                      |
   +-------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------+
   | Plan with Metrics | CoalescePartitionsExec, metrics=[]                                                                                                                        |
   |                   |   ProjectionExec: expr=[SUM(foo.x)@1 as SUM(x)], metrics=[]                                                                                               |
   |                   |     HashAggregateExec: mode=FinalPartitioned, gby=[b@0 as b], aggr=[SUM(x)], metrics=[outputRows=2]                                                       |
   |                   |       CoalesceBatchesExec: target_batch_size=4096, metrics=[]                                                                                             |
   |                   |         RepartitionExec: partitioning=Hash([Column { name: "b", index: 0 }], 16), metrics=[sendTime=968487, repartitionTime=5686072, fetchTime=110114033] |
   |                   |           HashAggregateExec: mode=Partial, gby=[b@1 as b], aggr=[SUM(x)], metrics=[outputRows=2]                                                          |
   |                   |             RepartitionExec: partitioning=RoundRobinBatch(16), metrics=[sendTime=12090, fetchTime=5106669, repartitionTime=0]                             |
   |                   |               CsvExec: source=Path(/tmp/foo.csv: [/tmp/foo.csv]), has_header=false, metrics=[]                                                            |
   +-------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------+
   ```
   ## After this PR
   (note the presence of `output_rows`,  `cpu_time`, `start_timestamp`, and `end_timestamp`)
   
   ```
   ------------------------+
   | plan_type         | plan                                                                                                                                                                                                                      |
   +-------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
   | Plan with Metrics | CoalescePartitionsExec, metrics=[output_rows=2, cpu_time=NOT RECORDED, start_timestamp=2021-08-20 17:10:51.810651 UTC, end_timestamp=2021-08-20 17:10:51.821645 UTC]                                                      |
   |                   |   ProjectionExec: expr=[SUM(foo.x)@1 as SUM(x)], metrics=[]                                                                                                                                                               |
   |                   |     HashAggregateExec: mode=FinalPartitioned, gby=[b@0 as b], aggr=[SUM(x)], metrics=[output_rows=2]                                                                                                                      |
   |                   |       CoalesceBatchesExec: target_batch_size=4096, metrics=[]                                                                                                                                                             |
   |                   |         RepartitionExec: partitioning=Hash([Column { name: "b", index: 0 }], 16), metrics=[fetch_time{inputPartition=2}=138.444053ms, repart_time{inputPartition=2}=7.445086ms, send_time{inputPartition=2}=NOT RECORDED] |
   |                   |           HashAggregateExec: mode=Partial, gby=[b@1 as b], aggr=[SUM(x)], metrics=[output_rows=2]                                                                                                                         |
   |                   |             RepartitionExec: partitioning=RoundRobinBatch(16), metrics=[send_time{inputPartition=0}=9.378µs, fetch_time{inputPartition=0}=6.57685ms, repart_time{inputPartition=0}=NOT RECORDED]                          |
   |                   |               CsvExec: source=Path(/tmp/foo.csv: [/tmp/foo.csv]), has_header=false, metrics=[]                                                                                                                            |
   +-------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
   1 row in set. Query took 0.019 seconds.
   ```
   
   # Follow on work:
   1. Annotate additional operators
   2. Add additional wrapper as suggested by @houqp in https://github.com/apache/arrow-datafusion/issues/866#issuecomment-898202701 to help avoid special code for each operator. The `BaselineMetrics` is a  start in this direction but there is still more to go
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on pull request #909: Add BaselineMetrics, Timestamp metrics, add for `CoalescePartitionsExec`, rename output_time -> elapsed_compute

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #909:
URL: https://github.com/apache/arrow-datafusion/pull/909#issuecomment-906360563


   Thanks all for the input


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb merged pull request #909: Add BaselineMetrics, Timestamp metrics, add for `CoalescePartitionsExec`, rename output_time -> elapsed_compute

Posted by GitBox <gi...@apache.org>.
alamb merged pull request #909:
URL: https://github.com/apache/arrow-datafusion/pull/909


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on a change in pull request #909: Add BaselineMetrics, Timestamp metrics, add for `CoalescePartitionsExec`, rename output_time -> elapsed_compute

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #909:
URL: https://github.com/apache/arrow-datafusion/pull/909#discussion_r696035860



##########
File path: datafusion/src/physical_plan/metrics/baseline.rs
##########
@@ -0,0 +1,183 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Metrics common for almost all operators
+
+use std::task::Poll;
+
+use arrow::{error::ArrowError, record_batch::RecordBatch};
+
+use super::{Count, ExecutionPlanMetricsSet, MetricBuilder, Time, Timestamp};
+
+/// Helper for creating and tracking common "baseline" metrics for

Review comment:
       This is the core new structure / wrapper as suggested by @houqp  -- it doesn't quite wrap everything yet but I think it makes annotating basic information pretty simple




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on a change in pull request #909: Add BaselineMetrics, Timestamp metrics, add for `CoalescePartitionsExec`, rename output_time -> elapsed_compute

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #909:
URL: https://github.com/apache/arrow-datafusion/pull/909#discussion_r695031745



##########
File path: datafusion/src/physical_plan/coalesce_partitions.rs
##########
@@ -43,12 +45,17 @@ use pin_project_lite::pin_project;
 pub struct CoalescePartitionsExec {
     /// Input execution plan
     input: Arc<dyn ExecutionPlan>,
+    /// Execution metrics

Review comment:
       This is the pattern that instrumented operators now follow:
   1. They have a new `metrics: ExecutionPlanMetricsSet` field
   2. During `execute()` they create new `BaselineMetrics`
   3. They add `elapsed_compute` timer during their CPU intensive work 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on pull request #909: Add BaselineMetrics, Timestamp metrics, add for `CoalescePartitionsExec`, rename output_time -> elapsed_compute

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #909:
URL: https://github.com/apache/arrow-datafusion/pull/909#issuecomment-905799662


   I have a few PRs backed up on this one (e.g. https://github.com/apache/arrow-datafusion/pull/938) so if someone has time to review this PR in the near term I would be most appreciative 🙏 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on pull request #909: Add BaselineMetrics, Timestamp metrics, add for `CoalescePartitionsExec`, rename output_time -> elapsed_compute

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #909:
URL: https://github.com/apache/arrow-datafusion/pull/909#issuecomment-905834242


   I'll give it until tomorrow in case anyone else wants to chime in or say they are interested in reviewing it . Thanks @andygrove 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on a change in pull request #909: Add BaselineMetrics, Timestamp metrics, add for `CoalescePartitionsExec`

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #909:
URL: https://github.com/apache/arrow-datafusion/pull/909#discussion_r695031745



##########
File path: datafusion/src/physical_plan/coalesce_partitions.rs
##########
@@ -43,12 +45,17 @@ use pin_project_lite::pin_project;
 pub struct CoalescePartitionsExec {
     /// Input execution plan
     input: Arc<dyn ExecutionPlan>,
+    /// Execution metrics

Review comment:
       This is the pattern that instrumented operators now follow:
   1. They have a new `metrics: ExecutionPlanMetricsSet` field
   2. During `execute()` they create new `BaselineMetrics`
   3. They add `elapsed_time` timer during their CPU intensive work 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on pull request #909: Add BaselineMetrics, Timestamp metrics, add for `CoalescePartitionsExec`, rename output_time -> elapsed_compute

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #909:
URL: https://github.com/apache/arrow-datafusion/pull/909#issuecomment-905826892


   cc @returnString 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org