You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/12/28 20:47:58 UTC

[GitHub] [arrow] andygrove opened a new pull request #9035: ARROW-11052: [Rust] [DataFusion] Implement metrics for HashJoinExec

andygrove opened a new pull request #9035:
URL: https://github.com/apache/arrow/pull/9035


   Implement metrics for `HashJoinExec` to help debug performance issues.
   
   Example output:
   
   ```
   [2020-12-28T20:46:28Z DEBUG datafusion::physical_plan::hash_join] Built build-side of hash join containing 3115341 rows in 9926 ms
   [2020-12-28T20:46:36Z DEBUG datafusion::physical_plan::hash_join] Processed 144 stream-side input batches containing 18750000 rows and produced 144 output batches containing 388922 rows in 7994 ms
   [2020-12-28T20:46:36Z DEBUG datafusion::physical_plan::hash_join] Processed 144 stream-side input batches containing 18750000 rows and produced 144 output batches containing 389294 rows in 7986 ms
   [2020-12-28T20:46:37Z DEBUG datafusion::physical_plan::hash_join] Processed 144 stream-side input batches containing 18750000 rows and produced 144 output batches containing 389558 rows in 8326 ms
   [2020-12-28T20:46:37Z DEBUG datafusion::physical_plan::hash_join] Processed 144 stream-side input batches containing 18750000 rows and produced 144 output batches containing 388841 rows in 8499 ms
   [2020-12-28T20:46:37Z DEBUG datafusion::physical_plan::hash_join] Processed 144 stream-side input batches containing 18750000 rows and produced 144 output batches containing 390056 rows in 8643 ms
   [2020-12-28T20:46:37Z DEBUG datafusion::physical_plan::hash_join] Processed 144 stream-side input batches containing 18749999 rows and produced 144 output batches containing 389143 rows in 8711 ms
   [2020-12-28T20:46:37Z DEBUG datafusion::physical_plan::hash_join] Processed 144 stream-side input batches containing 18750000 rows and produced 144 output batches containing 389605 rows in 8712 ms
   [2020-12-28T20:46:37Z DEBUG datafusion::physical_plan::hash_join] Processed 144 stream-side input batches containing 18750000 rows and produced 144 output batches containing 389922 rows in 8877 ms
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan commented on pull request #9035: ARROW-11052: [Rust] [DataFusion] Implement metrics for HashJoinExec

Posted by GitBox <gi...@apache.org>.
Dandandan commented on pull request #9035:
URL: https://github.com/apache/arrow/pull/9035#issuecomment-753451276


   I agree, feel free to address the remaining feedback or to merge it @andygrove👍


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on pull request #9035: ARROW-11052: [Rust] [DataFusion] Implement metrics for HashJoinExec

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #9035:
URL: https://github.com/apache/arrow/pull/9035#issuecomment-752956502


   The full set of Rust CI tests did not run on this PR :(
   
   Can you please rebase this PR against [apache/master](https://github.com/apache/arrow) to pick up the changes in https://github.com/apache/arrow/pull/9056 so that they do? 
   
   I apologize for the inconvenience. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] andygrove closed pull request #9035: ARROW-11052: [Rust] [DataFusion] Implement metrics for HashJoinExec

Posted by GitBox <gi...@apache.org>.
andygrove closed pull request #9035:
URL: https://github.com/apache/arrow/pull/9035


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] codecov-io commented on pull request #9035: ARROW-11052: [Rust] [DataFusion] Implement metrics for HashJoinExec

Posted by GitBox <gi...@apache.org>.
codecov-io commented on pull request #9035:
URL: https://github.com/apache/arrow/pull/9035#issuecomment-753515593


   # [Codecov](https://codecov.io/gh/apache/arrow/pull/9035?src=pr&el=h1) Report
   > Merging [#9035](https://codecov.io/gh/apache/arrow/pull/9035?src=pr&el=desc) (4b5154a) into [master](https://codecov.io/gh/apache/arrow/commit/4b7cdcb9220b6d94b251aef32c21ef9b4097ecfa?el=desc) (4b7cdcb) will **decrease** coverage by `0.02%`.
   > The diff coverage is `52.63%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/9035/graphs/tree.svg?width=650&height=150&src=pr&token=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/9035?src=pr&el=tree)
   
   ```diff
   @@            Coverage Diff             @@
   ##           master    #9035      +/-   ##
   ==========================================
   - Coverage   82.61%   82.58%   -0.03%     
   ==========================================
     Files         203      203              
     Lines       50140    50171      +31     
   ==========================================
   + Hits        41422    41436      +14     
   - Misses       8718     8735      +17     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow/pull/9035?src=pr&el=tree) | Coverage Δ | |
   |---|---|---|
   | [rust/datafusion/src/physical\_plan/hash\_join.rs](https://codecov.io/gh/apache/arrow/pull/9035/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2hhc2hfam9pbi5ycw==) | `85.86% <52.63%> (-3.67%)` | :arrow_down: |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/9035?src=pr&el=continue).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/9035?src=pr&el=footer). Last update [4b7cdcb...4b5154a](https://codecov.io/gh/apache/arrow/pull/9035?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] codecov-io edited a comment on pull request #9035: ARROW-11052: [Rust] [DataFusion] Implement metrics for HashJoinExec

Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #9035:
URL: https://github.com/apache/arrow/pull/9035#issuecomment-753515593


   # [Codecov](https://codecov.io/gh/apache/arrow/pull/9035?src=pr&el=h1) Report
   > Merging [#9035](https://codecov.io/gh/apache/arrow/pull/9035?src=pr&el=desc) (36b0ebd) into [master](https://codecov.io/gh/apache/arrow/commit/4b7cdcb9220b6d94b251aef32c21ef9b4097ecfa?el=desc) (4b7cdcb) will **decrease** coverage by `0.02%`.
   > The diff coverage is `55.55%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/9035/graphs/tree.svg?width=650&height=150&src=pr&token=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/9035?src=pr&el=tree)
   
   ```diff
   @@            Coverage Diff             @@
   ##           master    #9035      +/-   ##
   ==========================================
   - Coverage   82.61%   82.59%   -0.03%     
   ==========================================
     Files         203      204       +1     
     Lines       50140    50169      +29     
   ==========================================
   + Hits        41422    41435      +13     
   - Misses       8718     8734      +16     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow/pull/9035?src=pr&el=tree) | Coverage Δ | |
   |---|---|---|
   | [rust/datafusion/src/physical\_plan/hash\_join.rs](https://codecov.io/gh/apache/arrow/pull/9035/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2hhc2hfam9pbi5ycw==) | `86.32% <55.55%> (-3.21%)` | :arrow_down: |
   | [rust/arrow/src/csv/writer.rs](https://codecov.io/gh/apache/arrow/pull/9035/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvY3N2L3dyaXRlci5ycw==) | `78.82% <0.00%> (-0.56%)` | :arrow_down: |
   | [rust/parquet/src/encodings/encoding.rs](https://codecov.io/gh/apache/arrow/pull/9035/diff?src=pr&el=tree#diff-cnVzdC9wYXJxdWV0L3NyYy9lbmNvZGluZ3MvZW5jb2RpbmcucnM=) | `95.24% <0.00%> (-0.20%)` | :arrow_down: |
   | [rust/arrow/src/util/serialization.rs](https://codecov.io/gh/apache/arrow/pull/9035/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvdXRpbC9zZXJpYWxpemF0aW9uLnJz) | `100.00% <0.00%> (ø)` | |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/9035?src=pr&el=continue).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/9035?src=pr&el=footer). Last update [4b7cdcb...36b0ebd](https://codecov.io/gh/apache/arrow/pull/9035?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #9035: ARROW-11052: [Rust] [DataFusion] Implement metrics for HashJoinExec

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #9035:
URL: https://github.com/apache/arrow/pull/9035#issuecomment-751861449


   https://issues.apache.org/jira/browse/ARROW-11052


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] andygrove commented on pull request #9035: ARROW-11052: [Rust] [DataFusion] Implement metrics for HashJoinExec

Posted by GitBox <gi...@apache.org>.
andygrove commented on pull request #9035:
URL: https://github.com/apache/arrow/pull/9035#issuecomment-751859553


   @jorgecarleitao fyi


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] andygrove commented on pull request #9035: ARROW-11052: [Rust] [DataFusion] Implement metrics for HashJoinExec

Posted by GitBox <gi...@apache.org>.
andygrove commented on pull request #9035:
URL: https://github.com/apache/arrow/pull/9035#issuecomment-752691811


   I will rebase this and address feedback once https://github.com/apache/arrow/pull/9048 is merged


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan commented on a change in pull request #9035: ARROW-11052: [Rust] [DataFusion] Implement metrics for HashJoinExec

Posted by GitBox <gi...@apache.org>.
Dandandan commented on a change in pull request #9035:
URL: https://github.com/apache/arrow/pull/9035#discussion_r549486412



##########
File path: rust/datafusion/src/physical_plan/hash_join.rs
##########
@@ -531,14 +560,39 @@ impl Stream for HashJoinStream {
         self.right
             .poll_next_unpin(cx)
             .map(|maybe_batch| match maybe_batch {
-                Some(Ok(batch)) => Some(build_batch(
-                    &batch,
-                    &self.left_data,
-                    &self.on_right,
-                    &self.join_type,
-                    &self.schema,
-                )),
-                other => other,
+                Some(Ok(batch)) => {
+                    let start = Instant::now();
+                    let result = build_batch(
+                        &batch,
+                        &self.left_data,
+                        &self.on_right,
+                        &self.join_type,
+                        &self.schema,
+                    );
+                    self.num_input_batches += 1;
+                    self.num_input_rows += batch.num_rows();
+                    match result {
+                        Ok(ref batch) => {
+                            self.join_time += start.elapsed().as_millis() as usize;
+                            self.num_output_batches += 1;
+                            self.num_output_rows += batch.num_rows();
+                        }
+                        _ => {}
+                    }
+                    Some(result)
+                }
+                other => {
+                    debug!(
+                        "Processed {} probe-side input batches containing {} rows and \

Review comment:
       Maybe we can make this a bit less verbose? `Processed  {} rows probe-side and {} output rows in {} ms` or something like that? 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on pull request #9035: ARROW-11052: [Rust] [DataFusion] Implement metrics for HashJoinExec

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on pull request #9035:
URL: https://github.com/apache/arrow/pull/9035#issuecomment-753449360


   @Dandandan , Thanks a lot for the review. I am approving because even though I consider most of your comments relevant, I do not consider them blocking the merge.
   
   For example, summing the number of rows of all batches is `O(number_of_batches)` and negligible when compared to the time it takes to execute the join, thus being too minor to block this.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan commented on a change in pull request #9035: ARROW-11052: [Rust] [DataFusion] Implement metrics for HashJoinExec

Posted by GitBox <gi...@apache.org>.
Dandandan commented on a change in pull request #9035:
URL: https://github.com/apache/arrow/pull/9035#discussion_r549495107



##########
File path: rust/datafusion/src/physical_plan/hash_join.rs
##########
@@ -531,14 +560,39 @@ impl Stream for HashJoinStream {
         self.right
             .poll_next_unpin(cx)
             .map(|maybe_batch| match maybe_batch {
-                Some(Ok(batch)) => Some(build_batch(
-                    &batch,
-                    &self.left_data,
-                    &self.on_right,
-                    &self.join_type,
-                    &self.schema,
-                )),
-                other => other,
+                Some(Ok(batch)) => {
+                    let start = Instant::now();
+                    let result = build_batch(
+                        &batch,
+                        &self.left_data,
+                        &self.on_right,
+                        &self.join_type,
+                        &self.schema,
+                    );
+                    self.num_input_batches += 1;
+                    self.num_input_rows += batch.num_rows();
+                    match result {
+                        Ok(ref batch) => {
+                            self.join_time += start.elapsed().as_millis() as usize;

Review comment:
       We could avoid rounding errors by using `start.elapsed().as_secs_f64()` store `join_time` var as `f64` and format with 0 digits afterwards.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan commented on a change in pull request #9035: ARROW-11052: [Rust] [DataFusion] Implement metrics for HashJoinExec

Posted by GitBox <gi...@apache.org>.
Dandandan commented on a change in pull request #9035:
URL: https://github.com/apache/arrow/pull/9035#discussion_r549490463



##########
File path: rust/datafusion/src/physical_plan/hash_join.rs
##########
@@ -186,8 +190,18 @@ impl ExecutionPlan for HashJoinExec {
                         })
                         .await?;
 
+                    let num_rows: usize =
+                        left_data.1.iter().map(|batch| batch.num_rows()).sum();

Review comment:
       Maybe we want to put this behind a flag so we don't do this in the macro so we don't compute the `num_rows` when not being logged?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org