You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/12/28 20:47:58 UTC
[GitHub] [arrow] andygrove opened a new pull request #9035: ARROW-11052: [Rust] [DataFusion] Implement metrics for HashJoinExec
andygrove opened a new pull request #9035:
URL: https://github.com/apache/arrow/pull/9035
Implement metrics for `HashJoinExec` to help debug performance issues.
Example output:
```
[2020-12-28T20:46:28Z DEBUG datafusion::physical_plan::hash_join] Built build-side of hash join containing 3115341 rows in 9926 ms
[2020-12-28T20:46:36Z DEBUG datafusion::physical_plan::hash_join] Processed 144 stream-side input batches containing 18750000 rows and produced 144 output batches containing 388922 rows in 7994 ms
[2020-12-28T20:46:36Z DEBUG datafusion::physical_plan::hash_join] Processed 144 stream-side input batches containing 18750000 rows and produced 144 output batches containing 389294 rows in 7986 ms
[2020-12-28T20:46:37Z DEBUG datafusion::physical_plan::hash_join] Processed 144 stream-side input batches containing 18750000 rows and produced 144 output batches containing 389558 rows in 8326 ms
[2020-12-28T20:46:37Z DEBUG datafusion::physical_plan::hash_join] Processed 144 stream-side input batches containing 18750000 rows and produced 144 output batches containing 388841 rows in 8499 ms
[2020-12-28T20:46:37Z DEBUG datafusion::physical_plan::hash_join] Processed 144 stream-side input batches containing 18750000 rows and produced 144 output batches containing 390056 rows in 8643 ms
[2020-12-28T20:46:37Z DEBUG datafusion::physical_plan::hash_join] Processed 144 stream-side input batches containing 18749999 rows and produced 144 output batches containing 389143 rows in 8711 ms
[2020-12-28T20:46:37Z DEBUG datafusion::physical_plan::hash_join] Processed 144 stream-side input batches containing 18750000 rows and produced 144 output batches containing 389605 rows in 8712 ms
[2020-12-28T20:46:37Z DEBUG datafusion::physical_plan::hash_join] Processed 144 stream-side input batches containing 18750000 rows and produced 144 output batches containing 389922 rows in 8877 ms
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] Dandandan commented on pull request #9035: ARROW-11052: [Rust] [DataFusion] Implement metrics for HashJoinExec
Posted by GitBox <gi...@apache.org>.
Dandandan commented on pull request #9035:
URL: https://github.com/apache/arrow/pull/9035#issuecomment-753451276
I agree, feel free to address the remaining feedback or to merge it @andygrove👍
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] alamb commented on pull request #9035: ARROW-11052: [Rust] [DataFusion] Implement metrics for HashJoinExec
Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #9035:
URL: https://github.com/apache/arrow/pull/9035#issuecomment-752956502
The full set of Rust CI tests did not run on this PR :(
Can you please rebase this PR against [apache/master](https://github.com/apache/arrow) to pick up the changes in https://github.com/apache/arrow/pull/9056 so that they do?
I apologize for the inconvenience.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] andygrove closed pull request #9035: ARROW-11052: [Rust] [DataFusion] Implement metrics for HashJoinExec
Posted by GitBox <gi...@apache.org>.
andygrove closed pull request #9035:
URL: https://github.com/apache/arrow/pull/9035
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] codecov-io commented on pull request #9035: ARROW-11052: [Rust] [DataFusion] Implement metrics for HashJoinExec
Posted by GitBox <gi...@apache.org>.
codecov-io commented on pull request #9035:
URL: https://github.com/apache/arrow/pull/9035#issuecomment-753515593
# [Codecov](https://codecov.io/gh/apache/arrow/pull/9035?src=pr&el=h1) Report
> Merging [#9035](https://codecov.io/gh/apache/arrow/pull/9035?src=pr&el=desc) (4b5154a) into [master](https://codecov.io/gh/apache/arrow/commit/4b7cdcb9220b6d94b251aef32c21ef9b4097ecfa?el=desc) (4b7cdcb) will **decrease** coverage by `0.02%`.
> The diff coverage is `52.63%`.
[![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/9035/graphs/tree.svg?width=650&height=150&src=pr&token=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/9035?src=pr&el=tree)
```diff
@@ Coverage Diff @@
## master #9035 +/- ##
==========================================
- Coverage 82.61% 82.58% -0.03%
==========================================
Files 203 203
Lines 50140 50171 +31
==========================================
+ Hits 41422 41436 +14
- Misses 8718 8735 +17
```
| [Impacted Files](https://codecov.io/gh/apache/arrow/pull/9035?src=pr&el=tree) | Coverage Δ | |
|---|---|---|
| [rust/datafusion/src/physical\_plan/hash\_join.rs](https://codecov.io/gh/apache/arrow/pull/9035/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2hhc2hfam9pbi5ycw==) | `85.86% <52.63%> (-3.67%)` | :arrow_down: |
------
[Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/9035?src=pr&el=continue).
> **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
> `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
> Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/9035?src=pr&el=footer). Last update [4b7cdcb...4b5154a](https://codecov.io/gh/apache/arrow/pull/9035?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] codecov-io edited a comment on pull request #9035: ARROW-11052: [Rust] [DataFusion] Implement metrics for HashJoinExec
Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #9035:
URL: https://github.com/apache/arrow/pull/9035#issuecomment-753515593
# [Codecov](https://codecov.io/gh/apache/arrow/pull/9035?src=pr&el=h1) Report
> Merging [#9035](https://codecov.io/gh/apache/arrow/pull/9035?src=pr&el=desc) (36b0ebd) into [master](https://codecov.io/gh/apache/arrow/commit/4b7cdcb9220b6d94b251aef32c21ef9b4097ecfa?el=desc) (4b7cdcb) will **decrease** coverage by `0.02%`.
> The diff coverage is `55.55%`.
[![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/9035/graphs/tree.svg?width=650&height=150&src=pr&token=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/9035?src=pr&el=tree)
```diff
@@ Coverage Diff @@
## master #9035 +/- ##
==========================================
- Coverage 82.61% 82.59% -0.03%
==========================================
Files 203 204 +1
Lines 50140 50169 +29
==========================================
+ Hits 41422 41435 +13
- Misses 8718 8734 +16
```
| [Impacted Files](https://codecov.io/gh/apache/arrow/pull/9035?src=pr&el=tree) | Coverage Δ | |
|---|---|---|
| [rust/datafusion/src/physical\_plan/hash\_join.rs](https://codecov.io/gh/apache/arrow/pull/9035/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2hhc2hfam9pbi5ycw==) | `86.32% <55.55%> (-3.21%)` | :arrow_down: |
| [rust/arrow/src/csv/writer.rs](https://codecov.io/gh/apache/arrow/pull/9035/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvY3N2L3dyaXRlci5ycw==) | `78.82% <0.00%> (-0.56%)` | :arrow_down: |
| [rust/parquet/src/encodings/encoding.rs](https://codecov.io/gh/apache/arrow/pull/9035/diff?src=pr&el=tree#diff-cnVzdC9wYXJxdWV0L3NyYy9lbmNvZGluZ3MvZW5jb2RpbmcucnM=) | `95.24% <0.00%> (-0.20%)` | :arrow_down: |
| [rust/arrow/src/util/serialization.rs](https://codecov.io/gh/apache/arrow/pull/9035/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvdXRpbC9zZXJpYWxpemF0aW9uLnJz) | `100.00% <0.00%> (ø)` | |
------
[Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/9035?src=pr&el=continue).
> **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
> `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
> Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/9035?src=pr&el=footer). Last update [4b7cdcb...36b0ebd](https://codecov.io/gh/apache/arrow/pull/9035?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #9035: ARROW-11052: [Rust] [DataFusion] Implement metrics for HashJoinExec
Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #9035:
URL: https://github.com/apache/arrow/pull/9035#issuecomment-751861449
https://issues.apache.org/jira/browse/ARROW-11052
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] andygrove commented on pull request #9035: ARROW-11052: [Rust] [DataFusion] Implement metrics for HashJoinExec
Posted by GitBox <gi...@apache.org>.
andygrove commented on pull request #9035:
URL: https://github.com/apache/arrow/pull/9035#issuecomment-751859553
@jorgecarleitao fyi
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] andygrove commented on pull request #9035: ARROW-11052: [Rust] [DataFusion] Implement metrics for HashJoinExec
Posted by GitBox <gi...@apache.org>.
andygrove commented on pull request #9035:
URL: https://github.com/apache/arrow/pull/9035#issuecomment-752691811
I will rebase this and address feedback once https://github.com/apache/arrow/pull/9048 is merged
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] Dandandan commented on a change in pull request #9035: ARROW-11052: [Rust] [DataFusion] Implement metrics for HashJoinExec
Posted by GitBox <gi...@apache.org>.
Dandandan commented on a change in pull request #9035:
URL: https://github.com/apache/arrow/pull/9035#discussion_r549486412
##########
File path: rust/datafusion/src/physical_plan/hash_join.rs
##########
@@ -531,14 +560,39 @@ impl Stream for HashJoinStream {
self.right
.poll_next_unpin(cx)
.map(|maybe_batch| match maybe_batch {
- Some(Ok(batch)) => Some(build_batch(
- &batch,
- &self.left_data,
- &self.on_right,
- &self.join_type,
- &self.schema,
- )),
- other => other,
+ Some(Ok(batch)) => {
+ let start = Instant::now();
+ let result = build_batch(
+ &batch,
+ &self.left_data,
+ &self.on_right,
+ &self.join_type,
+ &self.schema,
+ );
+ self.num_input_batches += 1;
+ self.num_input_rows += batch.num_rows();
+ match result {
+ Ok(ref batch) => {
+ self.join_time += start.elapsed().as_millis() as usize;
+ self.num_output_batches += 1;
+ self.num_output_rows += batch.num_rows();
+ }
+ _ => {}
+ }
+ Some(result)
+ }
+ other => {
+ debug!(
+ "Processed {} probe-side input batches containing {} rows and \
Review comment:
Maybe we can make this a bit less verbose? `Processed {} rows probe-side and {} output rows in {} ms` or something like that?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] jorgecarleitao commented on pull request #9035: ARROW-11052: [Rust] [DataFusion] Implement metrics for HashJoinExec
Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on pull request #9035:
URL: https://github.com/apache/arrow/pull/9035#issuecomment-753449360
@Dandandan , Thanks a lot for the review. I am approving because even though I consider most of your comments relevant, I do not consider them blocking the merge.
For example, summing the number of rows of all batches is `O(number_of_batches)` and negligible when compared to the time it takes to execute the join, thus being too minor to block this.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] Dandandan commented on a change in pull request #9035: ARROW-11052: [Rust] [DataFusion] Implement metrics for HashJoinExec
Posted by GitBox <gi...@apache.org>.
Dandandan commented on a change in pull request #9035:
URL: https://github.com/apache/arrow/pull/9035#discussion_r549495107
##########
File path: rust/datafusion/src/physical_plan/hash_join.rs
##########
@@ -531,14 +560,39 @@ impl Stream for HashJoinStream {
self.right
.poll_next_unpin(cx)
.map(|maybe_batch| match maybe_batch {
- Some(Ok(batch)) => Some(build_batch(
- &batch,
- &self.left_data,
- &self.on_right,
- &self.join_type,
- &self.schema,
- )),
- other => other,
+ Some(Ok(batch)) => {
+ let start = Instant::now();
+ let result = build_batch(
+ &batch,
+ &self.left_data,
+ &self.on_right,
+ &self.join_type,
+ &self.schema,
+ );
+ self.num_input_batches += 1;
+ self.num_input_rows += batch.num_rows();
+ match result {
+ Ok(ref batch) => {
+ self.join_time += start.elapsed().as_millis() as usize;
Review comment:
We could avoid rounding errors by using `start.elapsed().as_secs_f64()` store `join_time` var as `f64` and format with 0 digits afterwards.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] Dandandan commented on a change in pull request #9035: ARROW-11052: [Rust] [DataFusion] Implement metrics for HashJoinExec
Posted by GitBox <gi...@apache.org>.
Dandandan commented on a change in pull request #9035:
URL: https://github.com/apache/arrow/pull/9035#discussion_r549490463
##########
File path: rust/datafusion/src/physical_plan/hash_join.rs
##########
@@ -186,8 +190,18 @@ impl ExecutionPlan for HashJoinExec {
})
.await?;
+ let num_rows: usize =
+ left_data.1.iter().map(|batch| batch.num_rows()).sum();
Review comment:
Maybe we want to put this behind a flag so we don't do this in the macro so we don't compute the `num_rows` when not being logged?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org