You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/01/17 16:25:01 UTC
[GitHub] [arrow] Dandandan opened a new pull request #9234: ARROW-11290: Address hash aggregate performance issue with low cardinality keys
Dandandan opened a new pull request #9234:
URL: https://github.com/apache/arrow/pull/9234
Currently, we loop to the hashmap for every key.
However, as we receive a batch, if we have low cardinality keys in the table (or sorted data, etc.) then we could create a lot of empty batches.
In the PR we keep track of which keys we received in the batch and only update the accumulators with the same keys.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] Dandandan commented on a change in pull request #9234: ARROW-11290: [Rust][DataFusion] Address hash aggregate performance issue with high number of groups
Posted by GitBox <gi...@apache.org>.
Dandandan commented on a change in pull request #9234:
URL: https://github.com/apache/arrow/pull/9234#discussion_r559225992
##########
File path: rust/datafusion/src/physical_plan/hash_aggregate.rs
##########
@@ -288,6 +288,9 @@ fn group_aggregate_batch(
// Make sure we can create the accumulators or otherwise return an error
create_accumulators(aggr_expr).map_err(DataFusionError::into_arrow_external_error)?;
+ // Keys received in this batch
+ let mut batch_keys = vec![];
Review comment:
It checks for either an empty indices array (which means no rows yet with this key) or being the first row with this key in `or_insert_with`.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] Dandandan commented on pull request #9234: ARROW-11290: [Rust][DataFusion] Address hash aggregate performance issue with high number of groups
Posted by GitBox <gi...@apache.org>.
Dandandan commented on pull request #9234:
URL: https://github.com/apache/arrow/pull/9234#issuecomment-762207326
Added a ticket for remaining work including some profiling information here: https://issues.apache.org/jira/browse/ARROW-11300
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] codecov-io edited a comment on pull request #9234: ARROW-11290: [Rust][DataFusion] Address hash aggregate performance issue with high number of groups
Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #9234:
URL: https://github.com/apache/arrow/pull/9234#issuecomment-761841128
# [Codecov](https://codecov.io/gh/apache/arrow/pull/9234?src=pr&el=h1) Report
> Merging [#9234](https://codecov.io/gh/apache/arrow/pull/9234?src=pr&el=desc) (eaf918e) into [master](https://codecov.io/gh/apache/arrow/commit/e73f205465051cb19cbec6900a01db8837948e9f?el=desc) (e73f205) will **increase** coverage by `0.00%`.
> The diff coverage is `100.00%`.
[![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/9234/graphs/tree.svg?width=650&height=150&src=pr&token=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/9234?src=pr&el=tree)
```diff
@@ Coverage Diff @@
## master #9234 +/- ##
=======================================
Coverage 81.61% 81.61%
=======================================
Files 215 215
Lines 51891 51897 +6
=======================================
+ Hits 42353 42358 +5
- Misses 9538 9539 +1
```
| [Impacted Files](https://codecov.io/gh/apache/arrow/pull/9234?src=pr&el=tree) | Coverage Δ | |
|---|---|---|
| [...ust/datafusion/src/physical\_plan/hash\_aggregate.rs](https://codecov.io/gh/apache/arrow/pull/9234/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2hhc2hfYWdncmVnYXRlLnJz) | `85.02% <100.00%> (+0.23%)` | :arrow_up: |
| [rust/parquet/src/encodings/encoding.rs](https://codecov.io/gh/apache/arrow/pull/9234/diff?src=pr&el=tree#diff-cnVzdC9wYXJxdWV0L3NyYy9lbmNvZGluZ3MvZW5jb2RpbmcucnM=) | `94.86% <0.00%> (-0.20%)` | :arrow_down: |
------
[Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/9234?src=pr&el=continue).
> **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
> `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
> Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/9234?src=pr&el=footer). Last update [e73f205...eaf918e](https://codecov.io/gh/apache/arrow/pull/9234?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] Dandandan commented on a change in pull request #9234: ARROW-11290: [Rust][DataFusion] Address hash aggregate performance issue with high number of groups
Posted by GitBox <gi...@apache.org>.
Dandandan commented on a change in pull request #9234:
URL: https://github.com/apache/arrow/pull/9234#discussion_r559225992
##########
File path: rust/datafusion/src/physical_plan/hash_aggregate.rs
##########
@@ -288,6 +288,9 @@ fn group_aggregate_batch(
// Make sure we can create the accumulators or otherwise return an error
create_accumulators(aggr_expr).map_err(DataFusionError::into_arrow_external_error)?;
+ // Keys received in this batch
+ let mut batch_keys = vec![];
Review comment:
It checks for either an empty indices vec (which means no rows yet with this key) or being the first row with this key in `or_insert_with`.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #9234: ARROW-11290: Address hash aggregate performance issue with low cardinality keys
Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #9234:
URL: https://github.com/apache/arrow/pull/9234#issuecomment-761838739
https://issues.apache.org/jira/browse/ARROW-11290
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] alamb closed pull request #9234: ARROW-11290: [Rust][DataFusion] Address hash aggregate performance issue with high number of groups
Posted by GitBox <gi...@apache.org>.
alamb closed pull request #9234:
URL: https://github.com/apache/arrow/pull/9234
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] alamb commented on pull request #9234: ARROW-11290: [Rust][DataFusion] Address hash aggregate performance issue with high number of groups
Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #9234:
URL: https://github.com/apache/arrow/pull/9234#issuecomment-763570051
I merged this branch locally to master and re-ran all the tests. Things looked good so merging it in.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] andygrove commented on a change in pull request #9234: ARROW-11290: [Rust][DataFusion] Address hash aggregate performance issue with high number of groups
Posted by GitBox <gi...@apache.org>.
andygrove commented on a change in pull request #9234:
URL: https://github.com/apache/arrow/pull/9234#discussion_r559225501
##########
File path: rust/datafusion/src/physical_plan/hash_aggregate.rs
##########
@@ -288,6 +288,9 @@ fn group_aggregate_batch(
// Make sure we can create the accumulators or otherwise return an error
create_accumulators(aggr_expr).map_err(DataFusionError::into_arrow_external_error)?;
+ // Keys received in this batch
+ let mut batch_keys = vec![];
Review comment:
Should this be a set rather than a vec since it is intended to track the unique set of keys in the batch?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] Dandandan commented on a change in pull request #9234: ARROW-11290: [Rust][DataFusion] Address hash aggregate performance issue with high number of groups
Posted by GitBox <gi...@apache.org>.
Dandandan commented on a change in pull request #9234:
URL: https://github.com/apache/arrow/pull/9234#discussion_r559225738
##########
File path: rust/datafusion/src/physical_plan/hash_aggregate.rs
##########
@@ -288,6 +288,9 @@ fn group_aggregate_batch(
// Make sure we can create the accumulators or otherwise return an error
create_accumulators(aggr_expr).map_err(DataFusionError::into_arrow_external_error)?;
+ // Keys received in this batch
+ let mut batch_keys = vec![];
Review comment:
That's what I thought first, but this is checked already when `push`ing the keys to the vec.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] codecov-io commented on pull request #9234: ARROW-11290: [Rust][DataFusion] Address hash aggregate performance issue with low cardinality keys
Posted by GitBox <gi...@apache.org>.
codecov-io commented on pull request #9234:
URL: https://github.com/apache/arrow/pull/9234#issuecomment-761841128
# [Codecov](https://codecov.io/gh/apache/arrow/pull/9234?src=pr&el=h1) Report
> Merging [#9234](https://codecov.io/gh/apache/arrow/pull/9234?src=pr&el=desc) (45b23e8) into [master](https://codecov.io/gh/apache/arrow/commit/e73f205465051cb19cbec6900a01db8837948e9f?el=desc) (e73f205) will **increase** coverage by `0.00%`.
> The diff coverage is `100.00%`.
[![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/9234/graphs/tree.svg?width=650&height=150&src=pr&token=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/9234?src=pr&el=tree)
```diff
@@ Coverage Diff @@
## master #9234 +/- ##
=======================================
Coverage 81.61% 81.62%
=======================================
Files 215 215
Lines 51891 51897 +6
=======================================
+ Hits 42353 42359 +6
Misses 9538 9538
```
| [Impacted Files](https://codecov.io/gh/apache/arrow/pull/9234?src=pr&el=tree) | Coverage Δ | |
|---|---|---|
| [...ust/datafusion/src/physical\_plan/hash\_aggregate.rs](https://codecov.io/gh/apache/arrow/pull/9234/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2hhc2hfYWdncmVnYXRlLnJz) | `85.02% <100.00%> (+0.23%)` | :arrow_up: |
------
[Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/9234?src=pr&el=continue).
> **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
> `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
> Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/9234?src=pr&el=footer). Last update [e73f205...45b23e8](https://codecov.io/gh/apache/arrow/pull/9234?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org