You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/01/17 16:25:01 UTC

[GitHub] [arrow] Dandandan opened a new pull request #9234: ARROW-11290: Address hash aggregate performance issue with low cardinality keys

Dandandan opened a new pull request #9234:
URL: https://github.com/apache/arrow/pull/9234


   Currently, we loop to the hashmap for every key.
   
   However, as we receive a batch, if we have low cardinality keys in the table (or sorted data, etc.) then we could create a lot of empty batches.
   
   In the PR we keep track of which keys we received in the batch and only update the accumulators with the same keys.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan commented on a change in pull request #9234: ARROW-11290: [Rust][DataFusion] Address hash aggregate performance issue with high number of groups

Posted by GitBox <gi...@apache.org>.
Dandandan commented on a change in pull request #9234:
URL: https://github.com/apache/arrow/pull/9234#discussion_r559225992



##########
File path: rust/datafusion/src/physical_plan/hash_aggregate.rs
##########
@@ -288,6 +288,9 @@ fn group_aggregate_batch(
     // Make sure we can create the accumulators or otherwise return an error
     create_accumulators(aggr_expr).map_err(DataFusionError::into_arrow_external_error)?;
 
+    // Keys received in this batch
+    let mut batch_keys = vec![];

Review comment:
       It checks for either an empty indices array (which means no rows yet with this key) or being the first row with this key in `or_insert_with`.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan commented on pull request #9234: ARROW-11290: [Rust][DataFusion] Address hash aggregate performance issue with high number of groups

Posted by GitBox <gi...@apache.org>.
Dandandan commented on pull request #9234:
URL: https://github.com/apache/arrow/pull/9234#issuecomment-762207326


   Added a ticket for remaining work including some profiling information here: https://issues.apache.org/jira/browse/ARROW-11300


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] codecov-io edited a comment on pull request #9234: ARROW-11290: [Rust][DataFusion] Address hash aggregate performance issue with high number of groups

Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #9234:
URL: https://github.com/apache/arrow/pull/9234#issuecomment-761841128


   # [Codecov](https://codecov.io/gh/apache/arrow/pull/9234?src=pr&el=h1) Report
   > Merging [#9234](https://codecov.io/gh/apache/arrow/pull/9234?src=pr&el=desc) (eaf918e) into [master](https://codecov.io/gh/apache/arrow/commit/e73f205465051cb19cbec6900a01db8837948e9f?el=desc) (e73f205) will **increase** coverage by `0.00%`.
   > The diff coverage is `100.00%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/9234/graphs/tree.svg?width=650&height=150&src=pr&token=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/9234?src=pr&el=tree)
   
   ```diff
   @@           Coverage Diff           @@
   ##           master    #9234   +/-   ##
   =======================================
     Coverage   81.61%   81.61%           
   =======================================
     Files         215      215           
     Lines       51891    51897    +6     
   =======================================
   + Hits        42353    42358    +5     
   - Misses       9538     9539    +1     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow/pull/9234?src=pr&el=tree) | Coverage Δ | |
   |---|---|---|
   | [...ust/datafusion/src/physical\_plan/hash\_aggregate.rs](https://codecov.io/gh/apache/arrow/pull/9234/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2hhc2hfYWdncmVnYXRlLnJz) | `85.02% <100.00%> (+0.23%)` | :arrow_up: |
   | [rust/parquet/src/encodings/encoding.rs](https://codecov.io/gh/apache/arrow/pull/9234/diff?src=pr&el=tree#diff-cnVzdC9wYXJxdWV0L3NyYy9lbmNvZGluZ3MvZW5jb2RpbmcucnM=) | `94.86% <0.00%> (-0.20%)` | :arrow_down: |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/9234?src=pr&el=continue).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/9234?src=pr&el=footer). Last update [e73f205...eaf918e](https://codecov.io/gh/apache/arrow/pull/9234?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan commented on a change in pull request #9234: ARROW-11290: [Rust][DataFusion] Address hash aggregate performance issue with high number of groups

Posted by GitBox <gi...@apache.org>.
Dandandan commented on a change in pull request #9234:
URL: https://github.com/apache/arrow/pull/9234#discussion_r559225992



##########
File path: rust/datafusion/src/physical_plan/hash_aggregate.rs
##########
@@ -288,6 +288,9 @@ fn group_aggregate_batch(
     // Make sure we can create the accumulators or otherwise return an error
     create_accumulators(aggr_expr).map_err(DataFusionError::into_arrow_external_error)?;
 
+    // Keys received in this batch
+    let mut batch_keys = vec![];

Review comment:
       It checks for either an empty indices vec (which means no rows yet with this key) or being the first row with this key in `or_insert_with`.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #9234: ARROW-11290: Address hash aggregate performance issue with low cardinality keys

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #9234:
URL: https://github.com/apache/arrow/pull/9234#issuecomment-761838739


   https://issues.apache.org/jira/browse/ARROW-11290


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb closed pull request #9234: ARROW-11290: [Rust][DataFusion] Address hash aggregate performance issue with high number of groups

Posted by GitBox <gi...@apache.org>.
alamb closed pull request #9234:
URL: https://github.com/apache/arrow/pull/9234


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on pull request #9234: ARROW-11290: [Rust][DataFusion] Address hash aggregate performance issue with high number of groups

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #9234:
URL: https://github.com/apache/arrow/pull/9234#issuecomment-763570051


   I merged this branch locally to master and re-ran all the tests. Things looked good so merging it in. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] andygrove commented on a change in pull request #9234: ARROW-11290: [Rust][DataFusion] Address hash aggregate performance issue with high number of groups

Posted by GitBox <gi...@apache.org>.
andygrove commented on a change in pull request #9234:
URL: https://github.com/apache/arrow/pull/9234#discussion_r559225501



##########
File path: rust/datafusion/src/physical_plan/hash_aggregate.rs
##########
@@ -288,6 +288,9 @@ fn group_aggregate_batch(
     // Make sure we can create the accumulators or otherwise return an error
     create_accumulators(aggr_expr).map_err(DataFusionError::into_arrow_external_error)?;
 
+    // Keys received in this batch
+    let mut batch_keys = vec![];

Review comment:
       Should this be a set rather than a vec since it is intended to track the unique set of keys in the batch? 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan commented on a change in pull request #9234: ARROW-11290: [Rust][DataFusion] Address hash aggregate performance issue with high number of groups

Posted by GitBox <gi...@apache.org>.
Dandandan commented on a change in pull request #9234:
URL: https://github.com/apache/arrow/pull/9234#discussion_r559225738



##########
File path: rust/datafusion/src/physical_plan/hash_aggregate.rs
##########
@@ -288,6 +288,9 @@ fn group_aggregate_batch(
     // Make sure we can create the accumulators or otherwise return an error
     create_accumulators(aggr_expr).map_err(DataFusionError::into_arrow_external_error)?;
 
+    // Keys received in this batch
+    let mut batch_keys = vec![];

Review comment:
       That's what I thought first, but this is checked already when `push`ing the keys to the vec.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] codecov-io commented on pull request #9234: ARROW-11290: [Rust][DataFusion] Address hash aggregate performance issue with low cardinality keys

Posted by GitBox <gi...@apache.org>.
codecov-io commented on pull request #9234:
URL: https://github.com/apache/arrow/pull/9234#issuecomment-761841128


   # [Codecov](https://codecov.io/gh/apache/arrow/pull/9234?src=pr&el=h1) Report
   > Merging [#9234](https://codecov.io/gh/apache/arrow/pull/9234?src=pr&el=desc) (45b23e8) into [master](https://codecov.io/gh/apache/arrow/commit/e73f205465051cb19cbec6900a01db8837948e9f?el=desc) (e73f205) will **increase** coverage by `0.00%`.
   > The diff coverage is `100.00%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/9234/graphs/tree.svg?width=650&height=150&src=pr&token=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/9234?src=pr&el=tree)
   
   ```diff
   @@           Coverage Diff           @@
   ##           master    #9234   +/-   ##
   =======================================
     Coverage   81.61%   81.62%           
   =======================================
     Files         215      215           
     Lines       51891    51897    +6     
   =======================================
   + Hits        42353    42359    +6     
     Misses       9538     9538           
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow/pull/9234?src=pr&el=tree) | Coverage Δ | |
   |---|---|---|
   | [...ust/datafusion/src/physical\_plan/hash\_aggregate.rs](https://codecov.io/gh/apache/arrow/pull/9234/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2hhc2hfYWdncmVnYXRlLnJz) | `85.02% <100.00%> (+0.23%)` | :arrow_up: |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/9234?src=pr&el=continue).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/9234?src=pr&el=footer). Last update [e73f205...45b23e8](https://codecov.io/gh/apache/arrow/pull/9234?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org