You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/11/25 16:49:11 UTC

[GitHub] [arrow] Dandandan commented on pull request #8765: ARROW-10722: [Rust][DataFusion] Reduce overhead of some data types in aggregations / joins, improve benchmarks

Dandandan commented on pull request #8765:
URL: https://github.com/apache/arrow/pull/8765#issuecomment-733824217


   @jorgecarleitao Not really on performance as current benchmarks / queries show, just looking at ways to improve the aggregate / join performance.
   
   The main thing I wanted to investigate is whether the aggregates / join can be made faster itself. I think one part would be to create a key that can be hashed faster. Now the hashing algorithm hashes each value individual GroupByValue instead of working on a byte array. The latter one could in principle be faster. Some specialized code could also be made for hashing based on 1 column only.
   
   It can have a larger impact on _**memory usage**_ though if you are hashing / aggregating something with high cardinality as each row will generate 10s of extra bytes per row based on 16 bytes for each GroupByValue, 8 bytes for using `Vec`  and 8 bytes for boxing the inner Vec of the aggregation.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org