You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "sunchao (via GitHub)" <gi...@apache.org> on 2023/07/26 04:30:55 UTC

[GitHub] [arrow-datafusion] sunchao opened a new issue, #7095: Evaluate vectorized hash table for group aggregation

sunchao opened a new issue, #7095:
URL: https://github.com/apache/arrow-datafusion/issues/7095

   ### Is your feature request related to a problem or challenge?
   
   Currently DF uses a `RawTable` from hashbrown as the hash table implementation in group aggregations. This requires first converting the input batches into a row format, and then process the converted rows one by one, does hash probing, equality check, as well as creating new entries accordingly.
   
   A different approach, as discussed in the [Photon paper](https://cs.stanford.edu/~matei/papers/2022/sigmod_photon.pdf) (and is also used by DuckDB), is to adopt a new vectorized approach in the hash table design, so that each of the above steps can be vectorized. In addition this allows us to skip the row conversion and directly operates on the input batches.
   
   Internally we have a draft implementation for this and it has shown considerable improvements (although with a lot of `unsafe`s 😂 ) on top of the current hash aggregation approach, so we'd like to contribute to DF and see if it can help to improve its aggregation performance even further.
   
   
   ### Describe the solution you'd like
   
   Design & implement a separate vectorized hash table. It can either replace the existing `RawTable` inside `GroupValuesRows`, or we can have a separate `GroupValues` implementation. 
   
   ### Describe alternatives you've considered
   
   Not to implement this.
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Evaluate vectorized hash table for group aggregation [arrow-datafusion]

Posted by "sunchao (via GitHub)" <gi...@apache.org>.
sunchao commented on issue #7095:
URL: https://github.com/apache/arrow-datafusion/issues/7095#issuecomment-1995192311

   @doki23 yes I'm still planning to. I have a POC branch for this work: https://github.com/sunchao/arrow-datafusion/commits/vectorized-hash-table/ but I haven't got time to go back to it yet. Will try to do it soon.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Evaluate vectorized hash table for group aggregation [arrow-datafusion]

Posted by "doki23 (via GitHub)" <gi...@apache.org>.
doki23 commented on issue #7095:
URL: https://github.com/apache/arrow-datafusion/issues/7095#issuecomment-1994555814

    Hi @sunchao, are you still moving it forward?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] Dandandan commented on issue #7095: Evaluate vectorized hash table for group aggregation

Posted by "Dandandan (via GitHub)" <gi...@apache.org>.
Dandandan commented on issue #7095:
URL: https://github.com/apache/arrow-datafusion/issues/7095#issuecomment-1651186636

   @sunchao that sounds very exciting 🚀 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org