You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/04/24 20:05:04 UTC

[GitHub] [arrow-datafusion] Dandandan opened a new issue #50: Hash join further optimization / vectorization

Dandandan opened a new issue #50:
URL: https://github.com/apache/arrow-datafusion/issues/50


   **Is your feature request related to a problem or challenge? Please describe what you are trying to do.**
   Further optimize the hash join algorithm
   
   **Describe the solution you'd like**
   There are a couple of optimizations we could implement:
   
   * Vectorize the row-equality check which now uses the `equal_rows` functions. We should be able to speed this up by vectorizing this, and also specialize it for handling non-null batches too. We probably can utilize the kernels `take` and `equals` here.
   * Don't use a `Hashmap` but a `Vec` (or similar) with a certain amount of buckets. I tried this before, but as it causes much more collisions than we have currently, it causes a big (3x) slowdown.
   
   **Additional context**
   
   https://www.cockroachlabs.com/blog/vectorized-hash-joiner/
   https://dare.uva.nl/search?identifier=5ccbb60a-38b8-4eeb-858a-e7735dd37487


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org