You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/08/02 18:17:02 UTC

[GitHub] [arrow-datafusion] Dandandan commented on issue #790: Rework GroupByHash for faster performance and support grouping by nulls

Dandandan commented on issue #790:
URL: https://github.com/apache/arrow-datafusion/issues/790#issuecomment-891233423


   > > Collisions are handled by the fact that the entry in the hash table is a _list_ of indicies into the mutable area -- if there are multiple values in that list each entry in the mutable area needs to be checked for equality to find the correct one.
   > 
   > I don't really see how we can to this equality checks faster than the hashmap itself could do it, but I would be very happy to be proven wrong.
   > 
   > For optimized eq and hash handling, hashbrown has some more low-level functionality that we are not yet using:
   > 
   > * [`from_hash(hash, eq_fn)`](https://docs.rs/hashbrown/0.11.2/hashbrown/hash_map/struct.RawEntryBuilderMut.html#method.from_hash): allows looking up an entry with a given hash and equals function, bypassing the hashbuilder of the map. This allows implementing eq on values that are not directly stored in the map, for example if only an index into another structure is part of the key.
   > * [`insert_with_hasher(hash, key, value, hasher)`](https://docs.rs/hashbrown/0.11.2/hashbrown/hash_map/struct.RawVacantEntryMut.html#method.insert_with_hasher): Again allows bypassing the hashbuilder and looking up the hash value in some other structure. The `hasher` parameter is only called if the table needs rehashing.
   
   `from_hash` might be indeed interesting to explore.
   
   Some more info of why collision might be not that bad, even with an alternative approach:
   
   * Based on good quality `u64` hashes, I found the remaining number collisions to be very low, as most of the collisions (with different `u64` but within the same hashmap "bucket") are already handled in the hashmap implementation. I had to implement some tests with fixed hashes to trigger hash collisions in the hash join algorithm as the current approach already gave the correct answers on all of the unit / integration tests without detecting collisions, e.g. each key maps to a different `u64` hash value in the tests.
   * For hash join there are some known ways to do hash collisions in a vectorized way e.g. see this article https://www.cockroachlabs.com/blog/vectorized-hash-joiner/ . this might be similar for hash aggregations


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org