You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/08/04 16:31:11 UTC

[GitHub] [arrow-datafusion] alamb opened a new issue #822: Reconsider hashing of nulls

alamb opened a new issue #822:
URL: https://github.com/apache/arrow-datafusion/issues/822


   **Is your feature request related to a problem or challenge? Please describe what you are trying to do.**
   
   The `create_hash` function is responsible for hashing values in arrays. At the moment, however, it (effectively) hashes NULL values to `0` for all types, which likely leads to sub optimial behavior such as @Dandandan observed in https://github.com/apache/arrow-datafusion/pull/812#discussion_r682319823 that `NULL,1` and `1,NULL` will hash to the same value.
   
   **Describe the solution you'd like**
   TBD
   
   **Describe alternatives you've considered**
   @jorgecarleitao 's comment (copied below) from https://github.com/apache/arrow-datafusion/issues/790#issuecomment-888516731 offers a few alternatives:
   
   From the hashing side, an unknown to me atm is how to efficiently hash `values+validity`. I.e. given `V = ["a", "", "c"]` and `N = [true, false, true]`, I see some options:
   
   * `hash(V) ^ !N + unique * N` where `unique` is a unique sentinel value exclusive for null values. If `hash` is vectorized, this operation is vectorized.
   
   * `concat(hash(value), is_valid) for value, is_valid in zip(V,N)`
   
   * split the array between nulls and not nulls, i.e. `N -> (non-null indices, null indices)`, perform hashing over valid indices only, and then, at the very end, append all values for the nulls. We do this in the sort kernel, to reduce the number of slots to perform comparisons over.
   
   If we could write the code in a way that we could "easily" switch between implementations (during dev only, not a conf parameter), we could bench whether one wins over the other, or under which circumstances.
   
   **Additional context**
   Add any other context or screenshots about the feature request here.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org