You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/07/28 18:12:46 UTC

[GitHub] [arrow-datafusion] jorgecarleitao commented on issue #790: Rework GroupByHash to support grouping by nulls

jorgecarleitao commented on issue #790:
URL: https://github.com/apache/arrow-datafusion/issues/790#issuecomment-888516731


   Great proposal.
   
   From the hashing side, an unknown to me atm is how to efficiently hash `values+validity`. I.e. given `V = ["a", "", "c"]` and `N = [true, false, true]`, I see some options:
   
   * `hash(V) ^ !N + unique * N` where `unique` is a unique sentinel value exclusive for null values. If `hash` is vectorized, this operation is vectorized.
   
   * `concat(hash(value), is_valid) for value, is_valid in zip(V,N)`
   
   * split the array between nulls and not nulls, i.e. `N -> (non-null indices, null indices)`, perform hashing over valid indices only, and then, at the very end, append all values for the nulls. We do this in the sort kernel, to reduce the number of slots to perform comparisons over.
   
   If we could write the code in a way that we could "easily" switch between implementations (during dev only, not a conf parameter), we could bench whether one wins over the other, or under which circumstances.
   
   Regardless, nulls in the group by are so important that IMO any is +1 at this point xD


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org