You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/12/06 15:12:59 UTC

[GitHub] [arrow-datafusion] Dandandan commented on issue #1404: Hash partitioning not working properly

Dandandan commented on issue #1404:
URL: https://github.com/apache/arrow-datafusion/issues/1404#issuecomment-986868097


   Hey @andrei-ionescu 
   
   I am not sure if it's really not working properly.
   
   You specified a hash expression `partitioning_columns` and `72` partitions.
   
   If you use `repartition` the expression will be hashes and divided over those 72 partitions.
   
   The *only* guarantee of hash-repartition  is that equal values (based on the expression) will end up in the same partition.
   
   This is based on a simple formula `hash(expr) % n_partitions`.
   
   However, two things can happen
   
   * Two different values can end up in the same partition.
   * A partition `n` can have no values - no `hash(expr) % n_partitions` equals to `n`.
   
   Does this address your issue?
   
   I also opened https://github.com/apache/arrow-datafusion/issues/1405 to have a look at empty batches out of repartitioning, but that is more related to performance instead of correctness.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org