You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/12/06 16:11:16 UTC
[GitHub] [arrow-datafusion] andrei-ionescu edited a comment on issue #1404: Hash partitioning not working properly

andrei-ionescu edited a comment on issue #1404:
URL: https://github.com/apache/arrow-datafusion/issues/1404#issuecomment-986916144


   @Dandandan A few questions:
   
   1. Why do we have yet another ticket? It will side track from the real issue.
   2. The hash partitioning is not working correctly. The issue is not the fact that there are collected partitions with `0` rows. The issue is the fact that the collected partitions do not contains correct data. If you look at the number of rows on the manual partitioning and repartition by hash there are collected partitions that have more than any manual partition. For example there is the `Partition=35, rows_in_part=943` that has `943` rows while in the table above the maximum is `571`. This means that the `Partition=35, rows_in_part=943` contains rows from other partition.
   
   The expected behaviour of repartition is the following...
   
   Given the dataset on left after repartitioning it by the `nlm_dimension_load_date` and `src_fuel_type` columns I expect the `collect_partitioned` to give me the result on the right:
   ```
   +-------------------------+---------------+-----+       P1 | 2020-10-18T00:29:41Z | Steam    | 11 |
   | nlm_dimension_load_date | src_fuel_type |  id |          | 2020-10-18T00:29:41Z | Steam    | 22 |
   +-------------------------+---------------+-----+          | 2020-10-18T00:29:41Z | Steam    | 33 |
   | 2020-10-18T00:29:41Z    | Steam         |  11 |
   | 2020-10-18T00:29:41Z    | Steam         |  22 |
   | 2020-10-18T00:29:41Z    | Steam         |  33 |       P2 | 2021-06-09T00:32:40Z | Gas      |  3 |
   | 2020-10-18T00:29:41Z    | Gas           |   1 |          | 2021-06-09T00:32:40Z | Gas      |  2 |
   | 2021-06-09T00:32:40Z    | Gas           |   3 |        
   | 2021-06-09T00:32:40Z    | Gas           |   2 |
   | 2021-06-09T00:32:40Z    | Electric      |  a1 |       P3 | 2021-06-09T00:32:40Z | Electric | a1 |
   | 2020-10-18T00:29:41Z    | Electric      |  b1 |                 
   | 2020-10-18T00:29:41Z    | Electric      |  c1 |        
   | 2020-10-18T00:29:41Z    | Electric      |  d1 |       P4 | 2020-10-18T00:29:41Z | Gas      |  1 |
   +-------------------------+---------------+-----+        
                                                            
                                                           P5 | 2020-10-18T00:29:41Z | Electric | b1 |
                                                              | 2020-10-18T00:29:41Z | Electric | c1 |
                                                              | 2020-10-18T00:29:41Z | Electric | d1 |
   ```
   
   Where `P1` to `P5` represents a `Vet<RecordBatch>`.
   
   Hashing by the two column should always give the same result - `hash(2020-10-18T00:29:41Z, Steam)` gives always the same result. First three rows, based on the hashing on `2020-10-18T00:29:41Z, Steam`, should end up in a partition.
   
   This is NOT happening in the current implementation.
   
   In the 72 partition example there are partitions containing different values from `nlm_dimension_load_date`.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org