You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/04/03 22:27:45 UTC

[GitHub] [arrow-datafusion] pjmore opened a new issue, #2145: Join on path partitioned columns fails with error

pjmore opened a new issue, #2145:
URL: https://github.com/apache/arrow-datafusion/issues/2145

   **Describe the bug**
   Dictionary types aren't supported by the hasher. Attempting to join on two partitioned columns panics with:
   ```
   thread 'parquet_join_on_partition_columns' panicked at 'called `Result::unwrap()` on an `Err` value: Internal("Unsupported data type in hasher")', datafusion/src/physical_plan/hash_join.rs:628:6
   ```
   This isn't an issue when joining partition column with another string column, I assume due to type coercion rules casting partition column to Utf8.
   
   **To Reproduce**
   I did this in using the path_partition tests using slightly modified version of register_partitioned_alltypes_parquet that allows setting the name of the registered table.
   ```
   async fn parquet_join_on_partition_columns()->Result<()>{
       let mut ctx = ExecutionContext::new();
   
       register_partitioned_alltypes_parquet_with_name(
           &mut ctx,
           &[
               "year=2021/month=09/day=09/file.parquet",
               "year=2021/month=10/day=09/file.parquet",
               "year=2021/month=10/day=28/file.parquet",
           ],
           &["year", "month", "day"],
           "",
           "alltypes_plain.parquet",
           "left"
       )
       .await;
    
   
       register_partitioned_alltypes_parquet_with_name(
           &mut ctx,
           &[
               "year=2021/month=09/day=09/file.parquet",
               "year=2021/month=10/day=09/file.parquet",
               "year=2021/month=10/day=28/file.parquet",
           ],
           &["year", "month", "day"],
           "",
           "alltypes_plain.parquet",
           "right"
       )
       .await;
   
       let _results = ctx.sql("SELECT left.id as lid, right.id as rid from left inner join right on left.month = right.month")
           .await?
           .collect()
           .await?;
   
       Ok(())
   }
   ```
   
   **Expected behavior**
   Query should return results without error.
   
   
   I'm not sure how this should be solved. It would be pretty straightforward to add casts into either the planner or the join implementation into a string array but this would cause an increase in memory usage. Another option would be to add implementations to the hasher, but this will add a large number of additional branches if other partition column types are eventually supported. I know that a row based join format for joins was being implemented but I think that the same issue with regards to having to cast from dictionary -> concrete type will cause an increase in memory usage with the row based version as well.  This would be solvable there with a hashmap that is external to the row which globally manages a integer -> T map and encodes the value of the partitioned column as a integer in the row.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] AssHero commented on issue #2145: Join on path partitioned columns fails with error

Posted by GitBox <gi...@apache.org>.

AssHero commented on issue #2145:
URL: https://github.com/apache/arrow-datafusion/issues/2145#issuecomment-1146984550

   I think we can check whether the data types are supported in hash join or not. If not supported, we can use cross join instead, and we can support more data types later in hash join. @pjmore 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb closed issue #2145: Join on path partitioned columns fails with error

Posted by GitBox <gi...@apache.org>.

alamb closed issue #2145: Join on path partitioned columns fails with error
URL: https://github.com/apache/arrow-datafusion/issues/2145


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org