You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/02/18 01:49:59 UTC

[GitHub] [arrow-datafusion] Igosuki opened a new issue #1859: UInt8 isn't enough for partitioning values

Igosuki opened a new issue #1859:
URL: https://github.com/apache/arrow-datafusion/issues/1859


   **Describe the bug**
   I hit a limitation in the PartitionColumnProjector, when there are too many partition values. Changing dictionary keys to UInt16 made it work. 
   
   **To Reproduce**
   Steps to reproduce the behavior: 
   For instance, I modified the test in `parquet_multiple_partitions` in module `path_partitions` like this :
   ```
   #[tokio::test]
   async fn parquet_multiple_partitions() -> Result<()> {
       let mut ctx = ExecutionContext::new();
       let store_paths = (0..200)
           .into_iter()
           .map(|i| {
               format!(
                   "first=what/second=ok/year=2021/month={}/day={}/file.parquet",
                   i, i
               )
           })
           .collect::<Vec<String>>();
       let store_paths_refs = store_paths
           .iter()
           .map(|s| s.as_str())
           .collect::<Vec<&str>>();
       register_partitioned_alltypes_parquet(
           &mut ctx,
           &store_paths_refs,
           &["first", "second", "year", "month", "day"],
           "",
           "alltypes_plain.parquet",
       )
       .await;
   
       let result = ctx
           .sql("SELECT id, day FROM t WHERE day=month and first='what' and second='ok' and year='2021' ORDER BY id")
           .await?
           .collect()
           .await?;
       Ok(())
   }
   ```
   At 200 it runs, at 2000 it fails.
   
   **Expected behavior**
   Datafusion shouldn't crash, even for thousands of partition values.
   
   **Additional context**
   I'm using prepartitioned parquet-files to speed up iterating in notebooks
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb closed issue #1859: UInt8 isn't enough for partitioning values

Posted by GitBox <gi...@apache.org>.
alamb closed issue #1859:
URL: https://github.com/apache/arrow-datafusion/issues/1859


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] Igosuki commented on issue #1859: UInt8 isn't enough for partitioning values

Posted by GitBox <gi...@apache.org>.
Igosuki commented on issue #1859:
URL: https://github.com/apache/arrow-datafusion/issues/1859#issuecomment-1043730403


   P.S. : this is only a fix for 7.0.0, as listing partitions is broken for me on master


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org