You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/03/04 09:41:39 UTC

[GitHub] [arrow-datafusion] Igosuki opened a new issue #1923: Local object store accepts file:/// as base path, but LocalStore returns meta without the prefix.

Igosuki opened a new issue #1923:
URL: https://github.com/apache/arrow-datafusion/issues/1923


   **Describe the bug**
   One can register a table with the file scheme `file://`, this in turns allows listing table to list files and find partitions.
   Unfortunately, LocalStore returns a FileMetaStream where the SizedFile path has the prefix stripped. This could be fine except `datafusion::datasource::listing::helpers::parse_partitions_for_path``` calls strip_prefix on the file_path with the original path used to register the table, which contains the scheme.
   
   There are two ways to fix this, either strip the scheme off the path in the registered table as well (would probably be best to let the ObjectStore implementation do that), or enhance FileMeta and use a URI instead of just a path.
   
   **To Reproduce**
   Steps to reproduce the behavior:
   
   ```/tmp/listing_table/part1=value1/``` and ```/tmp/listing_table/part1=value2/```
   should contain one parquet file each
   
   ```
   let mut ctx = ExecutionContext::new();
           let listing_options = ListingOptions {
               file_extension: "parquet".to_string(),
               format: Arc::new(ParquetFormat::default()),
               table_partition_cols: vec!["part1"],
               collect_stat: true,
               target_partitions: 8,
           };
           ctx.register_listing_table(
               "my_table",
               "file:///tmp/listing_table",
               listing_options,
               None,
           )
           .await?;
   
          let df = ctx.sql("select count(*) from my_table").await?;
          let rb = df.collect().await?;
          eprintln!("rb = {:?}", rb);
   ```
   
   **Expected behavior**
   The above should count the lines in the files properly, with the current behavior it'll return 0.
   
   **Additional context**
   I'm trying to be consistent on my project and so I use schemes for both local and remote files. Finding this debug required a lot of debugging.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] houqp commented on issue #1923: Local object store accepts file:/// as base path, but LocalStore returns meta without the prefix.

Posted by GitBox <gi...@apache.org>.
houqp commented on issue #1923:
URL: https://github.com/apache/arrow-datafusion/issues/1923#issuecomment-1066041247


   We ran into the same issue in delta-rs. I think the ideal solution would be to normalize table path within the objecstore implementation when it's being created.
   
   The issue with using URI in fileMeta is all the object stores' list calls do return full URIs, so we will have to perform a lot of string creations in the heap to construct the URIs, this could be expensive when we need to deal with millions of files.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org