You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/03/03 14:55:12 UTC

[GitHub] [arrow-datafusion] yjshen edited a comment on pull request #1905: Avoid repeated `open` for one single file and simplify object reader API on the `sync` part

yjshen edited a comment on pull request #1905:
URL: https://github.com/apache/arrow-datafusion/pull/1905#issuecomment-1058121557


   Yes, I'm aware of parallelizing ability the current API exposed out, however, it's hard to express or fully get utilized in the current execution plan: how should I trigger the current parallel chunk fetch while maintaining single-partition sterilization read? Instead, we have `PartitionedFile` abstraction that can be extended with file slicing ability. 
   
   ```rust
   /// A single file that should be read, along with its schema, statistics
   /// and partition column values that need to be appended to each row.
   pub struct PartitionedFile {
       /// Path for the file (e.g. URL, filesystem path, etc)
       pub file_meta: FileMeta,
       /// Values of partition columns to be appended to each row
       pub partition_values: Vec<ScalarValue>,
       // We may include row group range here for a more fine-grained parallel execution
   }
   ```
   
   for example, by enabling parquet scan with row groups ability https://github.com/apache/arrow-rs/pull/1389, we could utilize the above PartitionedFile's last comment with real ranges when we want a finer-grained fetch and execution. And in order to control the parallelism of FileSan execution, we could just tune a 'max_byte_per_split' configuration, and partition all input files into `Vec<Vec<PartitionedFile>`, each `Vec<PartitionedFile>` could be summed up to the 'max_byte_per_split' size, from many individual parquet files, or one big slice from one big parquet file.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org