You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/05/21 11:46:23 UTC
[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #2572: Decouple FileFormat from datafusion_data_access

alamb commented on code in PR #2572:
URL: https://github.com/apache/arrow-datafusion/pull/2572#discussion_r878686331


##########
datafusion/core/src/datasource/file_format/mod.rs:
##########
@@ -63,8 +67,9 @@ pub trait FileFormat: Send + Sync + fmt::Debug {
     /// TODO: should the file source return statistics for only columns referred to in the table schema?
     async fn infer_stats(
         &self,
-        reader: Arc<dyn ObjectReader>,
+        store: &dyn ObjectStore,

Review Comment:
   what is the reason for the change to `&dyn ObjectStore`? Couldn't this make it harder to pass the object store on to an inner implementation (that might need an `Arc<ObjectStore>`)?



##########
datafusion/core/src/datasource/file_format/mod.rs:
##########
@@ -63,8 +67,9 @@ pub trait FileFormat: Send + Sync + fmt::Debug {
     /// TODO: should the file source return statistics for only columns referred to in the table schema?
     async fn infer_stats(
         &self,
-        reader: Arc<dyn ObjectReader>,
+        store: &dyn ObjectStore,

Review Comment:
   ```suggestion
           store: &Arc<dyn ObjectStore>,
   ```



##########
datafusion/core/src/datasource/file_format/mod.rs:
##########
@@ -52,7 +52,11 @@ pub trait FileFormat: Send + Sync + fmt::Debug {
     /// be analysed up to a given number of records or files (as specified in the
     /// format config) then give the estimated common schema. This might fail if
     /// the files have schemas that cannot be merged.
-    async fn infer_schema(&self, readers: ObjectReaderStream) -> Result<SchemaRef>;
+    async fn infer_schema(
+        &self,
+        store: &dyn ObjectStore,

Review Comment:
   ```suggestion
           store: &Arc<dyn ObjectStore>,
   ```



##########
datafusion/core/src/datasource/file_format/mod.rs:
##########
@@ -75,3 +80,51 @@ pub trait FileFormat: Send + Sync + fmt::Debug {
         filters: &[Expr],
     ) -> Result<Arc<dyn ExecutionPlan>>;
 }
+
+#[cfg(test)]
+pub(crate) mod test_util {
+    use super::*;
+    use crate::datasource::listing::PartitionedFile;
+    use datafusion_data_access::object_store::local::{
+        local_unpartitioned_file, LocalFileSystem,
+    };
+
+    pub async fn get_exec_format(

Review Comment:
   ```suggestion
       pub async fn scan_from_format(
   ```
   
   I found this name somewhat confusing as it is producing an execution plan based on a file format



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org