You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/05/18 15:45:46 UTC

[GitHub] [arrow-datafusion] tustvold opened a new pull request, #2572: Decouple FileFormat from datafusion_data_access

tustvold opened a new pull request, #2572:
URL: https://github.com/apache/arrow-datafusion/pull/2572

   _Draft as builds on #2571 which fixes the avro tests_
   
   # Which issue does this PR close?
   
   Part of #2489 
   
    # Rationale for this change
   
   The IOx ObjectStore does not have an equivalent notion of ObjectReader, decoupling FileFormat from such notions is therefore a necessary precondition of any migration. I also think it makes the API significantly easier to use
   
   # What changes are included in this PR?
   
   Reworks FileFormat to be in terms of FileMeta and ObjectStore, instead of ObjectReader and ObjectReaderStream. These concepts translate cleanly, with FileMeta -> ObjectMeta.
   
   # Are there any user-facing changes?
   
   Yes, FileFormat is a public API


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #2572: Decouple FileFormat from datafusion_data_access

Posted by GitBox <gi...@apache.org>.
alamb commented on code in PR #2572:
URL: https://github.com/apache/arrow-datafusion/pull/2572#discussion_r878686331


##########
datafusion/core/src/datasource/file_format/mod.rs:
##########
@@ -63,8 +67,9 @@ pub trait FileFormat: Send + Sync + fmt::Debug {
     /// TODO: should the file source return statistics for only columns referred to in the table schema?
     async fn infer_stats(
         &self,
-        reader: Arc<dyn ObjectReader>,
+        store: &dyn ObjectStore,

Review Comment:
   what is the reason for the change to `&dyn ObjectStore`? Couldn't this make it harder to pass the object store on to an inner implementation (that might need an `Arc<ObjectStore>`)?



##########
datafusion/core/src/datasource/file_format/mod.rs:
##########
@@ -63,8 +67,9 @@ pub trait FileFormat: Send + Sync + fmt::Debug {
     /// TODO: should the file source return statistics for only columns referred to in the table schema?
     async fn infer_stats(
         &self,
-        reader: Arc<dyn ObjectReader>,
+        store: &dyn ObjectStore,

Review Comment:
   ```suggestion
           store: &Arc<dyn ObjectStore>,
   ```



##########
datafusion/core/src/datasource/file_format/mod.rs:
##########
@@ -52,7 +52,11 @@ pub trait FileFormat: Send + Sync + fmt::Debug {
     /// be analysed up to a given number of records or files (as specified in the
     /// format config) then give the estimated common schema. This might fail if
     /// the files have schemas that cannot be merged.
-    async fn infer_schema(&self, readers: ObjectReaderStream) -> Result<SchemaRef>;
+    async fn infer_schema(
+        &self,
+        store: &dyn ObjectStore,

Review Comment:
   ```suggestion
           store: &Arc<dyn ObjectStore>,
   ```



##########
datafusion/core/src/datasource/file_format/mod.rs:
##########
@@ -75,3 +80,51 @@ pub trait FileFormat: Send + Sync + fmt::Debug {
         filters: &[Expr],
     ) -> Result<Arc<dyn ExecutionPlan>>;
 }
+
+#[cfg(test)]
+pub(crate) mod test_util {
+    use super::*;
+    use crate::datasource::listing::PartitionedFile;
+    use datafusion_data_access::object_store::local::{
+        local_unpartitioned_file, LocalFileSystem,
+    };
+
+    pub async fn get_exec_format(

Review Comment:
   ```suggestion
       pub async fn scan_from_format(
   ```
   
   I found this name somewhat confusing as it is producing an execution plan based on a file format



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] thinkharderdev commented on pull request #2572: Decouple FileFormat from datafusion_data_access

Posted by GitBox <gi...@apache.org>.
thinkharderdev commented on PR #2572:
URL: https://github.com/apache/arrow-datafusion/pull/2572#issuecomment-1133703572

   Seems reasonable to me


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] tustvold commented on a diff in pull request #2572: Decouple FileFormat from datafusion_data_access

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #2572:
URL: https://github.com/apache/arrow-datafusion/pull/2572#discussion_r878690223


##########
datafusion/core/src/datasource/file_format/mod.rs:
##########
@@ -63,8 +67,9 @@ pub trait FileFormat: Send + Sync + fmt::Debug {
     /// TODO: should the file source return statistics for only columns referred to in the table schema?
     async fn infer_stats(
         &self,
-        reader: Arc<dyn ObjectReader>,
+        store: &dyn ObjectStore,

Review Comment:
   Predominantly to avoid being one of the implementations "that might need an `Arc<ObjectStore>`". I didn't see a compelling reason why this method would need a "ownable" reference, but given the rest of the codebase is pretty hard-coded to `Arc<dyn ObjectStore>` I don't feel strongly. Will change back :smile: 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] tustvold commented on a diff in pull request #2572: Decouple FileFormat from datafusion_data_access

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #2572:
URL: https://github.com/apache/arrow-datafusion/pull/2572#discussion_r876064584


##########
datafusion/core/src/datasource/listing/mod.rs:
##########
@@ -88,11 +88,12 @@ impl std::fmt::Display for PartitionedFile {
     }
 }
 
-/// Helper method to fetch the file size and date at given path and create a `FileMeta`
-pub fn local_unpartitioned_file(file: String) -> PartitionedFile {
-    PartitionedFile {
-        file_meta: local::local_unpartitioned_file(file),
-        partition_values: vec![],
-        range: None,
+impl From<FileMeta> for PartitionedFile {

Review Comment:
   This not only makes for reduced verbosity, but avoids a confusing situation where there are two local_unpartitioned_file, one which returns FileMeta and one which returns PartitionedFile



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] tustvold commented on a diff in pull request #2572: Decouple FileFormat from datafusion_data_access

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #2572:
URL: https://github.com/apache/arrow-datafusion/pull/2572#discussion_r880621119


##########
datafusion/core/src/datasource/file_format/mod.rs:
##########
@@ -52,7 +52,11 @@ pub trait FileFormat: Send + Sync + fmt::Debug {
     /// be analysed up to a given number of records or files (as specified in the
     /// format config) then give the estimated common schema. This might fail if
     /// the files have schemas that cannot be merged.
-    async fn infer_schema(&self, readers: ObjectReaderStream) -> Result<SchemaRef>;
+    async fn infer_schema(
+        &self,
+        store: &dyn ObjectStore,

Review Comment:
   I've made this change, but it grates with me... There doesn't seem to be a good reason why an implementation would need an Arc reference, and it makes for lots of opaque constructions like
   
   ```
   let store = Arc::new(LocalFileSystem {}) as _;
   ```
   
   I dunno, `&Arc` feels like a bit of a code-smell, either you want a reference or you want an owned type, I dunno...



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] tustvold merged pull request #2572: Decouple FileFormat from datafusion_data_access

Posted by GitBox <gi...@apache.org>.
tustvold merged PR #2572:
URL: https://github.com/apache/arrow-datafusion/pull/2572


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #2572: Decouple FileFormat from datafusion_data_access

Posted by GitBox <gi...@apache.org>.
alamb commented on code in PR #2572:
URL: https://github.com/apache/arrow-datafusion/pull/2572#discussion_r878686694


##########
datafusion/core/src/datasource/file_format/avro.rs:
##########
@@ -359,36 +358,13 @@ mod tests {
 
     async fn get_exec(
         file_name: &str,
-        projection: &Option<Vec<usize>>,
+        projection: Option<&[usize]>,
         limit: Option<usize>,
     ) -> Result<Arc<dyn ExecutionPlan>> {
         let testdata = crate::test_util::arrow_test_data();
-        let filename = format!("{}/avro/{}", testdata, file_name);
+        let store_root = format!("{}/avro", testdata);
         let format = AvroFormat {};
-        let file_schema = format
-            .infer_schema(local_object_reader_stream(vec![filename.clone()]))
-            .await
-            .expect("Schema inference");
-        let statistics = format
-            .infer_stats(local_object_reader(filename.clone()), file_schema.clone())
-            .await
-            .expect("Stats inference");
-        let file_groups = vec![vec![local_unpartitioned_file(filename.to_owned())]];
-        let exec = format
-            .create_physical_plan(
-                FileScanConfig {
-                    object_store: Arc::new(LocalFileSystem {}),
-                    file_schema,
-                    file_groups,
-                    statistics,
-                    projection: projection.clone(),
-                    limit,
-                    table_partition_cols: vec![],
-                },
-                &[],
-            )
-            .await?;
-        Ok(exec)
+        get_exec_format(&format, &store_root, file_name, projection, limit).await

Review Comment:
   (for other readers, it is common logic to all tests)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #2572: Decouple FileFormat from datafusion_data_access

Posted by GitBox <gi...@apache.org>.
alamb commented on code in PR #2572:
URL: https://github.com/apache/arrow-datafusion/pull/2572#discussion_r880599665


##########
.github/pull_request_template.md:
##########
@@ -25,3 +25,23 @@ If there are user-facing changes then we may require documentation to be updated
 <!--
 If there are any breaking changes to public APIs, please add the `api change` label.
 -->
+
+# Does this PR break compatibility with Ballista?

Review Comment:
   This seems like it doesn't belong in the PR 🤔 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] tustvold commented on a diff in pull request #2572: Decouple FileFormat from datafusion_data_access

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #2572:
URL: https://github.com/apache/arrow-datafusion/pull/2572#discussion_r876251108


##########
datafusion/core/src/datasource/file_format/avro.rs:
##########
@@ -359,36 +358,13 @@ mod tests {
 
     async fn get_exec(
         file_name: &str,
-        projection: &Option<Vec<usize>>,
+        projection: Option<&[usize]>,
         limit: Option<usize>,
     ) -> Result<Arc<dyn ExecutionPlan>> {
         let testdata = crate::test_util::arrow_test_data();
-        let filename = format!("{}/avro/{}", testdata, file_name);
+        let store_root = format!("{}/avro", testdata);
         let format = AvroFormat {};
-        let file_schema = format
-            .infer_schema(local_object_reader_stream(vec![filename.clone()]))
-            .await
-            .expect("Schema inference");
-        let statistics = format
-            .infer_stats(local_object_reader(filename.clone()), file_schema.clone())
-            .await
-            .expect("Stats inference");
-        let file_groups = vec![vec![local_unpartitioned_file(filename.to_owned())]];
-        let exec = format
-            .create_physical_plan(
-                FileScanConfig {
-                    object_store: Arc::new(LocalFileSystem {}),
-                    file_schema,
-                    file_groups,
-                    statistics,
-                    projection: projection.clone(),
-                    limit,
-                    table_partition_cols: vec![],
-                },
-                &[],
-            )
-            .await?;
-        Ok(exec)
+        get_exec_format(&format, &store_root, file_name, projection, limit).await

Review Comment:
   This is a drive-by-fix to extract the common scanning logic that was duplicated for each format



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] yjshen commented on pull request #2572: Decouple FileFormat from datafusion_data_access

Posted by GitBox <gi...@apache.org>.
yjshen commented on PR #2572:
URL: https://github.com/apache/arrow-datafusion/pull/2572#issuecomment-1133920452

   cc @richox 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] tustvold commented on a diff in pull request #2572: Decouple FileFormat from datafusion_data_access

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #2572:
URL: https://github.com/apache/arrow-datafusion/pull/2572#discussion_r880667438


##########
.github/pull_request_template.md:
##########
@@ -25,3 +25,23 @@ If there are user-facing changes then we may require documentation to be updated
 <!--
 If there are any breaking changes to public APIs, please add the `api change` label.
 -->
+
+# Does this PR break compatibility with Ballista?

Review Comment:
   That's from the merge commit where ballista was removed



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org