You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/08/09 12:21:46 UTC

[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #3051: Allow Overriding AsyncFileReader used by ParquetExec

alamb commented on code in PR #3051:
URL: https://github.com/apache/arrow-datafusion/pull/3051#discussion_r941261922


##########
datafusion/core/src/physical_plan/file_format/mod.rs:
##########
@@ -401,6 +404,33 @@ fn create_dict_array(
     ))
 }
 
+/// A single file or part of a file that should be read, along with its schema, statistics
+pub struct FileMeta {

Review Comment:
   This struct makes a lot of sense to me



##########
datafusion/core/src/physical_plan/file_format/parquet.rs:
##########
@@ -390,6 +452,50 @@ impl AsyncFileReader for ParquetFileReader {
     }
 }
 
+impl ParquetFileReaderFactory for DefaultParquetFileReaderFactory {
+    fn create_reader(
+        &self,
+        partition_index: usize,
+        file_meta: FileMeta,
+        metadata_size_hint: Option<usize>,
+        metrics: &ExecutionPlanMetricsSet,
+    ) -> Result<Box<dyn AsyncFileReader + Send>> {
+        let parquet_file_metrics = ParquetFileMetrics::new(
+            partition_index,
+            file_meta.location().as_ref(),
+            metrics,
+        );
+
+        Ok(Box::new(ParquetFileReader {
+            meta: file_meta.object_meta,
+            store: Arc::clone(&self.store),
+            metadata_size_hint,
+            metrics: parquet_file_metrics,
+        }))
+    }
+}
+
+///
+/// BoxedAsyncFileReader has been created to satisfy type requirements of
+/// parquet stream builder constructor.

Review Comment:
   Is this necessary after https://github.com/apache/arrow-rs/pull/2366? If not perhaps we can add a reference here so we can eventually clean it up?



##########
datafusion/core/src/physical_plan/file_format/mod.rs:
##########
@@ -401,6 +404,33 @@ fn create_dict_array(
     ))
 }
 
+/// A single file or part of a file that should be read, along with its schema, statistics

Review Comment:
   Maybe I am missing something but I don't see "schema" and "statistics" on this struct
   
   Perhaps we should refer to "extensions" instead?



##########
datafusion/core/src/physical_plan/file_format/parquet.rs:
##########
@@ -342,7 +379,32 @@ impl FileOpener for ParquetOpener {
                 });
 
             Ok(adapted.boxed())
-        })
+        }))
+    }
+}
+
+/// Factory of parquet file readers.
+///
+/// Provides means to implement custom data access interface.
+pub trait ParquetFileReaderFactory: Debug + Send + Sync + 'static {
+    /// Provides `AsyncFileReader` over parquet file specified in `FileMeta`
+    fn create_reader(
+        &self,
+        partition_index: usize,
+        file_meta: FileMeta,

Review Comment:
   the file meta contains user defined extensions too, right? So users can pass information down into their custom readers?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org