You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/12/23 16:25:35 UTC

[GitHub] [arrow] andygrove commented on a change in pull request #8992: ARROW-11014: [Rust] [DataFusion] Use correct statistics for ParquetExec

andygrove commented on a change in pull request #8992:
URL: https://github.com/apache/arrow/pull/8992#discussion_r548039078



##########
File path: rust/datafusion/src/physical_plan/parquet.rs
##########
@@ -67,14 +67,35 @@ impl ParquetExec {
         if filenames.is_empty() {
             Err(DataFusionError::Plan("No files found".to_string()))
         } else {
+            // Calculate statistics for the entire data set. Later, we will probably want to make
+            // statistics available on a per-partition basis.
+            let mut num_rows = 0;
+            let mut total_byte_size = 0;
+            for file in &filenames {
+                let file = File::open(file)?;
+                let file_reader = Arc::new(SerializedFileReader::new(file)?);

Review comment:
       This is a good point. I've pushed another change here to collect unique schemas during the scan of the files to avoid the separate read. This now also implements a check to make sure the schemas are the same. I have wasted time in the past tracking down issues due to incompatible files. I added a reference to the issue for implementing schema merging, which would be a nice addition.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org