You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/07/20 18:39:06 UTC

[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #2946: Add metadata_size_hint for optimistic fetching of parquet metadata

alamb commented on code in PR #2946:
URL: https://github.com/apache/arrow-datafusion/pull/2946#discussion_r925942388


##########
datafusion/core/src/datasource/file_format/parquet.rs:
##########
@@ -316,17 +352,34 @@ pub(crate) async fn fetch_parquet_metadata(
         )));
     }
 
-    let metadata_start = meta.size - length - 8;
-    let metadata = store
-        .get_range(&meta.location, metadata_start..footer_start)
-        .await?;
+    if length > suffix_len - 8 {

Review Comment:
   ```suggestion
       // Did not fetch the entire file metadata in the initial read, need to make a second request
       if length > suffix_len - 8 {
   ```
   



##########
datafusion/core/src/datasource/file_format/parquet.rs:
##########
@@ -509,6 +564,59 @@ mod tests {
         Ok(())
     }
 
+    #[tokio::test]
+    async fn fetch_metadata_with_size_hint() -> Result<()> {
+        let c1: ArrayRef =
+            Arc::new(StringArray::from(vec![Some("Foo"), None, Some("bar")]));
+
+        let c2: ArrayRef = Arc::new(Int64Array::from(vec![Some(1), Some(2), None]));
+
+        let batch1 = RecordBatch::try_from_iter(vec![("c1", c1.clone())]).unwrap();
+        let batch2 = RecordBatch::try_from_iter(vec![("c2", c2)]).unwrap();
+
+        let store = Arc::new(LocalFileSystem::new()) as _;
+        let (meta, _files) = store_parquet(vec![batch1, batch2]).await?;
+
+        // Use a size hint larger than the parquet footer but smaller than the actual metadata, requiring a second fetch
+        // for the remaining metadata
+        let format = ParquetFormat::default().with_metadata_size_hint(9);
+        let schema = format.infer_schema(&store, &meta).await.unwrap();
+
+        fetch_parquet_metadata(store.as_ref(), &meta[0], Some(9))
+            .await
+            .expect("error reading metadata with hint");
+
+        let stats =
+            fetch_statistics(store.as_ref(), schema.clone(), &meta[0], Some(9)).await?;
+
+        assert_eq!(stats.num_rows, Some(3));
+        let c1_stats = &stats.column_statistics.as_ref().expect("missing c1 stats")[0];
+        let c2_stats = &stats.column_statistics.as_ref().expect("missing c2 stats")[1];
+        assert_eq!(c1_stats.null_count, Some(1));
+        assert_eq!(c2_stats.null_count, Some(3));
+
+        // Use the file size as the hint so we can get the full metadata from the first fetch

Review Comment:
   I wonder if there is any way to test that there was a single request made to the object store as well 🤔 



##########
datafusion/core/src/datasource/file_format/parquet.rs:
##########
@@ -298,13 +327,20 @@ pub(crate) async fn fetch_parquet_metadata(
         )));
     }
 
-    let footer_start = meta.size - 8;
+    let footer_start = if let Some(size_hint) = size_hint {

Review Comment:
   ```suggestion
       // If a size hint is provided, read more than the minimum size 
       // to try and avoid a second fetch.
       let footer_start = if let Some(size_hint) = size_hint {
   ```



##########
datafusion/core/src/datasource/file_format/parquet.rs:
##########
@@ -298,13 +327,20 @@ pub(crate) async fn fetch_parquet_metadata(
         )));
     }
 
-    let footer_start = meta.size - 8;
+    let footer_start = if let Some(size_hint) = size_hint {
+        meta.size - size_hint

Review Comment:
   What happens if size_hit is larger than the file size?
   
   Maybe we could used `saturating_sub` here instead: https://doc.rust-lang.org/std/primitive.usize.html#method.saturating_sub



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org