You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "alamb (via GitHub)" <gi...@apache.org> on 2023/05/30 17:36:29 UTC

[GitHub] [arrow-rs] alamb commented on a diff in pull request #4307: Expose page-level arrow reader API (#4298)

alamb commented on code in PR #4307:
URL: https://github.com/apache/arrow-rs/pull/4307#discussion_r1210603684


##########
parquet/src/arrow/schema/complex.rs:
##########
@@ -35,6 +35,7 @@ fn get_repetition(t: &Type) -> Repetition {
 }
 
 /// Representation of a parquet file, in terms of arrow schema elements

Review Comment:
   When reading this PR I see this comment and wonder "is it really a parquet file" or is it more like a "parquet field"? Or a "possibly nested field" 🤔 



##########
parquet/src/arrow/schema/mod.rs:
##########
@@ -88,15 +89,50 @@ pub(crate) fn parquet_to_array_schema_and_fields(
         });
     }
 
-    match complex::convert_schema(parquet_schema, mask, maybe_schema.as_ref())? {
+    let hint = maybe_schema.as_ref().map(|s| s.fields());
+    let field_levels = parquet_to_arrow_field_levels(parquet_schema, mask, hint)?;
+    let schema = Schema::new_with_metadata(field_levels.fields, metadata);
+    Ok((schema, field_levels.levels))
+}
+
+/// Stores the parquet level information for a set of arrow [`Fields`]
+#[derive(Debug, Clone)]
+pub struct FieldLevels {
+    pub(crate) fields: Fields,

Review Comment:
   Is there any way to get access to `fields` and `levels`  outside of the parquet crate? Should there be?
   
   Maybe they need accessors 🤔 Or maybe the point is to pass it opaquely to `try_new_with_row_groups`? If so perhaps that is worth a comment



##########
parquet/src/arrow/schema/mod.rs:
##########
@@ -88,15 +89,50 @@ pub(crate) fn parquet_to_array_schema_and_fields(
         });
     }
 
-    match complex::convert_schema(parquet_schema, mask, maybe_schema.as_ref())? {
+    let hint = maybe_schema.as_ref().map(|s| s.fields());
+    let field_levels = parquet_to_arrow_field_levels(parquet_schema, mask, hint)?;
+    let schema = Schema::new_with_metadata(field_levels.fields, metadata);
+    Ok((schema, field_levels.levels))
+}
+
+/// Stores the parquet level information for a set of arrow [`Fields`]

Review Comment:
   I don't understand why the comment refers to `parquet level information` when the actual structure contains the an optional mapping to a `ParquetField` 🤔 
   
   Maybe the docs are out of date? Or perhaps mean to highlight the fact that the level information  is stored in `ParquetField`?



##########
parquet/src/arrow/arrow_reader/selection.rs:
##########
@@ -173,9 +173,14 @@ impl RowSelection {
         }
     }
 
-    /// Given an offset index, return the offset ranges for all data pages selected by `self`
-    #[cfg(any(test, feature = "async"))]
-    pub(crate) fn scan_ranges(
+    /// Given an offset index, return the byte ranges for all data pages selected by `self`
+    ///
+    /// This is useful for determining what byte ranges to fetch from underlying storage
+    ///
+    /// Note: this method does not make any effort to combine consecutive ranges, nor coalesce

Review Comment:
   👍 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org