You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "tustvold (via GitHub)" <gi...@apache.org> on 2023/05/29 21:36:18 UTC

[GitHub] [arrow-rs] tustvold opened a new pull request, #4307: Expose page-level arrow reader API (#4298)

tustvold opened a new pull request, #4307:
URL: https://github.com/apache/arrow-rs/pull/4307

   # Which issue does this PR close?
   
   <!--
   We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123.
   -->
   
   Closes #4298
   
   # Rationale for this change
    
   <!--
   Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed.
   Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes.
   -->
   
   This makes it possible to use the arrow conversion plumbing with custom logic for fetching the sourcing the underlying data pages.
   
   # What changes are included in this PR?
   
   <!--
   There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR.
   -->
   
   # Are there any user-facing changes?
   
   The only changes are to experimental APIs, so no breaking changes
   <!--
   If there are user-facing changes then we may require documentation to be updated before approving the PR.
   -->
   
   <!---
   If there are any breaking changes to public APIs, please add the `breaking change` label.
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] alamb commented on a diff in pull request #4307: Expose page-level arrow reader API (#4298)

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on code in PR #4307:
URL: https://github.com/apache/arrow-rs/pull/4307#discussion_r1210603684


##########
parquet/src/arrow/schema/complex.rs:
##########
@@ -35,6 +35,7 @@ fn get_repetition(t: &Type) -> Repetition {
 }
 
 /// Representation of a parquet file, in terms of arrow schema elements

Review Comment:
   When reading this PR I see this comment and wonder "is it really a parquet file" or is it more like a "parquet field"? Or a "possibly nested field" 🤔 



##########
parquet/src/arrow/schema/mod.rs:
##########
@@ -88,15 +89,50 @@ pub(crate) fn parquet_to_array_schema_and_fields(
         });
     }
 
-    match complex::convert_schema(parquet_schema, mask, maybe_schema.as_ref())? {
+    let hint = maybe_schema.as_ref().map(|s| s.fields());
+    let field_levels = parquet_to_arrow_field_levels(parquet_schema, mask, hint)?;
+    let schema = Schema::new_with_metadata(field_levels.fields, metadata);
+    Ok((schema, field_levels.levels))
+}
+
+/// Stores the parquet level information for a set of arrow [`Fields`]
+#[derive(Debug, Clone)]
+pub struct FieldLevels {
+    pub(crate) fields: Fields,

Review Comment:
   Is there any way to get access to `fields` and `levels`  outside of the parquet crate? Should there be?
   
   Maybe they need accessors 🤔 Or maybe the point is to pass it opaquely to `try_new_with_row_groups`? If so perhaps that is worth a comment



##########
parquet/src/arrow/schema/mod.rs:
##########
@@ -88,15 +89,50 @@ pub(crate) fn parquet_to_array_schema_and_fields(
         });
     }
 
-    match complex::convert_schema(parquet_schema, mask, maybe_schema.as_ref())? {
+    let hint = maybe_schema.as_ref().map(|s| s.fields());
+    let field_levels = parquet_to_arrow_field_levels(parquet_schema, mask, hint)?;
+    let schema = Schema::new_with_metadata(field_levels.fields, metadata);
+    Ok((schema, field_levels.levels))
+}
+
+/// Stores the parquet level information for a set of arrow [`Fields`]

Review Comment:
   I don't understand why the comment refers to `parquet level information` when the actual structure contains the an optional mapping to a `ParquetField` 🤔 
   
   Maybe the docs are out of date? Or perhaps mean to highlight the fact that the level information  is stored in `ParquetField`?



##########
parquet/src/arrow/arrow_reader/selection.rs:
##########
@@ -173,9 +173,14 @@ impl RowSelection {
         }
     }
 
-    /// Given an offset index, return the offset ranges for all data pages selected by `self`
-    #[cfg(any(test, feature = "async"))]
-    pub(crate) fn scan_ranges(
+    /// Given an offset index, return the byte ranges for all data pages selected by `self`
+    ///
+    /// This is useful for determining what byte ranges to fetch from underlying storage
+    ///
+    /// Note: this method does not make any effort to combine consecutive ranges, nor coalesce

Review Comment:
   👍 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #4307: Expose page-level arrow reader API (#4298)

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on code in PR #4307:
URL: https://github.com/apache/arrow-rs/pull/4307#discussion_r1209565707


##########
parquet/src/arrow/schema/mod.rs:
##########
@@ -66,11 +67,11 @@ pub fn parquet_to_arrow_schema_by_columns(
     mask: ProjectionMask,
     key_value_metadata: Option<&Vec<KeyValue>>,
 ) -> Result<Schema> {
-    Ok(parquet_to_array_schema_and_fields(parquet_schema, mask, key_value_metadata)?.0)
+    Ok(parquet_to_arrow_schema_and_fields(parquet_schema, mask, key_value_metadata)?.0)
 }
 
 /// Extracts the arrow metadata
-pub(crate) fn parquet_to_array_schema_and_fields(
+pub(crate) fn parquet_to_arrow_schema_and_fields(

Review Comment:
   I realised the name was a typo, so fixed it :smile: 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] alamb commented on a diff in pull request #4307: Expose page-level arrow reader API (#4298)

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on code in PR #4307:
URL: https://github.com/apache/arrow-rs/pull/4307#discussion_r1211602131


##########
parquet/src/arrow/schema/complex.rs:
##########
@@ -34,7 +34,7 @@ fn get_repetition(t: &Type) -> Repetition {
     }
 }
 
-/// Representation of a parquet file, in terms of arrow schema elements
+/// Representation of a parquet schema element, in terms of arrow schema elements

Review Comment:
   👍 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #4307: Expose page-level arrow reader API (#4298)

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on code in PR #4307:
URL: https://github.com/apache/arrow-rs/pull/4307#discussion_r1209565643


##########
parquet/src/arrow/schema/mod.rs:
##########
@@ -88,15 +89,50 @@ pub(crate) fn parquet_to_array_schema_and_fields(
         });
     }
 
-    match complex::convert_schema(parquet_schema, mask, maybe_schema.as_ref())? {
+    let hint = maybe_schema.as_ref().map(|s| s.fields());
+    let field_levels = parquet_to_arrow_field_levels(parquet_schema, mask, hint)?;
+    let schema = Schema::new_with_metadata(field_levels.fields, metadata);
+    Ok((schema, field_levels.levels))
+}
+
+/// Stores the parquet level information for a set of arrow [`Fields`]
+#[derive(Debug, Clone)]
+pub struct FieldLevels {
+    pub(crate) fields: Fields,
+    pub(crate) levels: Option<ParquetField>,
+}
+
+/// Convert a parquet [`SchemaDescriptor`] to its corresponding arrow representation
+///
+/// Columns not included within [`ProjectionMask`] will be ignored.
+///
+/// Where a field type in `hint` is compatible with the corresponding parquet type in `schema`, it
+/// will be used, otherwise the default arrow type for the given parquet column type will be used.
+///
+/// This is to accommodate arrow types that cannot be round-tripped through parquet natively.
+/// Depending on the parquet writer, this can lead to a mismatch between a file's parquet schema
+/// and its embedded arrow schema. The parquet `schema` must be treated as authoritative in such
+/// an event. See [#1663](https://github.com/apache/arrow-rs/issues/1663) for more information
+///
+/// Note: this is a low-level API, most users will want to make use of the higher-level
+/// [`parquet_to_arrow_schema`] for decoding metadata from a parquet file.
+pub fn parquet_to_arrow_field_levels(

Review Comment:
   This is the alternative to making `parquet_to_array_schema_and_fields` public, and does not expose the details of what `ParquetField` is.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #4307: Expose page-level arrow reader API (#4298)

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on code in PR #4307:
URL: https://github.com/apache/arrow-rs/pull/4307#discussion_r1209566002


##########
parquet/src/arrow/arrow_reader/mod.rs:
##########
@@ -420,6 +417,30 @@ impl ParquetRecordBatchReader {
             .build()
     }
 
+    /// Create a new [`ParquetRecordBatchReader`] from the provided [`RowGroups`]
+    ///
+    /// Note: this is a low-level interface see [`ParquetRecordBatchReader::try_new`] for a
+    /// higher-level interface for reading parquet data from a file
+    pub fn try_new_with_row_groups(
+        levels: &FieldLevels,
+        row_groups: &dyn RowGroups,
+        batch_size: usize,
+        selection: Option<RowSelection>,
+    ) -> Result<Self> {

Review Comment:
   We don't need to include a `ProjectionMask` as this can be passed to `parquet_to_arrow_field_levels` instead



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] sundy-li commented on pull request #4307: Expose page-level arrow reader API (#4298)

Posted by "sundy-li (via GitHub)" <gi...@apache.org>.
sundy-li commented on PR #4307:
URL: https://github.com/apache/arrow-rs/pull/4307#issuecomment-1568106975

   > What do you think of implementing RowGroups for your custom in-memory representation?
   
   Yes, we would to have our own `RowGroups` implementation but we also want to access the function of `RowSelector` to inject the page index, so how about making `scan_ranges ` public.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] sundy-li commented on pull request #4307: Expose page-level arrow reader API (#4298)

Posted by "sundy-li (via GitHub)" <gi...@apache.org>.
sundy-li commented on PR #4307:
URL: https://github.com/apache/arrow-rs/pull/4307#issuecomment-1569290509

   > I also suggest someone write up a tested example of using these new APIs so they don't get inadvertently broken during some future refactor
   
   Sure, I can help later after this pr is merged.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on pull request #4307: Expose page-level arrow reader API (#4298)

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on PR #4307:
URL: https://github.com/apache/arrow-rs/pull/4307#issuecomment-1567535613

   @sundy-li could you perhaps take a look and see if this would suit your use-case


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on pull request #4307: Expose page-level arrow reader API (#4298)

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on PR #4307:
URL: https://github.com/apache/arrow-rs/pull/4307#issuecomment-1567937454

   They're fairly tightly coupled with how ParquetRecordBatchStream does IO, something you indicated you wished to handle differently. What do you think of implementing `RowGroups` for your custom in-memory representation?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #4307: Expose page-level arrow reader API (#4298)

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on code in PR #4307:
URL: https://github.com/apache/arrow-rs/pull/4307#discussion_r1209565325


##########
parquet/src/arrow/arrow_reader/mod.rs:
##########
@@ -420,6 +417,30 @@ impl ParquetRecordBatchReader {
             .build()
     }
 
+    /// Create a new [`ParquetRecordBatchReader`] from the provided [`RowGroups`]
+    ///
+    /// Note: this is a low-level interface see [`ParquetRecordBatchReader::try_new`] for a
+    /// higher-level interface for reading parquet data from a file
+    pub fn try_new_with_row_groups(

Review Comment:
   This lets you construct a `ParquetRecordBatchReader` from an arbitrary source of pages



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold merged pull request #4307: Expose page-level arrow reader API (#4298)

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold merged PR #4307:
URL: https://github.com/apache/arrow-rs/pull/4307


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] sundy-li commented on pull request #4307: Expose page-level arrow reader API (#4298)

Posted by "sundy-li (via GitHub)" <gi...@apache.org>.
sundy-li commented on PR #4307:
URL: https://github.com/apache/arrow-rs/pull/4307#issuecomment-1567626887

   Almost LGTM,  how about exposing `InMemoryRowGroup` ?   Otherwise, we need a similar struct to work.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #4307: Expose page-level arrow reader API (#4298)

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on code in PR #4307:
URL: https://github.com/apache/arrow-rs/pull/4307#discussion_r1211427640


##########
parquet/src/arrow/schema/complex.rs:
##########
@@ -35,6 +35,7 @@ fn get_repetition(t: &Type) -> Repetition {
 }
 
 /// Representation of a parquet file, in terms of arrow schema elements

Review Comment:
   Reworded, I agree the use of file here is very misleading



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #4307: Expose page-level arrow reader API (#4298)

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on code in PR #4307:
URL: https://github.com/apache/arrow-rs/pull/4307#discussion_r1211415156


##########
parquet/src/arrow/schema/mod.rs:
##########
@@ -88,15 +89,50 @@ pub(crate) fn parquet_to_array_schema_and_fields(
         });
     }
 
-    match complex::convert_schema(parquet_schema, mask, maybe_schema.as_ref())? {
+    let hint = maybe_schema.as_ref().map(|s| s.fields());
+    let field_levels = parquet_to_arrow_field_levels(parquet_schema, mask, hint)?;
+    let schema = Schema::new_with_metadata(field_levels.fields, metadata);
+    Ok((schema, field_levels.levels))
+}
+
+/// Stores the parquet level information for a set of arrow [`Fields`]
+#[derive(Debug, Clone)]
+pub struct FieldLevels {
+    pub(crate) fields: Fields,

Review Comment:
   The intention of this container is to hide the internal notion of what is necessary to decode a field. Will clarify
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org