You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/05/21 09:57:56 UTC

[GitHub] [arrow-rs] tustvold opened a new pull request, #1716: Add explicit column mask construction (#1701)

tustvold opened a new pull request, #1716:
URL: https://github.com/apache/arrow-rs/pull/1716

   # Which issue does this PR close?
   
   Closes #1701.
   
   # Rationale for this change
    
   The current API is confusing, surfacing errors at runtime, and is liable to accidental misuse - https://github.com/apache/arrow-datafusion/issues/2453 and https://github.com/apache/arrow-datafusion/issues/2543.
   
   # What changes are included in this PR?
   
   This adds an explicit `ColumnMask` that replaces the iterators of indices. This makes for a cleaner API that should hopefully make it harder to accidentally misuse.
   
   In particular it also adds the ability to construct a RecordReader based on a mask of root columns, as opposed to leaf columns. This is the core behind https://github.com/apache/arrow-datafusion/issues/2543
   
   # Are there any user-facing changes?
   
   Yes, this changes the arrow projection API
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] codecov-commenter commented on pull request #1716: Add explicit column mask construction (#1701)

Posted by GitBox <gi...@apache.org>.
codecov-commenter commented on PR #1716:
URL: https://github.com/apache/arrow-rs/pull/1716#issuecomment-1133595638

   # [Codecov](https://codecov.io/gh/apache/arrow-rs/pull/1716?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#1716](https://codecov.io/gh/apache/arrow-rs/pull/1716?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (489d54a) into [master](https://codecov.io/gh/apache/arrow-rs/commit/a30e787b325fd5579699197db1411df52570cc20?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (a30e787) will **decrease** coverage by `0.00%`.
   > The diff coverage is `80.48%`.
   
   ```diff
   @@            Coverage Diff             @@
   ##           master    #1716      +/-   ##
   ==========================================
   - Coverage   83.32%   83.31%   -0.01%     
   ==========================================
     Files         195      196       +1     
     Lines       56023    56007      -16     
   ==========================================
   - Hits        46681    46663      -18     
   - Misses       9342     9344       +2     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow-rs/pull/1716?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
   |---|---|---|
   | [parquet/src/arrow/mod.rs](https://codecov.io/gh/apache/arrow-rs/pull/1716/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvYXJyb3cvbW9kLnJz) | `44.44% <44.44%> (ø)` | |
   | [parquet/src/arrow/schema.rs](https://codecov.io/gh/apache/arrow-rs/pull/1716/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvYXJyb3cvc2NoZW1hLnJz) | `96.81% <75.00%> (+0.02%)` | :arrow_up: |
   | [parquet/src/schema/types.rs](https://codecov.io/gh/apache/arrow-rs/pull/1716/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvc2NoZW1hL3R5cGVzLnJz) | `83.73% <80.00%> (-0.10%)` | :arrow_down: |
   | [parquet/src/arrow/array\_reader/builder.rs](https://codecov.io/gh/apache/arrow-rs/pull/1716/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvYXJyb3cvYXJyYXlfcmVhZGVyL2J1aWxkZXIucnM=) | `93.50% <100.00%> (-0.05%)` | :arrow_down: |
   | [parquet/src/arrow/array\_reader/list\_array.rs](https://codecov.io/gh/apache/arrow-rs/pull/1716/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvYXJyb3cvYXJyYXlfcmVhZGVyL2xpc3RfYXJyYXkucnM=) | `93.41% <100.00%> (+0.05%)` | :arrow_up: |
   | [parquet/src/arrow/arrow\_reader.rs](https://codecov.io/gh/apache/arrow-rs/pull/1716/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvYXJyb3cvYXJyb3dfcmVhZGVyLnJz) | `96.88% <100.00%> (+0.97%)` | :arrow_up: |
   | [parquet/src/arrow/schema/complex.rs](https://codecov.io/gh/apache/arrow-rs/pull/1716/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvYXJyb3cvc2NoZW1hL2NvbXBsZXgucnM=) | `73.77% <100.00%> (-1.01%)` | :arrow_down: |
   | [arrow/src/datatypes/datatype.rs](https://codecov.io/gh/apache/arrow-rs/pull/1716/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2RhdGF0eXBlcy9kYXRhdHlwZS5ycw==) | `65.42% <0.00%> (-0.38%)` | :arrow_down: |
   | [arrow/src/array/transform/mod.rs](https://codecov.io/gh/apache/arrow-rs/pull/1716/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2FycmF5L3RyYW5zZm9ybS9tb2QucnM=) | `87.02% <0.00%> (+0.22%)` | :arrow_up: |
   | [parquet\_derive/src/parquet\_field.rs](https://codecov.io/gh/apache/arrow-rs/pull/1716/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldF9kZXJpdmUvc3JjL3BhcnF1ZXRfZmllbGQucnM=) | `65.98% <0.00%> (+0.45%)` | :arrow_up: |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow-rs/pull/1716?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow-rs/pull/1716?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Last update [a30e787...489d54a](https://codecov.io/gh/apache/arrow-rs/pull/1716?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #1716: Add explicit column mask construction (#1701)

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #1716:
URL: https://github.com/apache/arrow-rs/pull/1716#discussion_r878679531


##########
parquet/src/arrow/async_reader.rs:
##########
@@ -166,32 +167,17 @@ impl<T: AsyncRead + AsyncSeek + Unpin> ParquetRecordBatchStreamBuilder<T> {
     }
 
     /// Only read data from the provided column indexes
-    pub fn with_projection(self, projection: Vec<usize>) -> Self {
+    pub fn with_projection(self, mask: ProjectionMask) -> Self {
         Self {
-            projection: Some(projection),
+            projection: mask,
             ..self
         }
     }
 
     /// Build a new [`ParquetRecordBatchStream`]
     pub fn build(self) -> Result<ParquetRecordBatchStream<T>> {
-        let num_columns = self.schema.fields().len();
         let num_row_groups = self.metadata.row_groups().len();
 
-        let columns = match self.projection {
-            Some(projection) => {
-                if let Some(col) = projection.iter().find(|x| **x >= num_columns) {
-                    return Err(general_err!(
-                        "column projection {} outside bounds of schema 0..{}",

Review Comment:
   This check was actually incorrect as it was checking against the arrow schema not the parquet schema. I think this demonstrates the footgun prone nature of the old API



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #1716: Add explicit column mask construction (#1701)

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #1716:
URL: https://github.com/apache/arrow-rs/pull/1716#discussion_r878677472


##########
parquet/src/arrow/schema.rs:
##########
@@ -155,24 +100,24 @@ fn get_arrow_schema_from_metadata(encoded_meta: &str) -> Result<Schema> {
                 Ok(message) => message
                     .header_as_schema()
                     .map(arrow::ipc::convert::fb_to_schema)
-                    .ok_or(ArrowError("the message is not Arrow Schema".to_string())),
+                    .ok_or(arrow_err!("the message is not Arrow Schema")),

Review Comment:
   Drive by cleanup to move to arrow_err! macro



##########
parquet/src/schema/types.rs:
##########
@@ -847,13 +847,13 @@ pub struct SchemaDescriptor {
     // `schema` in DFS order.
     leaves: Vec<ColumnDescPtr>,
 
-    // Mapping from a leaf column's index to the root column type that it
+    // Mapping from a leaf column's index to the root column index that it
     // comes from. For instance: the leaf `a.b.c.d` would have a link back to `a`:
     // -- a  <-----+
     // -- -- b     |
     // -- -- -- c  |
     // -- -- -- -- d
-    leaf_to_base: Vec<TypePtr>,
+    leaf_to_base: Vec<usize>,

Review Comment:
   This change to store the root index, as opposed to a copy of the root ptr makes it easier to convert from a root mask to a leaf mask



##########
parquet/src/arrow/mod.rs:
##########
@@ -133,11 +140,71 @@ pub use self::arrow_reader::ParquetFileArrowReader;
 pub use self::arrow_writer::ArrowWriter;
 #[cfg(feature = "async")]
 pub use self::async_reader::ParquetRecordBatchStreamBuilder;
+use crate::schema::types::SchemaDescriptor;
 
 pub use self::schema::{
     arrow_to_parquet_schema, parquet_to_arrow_schema, parquet_to_arrow_schema_by_columns,
-    parquet_to_arrow_schema_by_root_columns,
 };
 
 /// Schema metadata key used to store serialized Arrow IPC schema
 pub const ARROW_SCHEMA_META_KEY: &str = "ARROW:schema";
+
+/// A [`ProjectionMask`] identifies a set of columns within a potentially nested schema to project
+#[derive(Debug, Clone)]
+pub struct ProjectionMask {
+    /// A mask of
+    mask: Option<Vec<bool>>,
+}
+
+impl ProjectionMask {
+    /// Create a [`ColumnMask`] which selects all columns
+    pub fn all() -> Self {
+        Self { mask: None }
+    }
+
+    /// Create a [`ColumnMask`] which selects only the specified leaf columns
+    ///
+    /// Note: repeated or out of order indices will not impact the final mask
+    ///
+    /// i.e. `[0, 1, 2]` will construct the same mask as `[1, 0, 0, 2]`
+    pub fn leaves(
+        schema: &SchemaDescriptor,

Review Comment:
   The mask could theoretically carry this along and use it as a sanity check, but I'm inclined to think if a user constructs a mask and then uses it on a different schema, it's not something we can reasonably be expected to handle sensibly



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #1716: Add explicit column mask construction in parquet: `ProjectionMask` (#1701)

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #1716:
URL: https://github.com/apache/arrow-rs/pull/1716#discussion_r879404146


##########
parquet/src/arrow/schema.rs:
##########
@@ -51,74 +52,18 @@ pub fn parquet_to_arrow_schema(
 ) -> Result<Schema> {
     parquet_to_arrow_schema_by_columns(
         parquet_schema,
-        0..parquet_schema.columns().len(),
+        ProjectionMask::all(),
         key_value_metadata,
     )
 }
 
-/// Convert parquet schema to arrow schema including optional metadata,
-/// only preserving some root columns.
-/// This is useful if we have columns `a.b`, `a.c.e` and `a.d`,
-/// and want `a` with all its child fields
-pub fn parquet_to_arrow_schema_by_root_columns<T>(
-    parquet_schema: &SchemaDescriptor,
-    column_indices: T,
-    key_value_metadata: Option<&Vec<KeyValue>>,
-) -> Result<Schema>
-where
-    T: IntoIterator<Item = usize>,
-{
-    // Reconstruct the index ranges of the parent columns
-    // An Arrow struct gets represented by 1+ columns based on how many child fields the
-    // struct has. This means that getting fields 1 and 2 might return the struct twice,
-    // if field 1 is the struct having say 3 fields, and field 2 is a primitive.
-    //
-    // The below gets the parent columns, and counts the number of child fields in each parent,
-    // such that we would end up with:
-    // - field 1 - columns: [0, 1, 2]
-    // - field 2 - columns: [3]
-    let mut parent_columns = vec![];
-    let mut curr_name = "";
-    let mut prev_name = "";
-    let mut indices = vec![];
-    (0..(parquet_schema.num_columns())).for_each(|i| {
-        let p_type = parquet_schema.get_column_root(i);
-        curr_name = p_type.get_basic_info().name();
-        if prev_name.is_empty() {
-            // first index
-            indices.push(i);
-            prev_name = curr_name;
-        } else if curr_name != prev_name {
-            prev_name = curr_name;
-            parent_columns.push((curr_name.to_string(), indices.clone()));
-            indices = vec![i];
-        } else {
-            indices.push(i);
-        }
-    });
-    // push the last column if indices has values
-    if !indices.is_empty() {
-        parent_columns.push((curr_name.to_string(), indices));
-    }
-
-    // gather the required leaf columns
-    let leaf_columns = column_indices
-        .into_iter()
-        .flat_map(|i| parent_columns[i].1.clone());
-
-    parquet_to_arrow_schema_by_columns(parquet_schema, leaf_columns, key_value_metadata)
-}
-
 /// Convert parquet schema to arrow schema including optional metadata,
 /// only preserving some leaf columns.
-pub fn parquet_to_arrow_schema_by_columns<T>(
+pub fn parquet_to_arrow_schema_by_columns(
     parquet_schema: &SchemaDescriptor,
-    column_indices: T,
+    mask: ProjectionMask,

Review Comment:
   Currently yes, it gets moved into the Visitor. Theoretically it could borrow and have lifetimes, but in most cases I suspect we have the mask by value anyway.
   
   Edit: This might be an argument to move to arrow Bitmap, as that is internally refcounted... Future PR me thinks



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #1716: Add explicit column mask construction in parquet: `ProjectionMask` (#1701)

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #1716:
URL: https://github.com/apache/arrow-rs/pull/1716#discussion_r879396663


##########
parquet/src/arrow/mod.rs:
##########
@@ -133,11 +140,71 @@ pub use self::arrow_reader::ParquetFileArrowReader;
 pub use self::arrow_writer::ArrowWriter;
 #[cfg(feature = "async")]
 pub use self::async_reader::ParquetRecordBatchStreamBuilder;
+use crate::schema::types::SchemaDescriptor;
 
 pub use self::schema::{
     arrow_to_parquet_schema, parquet_to_arrow_schema, parquet_to_arrow_schema_by_columns,
-    parquet_to_arrow_schema_by_root_columns,
 };
 
 /// Schema metadata key used to store serialized Arrow IPC schema
 pub const ARROW_SCHEMA_META_KEY: &str = "ARROW:schema";
+
+/// A [`ProjectionMask`] identifies a set of columns within a potentially nested schema to project
+#[derive(Debug, Clone)]
+pub struct ProjectionMask {
+    /// A mask of

Review Comment:
   > Also, since we have Bitmap and all the associated handling in Arrow, I wonder if it is worth using that (though a Vec<bool> is nice and simple
   
   Let's stick with simple, and maybe if/when we promote this construct to arrow-rs we can switch to using Bitmap



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold merged pull request #1716: Add explicit column mask construction in parquet: `ProjectionMask` (#1701)

Posted by GitBox <gi...@apache.org>.
tustvold merged PR #1716:
URL: https://github.com/apache/arrow-rs/pull/1716


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #1716: Add explicit column mask construction in parquet: `ProjectionMask` (#1701)

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #1716:
URL: https://github.com/apache/arrow-rs/pull/1716#discussion_r879392488


##########
parquet/src/arrow/schema.rs:
##########
@@ -155,24 +100,24 @@ fn get_arrow_schema_from_metadata(encoded_meta: &str) -> Result<Schema> {
                 Ok(message) => message
                     .header_as_schema()
                     .map(arrow::ipc::convert::fb_to_schema)
-                    .ok_or(ArrowError("the message is not Arrow Schema".to_string())),
+                    .ok_or(arrow_err!("the message is not Arrow Schema")),

Review Comment:
   It's consistent with how errors are handled elsewhere in the crate, with arrow_err!, general_err!, etc...



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] alamb commented on a diff in pull request #1716: Add explicit column mask construction in parquet: `ProjectionMask` (#1701)

Posted by GitBox <gi...@apache.org>.
alamb commented on code in PR #1716:
URL: https://github.com/apache/arrow-rs/pull/1716#discussion_r878681742


##########
parquet/src/arrow/mod.rs:
##########
@@ -133,11 +140,71 @@ pub use self::arrow_reader::ParquetFileArrowReader;
 pub use self::arrow_writer::ArrowWriter;
 #[cfg(feature = "async")]
 pub use self::async_reader::ParquetRecordBatchStreamBuilder;
+use crate::schema::types::SchemaDescriptor;
 
 pub use self::schema::{
     arrow_to_parquet_schema, parquet_to_arrow_schema, parquet_to_arrow_schema_by_columns,
-    parquet_to_arrow_schema_by_root_columns,
 };
 
 /// Schema metadata key used to store serialized Arrow IPC schema
 pub const ARROW_SCHEMA_META_KEY: &str = "ARROW:schema";
+
+/// A [`ProjectionMask`] identifies a set of columns within a potentially nested schema to project
+#[derive(Debug, Clone)]
+pub struct ProjectionMask {
+    /// A mask of
+    mask: Option<Vec<bool>>,
+}
+
+impl ProjectionMask {
+    /// Create a [`ProjectionMask`] which selects all columns
+    pub fn all() -> Self {
+        Self { mask: None }
+    }
+
+    /// Create a [`ProjectionMask`] which selects only the specified leaf columns

Review Comment:
   Can you please explain (or provide a link to something that explains) leaves and roots and what "order" they are in. I think it refers to the parquet schema (or maybe the arrow schema and types within Structs / LIsts / others nested types?)



##########
parquet/src/arrow/schema.rs:
##########
@@ -51,74 +52,18 @@ pub fn parquet_to_arrow_schema(
 ) -> Result<Schema> {
     parquet_to_arrow_schema_by_columns(
         parquet_schema,
-        0..parquet_schema.columns().len(),
+        ProjectionMask::all(),

Review Comment:
   ❤️ 



##########
parquet/src/arrow/mod.rs:
##########
@@ -133,11 +140,71 @@ pub use self::arrow_reader::ParquetFileArrowReader;
 pub use self::arrow_writer::ArrowWriter;
 #[cfg(feature = "async")]
 pub use self::async_reader::ParquetRecordBatchStreamBuilder;
+use crate::schema::types::SchemaDescriptor;
 
 pub use self::schema::{
     arrow_to_parquet_schema, parquet_to_arrow_schema, parquet_to_arrow_schema_by_columns,
-    parquet_to_arrow_schema_by_root_columns,
 };
 
 /// Schema metadata key used to store serialized Arrow IPC schema
 pub const ARROW_SCHEMA_META_KEY: &str = "ARROW:schema";
+
+/// A [`ProjectionMask`] identifies a set of columns within a potentially nested schema to project
+#[derive(Debug, Clone)]
+pub struct ProjectionMask {
+    /// A mask of
+    mask: Option<Vec<bool>>,
+}
+
+impl ProjectionMask {
+    /// Create a [`ColumnMask`] which selects all columns
+    pub fn all() -> Self {
+        Self { mask: None }
+    }
+
+    /// Create a [`ColumnMask`] which selects only the specified leaf columns
+    ///
+    /// Note: repeated or out of order indices will not impact the final mask
+    ///
+    /// i.e. `[0, 1, 2]` will construct the same mask as `[1, 0, 0, 2]`
+    pub fn leaves(
+        schema: &SchemaDescriptor,

Review Comment:
   As long as a useful error is produced, I agree this is fine behavior



##########
parquet/src/arrow/schema.rs:
##########
@@ -1188,9 +1112,9 @@ mod tests {
         // required int64 leaf5;
 
         let parquet_schema = SchemaDescriptor::new(Arc::new(parquet_group_type));
+        let mask = ProjectionMask::leaves(&parquet_schema, [3, 0, 4]);

Review Comment:
   this is much nicer to read / reason about



##########
parquet/src/arrow/mod.rs:
##########
@@ -133,11 +140,71 @@ pub use self::arrow_reader::ParquetFileArrowReader;
 pub use self::arrow_writer::ArrowWriter;
 #[cfg(feature = "async")]
 pub use self::async_reader::ParquetRecordBatchStreamBuilder;
+use crate::schema::types::SchemaDescriptor;
 
 pub use self::schema::{
     arrow_to_parquet_schema, parquet_to_arrow_schema, parquet_to_arrow_schema_by_columns,
-    parquet_to_arrow_schema_by_root_columns,
 };
 
 /// Schema metadata key used to store serialized Arrow IPC schema
 pub const ARROW_SCHEMA_META_KEY: &str = "ARROW:schema";
+
+/// A [`ProjectionMask`] identifies a set of columns within a potentially nested schema to project
+#[derive(Debug, Clone)]
+pub struct ProjectionMask {
+    /// A mask of

Review Comment:
   The comment seems to be truncated.
   
   Also, since we have `Bitmap` and all the associated handling in Arrow, I wonder if it is worth using that (though a `Vec<bool>` is nice and simple



##########
parquet/src/arrow/schema.rs:
##########
@@ -51,74 +52,18 @@ pub fn parquet_to_arrow_schema(
 ) -> Result<Schema> {
     parquet_to_arrow_schema_by_columns(
         parquet_schema,
-        0..parquet_schema.columns().len(),
+        ProjectionMask::all(),
         key_value_metadata,
     )
 }
 
-/// Convert parquet schema to arrow schema including optional metadata,
-/// only preserving some root columns.
-/// This is useful if we have columns `a.b`, `a.c.e` and `a.d`,
-/// and want `a` with all its child fields
-pub fn parquet_to_arrow_schema_by_root_columns<T>(
-    parquet_schema: &SchemaDescriptor,
-    column_indices: T,
-    key_value_metadata: Option<&Vec<KeyValue>>,
-) -> Result<Schema>
-where
-    T: IntoIterator<Item = usize>,
-{
-    // Reconstruct the index ranges of the parent columns
-    // An Arrow struct gets represented by 1+ columns based on how many child fields the
-    // struct has. This means that getting fields 1 and 2 might return the struct twice,
-    // if field 1 is the struct having say 3 fields, and field 2 is a primitive.
-    //
-    // The below gets the parent columns, and counts the number of child fields in each parent,
-    // such that we would end up with:
-    // - field 1 - columns: [0, 1, 2]
-    // - field 2 - columns: [3]
-    let mut parent_columns = vec![];
-    let mut curr_name = "";
-    let mut prev_name = "";
-    let mut indices = vec![];
-    (0..(parquet_schema.num_columns())).for_each(|i| {
-        let p_type = parquet_schema.get_column_root(i);
-        curr_name = p_type.get_basic_info().name();
-        if prev_name.is_empty() {
-            // first index
-            indices.push(i);
-            prev_name = curr_name;
-        } else if curr_name != prev_name {
-            prev_name = curr_name;
-            parent_columns.push((curr_name.to_string(), indices.clone()));
-            indices = vec![i];
-        } else {
-            indices.push(i);
-        }
-    });
-    // push the last column if indices has values
-    if !indices.is_empty() {
-        parent_columns.push((curr_name.to_string(), indices));
-    }
-
-    // gather the required leaf columns
-    let leaf_columns = column_indices
-        .into_iter()
-        .flat_map(|i| parent_columns[i].1.clone());
-
-    parquet_to_arrow_schema_by_columns(parquet_schema, leaf_columns, key_value_metadata)
-}
-
 /// Convert parquet schema to arrow schema including optional metadata,
 /// only preserving some leaf columns.
-pub fn parquet_to_arrow_schema_by_columns<T>(
+pub fn parquet_to_arrow_schema_by_columns(
     parquet_schema: &SchemaDescriptor,
-    column_indices: T,
+    mask: ProjectionMask,

Review Comment:
   I wonder if this needs an owned mask or if it could be taken by reference
   
   ```suggestion
       mask: &ProjectionMask,
   ```



##########
parquet/src/arrow/mod.rs:
##########
@@ -133,11 +140,71 @@ pub use self::arrow_reader::ParquetFileArrowReader;
 pub use self::arrow_writer::ArrowWriter;
 #[cfg(feature = "async")]
 pub use self::async_reader::ParquetRecordBatchStreamBuilder;
+use crate::schema::types::SchemaDescriptor;
 
 pub use self::schema::{
     arrow_to_parquet_schema, parquet_to_arrow_schema, parquet_to_arrow_schema_by_columns,
-    parquet_to_arrow_schema_by_root_columns,
 };
 
 /// Schema metadata key used to store serialized Arrow IPC schema
 pub const ARROW_SCHEMA_META_KEY: &str = "ARROW:schema";
+
+/// A [`ProjectionMask`] identifies a set of columns within a potentially nested schema to project
+#[derive(Debug, Clone)]
+pub struct ProjectionMask {
+    /// A mask of

Review Comment:
   Describing in the docstring that a mask of `None` means `All` is probably also a good idea as well as which schema a `ProjectionMask`s indexes refer to (I think Parquet)



##########
parquet/src/arrow/mod.rs:
##########
@@ -133,11 +140,71 @@ pub use self::arrow_reader::ParquetFileArrowReader;
 pub use self::arrow_writer::ArrowWriter;
 #[cfg(feature = "async")]
 pub use self::async_reader::ParquetRecordBatchStreamBuilder;
+use crate::schema::types::SchemaDescriptor;
 
 pub use self::schema::{
     arrow_to_parquet_schema, parquet_to_arrow_schema, parquet_to_arrow_schema_by_columns,
-    parquet_to_arrow_schema_by_root_columns,
 };
 
 /// Schema metadata key used to store serialized Arrow IPC schema
 pub const ARROW_SCHEMA_META_KEY: &str = "ARROW:schema";
+
+/// A [`ProjectionMask`] identifies a set of columns within a potentially nested schema to project
+#[derive(Debug, Clone)]
+pub struct ProjectionMask {
+    /// A mask of
+    mask: Option<Vec<bool>>,
+}
+
+impl ProjectionMask {
+    /// Create a [`ProjectionMask`] which selects all columns
+    pub fn all() -> Self {
+        Self { mask: None }
+    }
+
+    /// Create a [`ProjectionMask`] which selects only the specified leaf columns
+    ///
+    /// Note: repeated or out of order indices will not impact the final mask

Review Comment:
   Nice -- so the idea is that you enforce masks having in-order indicies by wrapping them in a `ProjectionMask` which enforces this invariant during construction 👍 



##########
parquet/src/arrow/schema.rs:
##########
@@ -155,24 +100,24 @@ fn get_arrow_schema_from_metadata(encoded_meta: &str) -> Result<Schema> {
                 Ok(message) => message
                     .header_as_schema()
                     .map(arrow::ipc::convert::fb_to_schema)
-                    .ok_or(ArrowError("the message is not Arrow Schema".to_string())),
+                    .ok_or(arrow_err!("the message is not Arrow Schema")),

Review Comment:
   Using  `arrow_err` is not obviously better to me:  it is the same verbosity but now requires looking / knowing what the `arrow_err!` macro does
   
    ;0 But it is not worse either 
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on pull request #1716: Add explicit column mask construction (#1701)

Posted by GitBox <gi...@apache.org>.
tustvold commented on PR #1716:
URL: https://github.com/apache/arrow-rs/pull/1716#issuecomment-1133596264

   As a happy side-effect this actually found and fixed a bug in the handling of nested projection pushdown in ParquetRecordBatchStream


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #1716: Add explicit column mask construction in parquet: `ProjectionMask` (#1701)

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #1716:
URL: https://github.com/apache/arrow-rs/pull/1716#discussion_r879404146


##########
parquet/src/arrow/schema.rs:
##########
@@ -51,74 +52,18 @@ pub fn parquet_to_arrow_schema(
 ) -> Result<Schema> {
     parquet_to_arrow_schema_by_columns(
         parquet_schema,
-        0..parquet_schema.columns().len(),
+        ProjectionMask::all(),
         key_value_metadata,
     )
 }
 
-/// Convert parquet schema to arrow schema including optional metadata,
-/// only preserving some root columns.
-/// This is useful if we have columns `a.b`, `a.c.e` and `a.d`,
-/// and want `a` with all its child fields
-pub fn parquet_to_arrow_schema_by_root_columns<T>(
-    parquet_schema: &SchemaDescriptor,
-    column_indices: T,
-    key_value_metadata: Option<&Vec<KeyValue>>,
-) -> Result<Schema>
-where
-    T: IntoIterator<Item = usize>,
-{
-    // Reconstruct the index ranges of the parent columns
-    // An Arrow struct gets represented by 1+ columns based on how many child fields the
-    // struct has. This means that getting fields 1 and 2 might return the struct twice,
-    // if field 1 is the struct having say 3 fields, and field 2 is a primitive.
-    //
-    // The below gets the parent columns, and counts the number of child fields in each parent,
-    // such that we would end up with:
-    // - field 1 - columns: [0, 1, 2]
-    // - field 2 - columns: [3]
-    let mut parent_columns = vec![];
-    let mut curr_name = "";
-    let mut prev_name = "";
-    let mut indices = vec![];
-    (0..(parquet_schema.num_columns())).for_each(|i| {
-        let p_type = parquet_schema.get_column_root(i);
-        curr_name = p_type.get_basic_info().name();
-        if prev_name.is_empty() {
-            // first index
-            indices.push(i);
-            prev_name = curr_name;
-        } else if curr_name != prev_name {
-            prev_name = curr_name;
-            parent_columns.push((curr_name.to_string(), indices.clone()));
-            indices = vec![i];
-        } else {
-            indices.push(i);
-        }
-    });
-    // push the last column if indices has values
-    if !indices.is_empty() {
-        parent_columns.push((curr_name.to_string(), indices));
-    }
-
-    // gather the required leaf columns
-    let leaf_columns = column_indices
-        .into_iter()
-        .flat_map(|i| parent_columns[i].1.clone());
-
-    parquet_to_arrow_schema_by_columns(parquet_schema, leaf_columns, key_value_metadata)
-}
-
 /// Convert parquet schema to arrow schema including optional metadata,
 /// only preserving some leaf columns.
-pub fn parquet_to_arrow_schema_by_columns<T>(
+pub fn parquet_to_arrow_schema_by_columns(
     parquet_schema: &SchemaDescriptor,
-    column_indices: T,
+    mask: ProjectionMask,

Review Comment:
   Currently yes, it gets moved into the Visitor. Theoretically it could borrow and have lifetimes, but in most cases I suspect we have the mask by value anyway



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org