You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "tustvold (via GitHub)" <gi...@apache.org> on 2023/11/29 13:46:16 UTC

[PR] Support nested schema projection (#5148) [arrow-rs]

tustvold opened a new pull request, #5149:
URL: https://github.com/apache/arrow-rs/pull/5149

   # Which issue does this PR close?
   
   <!--
   We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123.
   -->
   
   Closes #5148.
   
   # Rationale for this change
    
   <!--
   Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed.
   Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes.
   -->
   
   See ticket
   
   
   # What changes are included in this PR?
   
   <!--
   There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR.
   -->
   
   # Are there any user-facing changes?
   
   
   <!--
   If there are user-facing changes then we may require documentation to be updated before approving the PR.
   -->
   
   <!---
   If there are any breaking changes to public APIs, please add the `breaking change` label.
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Support nested schema projection (#5148) [arrow-rs]

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on PR #5149:
URL: https://github.com/apache/arrow-rs/pull/5149#issuecomment-1832924528

   > > One question I still have is does this API support the usecase in #5149 (comment) (selecting a nested subfield)?
   > 
   > Yes, I responded to the comment as such?
   
   Sorry -- the github UI is confusing me. I see the response in https://github.com/apache/arrow-rs/pull/5149#discussion_r1409762734 now. Thanks
   
   <img width="948" alt="Screenshot 2023-11-29 at 7 41 04 PM" src="https://github.com/apache/arrow-rs/assets/490673/717f230d-34b6-46e7-b287-f43a92d1f274">
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Support nested schema projection (#5148) [arrow-rs]

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold merged PR #5149:
URL: https://github.com/apache/arrow-rs/pull/5149


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Support nested schema projection (#5148) [arrow-rs]

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on code in PR #5149:
URL: https://github.com/apache/arrow-rs/pull/5149#discussion_r1409762734


##########
arrow-schema/src/fields.rs:
##########
@@ -99,6 +100,90 @@ impl Fields {
                 .all(|(a, b)| Arc::ptr_eq(a, b) || a.contains(b))
     }
 
+    /// Performs a depth-first scan of [`Fields`] filtering the [`FieldRef`] with no children
+    ///
+    /// Returns a new [`Fields`] comprising the [`FieldRef`] for which `filter` returned `true`
+    ///
+    /// ```
+    /// # use arrow_schema::{DataType, Field, Fields};
+    /// let fields = Fields::from(vec![
+    ///     Field::new("a", DataType::Int32, true),
+    ///     Field::new("b", DataType::Struct(Fields::from(vec![
+    ///         Field::new("c", DataType::Float32, false),
+    ///         Field::new("d", DataType::Float64, false),
+    ///     ])), false)
+    /// ]);
+    /// let filtered = fields.filter_leaves(|idx, _| idx == 0 || idx == 2);
+    /// let expected = Fields::from(vec![
+    ///     Field::new("a", DataType::Int32, true),
+    ///     Field::new("b", DataType::Struct(Fields::from(vec![
+    ///         Field::new("d", DataType::Float64, false),
+    ///     ])), false)
+    /// ]);
+    /// assert_eq!(filtered, expected);
+    /// ```
+    pub fn filter_leaves<F: FnMut(usize, &FieldRef) -> bool>(&self, mut filter: F) -> Self {

Review Comment:
   You would select b by selecting all its children, if you call filter with intermediate fields you end up in a position of how much context do you provide, just the parent, what about the parent's parent - reasoning in terms of leaves is the only coherent mechanism I can think of



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Support nested schema projection (#5148) [arrow-rs]

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on PR #5149:
URL: https://github.com/apache/arrow-rs/pull/5149#issuecomment-1832748799

   Looks very nice @tustvold  -- thank you


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Support nested schema projection (#5148) [arrow-rs]

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on code in PR #5149:
URL: https://github.com/apache/arrow-rs/pull/5149#discussion_r1409822415


##########
arrow-schema/src/fields.rs:
##########
@@ -99,6 +100,93 @@ impl Fields {
                 .all(|(a, b)| Arc::ptr_eq(a, b) || a.contains(b))
     }
 
+    /// Performs a depth-first scan of [`Fields`] filtering the [`FieldRef`] with no children
+    ///
+    /// Invokes `filter` with each leaf [`FieldRef`], i.e. one containing no children, and a
+    /// count of the number of previous calls to `filter` - i.e. the leaf's index.

Review Comment:
   I tried to clarify this in https://github.com/apache/arrow-rs/pull/5149/commits/8d9795069c8244820b2107c76f46d9d484ee4740



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Support nested schema projection (#5148) [arrow-rs]

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on code in PR #5149:
URL: https://github.com/apache/arrow-rs/pull/5149#discussion_r1409732355


##########
arrow-schema/src/fields.rs:
##########
@@ -99,6 +100,93 @@ impl Fields {
                 .all(|(a, b)| Arc::ptr_eq(a, b) || a.contains(b))
     }
 
+    /// Performs a depth-first scan of [`Fields`] filtering the [`FieldRef`] with no children

Review Comment:
   Here is a proposal of how to better phrase what this is doing (as well as give a usecase that might not be obvious to the casual reader)
   
   ```suggestion
       /// Returns a copy of this [`Fields`], with only those non nested (leaf) [`FieldRef`]s that 
       /// pass a predicate.  
       ///
       /// This can be used to select only a subset of fields in nested types such as 
       /// [`DataType::Struct`].
       /// Leaf [`FieldRef`]s `DataType`s have no nested `FieldRef`s`. For example
       /// a field with [`DatatType::Int32`] would be a leaf.
   ```



##########
arrow-schema/src/fields.rs:
##########
@@ -99,6 +100,93 @@ impl Fields {
                 .all(|(a, b)| Arc::ptr_eq(a, b) || a.contains(b))
     }
 
+    /// Performs a depth-first scan of [`Fields`] filtering the [`FieldRef`] with no children
+    ///
+    /// Invokes `filter` with each leaf [`FieldRef`], i.e. one containing no children, and a
+    /// count of the number of previous calls to `filter` - i.e. the leaf's index.
+    ///
+    /// Returns a new [`Fields`] comprising the [`FieldRef`] for which `filter` returned `true`
+    ///
+    /// ```
+    /// # use arrow_schema::{DataType, Field, Fields};
+    /// let fields = Fields::from(vec![
+    ///     Field::new("a", DataType::Int32, true),
+    ///     Field::new("b", DataType::Struct(Fields::from(vec![
+    ///         Field::new("c", DataType::Float32, false),
+    ///         Field::new("d", DataType::Float64, false),
+    ///     ])), false)
+    /// ]);
+    /// let filtered = fields.filter_leaves(|idx, _| idx == 0 || idx == 2);
+    /// let expected = Fields::from(vec![
+    ///     Field::new("a", DataType::Int32, true),
+    ///     Field::new("b", DataType::Struct(Fields::from(vec![
+    ///         Field::new("d", DataType::Float64, false),
+    ///     ])), false)
+    /// ]);
+    /// assert_eq!(filtered, expected);
+    /// ```
+    pub fn filter_leaves<F: FnMut(usize, &FieldRef) -> bool>(&self, mut filter: F) -> Self {
+        fn filter_field<F: FnMut(&FieldRef) -> bool>(
+            f: &FieldRef,
+            filter: &mut F,
+        ) -> Option<FieldRef> {
+            use DataType::*;
+
+            let (k, v) = match f.data_type() {
+                Dictionary(k, v) => (Some(k.clone()), v.as_ref()),

Review Comment:
   this presumes the key itself doesn't have any leaves which I think is reasonable, but figured I would point it out.
   
   



##########
arrow-schema/src/fields.rs:
##########
@@ -99,6 +100,93 @@ impl Fields {
                 .all(|(a, b)| Arc::ptr_eq(a, b) || a.contains(b))
     }
 
+    /// Performs a depth-first scan of [`Fields`] filtering the [`FieldRef`] with no children
+    ///
+    /// Invokes `filter` with each leaf [`FieldRef`], i.e. one containing no children, and a
+    /// count of the number of previous calls to `filter` - i.e. the leaf's index.

Review Comment:
   ```suggestion
       /// Invokes `filter` with each leaf [`FieldRef`] and a
       /// count of the depth to the `field` (i.e. the number of previous calls to `filter`, and the leaf's index.)
   ```



##########
arrow-schema/src/fields.rs:
##########
@@ -99,6 +100,90 @@ impl Fields {
                 .all(|(a, b)| Arc::ptr_eq(a, b) || a.contains(b))
     }
 
+    /// Performs a depth-first scan of [`Fields`] filtering the [`FieldRef`] with no children
+    ///
+    /// Returns a new [`Fields`] comprising the [`FieldRef`] for which `filter` returned `true`
+    ///
+    /// ```
+    /// # use arrow_schema::{DataType, Field, Fields};
+    /// let fields = Fields::from(vec![
+    ///     Field::new("a", DataType::Int32, true),
+    ///     Field::new("b", DataType::Struct(Fields::from(vec![
+    ///         Field::new("c", DataType::Float32, false),
+    ///         Field::new("d", DataType::Float64, false),
+    ///     ])), false)
+    /// ]);
+    /// let filtered = fields.filter_leaves(|idx, _| idx == 0 || idx == 2);
+    /// let expected = Fields::from(vec![
+    ///     Field::new("a", DataType::Int32, true),
+    ///     Field::new("b", DataType::Struct(Fields::from(vec![
+    ///         Field::new("d", DataType::Float64, false),
+    ///     ])), false)
+    /// ]);
+    /// assert_eq!(filtered, expected);
+    /// ```
+    pub fn filter_leaves<F: FnMut(usize, &FieldRef) -> bool>(&self, mut filter: F) -> Self {

Review Comment:
   I wonder how we would support selecting a subfield that is itself a field?
   
   For example:
   ```
   { 
     a : {
       b: {
         c: 1, 
       },
       d: 2
   }
   ```
   
   Could I use this API to support only selecting a.b?
   
   ```
   { 
     a : {
       b: {
         c: 1, 
       },
   }
   ```
   
   Maybe we could if the parent FieldRef was also passed 
   ```rust
       /// calls filter(depth, parent_field, child_field)` 
       pub fn filter_leaves<F: FnMut(usize, &FieldRef, &FieldRef) -> bool>(&self, mut filter: F) -> Self {
   ...
   }
   ```
   
   🤔 
   
   



##########
arrow-schema/src/fields.rs:
##########
@@ -99,6 +100,90 @@ impl Fields {
                 .all(|(a, b)| Arc::ptr_eq(a, b) || a.contains(b))
     }
 
+    /// Performs a depth-first scan of [`Fields`] filtering the [`FieldRef`] with no children
+    ///
+    /// Returns a new [`Fields`] comprising the [`FieldRef`] for which `filter` returned `true`
+    ///
+    /// ```
+    /// # use arrow_schema::{DataType, Field, Fields};
+    /// let fields = Fields::from(vec![
+    ///     Field::new("a", DataType::Int32, true),
+    ///     Field::new("b", DataType::Struct(Fields::from(vec![
+    ///         Field::new("c", DataType::Float32, false),
+    ///         Field::new("d", DataType::Float64, false),
+    ///     ])), false)
+    /// ]);
+    /// let filtered = fields.filter_leaves(|idx, _| idx == 0 || idx == 2);
+    /// let expected = Fields::from(vec![
+    ///     Field::new("a", DataType::Int32, true),
+    ///     Field::new("b", DataType::Struct(Fields::from(vec![
+    ///         Field::new("d", DataType::Float64, false),
+    ///     ])), false)
+    /// ]);
+    /// assert_eq!(filtered, expected);
+    /// ```
+    pub fn filter_leaves<F: FnMut(usize, &FieldRef) -> bool>(&self, mut filter: F) -> Self {

Review Comment:
   I wonder how we would support selecting a subfield that is itself a field?
   
   For example:
   ```
   { 
     a : {
       b: {
         c: 1, 
       },
       d: 2
   }
   ```
   
   Could I use this API to support only selecting a.b?
   
   ```
   { 
     a : {
       b: {
         c: 1, 
       },
   }
   ```
   
   Maybe we could if the parent FieldRef was also passed 
   ```rust
       /// calls filter(depth, parent_field, child_field)` 
       pub fn filter_leaves<F: FnMut(usize, &FieldRef, &FieldRef) -> bool>(&self, mut filter: F) -> Self {
   ...
   }
   ```
   
   🤔 
   
   



##########
arrow-schema/src/fields.rs:
##########
@@ -99,6 +100,93 @@ impl Fields {
                 .all(|(a, b)| Arc::ptr_eq(a, b) || a.contains(b))
     }
 
+    /// Performs a depth-first scan of [`Fields`] filtering the [`FieldRef`] with no children
+    ///
+    /// Invokes `filter` with each leaf [`FieldRef`], i.e. one containing no children, and a
+    /// count of the number of previous calls to `filter` - i.e. the leaf's index.
+    ///
+    /// Returns a new [`Fields`] comprising the [`FieldRef`] for which `filter` returned `true`
+    ///
+    /// ```
+    /// # use arrow_schema::{DataType, Field, Fields};
+    /// let fields = Fields::from(vec![
+    ///     Field::new("a", DataType::Int32, true),
+    ///     Field::new("b", DataType::Struct(Fields::from(vec![
+    ///         Field::new("c", DataType::Float32, false),
+    ///         Field::new("d", DataType::Float64, false),
+    ///     ])), false)
+    /// ]);
+    /// let filtered = fields.filter_leaves(|idx, _| idx == 0 || idx == 2);
+    /// let expected = Fields::from(vec![
+    ///     Field::new("a", DataType::Int32, true),
+    ///     Field::new("b", DataType::Struct(Fields::from(vec![
+    ///         Field::new("d", DataType::Float64, false),
+    ///     ])), false)
+    /// ]);
+    /// assert_eq!(filtered, expected);
+    /// ```
+    pub fn filter_leaves<F: FnMut(usize, &FieldRef) -> bool>(&self, mut filter: F) -> Self {
+        fn filter_field<F: FnMut(&FieldRef) -> bool>(
+            f: &FieldRef,
+            filter: &mut F,
+        ) -> Option<FieldRef> {
+            use DataType::*;
+
+            let (k, v) = match f.data_type() {
+                Dictionary(k, v) => (Some(k.clone()), v.as_ref()),
+                d => (None, d),
+            };
+            let d = match v {

Review Comment:
   [Map](https://docs.rs/arrow/latest/arrow/datatypes/enum.DataType.html#variant.Map)  and [RunEndEncoded](https://docs.rs/arrow/latest/arrow/datatypes/enum.DataType.html#variant.RunEndEncoded) appear to have embedded fields too. It seems like they might also need to be handled 🤔 
   
   The usecase might be "I want only the `values` of the map? Or maybe Arrow can't express that concept (Apply the filter to only the values 🤔 )



##########
arrow-schema/src/fields.rs:
##########
@@ -99,6 +100,93 @@ impl Fields {
                 .all(|(a, b)| Arc::ptr_eq(a, b) || a.contains(b))
     }
 
+    /// Performs a depth-first scan of [`Fields`] filtering the [`FieldRef`] with no children
+    ///
+    /// Invokes `filter` with each leaf [`FieldRef`], i.e. one containing no children, and a
+    /// count of the number of previous calls to `filter` - i.e. the leaf's index.
+    ///
+    /// Returns a new [`Fields`] comprising the [`FieldRef`] for which `filter` returned `true`
+    ///
+    /// ```
+    /// # use arrow_schema::{DataType, Field, Fields};
+    /// let fields = Fields::from(vec![
+    ///     Field::new("a", DataType::Int32, true),
+    ///     Field::new("b", DataType::Struct(Fields::from(vec![
+    ///         Field::new("c", DataType::Float32, false),
+    ///         Field::new("d", DataType::Float64, false),
+    ///     ])), false)
+    /// ]);
+    /// let filtered = fields.filter_leaves(|idx, _| idx == 0 || idx == 2);
+    /// let expected = Fields::from(vec![
+    ///     Field::new("a", DataType::Int32, true),
+    ///     Field::new("b", DataType::Struct(Fields::from(vec![
+    ///         Field::new("d", DataType::Float64, false),
+    ///     ])), false)
+    /// ]);
+    /// assert_eq!(filtered, expected);
+    /// ```
+    pub fn filter_leaves<F: FnMut(usize, &FieldRef) -> bool>(&self, mut filter: F) -> Self {
+        fn filter_field<F: FnMut(&FieldRef) -> bool>(
+            f: &FieldRef,
+            filter: &mut F,
+        ) -> Option<FieldRef> {
+            use DataType::*;
+
+            let (k, v) = match f.data_type() {
+                Dictionary(k, v) => (Some(k.clone()), v.as_ref()),
+                d => (None, d),
+            };
+            let d = match v {

Review Comment:
   [Map](https://docs.rs/arrow/latest/arrow/datatypes/enum.DataType.html#variant.Map)  and [RunEndEncoded](https://docs.rs/arrow/latest/arrow/datatypes/enum.DataType.html#variant.RunEndEncoded) appear to have embedded fields too. It seems like they might also need to be handled 🤔 
   
   The usecase might be "I want only the `values` of the map? Or maybe Arrow can't express that concept (Apply the filter to only the values 🤔 )



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Support nested schema projection (#5148) [arrow-rs]

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on code in PR #5149:
URL: https://github.com/apache/arrow-rs/pull/5149#discussion_r1409760985


##########
arrow-schema/src/fields.rs:
##########
@@ -99,6 +100,93 @@ impl Fields {
                 .all(|(a, b)| Arc::ptr_eq(a, b) || a.contains(b))
     }
 
+    /// Performs a depth-first scan of [`Fields`] filtering the [`FieldRef`] with no children
+    ///
+    /// Invokes `filter` with each leaf [`FieldRef`], i.e. one containing no children, and a
+    /// count of the number of previous calls to `filter` - i.e. the leaf's index.

Review Comment:
   This isn't correct, it is the count of leaves encountered, not the depth



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Support nested schema projection (#5148) [arrow-rs]

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on code in PR #5149:
URL: https://github.com/apache/arrow-rs/pull/5149#discussion_r1409308240


##########
arrow-schema/src/fields.rs:
##########
@@ -99,6 +100,90 @@ impl Fields {
                 .all(|(a, b)| Arc::ptr_eq(a, b) || a.contains(b))
     }
 
+    /// Performs a depth-first scan of [`Fields`] filtering the [`FieldRef`] with no children
+    ///
+    /// Returns a new [`Fields`] comprising the [`FieldRef`] for which `filter` returned `true`
+    ///
+    /// ```
+    /// # use arrow_schema::{DataType, Field, Fields};
+    /// let fields = Fields::from(vec![
+    ///     Field::new("a", DataType::Int32, true),
+    ///     Field::new("b", DataType::Struct(Fields::from(vec![
+    ///         Field::new("c", DataType::Float32, false),
+    ///         Field::new("d", DataType::Float64, false),
+    ///     ])), false)
+    /// ]);
+    /// let filtered = fields.filter_leaves(|idx, _| idx == 0 || idx == 2);
+    /// let expected = Fields::from(vec![
+    ///     Field::new("a", DataType::Int32, true),
+    ///     Field::new("b", DataType::Struct(Fields::from(vec![
+    ///         Field::new("d", DataType::Float64, false),
+    ///     ])), false)
+    /// ]);
+    /// assert_eq!(filtered, expected);
+    /// ```
+    pub fn filter_leaves<F: FnMut(usize, &FieldRef) -> bool>(&self, mut filter: F) -> Self {

Review Comment:
   Originally I had two methods, one which was called for all Fields, however, it wasn't clear to me how such an API could be used - as it would lack any context as to where in the schema a given field appeared. I therefore just stuck with filter_leaves which has immediate concrete utility for #5135. A case could be made for a fully-fledged visitor pattern, but I'm not aware of any immediate use-cases for this, and therefore what the requirements might be for such an API.
   
   FYI @Jefffrey 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Support nested schema projection (#5148) [arrow-rs]

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on PR #5149:
URL: https://github.com/apache/arrow-rs/pull/5149#issuecomment-1832750174

   One question I have is does this API support the usecase in https://github.com/apache/arrow-rs/pull/5149#discussion_r1409743402 (selecting a nested subfield)? If not, I can file a ticket tracking that feature request


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Support nested schema projection (#5148) [arrow-rs]

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on code in PR #5149:
URL: https://github.com/apache/arrow-rs/pull/5149#discussion_r1409820957


##########
arrow-schema/src/fields.rs:
##########
@@ -99,6 +100,93 @@ impl Fields {
                 .all(|(a, b)| Arc::ptr_eq(a, b) || a.contains(b))
     }
 
+    /// Performs a depth-first scan of [`Fields`] filtering the [`FieldRef`] with no children

Review Comment:
   I incoporated this into https://github.com/apache/arrow-rs/pull/5149/commits/8d9795069c8244820b2107c76f46d9d484ee4740. I tweaked the wording a little as the depth-first part is critical to understanding what the leaf ordering means. I also added some comments to the doctest to highlight



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Support nested schema projection (#5148) [arrow-rs]

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on code in PR #5149:
URL: https://github.com/apache/arrow-rs/pull/5149#discussion_r1409311037


##########
arrow-schema/src/fields.rs:
##########
@@ -99,6 +100,93 @@ impl Fields {
                 .all(|(a, b)| Arc::ptr_eq(a, b) || a.contains(b))
     }
 
+    /// Performs a depth-first scan of [`Fields`] filtering the [`FieldRef`] with no children
+    ///
+    /// Invokes `filter` with each leaf [`FieldRef`], i.e. one containing no children, and a

Review Comment:
   I'm sure this could be expressed better, but I struggled to come up with something :sweat_smile: 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Support nested schema projection (#5148) [arrow-rs]

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on PR #5149:
URL: https://github.com/apache/arrow-rs/pull/5149#issuecomment-1832763278

   > One question I still have is does this API support the usecase in #5149 (comment) (selecting a nested subfield)?
   
   Yes, I responded to the comment as such?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Support nested schema projection (#5148) [arrow-rs]

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on code in PR #5149:
URL: https://github.com/apache/arrow-rs/pull/5149#discussion_r1409765764


##########
arrow-schema/src/fields.rs:
##########
@@ -99,6 +100,93 @@ impl Fields {
                 .all(|(a, b)| Arc::ptr_eq(a, b) || a.contains(b))
     }
 
+    /// Performs a depth-first scan of [`Fields`] filtering the [`FieldRef`] with no children
+    ///
+    /// Invokes `filter` with each leaf [`FieldRef`], i.e. one containing no children, and a
+    /// count of the number of previous calls to `filter` - i.e. the leaf's index.
+    ///
+    /// Returns a new [`Fields`] comprising the [`FieldRef`] for which `filter` returned `true`
+    ///
+    /// ```
+    /// # use arrow_schema::{DataType, Field, Fields};
+    /// let fields = Fields::from(vec![
+    ///     Field::new("a", DataType::Int32, true),
+    ///     Field::new("b", DataType::Struct(Fields::from(vec![
+    ///         Field::new("c", DataType::Float32, false),
+    ///         Field::new("d", DataType::Float64, false),
+    ///     ])), false)
+    /// ]);
+    /// let filtered = fields.filter_leaves(|idx, _| idx == 0 || idx == 2);
+    /// let expected = Fields::from(vec![
+    ///     Field::new("a", DataType::Int32, true),
+    ///     Field::new("b", DataType::Struct(Fields::from(vec![
+    ///         Field::new("d", DataType::Float64, false),
+    ///     ])), false)
+    /// ]);
+    /// assert_eq!(filtered, expected);
+    /// ```
+    pub fn filter_leaves<F: FnMut(usize, &FieldRef) -> bool>(&self, mut filter: F) -> Self {
+        fn filter_field<F: FnMut(&FieldRef) -> bool>(
+            f: &FieldRef,
+            filter: &mut F,
+        ) -> Option<FieldRef> {
+            use DataType::*;
+
+            let (k, v) = match f.data_type() {
+                Dictionary(k, v) => (Some(k.clone()), v.as_ref()),
+                d => (None, d),
+            };
+            let d = match v {

Review Comment:
   Map is there, I missed RunEndEncoded 👍



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Support nested schema projection (#5148) [arrow-rs]

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on code in PR #5149:
URL: https://github.com/apache/arrow-rs/pull/5149#discussion_r1409762734


##########
arrow-schema/src/fields.rs:
##########
@@ -99,6 +100,90 @@ impl Fields {
                 .all(|(a, b)| Arc::ptr_eq(a, b) || a.contains(b))
     }
 
+    /// Performs a depth-first scan of [`Fields`] filtering the [`FieldRef`] with no children
+    ///
+    /// Returns a new [`Fields`] comprising the [`FieldRef`] for which `filter` returned `true`
+    ///
+    /// ```
+    /// # use arrow_schema::{DataType, Field, Fields};
+    /// let fields = Fields::from(vec![
+    ///     Field::new("a", DataType::Int32, true),
+    ///     Field::new("b", DataType::Struct(Fields::from(vec![
+    ///         Field::new("c", DataType::Float32, false),
+    ///         Field::new("d", DataType::Float64, false),
+    ///     ])), false)
+    /// ]);
+    /// let filtered = fields.filter_leaves(|idx, _| idx == 0 || idx == 2);
+    /// let expected = Fields::from(vec![
+    ///     Field::new("a", DataType::Int32, true),
+    ///     Field::new("b", DataType::Struct(Fields::from(vec![
+    ///         Field::new("d", DataType::Float64, false),
+    ///     ])), false)
+    /// ]);
+    /// assert_eq!(filtered, expected);
+    /// ```
+    pub fn filter_leaves<F: FnMut(usize, &FieldRef) -> bool>(&self, mut filter: F) -> Self {

Review Comment:
   You would select b by selecting all its children, if you call with intermediate fields you end up in a position of how much context do you provide, just the parent, what about the parent's parent - reasoning in terms of leaves is the only coherent mechanism I can think of



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Support nested schema projection (#5148) [arrow-rs]

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on code in PR #5149:
URL: https://github.com/apache/arrow-rs/pull/5149#discussion_r1409903749


##########
arrow-schema/src/fields.rs:
##########
@@ -99,6 +100,93 @@ impl Fields {
                 .all(|(a, b)| Arc::ptr_eq(a, b) || a.contains(b))
     }
 
+    /// Performs a depth-first scan of [`Fields`] filtering the [`FieldRef`] with no children

Review Comment:
   Thank you -- the comments are most helpful



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Support nested schema projection (#5148) [arrow-rs]

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on code in PR #5149:
URL: https://github.com/apache/arrow-rs/pull/5149#discussion_r1409820310


##########
arrow-schema/src/fields.rs:
##########
@@ -99,6 +100,93 @@ impl Fields {
                 .all(|(a, b)| Arc::ptr_eq(a, b) || a.contains(b))
     }
 
+    /// Performs a depth-first scan of [`Fields`] filtering the [`FieldRef`] with no children
+    ///
+    /// Invokes `filter` with each leaf [`FieldRef`], i.e. one containing no children, and a
+    /// count of the number of previous calls to `filter` - i.e. the leaf's index.
+    ///
+    /// Returns a new [`Fields`] comprising the [`FieldRef`] for which `filter` returned `true`
+    ///
+    /// ```
+    /// # use arrow_schema::{DataType, Field, Fields};
+    /// let fields = Fields::from(vec![
+    ///     Field::new("a", DataType::Int32, true),
+    ///     Field::new("b", DataType::Struct(Fields::from(vec![
+    ///         Field::new("c", DataType::Float32, false),
+    ///         Field::new("d", DataType::Float64, false),
+    ///     ])), false)
+    /// ]);
+    /// let filtered = fields.filter_leaves(|idx, _| idx == 0 || idx == 2);
+    /// let expected = Fields::from(vec![
+    ///     Field::new("a", DataType::Int32, true),
+    ///     Field::new("b", DataType::Struct(Fields::from(vec![
+    ///         Field::new("d", DataType::Float64, false),
+    ///     ])), false)
+    /// ]);
+    /// assert_eq!(filtered, expected);
+    /// ```
+    pub fn filter_leaves<F: FnMut(usize, &FieldRef) -> bool>(&self, mut filter: F) -> Self {
+        fn filter_field<F: FnMut(&FieldRef) -> bool>(
+            f: &FieldRef,
+            filter: &mut F,
+        ) -> Option<FieldRef> {
+            use DataType::*;
+
+            let (k, v) = match f.data_type() {
+                Dictionary(k, v) => (Some(k.clone()), v.as_ref()),

Review Comment:
   Added comments in https://github.com/apache/arrow-rs/pull/5149/commits/8d9795069c8244820b2107c76f46d9d484ee4740



##########
arrow-schema/src/fields.rs:
##########
@@ -99,6 +100,93 @@ impl Fields {
                 .all(|(a, b)| Arc::ptr_eq(a, b) || a.contains(b))
     }
 
+    /// Performs a depth-first scan of [`Fields`] filtering the [`FieldRef`] with no children
+    ///
+    /// Invokes `filter` with each leaf [`FieldRef`], i.e. one containing no children, and a
+    /// count of the number of previous calls to `filter` - i.e. the leaf's index.
+    ///
+    /// Returns a new [`Fields`] comprising the [`FieldRef`] for which `filter` returned `true`
+    ///
+    /// ```
+    /// # use arrow_schema::{DataType, Field, Fields};
+    /// let fields = Fields::from(vec![
+    ///     Field::new("a", DataType::Int32, true),
+    ///     Field::new("b", DataType::Struct(Fields::from(vec![
+    ///         Field::new("c", DataType::Float32, false),
+    ///         Field::new("d", DataType::Float64, false),
+    ///     ])), false)
+    /// ]);
+    /// let filtered = fields.filter_leaves(|idx, _| idx == 0 || idx == 2);
+    /// let expected = Fields::from(vec![
+    ///     Field::new("a", DataType::Int32, true),
+    ///     Field::new("b", DataType::Struct(Fields::from(vec![
+    ///         Field::new("d", DataType::Float64, false),
+    ///     ])), false)
+    /// ]);
+    /// assert_eq!(filtered, expected);
+    /// ```
+    pub fn filter_leaves<F: FnMut(usize, &FieldRef) -> bool>(&self, mut filter: F) -> Self {
+        fn filter_field<F: FnMut(&FieldRef) -> bool>(
+            f: &FieldRef,
+            filter: &mut F,
+        ) -> Option<FieldRef> {
+            use DataType::*;
+
+            let (k, v) = match f.data_type() {
+                Dictionary(k, v) => (Some(k.clone()), v.as_ref()),
+                d => (None, d),
+            };
+            let d = match v {

Review Comment:
   RunEndEncoded support added in https://github.com/apache/arrow-rs/pull/5149/commits/8d9795069c8244820b2107c76f46d9d484ee4740



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Support nested schema projection (#5148) [arrow-rs]

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on code in PR #5149:
URL: https://github.com/apache/arrow-rs/pull/5149#discussion_r1409308240


##########
arrow-schema/src/fields.rs:
##########
@@ -99,6 +100,90 @@ impl Fields {
                 .all(|(a, b)| Arc::ptr_eq(a, b) || a.contains(b))
     }
 
+    /// Performs a depth-first scan of [`Fields`] filtering the [`FieldRef`] with no children
+    ///
+    /// Returns a new [`Fields`] comprising the [`FieldRef`] for which `filter` returned `true`
+    ///
+    /// ```
+    /// # use arrow_schema::{DataType, Field, Fields};
+    /// let fields = Fields::from(vec![
+    ///     Field::new("a", DataType::Int32, true),
+    ///     Field::new("b", DataType::Struct(Fields::from(vec![
+    ///         Field::new("c", DataType::Float32, false),
+    ///         Field::new("d", DataType::Float64, false),
+    ///     ])), false)
+    /// ]);
+    /// let filtered = fields.filter_leaves(|idx, _| idx == 0 || idx == 2);
+    /// let expected = Fields::from(vec![
+    ///     Field::new("a", DataType::Int32, true),
+    ///     Field::new("b", DataType::Struct(Fields::from(vec![
+    ///         Field::new("d", DataType::Float64, false),
+    ///     ])), false)
+    /// ]);
+    /// assert_eq!(filtered, expected);
+    /// ```
+    pub fn filter_leaves<F: FnMut(usize, &FieldRef) -> bool>(&self, mut filter: F) -> Self {

Review Comment:
   Originally I had two methods, one which was called for all Fields, however, it wasn't clear to me how such an API could be used - as it would lack any context as to where in the schema a given field appeared. I therefore just stuck with filter_leaves which has immediate concrete utility for #5135
   
   FYI @Jefffrey 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org