You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "comphead (via GitHub)" <gi...@apache.org> on 2023/12/07 16:56:25 UTC

[PR] Introduce `Schema::fieldNames` method [arrow-rs]

comphead opened a new pull request, #5186:
URL: https://github.com/apache/arrow-rs/pull/5186

   # Which issue does this PR close?
   
   <!--
   We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123.
   -->
   
   Closes #.
   
   # Rationale for this change
   Introducing `Schema::field_names` method similar to Spark https://spark.apache.org/docs/1.6.1/api/java/org/apache/spark/sql/types/StructType.html#fieldNames()
   
   The method returns list of column names, including nested fields.
   
   Later we can use this code to improve schema output to be more user-friendly  
   or create toDDL() method, (https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/types/StructField.html#toDDL--) etc
   
   <!--
   Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed.
   Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes.
   -->
   
   # What changes are included in this PR?
   Bunch of extensions to query recursively fields and schema. Also changed an error message to have more details when schema and data mismatches
   
   <!--
   There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR.
   -->
   
   # Are there any user-facing changes?
   
   No
   <!--
   If there are user-facing changes then we may require documentation to be updated before approving the PR.
   -->
   
   <!---
   If there are any breaking changes to public APIs, please add the `breaking change` label.
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Introduce `Schema::field_names` method [arrow-rs]

Posted by "comphead (via GitHub)" <gi...@apache.org>.
comphead commented on PR #5186:
URL: https://github.com/apache/arrow-rs/pull/5186#issuecomment-1852443487

   Hi @tustvold @viirya any plans to review this PR
   It would be also very helpful for DF as when DF faces the error like https://github.com/apache/arrow-datafusion/pull/8402#discussion_r1413200621
   ```
   select rank() over (order by null) from (select 1 a union all select 2 a) q;
   Arrow error: Invalid argument error: number of columns(2) must match number of fields(1) in schema
   ```
   it will give more context without debugging


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Introduce `Schema::field_names` method [arrow-rs]

Posted by "comphead (via GitHub)" <gi...@apache.org>.
comphead commented on code in PR #5186:
URL: https://github.com/apache/arrow-rs/pull/5186#discussion_r1437153677


##########
arrow-schema/src/field.rs:
##########
@@ -328,20 +328,80 @@ impl Field {
     /// within `self` contained within this field (including `self`)
     pub(crate) fn fields(&self) -> Vec<&Field> {
         let mut collected_fields = vec![self];
-        collected_fields.append(&mut Field::_fields(&self.data_type));
+        collected_fields.append(&mut Field::_fields(&self.data_type, true));
 
         collected_fields
     }
 
-    fn _fields(dt: &DataType) -> Vec<&Field> {
+    /// Returns a [`Vec`] direct children [`Field`]s
+    /// within `self`
+    pub(crate) fn nested_fields(&self) -> Vec<&Field> {
+        Field::_fields(&self.data_type, false)
+    }
+
+    /// Return self and direct children field names of the [`Field`]
+    ///
+    /// ```
+    /// # use arrow_schema::*;
+    /// let field = Field::new("nested",
+    ///        DataType::Struct(
+    ///            Fields::from(
+    ///                vec![
+    ///                    Field::new("inner",
+    ///                    DataType::Struct(
+    ///                        Fields::from(
+    ///                            vec![
+    ///                                Field::new("a", DataType::Int32, true)
+    ///                                ])), true)])), true
+    ///                );
+    ///
+    /// assert_eq!(field.children_names(), vec!["nested", "nested.inner", "nested.inner.a"]);
+    /// ```
+    pub fn children_names(&self) -> Vec<String> {
+        fn nested_field_names_inner(f: &Field, parent_name: String, buffer: &mut Vec<String>) {
+            let current_name = format!("{}{}", parent_name, f.name());
+
+            // Push the concatenated name to the result vector
+            buffer.push(current_name.clone());
+
+            // Recursively concatenate child names
+            for child in f.nested_fields() {
+                nested_field_names_inner(child, format!("{}.", current_name), buffer);

Review Comment:
   thanks @tustvold 
   do you prefer another separator or another approach? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Introduce `Schema::field_names` method [arrow-rs]

Posted by "comphead (via GitHub)" <gi...@apache.org>.
comphead commented on PR #5186:
URL: https://github.com/apache/arrow-rs/pull/5186#issuecomment-1852497287

   > Sorry it is on my list, but I'm wondering if we want a more general mechanism to walk these fields as opposed to adding lots of utility methods. I will have a play over the coming days
   
   Thanks @tustvold I reused 1 inner method instead of 2. 
   Tried to avoid changing pub methods as it can affect downstream. The generalization approach is great, let me know your thoughts whenever you have time


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Introduce `Schema::field_names` method [arrow-rs]

Posted by "comphead (via GitHub)" <gi...@apache.org>.
comphead commented on code in PR #5186:
URL: https://github.com/apache/arrow-rs/pull/5186#discussion_r1437171093


##########
arrow-schema/src/field.rs:
##########
@@ -328,20 +328,80 @@ impl Field {
     /// within `self` contained within this field (including `self`)
     pub(crate) fn fields(&self) -> Vec<&Field> {
         let mut collected_fields = vec![self];
-        collected_fields.append(&mut Field::_fields(&self.data_type));
+        collected_fields.append(&mut Field::_fields(&self.data_type, true));
 
         collected_fields
     }
 
-    fn _fields(dt: &DataType) -> Vec<&Field> {
+    /// Returns a [`Vec`] direct children [`Field`]s
+    /// within `self`
+    pub(crate) fn nested_fields(&self) -> Vec<&Field> {
+        Field::_fields(&self.data_type, false)
+    }
+
+    /// Return self and direct children field names of the [`Field`]
+    ///
+    /// ```
+    /// # use arrow_schema::*;
+    /// let field = Field::new("nested",
+    ///        DataType::Struct(
+    ///            Fields::from(
+    ///                vec![
+    ///                    Field::new("inner",
+    ///                    DataType::Struct(
+    ///                        Fields::from(
+    ///                            vec![
+    ///                                Field::new("a", DataType::Int32, true)
+    ///                                ])), true)])), true
+    ///                );
+    ///
+    /// assert_eq!(field.children_names(), vec!["nested", "nested.inner", "nested.inner.a"]);
+    /// ```
+    pub fn children_names(&self) -> Vec<String> {
+        fn nested_field_names_inner(f: &Field, parent_name: String, buffer: &mut Vec<String>) {
+            let current_name = format!("{}{}", parent_name, f.name());
+
+            // Push the concatenated name to the result vector
+            buffer.push(current_name.clone());
+
+            // Recursively concatenate child names
+            for child in f.nested_fields() {
+                nested_field_names_inner(child, format!("{}.", current_name), buffer);

Review Comment:
   what kind of approach? perhaps I can help here



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Introduce `Schema::field_names` method [arrow-rs]

Posted by "comphead (via GitHub)" <gi...@apache.org>.
comphead commented on PR #5186:
URL: https://github.com/apache/arrow-rs/pull/5186#issuecomment-1992062531

   > Apologies for the delay, coming back to this and honestly a little confused as to the use-case. The updated error message uses the new field_names method, but the RecordBatch method itself is not concerned with nested fields, and so printing the nested fields is actually just confusing? It should only print the first level of field names, as that is all RecordBatch is concerned with?
   
   Thanks @tustvold for having the time for this. The PR is planned to make debugging easier and improve the error text which is currently too unclear.
    ```
   Arrow error: Invalid argument error: number of columns(2) must match number of fields(1) in schema
   ```
   
   The idea is to show which fields exists, and which are missing in the schema. But if the field is nested we should display it in user-friendly way. You are correct about using dots to separate parent field and child fields, this is not very good idea as dots also used in catalog/schema/table qualifiers.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Introduce `Schema::field_names` method [arrow-rs]

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on code in PR #5186:
URL: https://github.com/apache/arrow-rs/pull/5186#discussion_r1437161527


##########
arrow-schema/src/field.rs:
##########
@@ -328,20 +328,80 @@ impl Field {
     /// within `self` contained within this field (including `self`)
     pub(crate) fn fields(&self) -> Vec<&Field> {
         let mut collected_fields = vec![self];
-        collected_fields.append(&mut Field::_fields(&self.data_type));
+        collected_fields.append(&mut Field::_fields(&self.data_type, true));
 
         collected_fields
     }
 
-    fn _fields(dt: &DataType) -> Vec<&Field> {
+    /// Returns a [`Vec`] direct children [`Field`]s
+    /// within `self`
+    pub(crate) fn nested_fields(&self) -> Vec<&Field> {
+        Field::_fields(&self.data_type, false)
+    }
+
+    /// Return self and direct children field names of the [`Field`]
+    ///
+    /// ```
+    /// # use arrow_schema::*;
+    /// let field = Field::new("nested",
+    ///        DataType::Struct(
+    ///            Fields::from(
+    ///                vec![
+    ///                    Field::new("inner",
+    ///                    DataType::Struct(
+    ///                        Fields::from(
+    ///                            vec![
+    ///                                Field::new("a", DataType::Int32, true)
+    ///                                ])), true)])), true
+    ///                );
+    ///
+    /// assert_eq!(field.children_names(), vec!["nested", "nested.inner", "nested.inner.a"]);
+    /// ```
+    pub fn children_names(&self) -> Vec<String> {
+        fn nested_field_names_inner(f: &Field, parent_name: String, buffer: &mut Vec<String>) {
+            let current_name = format!("{}{}", parent_name, f.name());
+
+            // Push the concatenated name to the result vector
+            buffer.push(current_name.clone());
+
+            // Recursively concatenate child names
+            for child in f.nested_fields() {
+                nested_field_names_inner(child, format!("{}.", current_name), buffer);

Review Comment:
   I'm leaning towards a different approach, I don't think there is an unambiguous separator



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Introduce `Schema::field_names` method [arrow-rs]

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on code in PR #5186:
URL: https://github.com/apache/arrow-rs/pull/5186#discussion_r1437189297


##########
arrow-schema/src/field.rs:
##########
@@ -328,20 +328,80 @@ impl Field {
     /// within `self` contained within this field (including `self`)
     pub(crate) fn fields(&self) -> Vec<&Field> {
         let mut collected_fields = vec![self];
-        collected_fields.append(&mut Field::_fields(&self.data_type));
+        collected_fields.append(&mut Field::_fields(&self.data_type, true));
 
         collected_fields
     }
 
-    fn _fields(dt: &DataType) -> Vec<&Field> {
+    /// Returns a [`Vec`] direct children [`Field`]s
+    /// within `self`
+    pub(crate) fn nested_fields(&self) -> Vec<&Field> {
+        Field::_fields(&self.data_type, false)
+    }
+
+    /// Return self and direct children field names of the [`Field`]
+    ///
+    /// ```
+    /// # use arrow_schema::*;
+    /// let field = Field::new("nested",
+    ///        DataType::Struct(
+    ///            Fields::from(
+    ///                vec![
+    ///                    Field::new("inner",
+    ///                    DataType::Struct(
+    ///                        Fields::from(
+    ///                            vec![
+    ///                                Field::new("a", DataType::Int32, true)
+    ///                                ])), true)])), true
+    ///                );
+    ///
+    /// assert_eq!(field.children_names(), vec!["nested", "nested.inner", "nested.inner.a"]);
+    /// ```
+    pub fn children_names(&self) -> Vec<String> {
+        fn nested_field_names_inner(f: &Field, parent_name: String, buffer: &mut Vec<String>) {
+            let current_name = format!("{}{}", parent_name, f.name());
+
+            // Push the concatenated name to the result vector
+            buffer.push(current_name.clone());
+
+            // Recursively concatenate child names
+            for child in f.nested_fields() {
+                nested_field_names_inner(child, format!("{}.", current_name), buffer);

Review Comment:
   My vague thought was something similar to how we display parquet schema, where it might be something like
   
   ```
   struct batch {
      optional int64 column1
       optional struct nested {
           required float64 f64
       }
   }
   ```
   
   But implemented as some `FieldVisitor` abstraction so people can easily do something different should they wish to



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Introduce `Schema::field_names` method [arrow-rs]

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on PR #5186:
URL: https://github.com/apache/arrow-rs/pull/5186#issuecomment-1984966989

   Apologies for the delay, coming back to this and honestly a little confused as to the use-case. The updated error message uses the new field_names method, but the method itself is not concerned with nested fields, and so printing the nested fields is actually just confusing? It should only print the first level of field names?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Introduce `Schema::field_names` method [arrow-rs]

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on code in PR #5186:
URL: https://github.com/apache/arrow-rs/pull/5186#discussion_r1437189297


##########
arrow-schema/src/field.rs:
##########
@@ -328,20 +328,80 @@ impl Field {
     /// within `self` contained within this field (including `self`)
     pub(crate) fn fields(&self) -> Vec<&Field> {
         let mut collected_fields = vec![self];
-        collected_fields.append(&mut Field::_fields(&self.data_type));
+        collected_fields.append(&mut Field::_fields(&self.data_type, true));
 
         collected_fields
     }
 
-    fn _fields(dt: &DataType) -> Vec<&Field> {
+    /// Returns a [`Vec`] direct children [`Field`]s
+    /// within `self`
+    pub(crate) fn nested_fields(&self) -> Vec<&Field> {
+        Field::_fields(&self.data_type, false)
+    }
+
+    /// Return self and direct children field names of the [`Field`]
+    ///
+    /// ```
+    /// # use arrow_schema::*;
+    /// let field = Field::new("nested",
+    ///        DataType::Struct(
+    ///            Fields::from(
+    ///                vec![
+    ///                    Field::new("inner",
+    ///                    DataType::Struct(
+    ///                        Fields::from(
+    ///                            vec![
+    ///                                Field::new("a", DataType::Int32, true)
+    ///                                ])), true)])), true
+    ///                );
+    ///
+    /// assert_eq!(field.children_names(), vec!["nested", "nested.inner", "nested.inner.a"]);
+    /// ```
+    pub fn children_names(&self) -> Vec<String> {
+        fn nested_field_names_inner(f: &Field, parent_name: String, buffer: &mut Vec<String>) {
+            let current_name = format!("{}{}", parent_name, f.name());
+
+            // Push the concatenated name to the result vector
+            buffer.push(current_name.clone());
+
+            // Recursively concatenate child names
+            for child in f.nested_fields() {
+                nested_field_names_inner(child, format!("{}.", current_name), buffer);

Review Comment:
   My vague thought was something similar to how we display parquet schema, where it might be something like
   
   ```
   struct batch {
      optional int64 column1
       optional struct nested {
           required float64 f64
       }
   }
   ```
   
   But implemented as some `FieldVisitor` abstraction so people can easily do something different should they wish to.
   
   I want to take some time to get my thoughts on this together, and will then get back to you



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Introduce `Schema::field_names` method [arrow-rs]

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold closed pull request #5186: Introduce `Schema::field_names` method
URL: https://github.com/apache/arrow-rs/pull/5186


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Introduce `Schema::field_names` method [arrow-rs]

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on PR #5186:
URL: https://github.com/apache/arrow-rs/pull/5186#issuecomment-1852463963

   Sorry it is on my list, but I'm wondering if we want a more general mechanism to walk these fields as opposed to adding lots of utility methods. I will have a play over the coming days


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Introduce `Schema::field_names` method [arrow-rs]

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on code in PR #5186:
URL: https://github.com/apache/arrow-rs/pull/5186#discussion_r1436917399


##########
arrow-schema/src/field.rs:
##########
@@ -328,20 +328,80 @@ impl Field {
     /// within `self` contained within this field (including `self`)
     pub(crate) fn fields(&self) -> Vec<&Field> {
         let mut collected_fields = vec![self];
-        collected_fields.append(&mut Field::_fields(&self.data_type));
+        collected_fields.append(&mut Field::_fields(&self.data_type, true));
 
         collected_fields
     }
 
-    fn _fields(dt: &DataType) -> Vec<&Field> {
+    /// Returns a [`Vec`] direct children [`Field`]s
+    /// within `self`
+    pub(crate) fn nested_fields(&self) -> Vec<&Field> {
+        Field::_fields(&self.data_type, false)
+    }
+
+    /// Return self and direct children field names of the [`Field`]
+    ///
+    /// ```
+    /// # use arrow_schema::*;
+    /// let field = Field::new("nested",
+    ///        DataType::Struct(
+    ///            Fields::from(
+    ///                vec![
+    ///                    Field::new("inner",
+    ///                    DataType::Struct(
+    ///                        Fields::from(
+    ///                            vec![
+    ///                                Field::new("a", DataType::Int32, true)
+    ///                                ])), true)])), true
+    ///                );
+    ///
+    /// assert_eq!(field.children_names(), vec!["nested", "nested.inner", "nested.inner.a"]);
+    /// ```
+    pub fn children_names(&self) -> Vec<String> {
+        fn nested_field_names_inner(f: &Field, parent_name: String, buffer: &mut Vec<String>) {
+            let current_name = format!("{}{}", parent_name, f.name());
+
+            // Push the concatenated name to the result vector
+            buffer.push(current_name.clone());
+
+            // Recursively concatenate child names
+            for child in f.nested_fields() {
+                nested_field_names_inner(child, format!("{}.", current_name), buffer);

Review Comment:
   The use of `.` as a separator makes me uncomfortable, there have been a lot of bugs in the past resulting from things incorrectly treating `.` as a nesting delimiter...
   
   This is something I'll try to address when I iterate on the visitor



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org