You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/06/10 11:58:31 UTC

[GitHub] [arrow-rs] nl5887 opened a new pull request, #1839: Improve comparison of struct fields

nl5887 opened a new pull request, #1839:
URL: https://github.com/apache/arrow-rs/pull/1839

   Currently if the order of fields in struct is different, it will fail.
   This fix will lookup fields, ignoring the field order and length and
   fail if not nullable.
   
   Array data type comparison will use the equals_datatype instead of
   comparing enums.
   
   This fix work towards closing of https://github.com/apache/arrow-datafusion/issues/2326.
   
   @alamb 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on pull request #1839: Improve comparison of struct fields

Posted by GitBox <gi...@apache.org>.
tustvold commented on PR #1839:
URL: https://github.com/apache/arrow-rs/pull/1839#issuecomment-1153733098

   I'm not sure about this, reordering fields in a StructArray is not compatible? To me this feels like a limitation of the SchemaAdapter logic in DataFusion which needs to:
   
   * Understand nested types
   * Re-project fields within nested types
   
   Columns and fields are identified by index and not column name, and so they cannot be arbitrarily reordered?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] alamb commented on a diff in pull request #1839: Improve comparison of struct fields

Posted by GitBox <gi...@apache.org>.
alamb commented on code in PR #1839:
URL: https://github.com/apache/arrow-rs/pull/1839#discussion_r894730475


##########
arrow/src/datatypes/datatype.rs:
##########
@@ -684,10 +684,17 @@ impl DataType {
                     && a.data_type().equals_datatype(b.data_type())
             }
             (DataType::Struct(a), DataType::Struct(b)) => {
-                a.len() == b.len()
-                    && a.iter().zip(b).all(|(a, b)| {
-                        a.is_nullable() == b.is_nullable()
-                            && a.data_type().equals_datatype(b.data_type())
+                    a.iter().all(|a| {

Review Comment:
   I agree that having another method would be good



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] nl5887 commented on a diff in pull request #1839: Improve comparison of struct fields

Posted by GitBox <gi...@apache.org>.
nl5887 commented on code in PR #1839:
URL: https://github.com/apache/arrow-rs/pull/1839#discussion_r894778455


##########
arrow/src/compute/kernels/concat.rs:
##########
@@ -62,7 +62,7 @@ pub fn concat(arrays: &[&dyn Array]) -> Result<ArrayRef> {
 
     if arrays
         .iter()
-        .any(|array| array.data_type() != arrays[0].data_type())
+        .any(|array| !array.data_type().equals_datatype(arrays[0].data_type()))

Review Comment:
   Arrays can be concatted as long as they are compatible?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on pull request #1839: Improve comparison of struct fields

Posted by GitBox <gi...@apache.org>.
tustvold commented on PR #1839:
URL: https://github.com/apache/arrow-rs/pull/1839#issuecomment-1383280987

   This PR has been inactive for a while so closing to clear the backlog, please feel free to reopen if you come back to this


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] codecov-commenter commented on pull request #1839: Improve comparison of struct fields

Posted by GitBox <gi...@apache.org>.
codecov-commenter commented on PR #1839:
URL: https://github.com/apache/arrow-rs/pull/1839#issuecomment-1152890889

   # [Codecov](https://codecov.io/gh/apache/arrow-rs/pull/1839?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#1839](https://codecov.io/gh/apache/arrow-rs/pull/1839?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (ed0ed75) into [master](https://codecov.io/gh/apache/arrow-rs/commit/8d787d9c09a12fdb7d576493052f0383f03d84d3?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (8d787d9) will **decrease** coverage by `0.03%`.
   > The diff coverage is `0.00%`.
   
   ```diff
   @@            Coverage Diff             @@
   ##           master    #1839      +/-   ##
   ==========================================
   - Coverage   83.48%   83.45%   -0.04%     
   ==========================================
     Files         201      201              
     Lines       56838    56931      +93     
   ==========================================
   + Hits        47452    47510      +58     
   - Misses       9386     9421      +35     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow-rs/pull/1839?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
   |---|---|---|
   | [arrow/src/datatypes/datatype.rs](https://codecov.io/gh/apache/arrow-rs/pull/1839/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2RhdGF0eXBlcy9kYXRhdHlwZS5ycw==) | `59.59% <0.00%> (-6.21%)` | :arrow_down: |
   | [arrow/src/compute/kernels/temporal.rs](https://codecov.io/gh/apache/arrow-rs/pull/1839/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2NvbXB1dGUva2VybmVscy90ZW1wb3JhbC5ycw==) | `95.77% <0.00%> (-1.36%)` | :arrow_down: |
   | [parquet\_derive/src/parquet\_field.rs](https://codecov.io/gh/apache/arrow-rs/pull/1839/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldF9kZXJpdmUvc3JjL3BhcnF1ZXRfZmllbGQucnM=) | `65.75% <0.00%> (-0.23%)` | :arrow_down: |
   | [parquet/src/encodings/encoding.rs](https://codecov.io/gh/apache/arrow-rs/pull/1839/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvZW5jb2RpbmdzL2VuY29kaW5nLnJz) | `93.46% <0.00%> (-0.20%)` | :arrow_down: |
   | [arrow/src/array/mod.rs](https://codecov.io/gh/apache/arrow-rs/pull/1839/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2FycmF5L21vZC5ycw==) | `100.00% <0.00%> (ø)` | |
   | [arrow/src/compute/kernels/substring.rs](https://codecov.io/gh/apache/arrow-rs/pull/1839/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2NvbXB1dGUva2VybmVscy9zdWJzdHJpbmcucnM=) | `99.53% <0.00%> (+0.01%)` | :arrow_up: |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow-rs/pull/1839?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow-rs/pull/1839?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Last update [8d787d9...ed0ed75](https://codecov.io/gh/apache/arrow-rs/pull/1839?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] viirya commented on a diff in pull request #1839: Improve comparison of struct fields

Posted by GitBox <gi...@apache.org>.
viirya commented on code in PR #1839:
URL: https://github.com/apache/arrow-rs/pull/1839#discussion_r894711663


##########
arrow/src/datatypes/datatype.rs:
##########
@@ -684,10 +684,17 @@ impl DataType {
                     && a.data_type().equals_datatype(b.data_type())
             }
             (DataType::Struct(a), DataType::Struct(b)) => {
-                a.len() == b.len()
-                    && a.iter().zip(b).all(|(a, b)| {
-                        a.is_nullable() == b.is_nullable()
-                            && a.data_type().equals_datatype(b.data_type())
+                    a.iter().all(|a| {

Review Comment:
   I think this is not correct. `equals_datatype` compares exact data type equality (except for metadata and nested field names). Technically speaking, two structs are equal only if the nested field data types are equal.
   
   In https://github.com/apache/arrow-datafusion/issues/2326, seems you want to overcome some errors at column projection in DataFusion. For column projection, it requires two structs to be "compatible", so nested field order can be ignored (because it will be "mapped" later). It is a more loose data type equality than the exact equality of `equals_datatype`.
   
   So I don't think we should change `equals_datatype`. For compatible data type change, maybe we should have another method for that.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] nl5887 commented on a diff in pull request #1839: Improve comparison of struct fields

Posted by GitBox <gi...@apache.org>.
nl5887 commented on code in PR #1839:
URL: https://github.com/apache/arrow-rs/pull/1839#discussion_r895000936


##########
arrow/src/compute/kernels/concat.rs:
##########
@@ -62,7 +62,7 @@ pub fn concat(arrays: &[&dyn Array]) -> Result<ArrayRef> {
 
     if arrays
         .iter()
-        .any(|array| array.data_type() != arrays[0].data_type())
+        .any(|array| !array.data_type().equals_datatype(arrays[0].data_type()))

Review Comment:
   Ok, reverted the concat change.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] viirya commented on a diff in pull request #1839: Improve comparison of struct fields

Posted by GitBox <gi...@apache.org>.
viirya commented on code in PR #1839:
URL: https://github.com/apache/arrow-rs/pull/1839#discussion_r894987275


##########
arrow/src/compute/kernels/concat.rs:
##########
@@ -62,7 +62,7 @@ pub fn concat(arrays: &[&dyn Array]) -> Result<ArrayRef> {
 
     if arrays
         .iter()
-        .any(|array| array.data_type() != arrays[0].data_type())
+        .any(|array| !array.data_type().equals_datatype(arrays[0].data_type()))

Review Comment:
   I think our concat kernel doesn't map out-of-order nested fields currently. It is why it checks data type with exact equality.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold closed pull request #1839: Improve comparison of struct fields

Posted by GitBox <gi...@apache.org>.
tustvold closed pull request #1839: Improve comparison of struct fields
URL: https://github.com/apache/arrow-rs/pull/1839


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] nl5887 commented on a diff in pull request #1839: Improve comparison of struct fields

Posted by GitBox <gi...@apache.org>.
nl5887 commented on code in PR #1839:
URL: https://github.com/apache/arrow-rs/pull/1839#discussion_r894776550


##########
arrow/src/compute/kernels/concat.rs:
##########
@@ -62,7 +62,7 @@ pub fn concat(arrays: &[&dyn Array]) -> Result<ArrayRef> {
 
     if arrays
         .iter()
-        .any(|array| array.data_type() != arrays[0].data_type())
+        .any(|array| !array.data_type().equals_datatype(arrays[0].data_type()))

Review Comment:
   I think this should be changed to compatible_datatype (at least for my own usecase).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] viirya commented on a diff in pull request #1839: Improve comparison of struct fields

Posted by GitBox <gi...@apache.org>.
viirya commented on code in PR #1839:
URL: https://github.com/apache/arrow-rs/pull/1839#discussion_r894987393


##########
arrow/src/compute/kernels/concat.rs:
##########
@@ -62,7 +62,7 @@ pub fn concat(arrays: &[&dyn Array]) -> Result<ArrayRef> {
 
     if arrays
         .iter()
-        .any(|array| array.data_type() != arrays[0].data_type())
+        .any(|array| !array.data_type().equals_datatype(arrays[0].data_type()))

Review Comment:
   It is better to focus on compatible datatype method in this PR. You can definitely propose do compatible concat kernel later.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] viirya commented on a diff in pull request #1839: Improve comparison of struct fields

Posted by GitBox <gi...@apache.org>.
viirya commented on code in PR #1839:
URL: https://github.com/apache/arrow-rs/pull/1839#discussion_r894711663


##########
arrow/src/datatypes/datatype.rs:
##########
@@ -684,10 +684,17 @@ impl DataType {
                     && a.data_type().equals_datatype(b.data_type())
             }
             (DataType::Struct(a), DataType::Struct(b)) => {
-                a.len() == b.len()
-                    && a.iter().zip(b).all(|(a, b)| {
-                        a.is_nullable() == b.is_nullable()
-                            && a.data_type().equals_datatype(b.data_type())
+                    a.iter().all(|a| {

Review Comment:
   I think this is not correct. `equals_datatype` compares exact data type equality (except for metadata and nested field names). Technically speaking, two structs are equal only if the nested field data types are equal.
   
   In #2326, seems you want to overcome some errors at column projection in DataFusion. For column projection, it requires two structs to be "compatible", so nested field order can be ignored (because it will be "mapped" later). It is a more loose data type equality than the exact equality of `equals_datatype`.
   
   So I don't think we should change `equals_datatype`. For compatible data type change, maybe we should have another method for that.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] nl5887 commented on a diff in pull request #1839: Improve comparison of struct fields

Posted by GitBox <gi...@apache.org>.
nl5887 commented on code in PR #1839:
URL: https://github.com/apache/arrow-rs/pull/1839#discussion_r894771359


##########
arrow/src/datatypes/datatype.rs:
##########
@@ -684,10 +684,17 @@ impl DataType {
                     && a.data_type().equals_datatype(b.data_type())
             }
             (DataType::Struct(a), DataType::Struct(b)) => {
-                a.len() == b.len()
-                    && a.iter().zip(b).all(|(a, b)| {
-                        a.is_nullable() == b.is_nullable()
-                            && a.data_type().equals_datatype(b.data_type())
+                    a.iter().all(|a| {

Review Comment:
   Agreed, I've added a method compatible_datatype, and reverted my changes to equal_datatype.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] viirya commented on a diff in pull request #1839: Improve comparison of struct fields

Posted by GitBox <gi...@apache.org>.
viirya commented on code in PR #1839:
URL: https://github.com/apache/arrow-rs/pull/1839#discussion_r894772298


##########
arrow/src/compute/kernels/concat.rs:
##########
@@ -62,7 +62,7 @@ pub fn concat(arrays: &[&dyn Array]) -> Result<ArrayRef> {
 
     if arrays
         .iter()
-        .any(|array| array.data_type() != arrays[0].data_type())
+        .any(|array| !array.data_type().equals_datatype(arrays[0].data_type()))

Review Comment:
   Does this need to be reverted too?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org