You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/11/30 11:08:40 UTC

[GitHub] [arrow] ch-sc opened a new pull request #8715: ARROW-10656: [Rust] Remove Field from (most) DataTypes, add nullability

ch-sc opened a new pull request #8715:
URL: https://github.com/apache/arrow/pull/8715


   Compare two data types based on types instead of strictly all values. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on a change in pull request #8715: ARROW-10656: [Rust] Use DataType comparison without values

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #8715:
URL: https://github.com/apache/arrow/pull/8715#discussion_r532031780



##########
File path: rust/arrow/src/datatypes.rs
##########
@@ -1149,13 +1155,34 @@ impl DataType {
     }
 }
 
+impl NullableDataType {
+    /// Creates a new data type context

Review comment:
       ```suggestion
       /// Creates a new `NullableDataType`
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ch-sc commented on a change in pull request #8715: ARROW-10656: [Rust] Use DataType comparison without values

Posted by GitBox <gi...@apache.org>.
ch-sc commented on a change in pull request #8715:
URL: https://github.com/apache/arrow/pull/8715#discussion_r527811826



##########
File path: rust/arrow/src/datatypes.rs
##########
@@ -1142,6 +1142,44 @@ impl DataType {
                 | Float64
         )
     }
+
+    /// Compares this data type with another data type only based on the data type
+    /// including nested data types, but not based on other values.
+    pub fn cmp_type(&self, other: &Self) -> bool {

Review comment:
       renamed it




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on a change in pull request #8715: ARROW-10656: [Rust] Use DataType comparison without values

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #8715:
URL: https://github.com/apache/arrow/pull/8715#discussion_r529044055



##########
File path: rust/arrow/src/ipc/convert.rs
##########
@@ -268,22 +270,38 @@ pub(crate) fn get_data_type(field: ipc::Field, may_be_dictionary: bool) -> DataT
             if children.len() != 1 {
                 panic!("expect a list to have one child")
             }
-            DataType::List(Box::new(children.get(0).into()))
+            let child_field = children.get(0);
+            // returning int16 for now, to test, not sure how to get data type

Review comment:
       Perhaps you could change this entire function to something like
   ```
   fn get_data_type_context(field: ipc::Field, may_be_dictionary: bool) -> DataTypeContext {
   ...
   }
   ```
   
   And then change 
   ```
   pub(crate) fn get_data_type(field: ipc::Field, may_be_dictionary: bool) -> DataType {
       get_data_type_context(field, may_be_dictionary).data_type
   ```
   
   
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on pull request #8715: ARROW-10656: [Rust] Use DataType comparison without values

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #8715:
URL: https://github.com/apache/arrow/pull/8715#issuecomment-735222106


   If the CI tests pass, I suggest we update the PR's title too to reflect what this PR does now -- something perhaps like "Do not store (unused) field name in DataType, only store nullability"


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #8715: ARROW-10656: [Rust] Use DataType comparison without values

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #8715:
URL: https://github.com/apache/arrow/pull/8715#issuecomment-730507645


   https://issues.apache.org/jira/browse/ARROW-10656


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] vertexclique commented on pull request #8715: ARROW-10656: [Rust] Use DataType comparison without values

Posted by GitBox <gi...@apache.org>.
vertexclique commented on pull request #8715:
URL: https://github.com/apache/arrow/pull/8715#issuecomment-730700392


   > Update the PartialCmp implementation for DataType to ignore Field names for all DataType == DataType comparisons
   
   A very good approach to go forward. I like it.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nevi-me commented on pull request #8715: ARROW-10656: [Rust] Remove Field from (most) DataTypes, add nullability

Posted by GitBox <gi...@apache.org>.
nevi-me commented on pull request #8715:
URL: https://github.com/apache/arrow/pull/8715#issuecomment-745870189


   Hey @ch-sc, I'm going to experiment with a slightly different approach, and I'll let you know how that goes. I'm aiming to have that done over the weekend. I ended up not getting enough time last week.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ch-sc commented on pull request #8715: ARROW-10656: [Rust] Use DataType comparison without values

Posted by GitBox <gi...@apache.org>.
ch-sc commented on pull request #8715:
URL: https://github.com/apache/arrow/pull/8715#issuecomment-731561884


   @jorgecarleitao no concerns from my side. 
   
   Though, if nullability was the only reason why `Field` was introduced in Lists I would prefer your second proposed option: replace `Field` with `(nullable, data_type)`.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ch-sc commented on pull request #8715: ARROW-10656: [Rust] Use DataType comparison without values

Posted by GitBox <gi...@apache.org>.
ch-sc commented on pull request #8715:
URL: https://github.com/apache/arrow/pull/8715#issuecomment-731255930


   Thank you all for the quick feedback!
   
   > Is the only intended change here to not compare the name of fields? Or do you have other ideas in mind for the future?
   
   Well, I guess what approach to take when comparing depends on the scenario. When calling `try_new()` to create a new record batch I didn't expect this behavior. But I can see the benefits of including more information in the comparison logic; e.g. nullable. 
   In the implementation I followed what I thought would be the expected behavior. Also, it aligned with the comment in the code that I cited in the ticket: _"list types can have different names, but we only need the data types to be the same"_
   
   > Is it easy to explain in what circumstance you have DataTypes that are the same except for the field name?
   
   Keeping track of such metadata adds complexity to the user especially in nested scenarios. I would assume in the majority of cases this information is not relevant when comparing. BTW this is not restricted to field names only, but also dictionary ids and dictionary ordering.
   
   I would second the suggestion from @alamb:
   
   > 2. Update the PartialCmp implementation for DataType to ignore Field names for all DataType == DataType comparisons
   
   ...or have 2 implementations for including metadata or not. I saw a similar approach in C++ for RecordBatch: 
   ```c
   bool RecordBatch::Equals(const RecordBatch& other, bool check_metadata)
   ```
   
   However, we should be clear to the users what is actually compared and what is not. If we end up in a scenario where some attributes play a role and some do not this is really hard to understand for anyone.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on pull request #8715: ARROW-10656: [Rust] Use DataType comparison without values

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #8715:
URL: https://github.com/apache/arrow/pull/8715#issuecomment-731581032


   > However, we should be clear to the users what is actually compared and what is not. If we end up in a scenario where some attributes play a role and some do not this is really hard to understand for anyone.
   
   I agree that the current situation is confusing
   
   I  think the core of the issue is that the "can the field be nullable" is logically part of a `DataType`, though the way the code is now that information is only part of the `DataType` for nested fields (for other structures, the nullability information is only encoded in a `Field` wrapper around the datatype. 
   
   I personally think changing `Field` to `(nulable, data_type)` in List would be a reasonable compromise and at least avoid the confusion about field names


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ch-sc commented on pull request #8715: ARROW-10656: [Rust] Use DataType comparison without values

Posted by GitBox <gi...@apache.org>.
ch-sc commented on pull request #8715:
URL: https://github.com/apache/arrow/pull/8715#issuecomment-732992137


   @alamb thank you for your feedback. 
   
   I agree, DataTypeContext is not ideal. What do you think of `NullableDataType`?
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nevi-me commented on pull request #8715: ARROW-10656: [Rust] Use DataType comparison without values

Posted by GitBox <gi...@apache.org>.
nevi-me commented on pull request #8715:
URL: https://github.com/apache/arrow/pull/8715#issuecomment-735380461


   > @nevi-me , since you fielded the proposal to use `Field` on the DataType, are you ok with this change?
   
   We can go ahead and merge it, I'll only be able to determine its impact correctly when I rebase it onto my parquet work.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao edited a comment on pull request #8715: ARROW-10656: [Rust] Use DataType comparison without values

Posted by GitBox <gi...@apache.org>.
jorgecarleitao edited a comment on pull request #8715:
URL: https://github.com/apache/arrow/pull/8715#issuecomment-730555090


   Hey @ch-sc , thanks for your PR!
   
   @nevi-me, could you help here? I am a bit worried about introducing another comparison of datatypes, but I was unable to find anything in the specification stating that a DataType of a ListArray should have a field name.
   
   My main concern here is that this change would allow the following: I receive a `RecordBatch`, and I want to verify that it is consistent. While doing so, I end up reaching to the conclusion that `batch.schema().field(0).data_type() != batch.column(0).data_type()`. IMO this goes against the whole idea of having a `RecordBatch` in the first place.
   
   OTOH, I also understand the motivation for this change: if the field name is irrelevant, then it should not be used in the comparison.
   
   My feeling is that if we need to introduce a different comparison, this often hints that there is useless information on the `DataType` that we should eliminate. If it is useless but needs to stay for some reason, then my suggestion is that we implement a custom `PartialEq` that ignores it, so that there is a common source of truth about whether two datatypes are equal.
   
   What are your thoughts @nevi-me, @alamb and @ch-sc ?
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nevi-me commented on pull request #8715: ARROW-10656: [Rust] Use DataType comparison without values

Posted by GitBox <gi...@apache.org>.
nevi-me commented on pull request #8715:
URL: https://github.com/apache/arrow/pull/8715#issuecomment-731556654


   One potentially last comment from my side @ch-sc @alamb @jorgecarleitao @vertexclique 
   
   We should support checking nullability for list and struct fields. This is important when writing batches to IPC and Parquet. I came across an issue regarding the comparison of nested nullable structs, I plan on creating a test case and a PR in the coming days for this.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on pull request #8715: ARROW-10656: [Rust] Remove Field from (most) DataTypes, add nullability

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #8715:
URL: https://github.com/apache/arrow/pull/8715#issuecomment-735996221


   I am sorry I missed the integration test failures on this PR :(


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ch-sc commented on pull request #8715: ARROW-10656: [Rust] Remove Field from (most) DataTypes, add nullability

Posted by GitBox <gi...@apache.org>.
ch-sc commented on pull request #8715:
URL: https://github.com/apache/arrow/pull/8715#issuecomment-744319300


   Hi @nevi-me, did you have time to look into the failing integration tests? 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] andygrove commented on pull request #8715: ARROW-10656: [Rust] Use DataType comparison without values

Posted by GitBox <gi...@apache.org>.
andygrove commented on pull request #8715:
URL: https://github.com/apache/arrow/pull/8715#issuecomment-735466192


   @nevi-me @ch-sc It looks like the integration tests were failing on this PR and I'm seeing the same failures on other PRs now. I'm not clear if this regression was caused by merging this PR or not. Could you take a look?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nevi-me closed pull request #8715: ARROW-10656: [Rust] Use DataType comparison without values

Posted by GitBox <gi...@apache.org>.
nevi-me closed pull request #8715:
URL: https://github.com/apache/arrow/pull/8715


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on pull request #8715: ARROW-10656: [Rust] Use DataType comparison without values

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #8715:
URL: https://github.com/apache/arrow/pull/8715#issuecomment-733023446


   @ch-sc  -- I think `NullableDataType` sounds good


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ch-sc commented on pull request #8715: ARROW-10656: [Rust] Remove Field from (most) DataTypes, add nullability

Posted by GitBox <gi...@apache.org>.
ch-sc commented on pull request #8715:
URL: https://github.com/apache/arrow/pull/8715#issuecomment-738167115


   The integration tests are still failling. Based on the error message it looks like fields are not serialized like expected. Therefore, I added manual serialization and deserialization for fields so that the same output is produced like before. That didn't help unfortuately. 
   
   Do you have any ideas @alamb @nevi-me?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ch-sc commented on pull request #8715: ARROW-10656: [Rust] Remove Field from (most) DataTypes, add nullability

Posted by GitBox <gi...@apache.org>.
ch-sc commented on pull request #8715:
URL: https://github.com/apache/arrow/pull/8715#issuecomment-738291083


   No it's not blocking me at the moment. Thank you! 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nevi-me commented on pull request #8715: ARROW-10656: [Rust] Use DataType comparison without values

Posted by GitBox <gi...@apache.org>.
nevi-me commented on pull request #8715:
URL: https://github.com/apache/arrow/pull/8715#issuecomment-735559849


   I have opened #8802 to revert the changes here.
   @ch-sc I missed the integration test failure, so that's on me.
   
   I think we'll need to try this with performing a looser comparison on `RecordBatch::try_new()` instead, as we do need the field names for lists in order to interop with other implementations.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nevi-me commented on pull request #8715: ARROW-10656: [Rust] Remove Field from (most) DataTypes, add nullability

Posted by GitBox <gi...@apache.org>.
nevi-me commented on pull request #8715:
URL: https://github.com/apache/arrow/pull/8715#issuecomment-751980283


   Closing in favour of #8988 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on pull request #8715: ARROW-10656: [Rust] Use DataType comparison without values

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on pull request #8715:
URL: https://github.com/apache/arrow/pull/8715#issuecomment-731558992


   I agree with @alamb and @nevi-me. IMO we must not have a second comparison.
   
   Valid options for me:
   1. do nothing
   2. replace `Field` by `(nullable, DataType)`, so that we do not track the field's name on the `DataType` (`Field` is a trivial struct and also contains `dict_is_ordered` and `dict_id` that may not be applicable here)
   3. implement `PartialEq` to ignore the field's name (note, `PartialCmp` does not exist)
   
   Along with Hash, equality is one of the most important operations on a `struct` in Rust, and trying to create variations of it adds a lot of confusion. Like the example I gave, users can no longer assume that two arrays on a `RecordBatch` will have equal DataTypes and equal to the RecordBatch's Schema's DataType, which essentially makes a RecordBatch's schema kind of useless. Like @nevi-me's point, disregarding nullability breaks integration with C++.
   
   @ch-sc , what is your concern with updating `PartialEq`?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ch-sc commented on a change in pull request #8715: ARROW-10656: [Rust] Use DataType comparison without values

Posted by GitBox <gi...@apache.org>.
ch-sc commented on a change in pull request #8715:
URL: https://github.com/apache/arrow/pull/8715#discussion_r529612290



##########
File path: rust/arrow/src/ipc/convert.rs
##########
@@ -268,22 +270,38 @@ pub(crate) fn get_data_type(field: ipc::Field, may_be_dictionary: bool) -> DataT
             if children.len() != 1 {
                 panic!("expect a list to have one child")
             }
-            DataType::List(Box::new(children.get(0).into()))
+            let child_field = children.get(0);
+            // returning int16 for now, to test, not sure how to get data type

Review comment:
       Changed it, but seems like some integration tests are still failing. Any ideas?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ch-sc commented on a change in pull request #8715: ARROW-10656: [Rust] Use DataType comparison without values

Posted by GitBox <gi...@apache.org>.
ch-sc commented on a change in pull request #8715:
URL: https://github.com/apache/arrow/pull/8715#discussion_r527793254



##########
File path: rust/arrow/src/datatypes.rs
##########
@@ -1142,6 +1142,44 @@ impl DataType {
                 | Float64
         )
     }
+
+    /// Compares this data type with another data type only based on the data type
+    /// including nested data types, but not based on other values.
+    pub fn cmp_type(&self, other: &Self) -> bool {
+        match (self, other) {
+            (DataType::List(f1), DataType::List(f2)) => {
+                f1.data_type().cmp_type(f2.data_type())
+            }
+            (DataType::FixedSizeList(f1, _), DataType::FixedSizeList(f2, _)) => {
+                f1.data_type().cmp_type(f2.data_type())
+            }
+            (DataType::LargeList(f1), DataType::LargeList(f2)) => {
+                f1.data_type().cmp_type(f2.data_type())
+            }
+            (DataType::Struct(f1), DataType::Struct(f2)) => {
+                if f1.len() == f2.len() {

Review comment:
       Yeah that looks much better.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on pull request #8715: ARROW-10656: [Rust] Use DataType comparison without values

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on pull request #8715:
URL: https://github.com/apache/arrow/pull/8715#issuecomment-730555090


   Hey @ch-sc , thanks for your PR!
   
   @nevi-me, could you help here? I am a bit worried about introducing another comparison of datatypes, but I was unable to find anything in the specification stating that a DataType of a ListArray should have a field name.
   
   My main concern here is that this change would allow the following: I receive a `RecordBatch`, and I want to verify that it is consistent. While doing so, I end up reaching to the conclusion that batch.schema().field(0).data_type() != batch.column(0).data_type()`. IMO this goes against the whole idea of having a `RecordBatch` in the first place.
   
   OTOH, I also understand the motivation for this change: if the field name is irrelevant, then it should not be used in the comparison.
   
   My feeling is that if we need to introduce a different comparison, this often hints that there is useless information on the `DataType` that we should eliminate. If it is useless but needs to stay for some reason, then my suggestion is that we implement a custom `PartialEq` that ignores it, so that there is a common source of truth about whether two datatypes are equal.
   
   What are your thoughts @nevi-me, @alamb and @ch-sc ?
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ch-sc commented on pull request #8715: ARROW-10656: [Rust] Use DataType comparison without values

Posted by GitBox <gi...@apache.org>.
ch-sc commented on pull request #8715:
URL: https://github.com/apache/arrow/pull/8715#issuecomment-735682092


   Hey, sorry I didn't look into github last weekend. @nevi-me can you re-open the PR? I'll try to fix the integration tests.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ch-sc commented on a change in pull request #8715: ARROW-10656: [Rust] Use DataType comparison without values

Posted by GitBox <gi...@apache.org>.
ch-sc commented on a change in pull request #8715:
URL: https://github.com/apache/arrow/pull/8715#discussion_r527798357



##########
File path: rust/arrow/src/datatypes.rs
##########
@@ -1142,6 +1142,44 @@ impl DataType {
                 | Float64
         )
     }
+
+    /// Compares this data type with another data type only based on the data type
+    /// including nested data types, but not based on other values.
+    pub fn cmp_type(&self, other: &Self) -> bool {

Review comment:
       True, `cmp()` usually returns `Ordering`. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nevi-me closed pull request #8715: ARROW-10656: [Rust] Remove Field from (most) DataTypes, add nullability

Posted by GitBox <gi...@apache.org>.
nevi-me closed pull request #8715:
URL: https://github.com/apache/arrow/pull/8715


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan commented on a change in pull request #8715: ARROW-10656: [Rust] Use DataType comparison without values

Posted by GitBox <gi...@apache.org>.
Dandandan commented on a change in pull request #8715:
URL: https://github.com/apache/arrow/pull/8715#discussion_r527093860



##########
File path: rust/arrow/src/datatypes.rs
##########
@@ -1142,6 +1142,44 @@ impl DataType {
                 | Float64
         )
     }
+
+    /// Compares this data type with another data type only based on the data type
+    /// including nested data types, but not based on other values.
+    pub fn cmp_type(&self, other: &Self) -> bool {

Review comment:
       maybe `equals` instead of compare?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on pull request #8715: ARROW-10656: [Rust] Use DataType comparison without values

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #8715:
URL: https://github.com/apache/arrow/pull/8715#issuecomment-735221904


   @ch-sc  -- I think this PR needs a rebase against `master` -- the issue with the integration tests has been fixed I believe


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nevi-me commented on pull request #8715: ARROW-10656: [Rust] Use DataType comparison without values

Posted by GitBox <gi...@apache.org>.
nevi-me commented on pull request #8715:
URL: https://github.com/apache/arrow/pull/8715#issuecomment-730640190


   > My feeling is that if we need to introduce a different comparison, this often hints that there is useless information on the `DataType` that we should eliminate. If it is useless but needs to stay for some reason, then my suggestion is that we implement a custom `PartialEq` that ignores it, so that there is a common source of truth about whether two datatypes are equal.
   
   @jorgecarleitao I've also checked the specification, and it's unclear on what should happen. It looks like the C++ implementation doesn't check the field names (https://github.com/apache/arrow/blob/master/cpp/src/arrow/record_batch.cc).
   
   I made the change for IPC integration purposes, as we were failing tests because of field names.
   I think not checking the names when constructing record batches is fine in that regard (as the tests still pass).
   
   I'm trying to think of what the broader implications of this change would be.
   
   I'm so far leaning on saying "we need not check the field name when constructing record batches, but for lists and structs, we should continue checking nullability of the `Field`. When writing Parquet files, having one batch's column being nullable, then the next batch's not being nullable, would break some things.
   
   I haven't looked at the code yet, but if the change can be isolated to only be used in the `RecordBatch::try_new()` part (at least for now), then it should be a reasonable change.
   
   Interestingly, I went down this (or similar route) with data type equality, before realising that we needed a `Field` on lists.
   
   One last matter, which might not be a concern, is that in the specification, a `Field` can carry metadata. So, in what conditions would we need to compare field metadata? I haven't looked at `ExtensionArray` yet, but I think the field metadata is used there.
   
   > I would personally suggest one of the following:
   > 
   > 1. Keep the comparison code as is (and include the field name) as there is some reasonable story that it is "part" of the type
   > 2. Update the `PartialCmp` implementation for `DataType` to ignore Field names for all `DataType` == `DataType` comparisons
   
   I like this approach from @alamb 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan commented on a change in pull request #8715: ARROW-10656: [Rust] Use DataType comparison without values

Posted by GitBox <gi...@apache.org>.
Dandandan commented on a change in pull request #8715:
URL: https://github.com/apache/arrow/pull/8715#discussion_r527092867



##########
File path: rust/arrow/src/datatypes.rs
##########
@@ -1142,6 +1142,44 @@ impl DataType {
                 | Float64
         )
     }
+
+    /// Compares this data type with another data type only based on the data type
+    /// including nested data types, but not based on other values.
+    pub fn cmp_type(&self, other: &Self) -> bool {
+        match (self, other) {
+            (DataType::List(f1), DataType::List(f2)) => {
+                f1.data_type().cmp_type(f2.data_type())
+            }
+            (DataType::FixedSizeList(f1, _), DataType::FixedSizeList(f2, _)) => {
+                f1.data_type().cmp_type(f2.data_type())
+            }
+            (DataType::LargeList(f1), DataType::LargeList(f2)) => {
+                f1.data_type().cmp_type(f2.data_type())
+            }
+            (DataType::Struct(f1), DataType::Struct(f2)) => {
+                if f1.len() == f2.len() {

Review comment:
       Can all use the form `f1.len() == f2.len() && f1.iter()...` without if else




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on pull request #8715: ARROW-10656: [Rust] Use DataType comparison without values

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on pull request #8715:
URL: https://github.com/apache/arrow/pull/8715#issuecomment-735378990


   @nevi-me , since you fielded the proposal to use `Field` on the DataType, are you ok with this change?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nevi-me commented on pull request #8715: ARROW-10656: [Rust] Remove Field from (most) DataTypes, add nullability

Posted by GitBox <gi...@apache.org>.
nevi-me commented on pull request #8715:
URL: https://github.com/apache/arrow/pull/8715#issuecomment-738274461


   > Do you have any ideas @alamb @nevi-me?
   
   Is this work blocking you? Otherwise I'll be ablet to look in detail over the weekend.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nevi-me commented on pull request #8715: ARROW-10656: [Rust] Use DataType comparison without values

Posted by GitBox <gi...@apache.org>.
nevi-me commented on pull request #8715:
URL: https://github.com/apache/arrow/pull/8715#issuecomment-735466540


   My apologies, I'm not sure of how I missed this. When I merged this, I only saw unrelated failures in CI.
   It's past my bedtime, so I'll look at this before I go to work tomorrow morning


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org