You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/11/04 15:39:46 UTC

[GitHub] [arrow] carols10cents opened a new pull request #8590: ARROW-10042: [Rust] Only compare BufferData's data for equality

carols10cents opened a new pull request #8590:
URL: https://github.com/apache/arrow/pull/8590


   And disregard differences in capacity.
   
   This gets 2 ignored Parquet -> Arrow roundtrip tests to pass because they were only failing due to differences in `BufferData` capacities.
   
   I noticed that `MutableBuffer`'s `PartialEq` also checks `capacity`, and I think that's still correct: two `MutableBuffer`s who differ only in `capacity` could have different behavior if you try to add the same data to each of them and one doesn't have sufficient capacity.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] paddyhoran commented on pull request #8590: ARROW-10042: [Rust] Only compare BufferData's data for equality

Posted by GitBox <gi...@apache.org>.
paddyhoran commented on pull request #8590:
URL: https://github.com/apache/arrow/pull/8590#issuecomment-721893832


   Here is the PR: #4331
   
   See the comment about half way down (I'm can't seem to link to it directly)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #8590: ARROW-10042: [Rust] Only compare BufferData's data for equality

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #8590:
URL: https://github.com/apache/arrow/pull/8590#issuecomment-721810399


   https://issues.apache.org/jira/browse/ARROW-10042


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on pull request #8590: ARROW-10042: [Rust] Fix tests involving ArrayData/Buffer equality

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #8590:
URL: https://github.com/apache/arrow/pull/8590#issuecomment-725016059


   The tests have passed: 
   ![Screen Shot 2020-11-10 at 5 53 11 PM](https://user-images.githubusercontent.com/490673/98743182-9a354a80-237d-11eb-9ae3-f9a8b5f88e37.png)
   
   I'll merge it


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jhorstmann commented on a change in pull request #8590: ARROW-10042: [Rust] BufferData's equality should depend on its capacity; ArrayData's equality should not depend on its BufferDatas' capacities' equality

Posted by GitBox <gi...@apache.org>.
jhorstmann commented on a change in pull request #8590:
URL: https://github.com/apache/arrow/pull/8590#discussion_r519016891



##########
File path: rust/arrow/src/array/data.rs
##########
@@ -211,57 +211,66 @@ impl ArrayData {
 
 impl PartialEq for ArrayData {
     fn eq(&self, other: &Self) -> bool {
-        assert_eq!(
-            self.data_type(),
-            other.data_type(),
-            "Data types not the same"
-        );
-        assert_eq!(self.len(), other.len(), "Lengths not the same");
-        // TODO: when adding tests for this, test that we can compare with arrays that have offsets
-        assert_eq!(self.offset(), other.offset(), "Offsets not the same");
-        assert_eq!(self.null_count(), other.null_count());
-        // compare buffers excluding padding
-        let self_buffers = self.buffers();
-        let other_buffers = other.buffers();
-        assert_eq!(self_buffers.len(), other_buffers.len());
-        self_buffers.iter().zip(other_buffers).for_each(|(s, o)| {
-            compare_buffer_regions(
-                s,
-                self.offset(), // TODO mul by data length
-                o,
-                other.offset(), // TODO mul by data len
-            );
-        });
-        // assert_eq!(self.buffers(), other.buffers());
-
-        assert_eq!(self.child_data(), other.child_data());
-        // null arrays can skip the null bitmap, thus only compare if there are no nulls
-        if self.null_count() != 0 || other.null_count() != 0 {
-            compare_buffer_regions(
+        self.data_type() == other.data_type()
+            && self.len() == other.len()
+            // TODO: when adding tests for this, test that we can compare with arrays that have
+            // offsets
+            && self.offset() == other.offset()
+            && compare_buffers(
+                    self.buffers(),
+                    self.offset(),
+                    other.buffers(),
+                    other.offset(),
+                )
+            && self.child_data() == other.child_data()
+            && self.null_count() == other.null_count()
+            // null arrays can skip the null bitmap, thus only compare if there are no nulls
+            // previous line would fail if null counts differed, so this check only needs to
+            // check one null_count to see if there are no nulls
+            && (self.null_count() == 0 || compare_buffer_regions(

Review comment:
       Looking at that function again, `compare_buffers_regions` and also `PartialEq for ArrayData` do `assert_eq!` instead of returning true/false, that is probably also not intended.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb closed pull request #8590: ARROW-10042: [Rust] Fix tests involving ArrayData/Buffer equality

Posted by GitBox <gi...@apache.org>.
alamb closed pull request #8590:
URL: https://github.com/apache/arrow/pull/8590


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] carols10cents commented on pull request #8590: ARROW-10042: [Rust] BufferData's equality should depend on its capacity; ArrayData's equality should not depend on its BufferDatas' capacities' equality

Posted by GitBox <gi...@apache.org>.
carols10cents commented on pull request #8590:
URL: https://github.com/apache/arrow/pull/8590#issuecomment-724992839


   @nevi-me done! The only changes left here are in tests :) Thank you all for the work related to this!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] carols10cents commented on pull request #8590: ARROW-10042: [Rust] Only compare BufferData's data for equality

Posted by GitBox <gi...@apache.org>.
carols10cents commented on pull request #8590:
URL: https://github.com/apache/arrow/pull/8590#issuecomment-721903270


   If I remove the `impl PartialEq for BufferData` entirely, compiling then complains that the `#[derive(PartialEq)]` on `Buffer` can't work. If I remove the derived `PartialEq` on `Buffer`, I then get compilation errors in places such as:
   
   - https://github.com/apache/arrow/blob/913cd7612aebd235eaa35bf8b344448e281b2cbe/rust/arrow/src/array/data.rs#L264 the new-ish `compare_buffer_regions` function that I think is relevant to [this discussion](https://github.com/apache/arrow/pull/8541/files#r516989210)
   - https://github.com/apache/arrow/blob/5355f7cbc031b2dd4dcdf55aa15cb5c0ef80bea4/rust/arrow/src/array/array.rs#L2429 and a bunch of other tests in that file
   - https://github.com/apache/arrow/blob/913cd7612aebd235eaa35bf8b344448e281b2cbe/rust/arrow/src/array/builder.rs#L2705 and a bunch of other tests in that file
   - https://github.com/apache/arrow/blob/913cd7612aebd235eaa35bf8b344448e281b2cbe/rust/arrow/src/array/union.rs#L656-L659 and a bunch of other tests in that file 
   - A bunch of tests in arrow/src/buffer.rs of course
   - https://github.com/apache/arrow/blob/913cd7612aebd235eaa35bf8b344448e281b2cbe/rust/arrow/src/compute/kernels/cast.rs#L1365-L1368
   - https://github.com/apache/arrow/blob/913cd7612aebd235eaa35bf8b344448e281b2cbe/rust/arrow/src/compute/kernels/filter.rs#L1117-L1120 and a bunch of other tests in that file
   - https://github.com/apache/arrow/blob/913cd7612aebd235eaa35bf8b344448e281b2cbe/rust/arrow/src/compute/util.rs#L241-L244 and a bunch of other tests in that file
   
   I'm not sure enough of those cases whether they do expect the `capacity` to be equal and mean this `PartialEq` should stay as-is, or if these cases are being tested at too low a level and should be checking `ArrayData` instead... please let me know what you all think!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nevi-me commented on pull request #8590: ARROW-10042: [Rust] Fix tests involving ArrayData/Buffer equality

Posted by GitBox <gi...@apache.org>.
nevi-me commented on pull request #8590:
URL: https://github.com/apache/arrow/pull/8590#issuecomment-725014853


   > @nevi-me and @jorgecarleitao I think we should merge this PR -- is there anything else that needs to be done before we do so?
   
   You may go ahead and merge it :) I was waiting for integration tests to complete, but I'm about to head to sleep now


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] carols10cents edited a comment on pull request #8590: ARROW-10042: [Rust] BufferData's equality should depend on its capacity; ArrayData's equality should not depend on its BufferDatas' capacities' equality

Posted by GitBox <gi...@apache.org>.
carols10cents edited a comment on pull request #8590:
URL: https://github.com/apache/arrow/pull/8590#issuecomment-723248048


   @jorgecarleitao @nevi-me @alamb I've just force pushed an alternative solution to this problem that doesn't change `BufferData`'s `PartialEq` implementation, and I've added more tests to demonstrate the problems I'm trying to solve. I would appreciate another look and to know what you think!
   
   I've updated the PR title and description. I'm still associating this to the same JIRA ticket as I think this change will "resolve" the question posed in the ticket, but it no longer "fixes" the problem as proposed there.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on pull request #8590: ARROW-10042: [Rust] Only compare BufferData's data for equality

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on pull request #8590:
URL: https://github.com/apache/arrow/pull/8590#issuecomment-721818365


   Could you describe where do we need to compare buffers?
   
   I am asking because I have worked on this problem before, and in my last iteration of this, I abandoned the idea of comparing buffers: a `Buffer` is purely a physical representation of data without any logical interpretation: two buffers from two different arrays can be equal and the arrays still be different (e.g. due to the nullability), and the opposite is also true: two arrays can be equal but have different buffers (e.g. if the child data is different, or if the datatype is different).
   
   IMO the equality should be done via `ArrayData`, which contains all the relevant data _and_ logical representation of that data -- `DataType`, to make it possible to logically compare two arrays. I fielded #8541 with that idea, which IMO will also help the parquet work. I think that @nevi-me also had some ideas in mind.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao edited a comment on pull request #8590: ARROW-10042: [Rust] Only compare BufferData's data for equality

Posted by GitBox <gi...@apache.org>.
jorgecarleitao edited a comment on pull request #8590:
URL: https://github.com/apache/arrow/pull/8590#issuecomment-721818365


   Could you describe where do we need to compare buffers?
   
   I am asking because I have worked on this problem before, and in my last iteration of this, I abandoned the idea of comparing buffers: a `Buffer` is purely a physical representation of data without any logical interpretation: two buffers from two different arrays can be equal and the arrays still be different (e.g. due to the nullability or offset), and the opposite is also true: two arrays can be equal but have different buffers (e.g. if the child data is different, or if the datatype is different).
   
   My current hypothesis is that equality should be done via `ArrayData`, which contains all the relevant data _and_ logical representation of that data -- `DataType`, to make it possible to logically compare two arrays. I fielded #8541 with that idea, which IMO will also help the parquet work. I think that @nevi-me also had some ideas in mind.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nevi-me commented on pull request #8590: ARROW-10042: [Rust] BufferData's equality should depend on its capacity; ArrayData's equality should not depend on its BufferDatas' capacities' equality

Posted by GitBox <gi...@apache.org>.
nevi-me commented on pull request #8590:
URL: https://github.com/apache/arrow/pull/8590#issuecomment-723429467


   > One idea:
   > 
   > 1. Make the array comparison on the parquet tests be done using the `PartialEq`, like this PR is doing
   > 2. Make `PartialEq` use the function `equal` being introduced on #8541
   > 3. Remove both `#[ignore]`
   
   I also like this approach, as #8541 now accounts for logical equality


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on pull request #8590: ARROW-10042: [Rust] Only compare BufferData's data for equality

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on pull request #8590:
URL: https://github.com/apache/arrow/pull/8590#issuecomment-721939420


   @carols10cents , I do not think that we should remove `impl PartialEq for BufferData`. I am only trying to understand why would we would like to compare two `Buffer` and not take the `ArrayData::offset` into account in the equality. Without the offset, comparing buffers is basically comparing physical memory, and IMO has limited use-cases.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] carols10cents commented on a change in pull request #8590: ARROW-10042: [Rust] BufferData's equality should depend on its capacity; ArrayData's equality should not depend on its BufferDatas' capacities' equality

Posted by GitBox <gi...@apache.org>.
carols10cents commented on a change in pull request #8590:
URL: https://github.com/apache/arrow/pull/8590#discussion_r518973664



##########
File path: rust/arrow/src/array/data.rs
##########
@@ -211,57 +211,66 @@ impl ArrayData {
 
 impl PartialEq for ArrayData {
     fn eq(&self, other: &Self) -> bool {
-        assert_eq!(
-            self.data_type(),
-            other.data_type(),
-            "Data types not the same"
-        );
-        assert_eq!(self.len(), other.len(), "Lengths not the same");
-        // TODO: when adding tests for this, test that we can compare with arrays that have offsets
-        assert_eq!(self.offset(), other.offset(), "Offsets not the same");
-        assert_eq!(self.null_count(), other.null_count());
-        // compare buffers excluding padding
-        let self_buffers = self.buffers();
-        let other_buffers = other.buffers();
-        assert_eq!(self_buffers.len(), other_buffers.len());
-        self_buffers.iter().zip(other_buffers).for_each(|(s, o)| {
-            compare_buffer_regions(
-                s,
-                self.offset(), // TODO mul by data length
-                o,
-                other.offset(), // TODO mul by data len
-            );
-        });
-        // assert_eq!(self.buffers(), other.buffers());
-
-        assert_eq!(self.child_data(), other.child_data());
-        // null arrays can skip the null bitmap, thus only compare if there are no nulls
-        if self.null_count() != 0 || other.null_count() != 0 {
-            compare_buffer_regions(
+        self.data_type() == other.data_type()
+            && self.len() == other.len()
+            // TODO: when adding tests for this, test that we can compare with arrays that have
+            // offsets
+            && self.offset() == other.offset()
+            && compare_buffers(
+                    self.buffers(),
+                    self.offset(),
+                    other.buffers(),
+                    other.offset(),
+                )
+            && self.child_data() == other.child_data()
+            && self.null_count() == other.null_count()
+            // null arrays can skip the null bitmap, thus only compare if there are no nulls

Review comment:
       I'm pretty sure those are @nevi-me 's, I just kept them in there :)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on pull request #8590: ARROW-10042: [Rust] BufferData's equality should depend on its capacity; ArrayData's equality should not depend on its BufferDatas' capacities' equality

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on pull request #8590:
URL: https://github.com/apache/arrow/pull/8590#issuecomment-723399529


   Also, I very much agree with @carols10cents that we need a methodology to compare arrays with granularity :)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jhorstmann commented on a change in pull request #8590: ARROW-10042: [Rust] BufferData's equality should depend on its capacity; ArrayData's equality should not depend on its BufferDatas' capacities' equality

Posted by GitBox <gi...@apache.org>.
jhorstmann commented on a change in pull request #8590:
URL: https://github.com/apache/arrow/pull/8590#discussion_r519011669



##########
File path: rust/arrow/src/array/data.rs
##########
@@ -211,57 +211,66 @@ impl ArrayData {
 
 impl PartialEq for ArrayData {
     fn eq(&self, other: &Self) -> bool {
-        assert_eq!(
-            self.data_type(),
-            other.data_type(),
-            "Data types not the same"
-        );
-        assert_eq!(self.len(), other.len(), "Lengths not the same");
-        // TODO: when adding tests for this, test that we can compare with arrays that have offsets
-        assert_eq!(self.offset(), other.offset(), "Offsets not the same");
-        assert_eq!(self.null_count(), other.null_count());
-        // compare buffers excluding padding
-        let self_buffers = self.buffers();
-        let other_buffers = other.buffers();
-        assert_eq!(self_buffers.len(), other_buffers.len());
-        self_buffers.iter().zip(other_buffers).for_each(|(s, o)| {
-            compare_buffer_regions(
-                s,
-                self.offset(), // TODO mul by data length
-                o,
-                other.offset(), // TODO mul by data len
-            );
-        });
-        // assert_eq!(self.buffers(), other.buffers());
-
-        assert_eq!(self.child_data(), other.child_data());
-        // null arrays can skip the null bitmap, thus only compare if there are no nulls
-        if self.null_count() != 0 || other.null_count() != 0 {
-            compare_buffer_regions(
+        self.data_type() == other.data_type()
+            && self.len() == other.len()
+            // TODO: when adding tests for this, test that we can compare with arrays that have
+            // offsets
+            && self.offset() == other.offset()

Review comment:
       I think the offset does not have to be equal. Consider a buffer `[1,2,3,4,1,2,3,4]` and two arrays referencing this buffer, one with offset 0 and len 4, the other with offset 4 and len 4. Both arrays should be considered to contain the same 4 elements. This also means the `compare_buffers` function needs to take the array length as additional parameter.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jhorstmann commented on a change in pull request #8590: ARROW-10042: [Rust] BufferData's equality should depend on its capacity; ArrayData's equality should not depend on its BufferDatas' capacities' equality

Posted by GitBox <gi...@apache.org>.
jhorstmann commented on a change in pull request #8590:
URL: https://github.com/apache/arrow/pull/8590#discussion_r519016891



##########
File path: rust/arrow/src/array/data.rs
##########
@@ -211,57 +211,66 @@ impl ArrayData {
 
 impl PartialEq for ArrayData {
     fn eq(&self, other: &Self) -> bool {
-        assert_eq!(
-            self.data_type(),
-            other.data_type(),
-            "Data types not the same"
-        );
-        assert_eq!(self.len(), other.len(), "Lengths not the same");
-        // TODO: when adding tests for this, test that we can compare with arrays that have offsets
-        assert_eq!(self.offset(), other.offset(), "Offsets not the same");
-        assert_eq!(self.null_count(), other.null_count());
-        // compare buffers excluding padding
-        let self_buffers = self.buffers();
-        let other_buffers = other.buffers();
-        assert_eq!(self_buffers.len(), other_buffers.len());
-        self_buffers.iter().zip(other_buffers).for_each(|(s, o)| {
-            compare_buffer_regions(
-                s,
-                self.offset(), // TODO mul by data length
-                o,
-                other.offset(), // TODO mul by data len
-            );
-        });
-        // assert_eq!(self.buffers(), other.buffers());
-
-        assert_eq!(self.child_data(), other.child_data());
-        // null arrays can skip the null bitmap, thus only compare if there are no nulls
-        if self.null_count() != 0 || other.null_count() != 0 {
-            compare_buffer_regions(
+        self.data_type() == other.data_type()
+            && self.len() == other.len()
+            // TODO: when adding tests for this, test that we can compare with arrays that have
+            // offsets
+            && self.offset() == other.offset()
+            && compare_buffers(
+                    self.buffers(),
+                    self.offset(),
+                    other.buffers(),
+                    other.offset(),
+                )
+            && self.child_data() == other.child_data()
+            && self.null_count() == other.null_count()
+            // null arrays can skip the null bitmap, thus only compare if there are no nulls
+            // previous line would fail if null counts differed, so this check only needs to
+            // check one null_count to see if there are no nulls
+            && (self.null_count() == 0 || compare_buffer_regions(

Review comment:
       Looking at that function again, <del>`compare_buffers_regions` and also</del> `PartialEq for ArrayData` do `assert_eq!` instead of returning true/false, that is probably also not intended.
   
   Update: Ok, I see the return type is fixed in this PR.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] carols10cents commented on pull request #8590: ARROW-10042: [Rust] Only compare BufferData's data for equality

Posted by GitBox <gi...@apache.org>.
carols10cents commented on pull request #8590:
URL: https://github.com/apache/arrow/pull/8590#issuecomment-721941737


   > I do not think that we should remove `impl PartialEq for BufferData`. I am only trying to understand why would we would like to compare two `Buffer` and not take the `ArrayData::offset` into account in the equality. Without the offset, comparing buffers is basically comparing physical memory, and IMO has limited use-cases.
   
   I was not proposing that we remove `impl PartialEq for BufferData`, I was only doing that to get the compiler to tell me about all the places we're using it.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] carols10cents commented on pull request #8590: ARROW-10042: [Rust] BufferData's equality should depend on its capacity; ArrayData's equality should not depend on its BufferDatas' capacities' equality

Posted by GitBox <gi...@apache.org>.
carols10cents commented on pull request #8590:
URL: https://github.com/apache/arrow/pull/8590#issuecomment-723248048


   @jorgecarleitao @nevi-me @alamb I've just force pushed an alternative solution to this problem that doesn't change `BufferData`'s `PartialEq` implementation, and I've added more tests to demonstrate the problems I'm trying to solve. I would appreciate another look and to know what you think!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on a change in pull request #8590: ARROW-10042: [Rust] BufferData's equality should depend on its capacity; ArrayData's equality should not depend on its BufferDatas' capacities' equality

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #8590:
URL: https://github.com/apache/arrow/pull/8590#discussion_r518949715



##########
File path: rust/arrow/src/array/data.rs
##########
@@ -211,57 +211,66 @@ impl ArrayData {
 
 impl PartialEq for ArrayData {
     fn eq(&self, other: &Self) -> bool {
-        assert_eq!(
-            self.data_type(),
-            other.data_type(),
-            "Data types not the same"
-        );
-        assert_eq!(self.len(), other.len(), "Lengths not the same");
-        // TODO: when adding tests for this, test that we can compare with arrays that have offsets
-        assert_eq!(self.offset(), other.offset(), "Offsets not the same");
-        assert_eq!(self.null_count(), other.null_count());
-        // compare buffers excluding padding
-        let self_buffers = self.buffers();
-        let other_buffers = other.buffers();
-        assert_eq!(self_buffers.len(), other_buffers.len());
-        self_buffers.iter().zip(other_buffers).for_each(|(s, o)| {
-            compare_buffer_regions(
-                s,
-                self.offset(), // TODO mul by data length
-                o,
-                other.offset(), // TODO mul by data len
-            );
-        });
-        // assert_eq!(self.buffers(), other.buffers());
-
-        assert_eq!(self.child_data(), other.child_data());
-        // null arrays can skip the null bitmap, thus only compare if there are no nulls
-        if self.null_count() != 0 || other.null_count() != 0 {
-            compare_buffer_regions(
+        self.data_type() == other.data_type()
+            && self.len() == other.len()
+            // TODO: when adding tests for this, test that we can compare with arrays that have
+            // offsets
+            && self.offset() == other.offset()
+            && compare_buffers(
+                    self.buffers(),
+                    self.offset(),
+                    other.buffers(),
+                    other.offset(),
+                )
+            && self.child_data() == other.child_data()
+            && self.null_count() == other.null_count()
+            // null arrays can skip the null bitmap, thus only compare if there are no nulls

Review comment:
       👍  for comments




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] carols10cents commented on pull request #8590: ARROW-10042: [Rust] Fix tests involving ArrayData/Buffer equality

Posted by GitBox <gi...@apache.org>.
carols10cents commented on pull request #8590:
URL: https://github.com/apache/arrow/pull/8590#issuecomment-725014079


   (updated)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on pull request #8590: ARROW-10042: [Rust] BufferData's equality should depend on its capacity; ArrayData's equality should not depend on its BufferDatas' capacities' equality

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #8590:
URL: https://github.com/apache/arrow/pull/8590#issuecomment-725012752


   @nevi-me  and @jorgecarleitao  I think we should merge this PR -- is there anything else that needs to be done before we do so?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] carols10cents commented on pull request #8590: ARROW-10042: [Rust] Only compare BufferData's data for equality

Posted by GitBox <gi...@apache.org>.
carols10cents commented on pull request #8590:
URL: https://github.com/apache/arrow/pull/8590#issuecomment-721995452


   Additional data to consider: I'm not confident the implementations are exactly comparable, but the C++ Buffer type does not take its [`capacity`](https://github.com/apache/arrow/blob/1c223f517f30f4d577a00eae3d0dbca929b2ffac/cpp/src/arrow/buffer.h#L64) into account [in its `Equals` implementations](https://github.com/apache/arrow/blob/1c223f517f30f4d577a00eae3d0dbca929b2ffac/cpp/src/arrow/buffer.cc#L91-L101).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nevi-me commented on pull request #8590: ARROW-10042: [Rust] BufferData's equality should depend on its capacity; ArrayData's equality should not depend on its BufferDatas' capacities' equality

Posted by GitBox <gi...@apache.org>.
nevi-me commented on pull request #8590:
URL: https://github.com/apache/arrow/pull/8590#issuecomment-724977488


   Hi @carols10cents, with the IPC integration and logical equality PRs merged, this should be ready (the Parquet changes).
   We can merge it in after you rebase :)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jhorstmann commented on a change in pull request #8590: ARROW-10042: [Rust] BufferData's equality should depend on its capacity; ArrayData's equality should not depend on its BufferDatas' capacities' equality

Posted by GitBox <gi...@apache.org>.
jhorstmann commented on a change in pull request #8590:
URL: https://github.com/apache/arrow/pull/8590#discussion_r519014044



##########
File path: rust/arrow/src/array/data.rs
##########
@@ -211,57 +211,66 @@ impl ArrayData {
 
 impl PartialEq for ArrayData {
     fn eq(&self, other: &Self) -> bool {
-        assert_eq!(
-            self.data_type(),
-            other.data_type(),
-            "Data types not the same"
-        );
-        assert_eq!(self.len(), other.len(), "Lengths not the same");
-        // TODO: when adding tests for this, test that we can compare with arrays that have offsets
-        assert_eq!(self.offset(), other.offset(), "Offsets not the same");
-        assert_eq!(self.null_count(), other.null_count());
-        // compare buffers excluding padding
-        let self_buffers = self.buffers();
-        let other_buffers = other.buffers();
-        assert_eq!(self_buffers.len(), other_buffers.len());
-        self_buffers.iter().zip(other_buffers).for_each(|(s, o)| {
-            compare_buffer_regions(
-                s,
-                self.offset(), // TODO mul by data length
-                o,
-                other.offset(), // TODO mul by data len
-            );
-        });
-        // assert_eq!(self.buffers(), other.buffers());
-
-        assert_eq!(self.child_data(), other.child_data());
-        // null arrays can skip the null bitmap, thus only compare if there are no nulls
-        if self.null_count() != 0 || other.null_count() != 0 {
-            compare_buffer_regions(
+        self.data_type() == other.data_type()
+            && self.len() == other.len()
+            // TODO: when adding tests for this, test that we can compare with arrays that have
+            // offsets
+            && self.offset() == other.offset()
+            && compare_buffers(
+                    self.buffers(),
+                    self.offset(),
+                    other.buffers(),
+                    other.offset(),
+                )
+            && self.child_data() == other.child_data()
+            && self.null_count() == other.null_count()
+            // null arrays can skip the null bitmap, thus only compare if there are no nulls
+            // previous line would fail if null counts differed, so this check only needs to
+            // check one null_count to see if there are no nulls
+            && (self.null_count() == 0 || compare_buffer_regions(

Review comment:
       `compare_buffers_regions` also needs an additional parameter for the array length to be correct since the array could be a slice of some bigger underlying buffers.
   
   I missed the PR that introduced `compare_buffers_regions`, the way it is written it will only work correctly for bitpacked buffers. That is correct for this call, but I found another usage where it is used for other buffers. It uses the `buffer.bit_slice` method which is documented to take an offset in bits.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] paddyhoran commented on pull request #8590: ARROW-10042: [Rust] Only compare BufferData's data for equality

Posted by GitBox <gi...@apache.org>.
paddyhoran commented on pull request #8590:
URL: https://github.com/apache/arrow/pull/8590#issuecomment-721891808


   I can't find the reference now but this was discussed some time ago on a PR.  The thinking for taking `capacity` into account when comparing `Buffer`'s was that array equality, etc. would be done at a higher level as @jorgecarleitao described.  I think that @sunchao, in particular, favored taking `capacity` into account when comparing `Buffer`'s.  i.e. this is a more low-level equality check.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on pull request #8590: ARROW-10042: [Rust] BufferData's equality should depend on its capacity; ArrayData's equality should not depend on its BufferDatas' capacities' equality

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #8590:
URL: https://github.com/apache/arrow/pull/8590#issuecomment-725013368


   (though the title may need to be updated)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org