You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/10/27 18:05:59 UTC

[GitHub] [arrow] jorgecarleitao opened a new pull request #8541: ARROW-10402: [Rust] Refactor array equality

jorgecarleitao opened a new pull request #8541:
URL: https://github.com/apache/arrow/pull/8541


   This is a major refactor of the `equal.rs` module. :/
   
   The rational for this change is many fold: 
   
   * currently array comparison requires downcasting the array ref to its concrete types. This is painful and not very ergonomics, as the user must "guess" what to downcast for comparison. We can see this in the hacks around `sort`, `take` and `concatenate` kernel's tests, and some of the tests of the builders.
   * the code in array comparison is hard to follow given the amount of calls that they perform.
   * The implementation currently indirectly uses many of the `unsafe` APIs that we have (via `is_null`, `value` and the like), which makes it risky to operate and mutate.
   * Some code is being repeated.
   
   This PR:
   
   1. adds `impl PartialEq for dyn Array`, to allow `Array` comparison based on `Array::data` (main change)
   2. Makes array equality to only depend on `ArrayData`, i.e. it no longer depends on concrete array types (such as `PrimitiveArray` and related API) to perform comparisons.
   3. Significantly reduces the risk of panics and UB when composite arrays are of different types, by checking the types on `range` comparison
   4. Makes array equality be statically dispatched, via `match datatype`.
   5. DRY the code around array equality
   6. Fixes an error in equality of dictionary with equal values
   7. Added tests to equalities that were not tested (fixed binary, some edge cases of dictionaries)
   8. splits `equal.rs` in smaller, more manageable files.
   9. Removes `ArrayListOps`, since it it no longer needed
   10. Moves Json equality to its own module, for clarity.
   11. removes the need to have two functions per type to compare arrays.
   12. Adds, as part of the specification, the number of buffers and their respective width to datatypes. This was backported from #8401
   
   This is a draft as there is a significant design change here: all our code base so far focused on implementing the logic on the implementers of `Array` instead of `ArrayData`. This PR breaks that design, by using an Array-independent implementation, so that we can use it on `dyn Array == dyn Array`.
   
   Note that this does not implement `PartialEq` for `ArrayData`, only `dyn Array`, as different data does not a different array (due to the nullability). That implementation is being worked on #8200.
   
   IMO this PR significantly simplifies the code around array comparison, to the point where many implementations are 5 lines long.
   
   The first 3 commits are almost independent of the rest. The remaining are historically correct, but not very interesting to read. The functions are divided by modules for readability.
   
   All tests are there, plus new tests for some of the edge cases and untested arrays.
   
   This change is backward incompatible `array1.equals(&array2)` no longer works: use `array1 == array2` instead, which is the idiomatic way of comparing structs and trait objects in rust. :P


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on pull request #8541: ARROW-10402: [Rust] Refactor array equality

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on pull request #8541:
URL: https://github.com/apache/arrow/pull/8541#issuecomment-722936811


   I have simplified this code further to allow the compiler to optimize out some functions. The code is now 10-40% faster compared to current master. I updated the description accordingly.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on pull request #8541: ARROW-10402: [Rust] Refactor array equality

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on pull request #8541:
URL: https://github.com/apache/arrow/pull/8541#issuecomment-723470287


   @alamb , @nevi-me and @jhorstmann , thanks a lot for taking a close look into this, also for the amazing work yesterday and today on reviewing and merging stuff. Really impressive! 💯 
   
   I rebased this and resolved conflicts.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on a change in pull request #8541: ARROW-10402: [Rust] Refactor array equality

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on a change in pull request #8541:
URL: https://github.com/apache/arrow/pull/8541#discussion_r517105140



##########
File path: rust/arrow/src/array/equal/null.rs
##########
@@ -0,0 +1,31 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use crate::array::ArrayData;
+
+#[inline]
+pub fn null_equal(
+    _lhs: &ArrayData,
+    _rhs: &ArrayData,
+    _lhs_start: usize,
+    _rhs_start: usize,
+    _len: usize,
+) -> bool {
+    // a null buffer's range is always true, as every element is by definition equal (to null).
+    // We only need to compare data_types
+    true

Review comment:
       Note that this function already assumes that me performed that comparison before-hand. We do that on the `equal::equal` and `equal::equal_range` functions.
   
   The reason I moved `base_equal` away from type-specific functions is that it allow us to have a single function per type for both range equality and equality.
   
   To avoid the risk of being misused (i.e. not checking `base_equal` first), I pushed a commit that privatizes all these functions, effectively making `equal::equal` the only public function of the module `equal`.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on a change in pull request #8541: ARROW-10402: [Rust] Refactor array equality

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on a change in pull request #8541:
URL: https://github.com/apache/arrow/pull/8541#discussion_r517811042



##########
File path: rust/arrow/src/array/equal/null.rs
##########
@@ -0,0 +1,31 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use crate::array::ArrayData;
+
+#[inline]
+pub fn null_equal(
+    _lhs: &ArrayData,
+    _rhs: &ArrayData,
+    _lhs_start: usize,
+    _rhs_start: usize,
+    _len: usize,
+) -> bool {
+    // a null buffer's range is always true, as every element is by definition equal (to null).
+    // We only need to compare data_types
+    true

Review comment:
       I added a test for this, just for avoidance of doubts :)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nevi-me closed pull request #8541: ARROW-10402: [Rust] Refactor array equality

Posted by GitBox <gi...@apache.org>.
nevi-me closed pull request #8541:
URL: https://github.com/apache/arrow/pull/8541


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nevi-me commented on a change in pull request #8541: ARROW-10402: [Rust] Refactor array equality

Posted by GitBox <gi...@apache.org>.
nevi-me commented on a change in pull request #8541:
URL: https://github.com/apache/arrow/pull/8541#discussion_r516989790



##########
File path: rust/arrow/src/array/equal/variable_size.rs
##########
@@ -0,0 +1,90 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use crate::array::{ArrayData, OffsetSizeTrait};
+
+use super::utils::equal_len;
+
+fn offset_value_equal<T: OffsetSizeTrait>(
+    lhs_values: &[u8],
+    rhs_values: &[u8],
+    lhs_offsets: &[T],
+    rhs_offsets: &[T],
+    lhs_pos: usize,
+    rhs_pos: usize,
+    len: usize,
+) -> bool {
+    let lhs_start = lhs_offsets[lhs_pos].to_usize().unwrap();
+    let rhs_start = rhs_offsets[rhs_pos].to_usize().unwrap();
+    let lhs_len = lhs_offsets[lhs_pos + len] - lhs_offsets[lhs_pos];
+    let rhs_len = rhs_offsets[rhs_pos + len] - rhs_offsets[rhs_pos];
+
+    lhs_len == rhs_len
+        && equal_len(
+            lhs_values,
+            rhs_values,
+            lhs_start,
+            rhs_start,
+            lhs_len.to_usize().unwrap(),
+        )
+}
+
+pub fn variable_sized_equal<T: OffsetSizeTrait>(

Review comment:
       May you please add a doc, as it's a public function

##########
File path: rust/arrow/src/array/equal/null.rs
##########
@@ -0,0 +1,31 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use crate::array::ArrayData;
+
+#[inline]
+pub fn null_equal(
+    _lhs: &ArrayData,
+    _rhs: &ArrayData,
+    _lhs_start: usize,
+    _rhs_start: usize,
+    _len: usize,
+) -> bool {
+    // a null buffer's range is always true, as every element is by definition equal (to null).
+    // We only need to compare data_types
+    true

Review comment:
       We also need to compare the lengths so we can achieve logical equality. `NullArray::new(3) != NullArray::new(4)`.
   If I write an IPC file with a single array for both those 2 arrays, it'll be unequal because we still encode lengths in the array's data.

##########
File path: rust/arrow/src/array/data.rs
##########
@@ -26,6 +26,15 @@ use crate::buffer::Buffer;
 use crate::datatypes::DataType;
 use crate::util::bit_util;
 
+fn count_nulls(null_bit_buffer: Option<&Buffer>, offset: usize, len: usize) -> usize {

Review comment:
       Perhaps we could inline this? I haven't peeked at what happens with the assembly of these small functions, but I've read somewhere in the past that it might be beneficial to always inline small functions.

##########
File path: rust/arrow/src/array/data.rs
##########
@@ -207,6 +210,56 @@ impl ArrayData {
 
         size
     }
+
+    /// Creates a zero-copy slice of itself. This creates a new [ArrayData]
+    /// with a different offset, len and a shifted null bitmap.
+    ///
+    /// # Panics
+    ///
+    /// Panics if `offset + length > self.len()`.
+    pub fn slice(&self, offset: usize, length: usize) -> ArrayData {
+        assert!((offset + length) <= self.len());
+
+        let mut new_data = self.clone();
+
+        new_data.len = length;
+        new_data.offset = offset + self.offset;
+
+        new_data.null_count =
+            count_nulls(new_data.null_buffer(), new_data.offset, new_data.len);
+
+        new_data
+    }
+
+    /// Returns the buffer `buffer` as a slice of type `T`. When the expected buffer is bit-packed,
+    /// the slice is not offset.
+    #[inline]
+    pub(super) fn buffer<T>(&self, buffer: usize) -> &[T] {
+        let values = unsafe { self.buffers[buffer].data().align_to::<T>() };
+        if values.0.len() != 0 || values.2.len() != 0 {
+            panic!("The buffer is not byte-aligned with its interpretation")
+        };
+        if self.data_type.width(buffer).unwrap() == 1 {
+            // bitpacked buffers are offset differently and it is the consumers that need to worry about.
+            &values.1
+        } else {
+            &values.1[self.offset..]
+        }
+    }
+
+    /// Returns the buffer `buffer` as a slice of type `T`. When the expected buffer is bit-packed,
+    /// the slice is not offset.
+    #[inline]
+    pub fn buffer_data(&self, buffer: usize) -> &[u8] {
+        let values = self.buffers[buffer].data();
+        let width = self.data_type.width(buffer).unwrap();
+        if width == 1 {
+            // bitpacked buffers are offset differently and it is the consumers that need to worry about.
+            &values
+        } else {
+            &values[self.offset * (width / 8)..]
+        }
+    }
 }
 
 impl PartialEq for ArrayData {

Review comment:
       My changes here muddy the water. While reviewing this, I started to clean this up. I'll try submit a PR on top of this, removing the asserts, and completing the implementation.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on a change in pull request #8541: ARROW-10402: [Rust] Refactor array equality

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #8541:
URL: https://github.com/apache/arrow/pull/8541#discussion_r519169437



##########
File path: rust/arrow/src/array/builder.rs
##########
@@ -3535,14 +3531,15 @@ mod tests {
             None,
             Some(6),
             Some(7),
+            // array.data() end
             Some(3),
             None,
             None,
             Some(6),
-        ])) as ArrayRef;
+        ]);
         assert_eq!(finished.len(), expected.len());
         assert_eq!(finished.null_count(), expected.null_count());
-        assert!(finished.equals(&(*expected)));
+        assert_eq!(finished, expected);

Review comment:
       this is certainly a lot nicer

##########
File path: rust/arrow/src/array/builder.rs
##########
@@ -531,7 +531,7 @@ impl<T: ArrowPrimitiveType> ArrayBuilder for PrimitiveBuilder<T> {
 
             for i in 0..len {
                 // account for offset as `ArrayData` does not

Review comment:
       this comment may need to be updated as the code seems to no longer account for offset

##########
File path: rust/arrow/src/array/equal/mod.rs
##########
@@ -0,0 +1,829 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Module containing functionality to compute array equality.
+//! This module uses [ArrayData] and does not
+//! depend on dynamic casting of `Array`.
+
+use super::{
+    array::BinaryOffsetSizeTrait, Array, ArrayData, FixedSizeBinaryArray,
+    GenericBinaryArray, GenericListArray, GenericStringArray, OffsetSizeTrait,
+    PrimitiveArray, StringOffsetSizeTrait, StructArray,
+};
+
+use crate::datatypes::{ArrowPrimitiveType, DataType, IntervalUnit};
+
+mod boolean;
+mod dictionary;
+mod fixed_binary;
+mod fixed_list;
+mod list;
+mod null;
+mod primitive;
+mod structure;
+mod utils;
+mod variable_size;
+
+// these methods assume the same type, len and null count.
+// For this reason, they are not exposed and are instead used
+// to build the generic functions below (`equal_range` and `equal`).
+use boolean::boolean_equal;
+use dictionary::dictionary_equal;
+use fixed_binary::fixed_binary_equal;
+use fixed_list::fixed_list_equal;
+use list::list_equal;
+use null::null_equal;
+use primitive::primitive_equal;
+use structure::struct_equal;
+use variable_size::variable_sized_equal;
+
+impl PartialEq for dyn Array {
+    fn eq(&self, other: &Self) -> bool {
+        equal(self.data().as_ref(), other.data().as_ref())
+    }
+}
+
+impl<T: Array> PartialEq<T> for dyn Array {
+    fn eq(&self, other: &T) -> bool {
+        equal(self.data().as_ref(), other.data().as_ref())
+    }
+}
+
+impl<T: ArrowPrimitiveType> PartialEq for PrimitiveArray<T> {
+    fn eq(&self, other: &PrimitiveArray<T>) -> bool {
+        equal(self.data().as_ref(), other.data().as_ref())
+    }
+}
+
+impl<OffsetSize: StringOffsetSizeTrait> PartialEq for GenericStringArray<OffsetSize> {
+    fn eq(&self, other: &Self) -> bool {
+        equal(self.data().as_ref(), other.data().as_ref())
+    }
+}
+
+impl<OffsetSize: BinaryOffsetSizeTrait> PartialEq for GenericBinaryArray<OffsetSize> {
+    fn eq(&self, other: &Self) -> bool {
+        equal(self.data().as_ref(), other.data().as_ref())
+    }
+}
+
+impl PartialEq for FixedSizeBinaryArray {
+    fn eq(&self, other: &Self) -> bool {
+        equal(self.data().as_ref(), other.data().as_ref())
+    }
+}
+
+impl<OffsetSize: OffsetSizeTrait> PartialEq for GenericListArray<OffsetSize> {
+    fn eq(&self, other: &Self) -> bool {
+        equal(self.data().as_ref(), other.data().as_ref())
+    }
+}
+
+impl PartialEq for StructArray {
+    fn eq(&self, other: &Self) -> bool {
+        equal(self.data().as_ref(), other.data().as_ref())
+    }
+}
+
+#[inline]
+fn equal_values(

Review comment:
       ```suggestion
   /// Compares two arrays for equality starting at `lhs_start` and `rhs_start`
   /// `lhs` and `rhs` *must* have the same data type
   fn equal_values(
   ```

##########
File path: rust/arrow/src/array/data.rs
##########
@@ -207,6 +206,41 @@ impl ArrayData {
 
         size
     }
+
+    /// Creates a zero-copy slice of itself. This creates a new [ArrayData]
+    /// with a different offset, len and a shifted null bitmap.
+    ///
+    /// # Panics
+    ///
+    /// Panics if `offset + length > self.len()`.
+    pub fn slice(&self, offset: usize, length: usize) -> ArrayData {
+        assert!((offset + length) <= self.len());
+
+        let mut new_data = self.clone();
+
+        new_data.len = length;
+        new_data.offset = offset + self.offset;
+
+        new_data.null_count =
+            count_nulls(new_data.null_buffer(), new_data.offset, new_data.len);
+
+        new_data
+    }
+
+    /// Returns the buffer `buffer` as a slice of type `T`. The slice is already offset.

Review comment:
       ```suggestion
       /// Returns self.buffers at index `buffer` as a slice of type `T` starting at self.offset
   ```

##########
File path: rust/arrow/src/array/data.rs
##########
@@ -207,6 +206,41 @@ impl ArrayData {
 
         size
     }
+
+    /// Creates a zero-copy slice of itself. This creates a new [ArrayData]
+    /// with a different offset, len and a shifted null bitmap.
+    ///
+    /// # Panics
+    ///
+    /// Panics if `offset + length > self.len()`.
+    pub fn slice(&self, offset: usize, length: usize) -> ArrayData {
+        assert!((offset + length) <= self.len());
+
+        let mut new_data = self.clone();
+
+        new_data.len = length;
+        new_data.offset = offset + self.offset;
+
+        new_data.null_count =
+            count_nulls(new_data.null_buffer(), new_data.offset, new_data.len);
+
+        new_data
+    }
+
+    /// Returns the buffer `buffer` as a slice of type `T`. The slice is already offset.
+    /// # Panics
+    /// This function panics if:
+    /// * the buffer is not byte-aligned with type T, or
+    /// * the datatype is `Boolean` (it corresponds to a bit-packed buffer where the offset is not applicable)
+    #[inline]
+    pub(super) fn buffer<T: ArrowNativeType>(&self, buffer: usize) -> &[T] {
+        let values = unsafe { self.buffers[buffer].data().align_to::<T>() };
+        if values.0.len() != 0 || values.2.len() != 0 {
+            panic!("The buffer is not byte-aligned with its interpretation")

Review comment:
       👍 for the panic

##########
File path: rust/arrow/src/array/builder.rs
##########
@@ -3999,7 +3996,7 @@ mod tests {
         let expected_list =
             FixedSizeListArray::from(Arc::new(expected_list_data) as ArrayDataRef);
         let expected_list = FixedSizeBinaryArray::from(expected_list);
-        // assert!(expected_list.values().equals(&*finished.values()));
+        // assert_eq!(expected_list.values(), finished.values());

Review comment:
       not that you did it, but I wonder why this code is commented out -- maybe your changes will have fixed it

##########
File path: rust/arrow/src/array/array.rs
##########
@@ -112,10 +112,10 @@ pub trait Array: fmt::Debug + Send + Sync + ArrayEqual + JsonEqual {
     /// // Make slice over the values [2, 3, 4]
     /// let array_slice = array.slice(1, 3);
     ///
-    /// assert!(array_slice.equals(&Int32Array::from(vec![2, 3, 4])));
+    /// assert_eq!(array_slice.as_ref(), &Int32Array::from(vec![2, 3, 4]));

Review comment:
       :heart




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nevi-me commented on a change in pull request #8541: ARROW-10402: [Rust] Refactor array equality

Posted by GitBox <gi...@apache.org>.
nevi-me commented on a change in pull request #8541:
URL: https://github.com/apache/arrow/pull/8541#discussion_r519115194



##########
File path: rust/arrow/src/array/equal/structure.rs
##########
@@ -0,0 +1,59 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use crate::array::ArrayData;
+
+use super::equal_range;
+
+fn equal_values(
+    lhs: &ArrayData,
+    rhs: &ArrayData,
+    lhs_start: usize,
+    rhs_start: usize,
+    len: usize,
+) -> bool {
+    lhs.child_data()
+        .iter()
+        .zip(rhs.child_data())
+        .all(|(lhs_values, rhs_values)| {
+            equal_range(lhs_values, rhs_values, lhs_start, rhs_start, len)
+        })
+}
+
+pub(super) fn struct_equal(
+    lhs: &ArrayData,
+    rhs: &ArrayData,
+    lhs_start: usize,
+    rhs_start: usize,
+    len: usize,
+) -> bool {

Review comment:
       Something that we can address later, is that if a struct slot is null, we should carry that nullness across to its children. This is actually the problem that @carols10cents and I encountered on both parquet roundtrips and integration tests.
   
   The spec (http://arrow.apache.org/docs/format/Columnar.html#struct-layout) says:
   
   > While a struct does not have physical storage for each of its semantic slots (i.e. each scalar C-like struct), an entire struct slot can be set to null via the validity bitmap. Any of the child field arrays can have null values according to their respective independent validity bitmaps. This implies that for a particular struct slot the validity bitmap for the struct array might indicate a null slot when one or more of its child arrays has a non-null value in their corresponding slot. When reading the struct array the parent validity bitmap takes priority.
   
   I think we can address this in a separate PR, if we still find that `struct_equal` doesn't cover that edge case.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nevi-me commented on pull request #8541: ARROW-10402: [Rust] Refactor array equality

Posted by GitBox <gi...@apache.org>.
nevi-me commented on pull request #8541:
URL: https://github.com/apache/arrow/pull/8541#issuecomment-723471732


   > @alamb , @nevi-me and @jhorstmann , thanks a lot for taking a close look into this, also for the amazing work yesterday and today on reviewing and merging stuff. Really impressive! 💯
   > 
   > I rebased this and resolved conflicts.
   
   I'm super-excited, because this likely unlocks us on the failures that I've been getting on #8200.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #8541: ARROW-10402: [Rust] Refactor array equality

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #8541:
URL: https://github.com/apache/arrow/pull/8541#issuecomment-717427176


   https://issues.apache.org/jira/browse/ARROW-10402


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on a change in pull request #8541: ARROW-10402: [Rust] Refactor array equality

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on a change in pull request #8541:
URL: https://github.com/apache/arrow/pull/8541#discussion_r517095720



##########
File path: rust/arrow/src/array/data.rs
##########
@@ -207,6 +210,56 @@ impl ArrayData {
 
         size
     }
+
+    /// Creates a zero-copy slice of itself. This creates a new [ArrayData]
+    /// with a different offset, len and a shifted null bitmap.
+    ///
+    /// # Panics
+    ///
+    /// Panics if `offset + length > self.len()`.
+    pub fn slice(&self, offset: usize, length: usize) -> ArrayData {
+        assert!((offset + length) <= self.len());
+
+        let mut new_data = self.clone();
+
+        new_data.len = length;
+        new_data.offset = offset + self.offset;
+
+        new_data.null_count =
+            count_nulls(new_data.null_buffer(), new_data.offset, new_data.len);
+
+        new_data
+    }
+
+    /// Returns the buffer `buffer` as a slice of type `T`. When the expected buffer is bit-packed,
+    /// the slice is not offset.
+    #[inline]
+    pub(super) fn buffer<T>(&self, buffer: usize) -> &[T] {
+        let values = unsafe { self.buffers[buffer].data().align_to::<T>() };
+        if values.0.len() != 0 || values.2.len() != 0 {
+            panic!("The buffer is not byte-aligned with its interpretation")
+        };
+        if self.data_type.width(buffer).unwrap() == 1 {
+            // bitpacked buffers are offset differently and it is the consumers that need to worry about.
+            &values.1
+        } else {
+            &values.1[self.offset..]
+        }
+    }
+
+    /// Returns the buffer `buffer` as a slice of type `T`. When the expected buffer is bit-packed,
+    /// the slice is not offset.
+    #[inline]
+    pub fn buffer_data(&self, buffer: usize) -> &[u8] {
+        let values = self.buffers[buffer].data();
+        let width = self.data_type.width(buffer).unwrap();
+        if width == 1 {
+            // bitpacked buffers are offset differently and it is the consumers that need to worry about.
+            &values
+        } else {
+            &values[self.offset * (width / 8)..]
+        }
+    }
 }
 
 impl PartialEq for ArrayData {

Review comment:
       One Idea would be to use the equality code from this PR, but sprinkle it with `debug_assert_eq`, so that during debug, it is more clear where the difference is. An alternative is to use `assert_eq`, so that the two arrays are shown side-by-side (this only works for small arrays, though).




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on a change in pull request #8541: ARROW-10402: [Rust] Refactor array equality

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on a change in pull request #8541:
URL: https://github.com/apache/arrow/pull/8541#discussion_r517105719



##########
File path: rust/arrow/src/array/equal/variable_size.rs
##########
@@ -0,0 +1,90 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use crate::array::{ArrayData, OffsetSizeTrait};
+
+use super::utils::equal_len;
+
+fn offset_value_equal<T: OffsetSizeTrait>(
+    lhs_values: &[u8],
+    rhs_values: &[u8],
+    lhs_offsets: &[T],
+    rhs_offsets: &[T],
+    lhs_pos: usize,
+    rhs_pos: usize,
+    len: usize,
+) -> bool {
+    let lhs_start = lhs_offsets[lhs_pos].to_usize().unwrap();
+    let rhs_start = rhs_offsets[rhs_pos].to_usize().unwrap();
+    let lhs_len = lhs_offsets[lhs_pos + len] - lhs_offsets[lhs_pos];
+    let rhs_len = rhs_offsets[rhs_pos + len] - rhs_offsets[rhs_pos];
+
+    lhs_len == rhs_len
+        && equal_len(
+            lhs_values,
+            rhs_values,
+            lhs_start,
+            rhs_start,
+            lhs_len.to_usize().unwrap(),
+        )
+}
+
+pub fn variable_sized_equal<T: OffsetSizeTrait>(

Review comment:
       I have privatized this function, as it assumes that the user already used `base_equal`. This reminded me to document the `equal::equal` function, which is the public one of the module.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jhorstmann commented on a change in pull request #8541: ARROW-10402: [Rust] Refactor array equality

Posted by GitBox <gi...@apache.org>.
jhorstmann commented on a change in pull request #8541:
URL: https://github.com/apache/arrow/pull/8541#discussion_r519187056



##########
File path: rust/arrow/src/array/data.rs
##########
@@ -121,7 +120,7 @@ impl ArrayData {
     /// Returns whether the element at index `i` is null
     pub fn is_null(&self, i: usize) -> bool {
         if let Some(ref b) = self.null_bitmap {
-            return !b.is_set(i);
+            return !b.is_set(self.offset + i);

Review comment:
       Thanks a lot, I think I have another ticket that would be fixed by this change




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org