You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/01/05 06:34:33 UTC

[GitHub] [arrow] jorgecarleitao commented on a change in pull request #9093: ARROW-11125: [Rust] Logical equality for list arrays

jorgecarleitao commented on a change in pull request #9093:
URL: https://github.com/apache/arrow/pull/9093#discussion_r551738974



##########
File path: rust/arrow/src/array/equal/structure.rs
##########
@@ -55,21 +56,24 @@ fn equal_values(
                     temp_lhs.as_ref()
                 }
             };
-            let rhs_merged_nulls = match (rhs_nulls, rhs_values.null_buffer()) {
-                (None, None) => None,
-                (None, Some(c)) => Some(c),
-                (Some(p), None) => Some(p),
-                (Some(p), Some(c)) => {
-                    let merged = (p & c).unwrap();
-                    temp_rhs = Some(merged);
-                    temp_rhs.as_ref()
-                }
-            };
+            // TODO: this is intentional, looking at which is the better option
+            let rhs_merged_nulls =
+                rhs.child_logical_null_buffer(rhs_nulls.cloned(), index);

Review comment:
       pass `rhs_values` instead of `index` here?

##########
File path: rust/arrow/src/array/data.rs
##########
@@ -136,6 +137,84 @@ impl ArrayData {
         &self.null_bitmap
     }
 
+    /// Computes the logical validity bitmap of the array data using the
+    /// parent's array data. The parent should be a list or struct, else
+    /// the logical bitmap of the array is returned unaltered.
+    ///
+    /// Parent data is passed along with the parent's logical bitmap, as
+    /// nested arrays could have a logical bitmap different to the physical
+    /// one on the `ArrayData`.
+    ///
+    /// Safety

Review comment:
       `# safety` is used for `unsafe`, this is `# Panics`.

##########
File path: rust/arrow/src/array/equal/structure.rs
##########
@@ -38,12 +38,13 @@ fn equal_values(
     len: usize,
 ) -> bool {
     let mut temp_lhs: Option<Buffer> = None;
-    let mut temp_rhs: Option<Buffer> = None;
+    // let mut temp_rhs: Option<Buffer> = None;
 
     lhs.child_data()
         .iter()
         .zip(rhs.child_data())
-        .all(|(lhs_values, rhs_values)| {
+        .enumerate()

Review comment:
       which allows to remove this `enumerate` here.

##########
File path: rust/arrow/src/array/equal/primitive.rs
##########
@@ -32,7 +36,10 @@ pub(super) fn primitive_equal<T>(
     let lhs_values = &lhs.buffers()[0].as_slice()[lhs.offset() * byte_width..];
     let rhs_values = &rhs.buffers()[0].as_slice()[rhs.offset() * byte_width..];
 
-    if lhs.null_count() == 0 && rhs.null_count() == 0 {
+    let lhs_null_count = count_nulls(lhs_nulls, lhs_start, len);
+    let rhs_null_count = count_nulls(rhs_nulls, rhs_start, len);
+
+    if lhs_null_count == 0 && rhs_null_count == 0 {

Review comment:
       This is even better: if the buffer only has 1s outside of the range, we short-circuit to the fast case. 👍 

##########
File path: rust/arrow/src/array/equal/mod.rs
##########
@@ -146,118 +146,103 @@ fn equal_values(
     rhs_start: usize,
     len: usize,
 ) -> bool {
-    // compute the nested buffer of the parent and child
-    // if the array has no parent, the child is computed with itself
-    #[allow(unused_assignments)]
-    let mut temp_lhs: Option<Buffer> = None;
-    #[allow(unused_assignments)]
-    let mut temp_rhs: Option<Buffer> = None;
-    let lhs_merged_nulls = match (lhs_nulls, lhs.null_buffer()) {
-        (None, None) => None,
-        (None, Some(c)) => Some(c),
-        (Some(p), None) => Some(p),
-        (Some(p), Some(c)) => {
-            let merged = (p & c).unwrap();
-            temp_lhs = Some(merged);
-            temp_lhs.as_ref()
-        }
-    };

Review comment:
       I liked this more. A bit more explicit what was happening here.

##########
File path: rust/arrow/src/array/data.rs
##########
@@ -136,6 +137,84 @@ impl ArrayData {
         &self.null_bitmap
     }
 
+    /// Computes the logical validity bitmap of the array data using the
+    /// parent's array data. The parent should be a list or struct, else
+    /// the logical bitmap of the array is returned unaltered.
+    ///
+    /// Parent data is passed along with the parent's logical bitmap, as
+    /// nested arrays could have a logical bitmap different to the physical
+    /// one on the `ArrayData`.
+    ///
+    /// Safety
+    ///
+    /// As we index into [`ArrayData::child_data`], this function panics if
+    /// array data is not a nested type, as it will not have child data.
+    pub fn child_logical_null_buffer(

Review comment:
       I would have placed this function outside of `impl ArrayData`, add the corresponding arguments, and place it under `equal/` for now. Alternatively, use `pub(crate)` to not make this public.

##########
File path: rust/arrow/src/array/data.rs
##########
@@ -136,6 +137,84 @@ impl ArrayData {
         &self.null_bitmap
     }
 
+    /// Computes the logical validity bitmap of the array data using the
+    /// parent's array data. The parent should be a list or struct, else
+    /// the logical bitmap of the array is returned unaltered.
+    ///
+    /// Parent data is passed along with the parent's logical bitmap, as
+    /// nested arrays could have a logical bitmap different to the physical
+    /// one on the `ArrayData`.
+    ///
+    /// Safety
+    ///
+    /// As we index into [`ArrayData::child_data`], this function panics if
+    /// array data is not a nested type, as it will not have child data.
+    pub fn child_logical_null_buffer(
+        &self,
+        logical_null_buffer: Option<Buffer>,
+        child_index: usize,
+    ) -> Option<Buffer> {
+        // This function should only be called when having nested data types.
+        // However, as a convenience, we return the parent's logical buffer if
+        // we do not encounter a nested type.
+        let child_data = self.child_data().get(child_index).unwrap();
+
+        // TODO: rationalise this logic, prefer not creating populated bitmaps, use Option matching
+        // I found taking a Bitmap that's populated to be more convenient, but I was more concerned first
+        // about accuracy, so I'll explore using Option<&Bitmap> directly.
+        let parent_bitmap = logical_null_buffer.map(Bitmap::from).unwrap_or_else(|| {
+            let ceil = bit_util::ceil(self.len(), 8);
+            Bitmap::from(Buffer::from(vec![0b11111111; ceil]))
+        });
+        let self_null_bitmap = child_data.null_bitmap().clone().unwrap_or_else(|| {
+            let ceil = bit_util::ceil(child_data.len(), 8);
+            Bitmap::from(Buffer::from(vec![0b11111111; ceil]))
+        });
+        match self.data_type() {
+            DataType::List(_) => Some(logical_list_bitmap::<i32>(
+                self,
+                parent_bitmap,
+                self_null_bitmap,
+            )),
+            DataType::LargeList(_) => Some(logical_list_bitmap::<i64>(
+                self,
+                parent_bitmap,
+                self_null_bitmap,
+            )),
+            DataType::FixedSizeList(_, len) => {
+                let len = *len as usize;
+                let array_len = self.len();
+                let array_offset = self.offset();
+                let bitmap_len = bit_util::ceil(array_len * len, 8);
+                let mut buffer =
+                    MutableBuffer::new(bitmap_len).with_bitset(bitmap_len, false);
+                let mut null_slice = buffer.as_slice_mut();
+                (array_offset..array_len + array_offset).for_each(|index| {
+                    let start = index * len;
+                    let end = start + len;
+                    let mask = parent_bitmap.is_set(index);
+                    (start..end).for_each(|child_index| {
+                        if mask && self_null_bitmap.is_set(child_index) {
+                            bit_util::set_bit(&mut null_slice, child_index);
+                        }
+                    });
+                });
+                Some(buffer.into())
+            }
+            DataType::Struct(_) => (&parent_bitmap & &self_null_bitmap)
+                .ok()
+                .map(|bitmap| bitmap.bits),
+            DataType::Union(_) => {
+                panic!("Logical equality not yet implemented for union arrays")

Review comment:
       `unimplemented`

##########
File path: rust/arrow/src/array/data.rs
##########
@@ -136,6 +137,84 @@ impl ArrayData {
         &self.null_bitmap
     }
 
+    /// Computes the logical validity bitmap of the array data using the
+    /// parent's array data. The parent should be a list or struct, else
+    /// the logical bitmap of the array is returned unaltered.
+    ///
+    /// Parent data is passed along with the parent's logical bitmap, as
+    /// nested arrays could have a logical bitmap different to the physical
+    /// one on the `ArrayData`.
+    ///
+    /// Safety
+    ///
+    /// As we index into [`ArrayData::child_data`], this function panics if
+    /// array data is not a nested type, as it will not have child data.
+    pub fn child_logical_null_buffer(
+        &self,
+        logical_null_buffer: Option<Buffer>,
+        child_index: usize,

Review comment:
       I would have pass `&ArrayData` instead of `child_index`. This is available in all instances where this is called.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org