You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/04/19 21:40:32 UTC

[GitHub] [arrow-rs] tustvold opened a new pull request, #1589: Fix ListArray and StructArray equality (#626)

tustvold opened a new pull request, #1589:
URL: https://github.com/apache/arrow-rs/pull/1589

   # Which issue does this PR close?
   
   Closes #626.
   
   # Rationale for this change
    
   Logically equal ListArray should compare equal, also fixes a bug in comparisons of null arrays with non-zero offsets
   
   # What changes are included in this PR?
   
   Alters ListArray equality logic to correctly handle non-empty null slices
   
   Alters equal_nulls to correctly handle offset arrays
   
   # Are there any user-facing changes?
   
   Array equality should be more correct
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] viirya commented on a diff in pull request #1589: Fix Null Mask Handling in ArrayData And UnionArray

Posted by GitBox <gi...@apache.org>.
viirya commented on code in PR #1589:
URL: https://github.com/apache/arrow-rs/pull/1589#discussion_r854817015


##########
arrow/src/array/array_union.rs:
##########
@@ -522,29 +516,29 @@ mod tests {
             match i {
                 0 => {
                     let slot = slot.as_any().downcast_ref::<Int32Array>().unwrap();
-                    assert!(!union.is_null(i));
+                    assert!(!slot.is_null(0));

Review Comment:
   I think it is good to check null in the current slot, but does `union.is_null(index)` not work anymore? Can we check both?



##########
arrow/src/array/array_union.rs:
##########
@@ -886,4 +876,12 @@ mod tests {
             }
         }
     }
+
+    #[test]
+    fn test_type_check() {
+        let mut builder = UnionBuilder::new_sparse(2);
+        builder.append::<Float32Type>("a", 1.0).unwrap();
+        let err = builder.append::<Int32Type>("a", 1).unwrap_err().to_string();
+        assert!(err.contains("Attempt to write col \"a\" with type Int32 doesn't match existing type Float32"), "{}", err);

Review Comment:
   👍 



##########
arrow/src/array/builder.rs:
##########
@@ -1895,22 +1895,18 @@ struct FieldData {
     ///  The number of array slots represented by the buffer
     slots: usize,
     /// A builder for the bitmap if required (for Sparse Unions)
-    bitmap_builder: Option<BooleanBufferBuilder>,
+    bitmap_builder: BooleanBufferBuilder,

Review Comment:
   The comment seems out-of-date now?



##########
arrow/src/array/builder.rs:
##########
@@ -2182,16 +2152,10 @@ impl UnionBuilder {
                 .into();
             let arr_data_builder = ArrayDataBuilder::new(data_type.clone())
                 .add_buffer(buffer)
-                .len(slots);
+                .len(slots)
+                .null_bit_buffer(bitmap_builder.finish());
             //                .build();
-            let arr_data_ref = unsafe {
-                match bitmap_builder {
-                    Some(mut bb) => arr_data_builder
-                        .null_bit_buffer(bb.finish())
-                        .build_unchecked(),
-                    None => arr_data_builder.build_unchecked(),
-                }

Review Comment:
   more clear now.



##########
arrow/src/array/equal/boolean.rs:
##########
@@ -16,25 +16,22 @@
 // under the License.
 
 use crate::array::{data::count_nulls, ArrayData};
-use crate::buffer::Buffer;
 use crate::util::bit_util::get_bit;
 
 use super::utils::{equal_bits, equal_len};
 
 pub(super) fn boolean_equal(
     lhs: &ArrayData,
     rhs: &ArrayData,
-    lhs_nulls: Option<&Buffer>,
-    rhs_nulls: Option<&Buffer>,
     mut lhs_start: usize,
     mut rhs_start: usize,
     mut len: usize,
 ) -> bool {
     let lhs_values = lhs.buffers()[0].as_slice();
     let rhs_values = rhs.buffers()[0].as_slice();
 
-    let lhs_null_count = count_nulls(lhs_nulls, lhs_start, len);
-    let rhs_null_count = count_nulls(rhs_nulls, rhs_start, len);
+    let lhs_null_count = count_nulls(lhs.null_buffer(), lhs_start + lhs.offset(), len);
+    let rhs_null_count = count_nulls(rhs.null_buffer(), rhs_start + rhs.offset(), len);

Review Comment:
   Hmm, do we assume `offset` is zero previously?



##########
arrow/src/array/equal/mod.rs:
##########
@@ -282,8 +282,7 @@ fn equal_range(
     rhs_start: usize,
     len: usize,
 ) -> bool {
-    utils::base_equal(lhs, rhs)

Review Comment:
   Actually `equal` already does `utils::base_equal`. I guess it is okay to remove it from here, although I'm not pretty sure.



##########
arrow/src/array/builder.rs:
##########
@@ -2182,16 +2152,10 @@ impl UnionBuilder {
                 .into();
             let arr_data_builder = ArrayDataBuilder::new(data_type.clone())
                 .add_buffer(buffer)
-                .len(slots);
+                .len(slots)
+                .null_bit_buffer(bitmap_builder.finish());
             //                .build();

Review Comment:
   `// .build();` seems redundant and can be removed?



##########
arrow/src/array/data.rs:
##########
@@ -1308,21 +1320,28 @@ fn layout(data_type: &DataType) -> DataTypeLayout {
 struct DataTypeLayout {
     /// A vector of buffer layout specifications, one for each expected buffer
     pub buffers: Vec<BufferSpec>,
+
+    /// Can contain a null bitmask
+    pub can_contain_null_mask: bool,

Review Comment:
   Although the C++ structure definition doesn't have this, I think it is fine to have it here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on pull request #1589: Fix ListArray and StructArray equality (#626)

Posted by GitBox <gi...@apache.org>.
tustvold commented on PR #1589:
URL: https://github.com/apache/arrow-rs/pull/1589#issuecomment-1103205474

   More stuff is broken :sob: 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] viirya commented on pull request #1589: Fix Null Mask Handling in ArrayData And UnionArray

Posted by GitBox <gi...@apache.org>.
viirya commented on PR #1589:
URL: https://github.com/apache/arrow-rs/pull/1589#issuecomment-1104201677

   Thank you @tustvold for fixing these. I will begin to review this today.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] alamb merged pull request #1589: Fix Null Mask Handling in ArrayData And UnionArray

Posted by GitBox <gi...@apache.org>.
alamb merged PR #1589:
URL: https://github.com/apache/arrow-rs/pull/1589


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #1589: Fix ListArray and StructArray equality (#626)

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #1589:
URL: https://github.com/apache/arrow-rs/pull/1589#discussion_r853513902


##########
arrow/src/array/equal/utils.rs:
##########
@@ -47,8 +47,13 @@ pub(super) fn equal_nulls(
     rhs_start: usize,
     len: usize,
 ) -> bool {
-    let lhs_null_count = count_nulls(lhs_nulls, lhs_start, len);
-    let rhs_null_count = count_nulls(rhs_nulls, rhs_start, len);
+    let lhs_null_count = count_nulls(lhs_nulls, lhs_start + lhs.offset(), len);

Review Comment:
   equal_bits a few lines is correctly taking into account the offset, but this line wasn't => excitement



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #1589: Fix Null Mask Handling in ArrayData And UnionArray

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #1589:
URL: https://github.com/apache/arrow-rs/pull/1589#discussion_r854871960


##########
arrow/src/array/equal/utils.rs:
##########
@@ -240,98 +233,3 @@ pub(super) fn union_child_logical_null_buffer(
         }
     }
 }
-
-// Calculate a list child's logical bitmap/buffer
-#[inline]
-fn logical_list_bitmap<OffsetSize: OffsetSizeTrait>(

Review Comment:
   We only call equal_ranges on the non null ranges



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #1589: Fix ListArray and StructArray equality (#626)

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #1589:
URL: https://github.com/apache/arrow-rs/pull/1589#discussion_r853510330


##########
arrow/src/array/equal/utils.rs:
##########
@@ -183,10 +178,8 @@ pub(super) fn child_logical_null_buffer(
             let array_offset = parent_data.offset();
             let mut buffer = MutableBuffer::new_null(parent_len);
             let null_slice = buffer.as_slice_mut();
-            (0..parent_len).for_each(|index| {
-                if parent_bitmap.is_set(index + array_offset)
-                    && self_null_bitmap.is_set(index + array_offset)
-                {
+            (array_offset..parent_len + array_offset).for_each(|index| {

Review Comment:
   As the Buffer is interchangeable with the buffer from ArrayData it needs to be created with the array offset, see FixedSizeList above



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #1589: Fix ListArray and StructArray equality (#626)

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #1589:
URL: https://github.com/apache/arrow-rs/pull/1589#discussion_r853514135


##########
arrow/src/array/equal/utils.rs:
##########
@@ -137,16 +142,6 @@ pub(super) fn child_logical_null_buffer(
         Bitmap::from(Buffer::from(vec![0b11111111; ceil]))
     });
     match parent_data.data_type() {
-        DataType::List(_) | DataType::Map(_, _) => Some(logical_list_bitmap::<i32>(

Review Comment:
   No longer needed, as null slices are skipped over in parent



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #1589: Fix Null Mask Handling in ArrayData And UnionArray

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #1589:
URL: https://github.com/apache/arrow-rs/pull/1589#discussion_r854874177


##########
arrow/src/array/array_union.rs:
##########
@@ -522,29 +516,29 @@ mod tests {
             match i {
                 0 => {
                     let slot = slot.as_any().downcast_ref::<Int32Array>().unwrap();
-                    assert!(!union.is_null(i));
+                    assert!(!slot.is_null(0));

Review Comment:
   The nullability is only a property of the slot, and not the Union itself, at least that is my interpretation of the specification



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] alamb commented on pull request #1589: Fix Null Mask Handling in ArrayData And UnionArray

Posted by GitBox <gi...@apache.org>.
alamb commented on PR #1589:
URL: https://github.com/apache/arrow-rs/pull/1589#issuecomment-1112483164

   🚀 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] jhorstmann commented on pull request #1589: Fix Null Mask Handling in ArrayData And UnionArray

Posted by GitBox <gi...@apache.org>.
jhorstmann commented on PR #1589:
URL: https://github.com/apache/arrow-rs/pull/1589#issuecomment-1107435883

   Looks good. This is some really gnarly code and last time I tried to improve it I gave up since it was never clear whether the start variable already include offsets or where the offsets have to be applied.
   
   I have two tests in #1499 where `assert_eq` was not working, the one comparing list arrays works with these changes, I'm currently looking into the other one comparing two struct arrays.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #1589: Fix ListArray and StructArray equality (#626)

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #1589:
URL: https://github.com/apache/arrow-rs/pull/1589#discussion_r853512785


##########
arrow/src/array/equal/mod.rs:
##########
@@ -282,8 +282,7 @@ fn equal_range(
     rhs_start: usize,
     len: usize,
 ) -> bool {
-    utils::base_equal(lhs, rhs)

Review Comment:
   `utils::base_equal` does two things:
   
   1. Compares that the data types are equal
   2. Compares that the lengths are equal
   
   The first is not necessarily correct, as MapArrays are agnostic to field names. The second is also not necessarily correct in the presence of non-empty null slices. I therefore decided to just remove this and simplify the code. 
   
   Ultimately it isn't clear to me what this check could catch that wouldn't represent some violation of a variant somewhere else. If a nested type is inconsistent with the length, type, etc... of its parent then you've got UB anyway



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #1589: Fix ListArray and StructArray equality (#626)

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #1589:
URL: https://github.com/apache/arrow-rs/pull/1589#discussion_r853510515


##########
arrow/src/array/equal/utils.rs:
##########
@@ -240,98 +233,3 @@ pub(super) fn union_child_logical_null_buffer(
         }
     }
 }
-
-// Calculate a list child's logical bitmap/buffer
-#[inline]
-fn logical_list_bitmap<OffsetSize: OffsetSizeTrait>(

Review Comment:
   We no longer need this code, as we simply skip over null slices



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] jhorstmann commented on pull request #1589: Fix Null Mask Handling in ArrayData And UnionArray

Posted by GitBox <gi...@apache.org>.
jhorstmann commented on PR #1589:
URL: https://github.com/apache/arrow-rs/pull/1589#issuecomment-1107441412

   After the suggestion above, my remaining test failure was because of the `nullable` flag on `StructArray` fields and not directly related to the equality comparison. Whether that flag needs to be equal for two struct or map arrays to be considered equal could be discussed, but it seems also some kernels like `nullif` would need to update the flag to be consistent.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] alamb commented on pull request #1589: Fix Null Mask Handling in ArrayData And UnionArray

Posted by GitBox <gi...@apache.org>.
alamb commented on PR #1589:
URL: https://github.com/apache/arrow-rs/pull/1589#issuecomment-1105657516

   I will try and review this more carefully tomorrow. cc @jhorstmann who might also be interested


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] alamb commented on a diff in pull request #1589: Fix Null Mask Handling in ArrayData And UnionArray

Posted by GitBox <gi...@apache.org>.
alamb commented on code in PR #1589:
URL: https://github.com/apache/arrow-rs/pull/1589#discussion_r861207025


##########
arrow/src/array/builder.rs:
##########
@@ -2070,39 +2061,13 @@ impl UnionBuilder {
             fields: HashMap::default(),
             type_id_builder: Int8BufferBuilder::new(capacity),
             value_offset_builder: None,
-            bitmap_builder: None,
         }
     }
 
     /// Appends a null to this builder.
     #[inline]
-    pub fn append_null(&mut self) -> Result<()> {
-        if self.bitmap_builder.is_none() {
-            let mut builder = BooleanBufferBuilder::new(self.len + 1);
-            for _ in 0..self.len {
-                builder.append(true);
-            }
-            self.bitmap_builder = Some(builder)
-        }
-        self.bitmap_builder
-            .as_mut()
-            .expect("Cannot be None")
-            .append(false);
-
-        self.type_id_builder.append(i8::default());
-
-        match &mut self.value_offset_builder {
-            // Handle dense union
-            Some(value_offset_builder) => value_offset_builder.append(i32::default()),
-            // Handle sparse union
-            None => {
-                for (_, fd) in self.fields.iter_mut() {
-                    fd.append_null_dynamic()?;
-                }
-            }
-        };
-        self.len += 1;
-        Ok(())
+    pub fn append_null<T: ArrowPrimitiveType>(&mut self, type_name: &str) -> Result<()> {

Review Comment:
   Makes sense -- tried to encode that insight in comments as part of https://github.com/apache/arrow-rs/pull/1628



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on pull request #1589: Fix Null Mask Handling in ArrayData And UnionArray

Posted by GitBox <gi...@apache.org>.
tustvold commented on PR #1589:
URL: https://github.com/apache/arrow-rs/pull/1589#issuecomment-1107444994

   > Whether that flag needs to be equal for two struct or map arrays to be considered equal could be discussed
   
   Yes I have found the handling of this field to be very inconsistent, imo it shouldn't be possible to associate a null bitmask with a field that says it isn't nullable. Would you be able to create a ticket?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #1589: Fix Null Mask Handling in ArrayData And UnionArray

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #1589:
URL: https://github.com/apache/arrow-rs/pull/1589#discussion_r854874816


##########
arrow/src/array/equal/boolean.rs:
##########
@@ -16,25 +16,22 @@
 // under the License.
 
 use crate::array::{data::count_nulls, ArrayData};
-use crate::buffer::Buffer;
 use crate::util::bit_util::get_bit;
 
 use super::utils::{equal_bits, equal_len};
 
 pub(super) fn boolean_equal(
     lhs: &ArrayData,
     rhs: &ArrayData,
-    lhs_nulls: Option<&Buffer>,
-    rhs_nulls: Option<&Buffer>,
     mut lhs_start: usize,
     mut rhs_start: usize,
     mut len: usize,
 ) -> bool {
     let lhs_values = lhs.buffers()[0].as_slice();
     let rhs_values = rhs.buffers()[0].as_slice();
 
-    let lhs_null_count = count_nulls(lhs_nulls, lhs_start, len);
-    let rhs_null_count = count_nulls(rhs_nulls, rhs_start, len);
+    let lhs_null_count = count_nulls(lhs.null_buffer(), lhs_start + lhs.offset(), len);
+    let rhs_null_count = count_nulls(rhs.null_buffer(), rhs_start + rhs.offset(), len);

Review Comment:
   Yes, in lots of places 😅



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] codecov-commenter commented on pull request #1589: Fix Null Mask Handling in ArrayData And UnionArray

Posted by GitBox <gi...@apache.org>.
codecov-commenter commented on PR #1589:
URL: https://github.com/apache/arrow-rs/pull/1589#issuecomment-1103814127

   # [Codecov](https://codecov.io/gh/apache/arrow-rs/pull/1589?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#1589](https://codecov.io/gh/apache/arrow-rs/pull/1589?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (feda7b6) into [master](https://codecov.io/gh/apache/arrow-rs/commit/786792c750b893781939d8d507b464db09ccc634?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (786792c) will **increase** coverage by `0.15%`.
   > The diff coverage is `88.92%`.
   
   ```diff
   @@            Coverage Diff             @@
   ##           master    #1589      +/-   ##
   ==========================================
   + Coverage   82.88%   83.03%   +0.15%     
   ==========================================
     Files         193      193              
     Lines       55307    55221      -86     
   ==========================================
   + Hits        45839    45853      +14     
   + Misses       9468     9368     -100     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow-rs/pull/1589?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
   |---|---|---|
   | [arrow/src/ipc/writer.rs](https://codecov.io/gh/apache/arrow-rs/pull/1589/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2lwYy93cml0ZXIucnM=) | `81.55% <ø> (-0.25%)` | :arrow_down: |
   | [arrow/src/array/data.rs](https://codecov.io/gh/apache/arrow-rs/pull/1589/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2FycmF5L2RhdGEucnM=) | `83.05% <50.00%> (-0.20%)` | :arrow_down: |
   | [arrow/src/array/equal/mod.rs](https://codecov.io/gh/apache/arrow-rs/pull/1589/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2FycmF5L2VxdWFsL21vZC5ycw==) | `96.09% <70.76%> (+2.10%)` | :arrow_up: |
   | [arrow/src/array/transform/union.rs](https://codecov.io/gh/apache/arrow-rs/pull/1589/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2FycmF5L3RyYW5zZm9ybS91bmlvbi5ycw==) | `83.33% <78.94%> (+8.33%)` | :arrow_up: |
   | [arrow/src/array/builder.rs](https://codecov.io/gh/apache/arrow-rs/pull/1589/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2FycmF5L2J1aWxkZXIucnM=) | `86.79% <94.11%> (+0.10%)` | :arrow_up: |
   | [arrow/src/array/equal/list.rs](https://codecov.io/gh/apache/arrow-rs/pull/1589/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2FycmF5L2VxdWFsL2xpc3QucnM=) | `95.77% <95.55%> (+16.39%)` | :arrow_up: |
   | [arrow/src/array/array\_union.rs](https://codecov.io/gh/apache/arrow-rs/pull/1589/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2FycmF5L2FycmF5X3VuaW9uLnJz) | `90.71% <96.55%> (ø)` | |
   | [arrow/src/array/equal/boolean.rs](https://codecov.io/gh/apache/arrow-rs/pull/1589/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2FycmF5L2VxdWFsL2Jvb2xlYW4ucnM=) | `97.14% <100.00%> (ø)` | |
   | [arrow/src/array/equal/decimal.rs](https://codecov.io/gh/apache/arrow-rs/pull/1589/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2FycmF5L2VxdWFsL2RlY2ltYWwucnM=) | `100.00% <100.00%> (ø)` | |
   | [arrow/src/array/equal/dictionary.rs](https://codecov.io/gh/apache/arrow-rs/pull/1589/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2FycmF5L2VxdWFsL2RpY3Rpb25hcnkucnM=) | `100.00% <100.00%> (ø)` | |
   | ... and [20 more](https://codecov.io/gh/apache/arrow-rs/pull/1589/diff?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow-rs/pull/1589?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow-rs/pull/1589?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Last update [786792c...feda7b6](https://codecov.io/gh/apache/arrow-rs/pull/1589?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #1589: Fix ListArray and StructArray equality (#626)

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #1589:
URL: https://github.com/apache/arrow-rs/pull/1589#discussion_r853512785


##########
arrow/src/array/equal/mod.rs:
##########
@@ -282,8 +282,7 @@ fn equal_range(
     rhs_start: usize,
     len: usize,
 ) -> bool {
-    utils::base_equal(lhs, rhs)

Review Comment:
   `utils::base_equal` does two things:
   
   1. Compares that the data types are equal
   2. Compares that the lengths are equal
   
   The first is not necessarily correct, as MapArrays are agnostic to field names. The second is also not necessarily correct in the presence of non-empty null slices. I therefore decided to just remove this and simplify the code. 
   
   Ultimately it isn't clear to me what this check could catch that wouldn't represent some violation of a variant somewhere else. If a child of a StructArray doesn't have the same length as the parent, or the corresponding data type, etc... you're into UB anyway.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on pull request #1589: Fix Null Mask Handling in ArrayData And UnionArray

Posted by GitBox <gi...@apache.org>.
tustvold commented on PR #1589:
URL: https://github.com/apache/arrow-rs/pull/1589#issuecomment-1103797885

   I will continue to add tests, and file bugs for the various things this fixes, but I think the core is ready for review now


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] jhorstmann commented on a diff in pull request #1589: Fix Null Mask Handling in ArrayData And UnionArray

Posted by GitBox <gi...@apache.org>.
jhorstmann commented on code in PR #1589:
URL: https://github.com/apache/arrow-rs/pull/1589#discussion_r856871037


##########
arrow/src/array/equal/structure.rs:
##########
@@ -84,7 +63,7 @@ pub(super) fn struct_equal(
 
             lhs_is_null
                 || (lhs_is_null == rhs_is_null)
-                    && equal_values(lhs, rhs, lhs_nulls, rhs_nulls, lhs_pos, rhs_pos, 1)
+                    && equal_child_values(lhs, rhs, lhs_pos, rhs_pos, 1)

Review Comment:
   I think this needs the same logic as for list arrays:
   
   ```suggestion
               if lhs_is_null != rhs_is_null {
                   return false;
               }
   
               lhs_is_null || equal_child_values(lhs, rhs, lhs_pos, rhs_pos, 1)
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] viirya commented on a diff in pull request #1589: Fix Null Mask Handling in ArrayData And UnionArray

Posted by GitBox <gi...@apache.org>.
viirya commented on code in PR #1589:
URL: https://github.com/apache/arrow-rs/pull/1589#discussion_r854852546


##########
arrow/src/array/equal/utils.rs:
##########
@@ -240,98 +233,3 @@ pub(super) fn union_child_logical_null_buffer(
         }
     }
 }
-
-// Calculate a list child's logical bitmap/buffer
-#[inline]
-fn logical_list_bitmap<OffsetSize: OffsetSizeTrait>(

Review Comment:
   What you mean we skip over null slices?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] jhorstmann commented on pull request #1589: Fix Null Mask Handling in ArrayData And UnionArray

Posted by GitBox <gi...@apache.org>.
jhorstmann commented on PR #1589:
URL: https://github.com/apache/arrow-rs/pull/1589#issuecomment-1107475810

   
   > Yes I have found the handling of this field to be very inconsistent, imo it shouldn't be possible to associate a null bitmask with a field that says it isn't nullable. Would you be able to create a ticket?
   
   I created #1611 for validating the `nullable` flag on `StructArray`. Not sure how good my description there is, feel free to add more information or other ideas.
   
   > I presume you mean that the previous behaviour was wrong, and is now fixed, and not that this has broken behaviour that was previously correct?
   
   Sorry, my description probably was not very clear. In one test I [had to compare individual elements of a ListArray](https://github.com/apache/arrow-rs/pull/1499/files#diff-036c91a4b91e350837448f27da4ac4c0b43cfb477571330ac1f37654b0e2c172R1510 ) because the previous equality implementation did not work with different offsets. With the changes from this PR that test can now use a simple `assert_eq`.
   
   The other test with `StructArray` now also works with these changes and `assert_eq`, the other missing piece was explicitly creating the array with nullable fields.
   
   Whether we should change the comparision semantic to exclude the nullable flag is probably out of scope for this PR, I don't have a strong opinion either way.
   
   A big "Thank You" for diving into this code and fixing it!
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] alamb commented on pull request #1589: Fix Null Mask Handling in ArrayData And UnionArray

Posted by GitBox <gi...@apache.org>.
alamb commented on PR #1589:
URL: https://github.com/apache/arrow-rs/pull/1589#issuecomment-1112162974

   After some thought, I would like to merge this PR to handle all the undefined behavior bugs
   
   While there are more potential follow on things to do to improve union and null handling, this seems like a significant improvement. 
   
   Unless there are any objections, I will plan to merge it later today


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] alamb commented on a diff in pull request #1589: Fix Null Mask Handling in ArrayData And UnionArray

Posted by GitBox <gi...@apache.org>.
alamb commented on code in PR #1589:
URL: https://github.com/apache/arrow-rs/pull/1589#discussion_r860755500


##########
arrow/src/array/builder.rs:
##########
@@ -1894,23 +1894,19 @@ struct FieldData {
     values_buffer: Option<MutableBuffer>,
     ///  The number of array slots represented by the buffer
     slots: usize,
-    /// A builder for the bitmap if required (for Sparse Unions)
-    bitmap_builder: Option<BooleanBufferBuilder>,
+    /// A builder for the null bitmap

Review Comment:
   👍 



##########
arrow/src/array/builder.rs:
##########
@@ -2070,39 +2061,13 @@ impl UnionBuilder {
             fields: HashMap::default(),
             type_id_builder: Int8BufferBuilder::new(capacity),
             value_offset_builder: None,
-            bitmap_builder: None,
         }
     }
 
     /// Appends a null to this builder.
     #[inline]
-    pub fn append_null(&mut self) -> Result<()> {
-        if self.bitmap_builder.is_none() {
-            let mut builder = BooleanBufferBuilder::new(self.len + 1);
-            for _ in 0..self.len {
-                builder.append(true);
-            }
-            self.bitmap_builder = Some(builder)
-        }
-        self.bitmap_builder
-            .as_mut()
-            .expect("Cannot be None")
-            .append(false);
-
-        self.type_id_builder.append(i8::default());
-
-        match &mut self.value_offset_builder {
-            // Handle dense union
-            Some(value_offset_builder) => value_offset_builder.append(i32::default()),
-            // Handle sparse union
-            None => {
-                for (_, fd) in self.fields.iter_mut() {
-                    fd.append_null_dynamic()?;
-                }
-            }
-        };
-        self.len += 1;
-        Ok(())
+    pub fn append_null<T: ArrowPrimitiveType>(&mut self, type_name: &str) -> Result<()> {

Review Comment:
   I realize from an implementation perspective why you chose to require appending a null by field name, but it seems kind of a messy API -- what do you think about potentially appending a null to the first child or something by default?
   
   Perhaps something to think about for a follow on PR



##########
arrow/src/array/array_union.rs:
##########
@@ -560,7 +554,7 @@ mod tests {
         builder.append::<Int32Type>("a", 1).unwrap();
         builder.append::<Int64Type>("c", 3).unwrap();
         builder.append::<Int32Type>("a", 10).unwrap();
-        builder.append_null().unwrap();
+        builder.append_null::<Int32Type>("a").unwrap();

Review Comment:
   it is strange to require a field name to append nulls -- I left a comment on how to improve things perhaps in the future



##########
arrow/src/array/equal/fixed_binary.rs:
##########
@@ -39,8 +36,8 @@ pub(super) fn fixed_binary_equal(
     let lhs_values = &lhs.buffers()[0].as_slice()[lhs.offset() * size..];
     let rhs_values = &rhs.buffers()[0].as_slice()[rhs.offset() * size..];
 
-    let lhs_null_count = count_nulls(lhs_nulls, lhs_start, len);
-    let rhs_null_count = count_nulls(rhs_nulls, rhs_start, len);
+    let lhs_null_count = count_nulls(lhs.null_buffer(), lhs_start + lhs.offset(), len);

Review Comment:
   it makes a lot more sense to look at the buffer's nulls directly



##########
arrow/src/compute/kernels/filter.rs:
##########
@@ -1707,17 +1738,17 @@ mod tests {
         let mut builder = UnionBuilder::new_sparse(4);
         builder.append::<Int32Type>("A", 1).unwrap();
         builder.append::<Float64Type>("B", 3.2).unwrap();
-        builder.append_null().unwrap();
+        builder.append_null::<Float64Type>("B").unwrap();
         builder.append::<Int32Type>("A", 34).unwrap();
         let array = builder.build().unwrap();
 
         let filter_array = BooleanArray::from(vec![true, false, true, false]);
         let c = filter(&array, &filter_array).unwrap();
         let filtered = c.as_any().downcast_ref::<UnionArray>().unwrap();
 
-        let mut builder = UnionBuilder::new_dense(1);
+        let mut builder = UnionBuilder::new_sparse(2);

Review Comment:
   I assume this was changed because the name of the test is `test_filter_union_array_sparse_with_nulls` but we were using a `dense` array previously



##########
arrow/src/array/data.rs:
##########
@@ -621,6 +621,13 @@ impl ArrayData {
         // Check that the data layout conforms to the spec
         let layout = layout(&self.data_type);
 
+        if !layout.can_contain_null_mask && self.null_bitmap.is_some() {

Review Comment:
   👍 



##########
arrow/src/array/array_union.rs:
##########
@@ -522,29 +516,29 @@ mod tests {
             match i {
                 0 => {
                     let slot = slot.as_any().downcast_ref::<Int32Array>().unwrap();
-                    assert!(!union.is_null(i));
+                    assert!(!slot.is_null(0));

Review Comment:
   https://arrow.apache.org/docs/format/Columnar.html#union-layout
   
   Says
   
   > Unlike other data types, unions do not have their own validity bitmap. Instead, the nullness of each slot is determined exclusively by the child arrays which are composed to create the union.
   
   Which seems consistent 👍 
   
   However, it seems like `UnionArray` doesn't override `is_null` to take this into account.  Filed https://github.com/apache/arrow-rs/issues/1625 to track



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #1589: Fix Null Mask Handling in ArrayData And UnionArray

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #1589:
URL: https://github.com/apache/arrow-rs/pull/1589#discussion_r860772343


##########
arrow/src/array/builder.rs:
##########
@@ -2070,39 +2061,13 @@ impl UnionBuilder {
             fields: HashMap::default(),
             type_id_builder: Int8BufferBuilder::new(capacity),
             value_offset_builder: None,
-            bitmap_builder: None,
         }
     }
 
     /// Appends a null to this builder.
     #[inline]
-    pub fn append_null(&mut self) -> Result<()> {
-        if self.bitmap_builder.is_none() {
-            let mut builder = BooleanBufferBuilder::new(self.len + 1);
-            for _ in 0..self.len {
-                builder.append(true);
-            }
-            self.bitmap_builder = Some(builder)
-        }
-        self.bitmap_builder
-            .as_mut()
-            .expect("Cannot be None")
-            .append(false);
-
-        self.type_id_builder.append(i8::default());
-
-        match &mut self.value_offset_builder {
-            // Handle dense union
-            Some(value_offset_builder) => value_offset_builder.append(i32::default()),
-            // Handle sparse union
-            None => {
-                for (_, fd) in self.fields.iter_mut() {
-                    fd.append_null_dynamic()?;
-                }
-            }
-        };
-        self.len += 1;
-        Ok(())
+    pub fn append_null<T: ArrowPrimitiveType>(&mut self, type_name: &str) -> Result<()> {

Review Comment:
   The short answer is that UnionArray nulls are typed and so two arrays are not equal if they have nulls in different child arrays.
   
   There is also the case of appending a null to an empty UnionArray...



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org