You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "tustvold (via GitHub)" <gi...@apache.org> on 2023/03/17 17:47:48 UTC

[GitHub] [arrow-rs] tustvold opened a new pull request, #3885: Flesh out NullBuffer abstraction (#3880)

tustvold opened a new pull request, #3885:
URL: https://github.com/apache/arrow-rs/pull/3885

   # Which issue does this PR close?
   
   <!--
   We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123.
   -->
   
   Part of #3880
   
   # Rationale for this change
    
   <!--
   Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed.
   Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes.
   -->
   
   This continues the process of moving to more expressive buffer abstractions
   
   # What changes are included in this PR?
   
   <!--
   There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR.
   -->
   
   # Are there any user-facing changes?
   
   
   <!--
   If there are user-facing changes then we may require documentation to be updated before approving the PR.
   -->
   
   <!---
   If there are any breaking changes to public APIs, please add the `breaking change` label.
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #3885: Flesh out NullBuffer abstraction (#3880)

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on code in PR #3885:
URL: https://github.com/apache/arrow-rs/pull/3885#discussion_r1143274050


##########
arrow-arith/src/arity.rs:
##########
@@ -412,14 +388,8 @@ where
             Err(err) => return Ok(Err(err)),
         };
 
-        let array_builder = builder
-            .finish()
-            .into_data()
-            .into_builder()
-            .null_bit_buffer(null_buffer)
-            .null_count(null_count);
-
-        let array_data = unsafe { array_builder.build_unchecked() };
+        let array_builder = builder.finish().into_data().into_builder();

Review Comment:
   PrimitiveBuilder vs ArrayBuilder, this should be greatly simplified with the ongoing work in #3880 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] viirya commented on a diff in pull request #3885: Flesh out NullBuffer abstraction (#3880)

Posted by "viirya (via GitHub)" <gi...@apache.org>.
viirya commented on code in PR #3885:
URL: https://github.com/apache/arrow-rs/pull/3885#discussion_r1142734366


##########
arrow-buffer/src/buffer/null.rs:
##########
@@ -50,6 +51,18 @@ impl NullBuffer {
         Self { buffer, null_count }
     }
 
+    /// Computes the union of two optional [`NullBuffer`]
+    pub fn union(

Review Comment:
   This is bit and, but I think `union` sounds like bit or? Maybe `bitand`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #3885: Flesh out NullBuffer abstraction (#3880)

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on code in PR #3885:
URL: https://github.com/apache/arrow-rs/pull/3885#discussion_r1142751190


##########
arrow-buffer/src/buffer/null.rs:
##########
@@ -50,6 +51,18 @@ impl NullBuffer {
         Self { buffer, null_count }
     }
 
+    /// Computes the union of two optional [`NullBuffer`]
+    pub fn union(

Review Comment:
   Its union of the nulls, its confusing because a 0 bit is a null... I'll add further docs



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #3885: Flesh out NullBuffer abstraction (#3880)

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on code in PR #3885:
URL: https://github.com/apache/arrow-rs/pull/3885#discussion_r1140551043


##########
arrow-arith/src/arity.rs:
##########
@@ -220,11 +215,7 @@ where
         return Ok(PrimitiveArray::from(ArrayData::new_empty(&O::DATA_TYPE)));
     }
 
-    let null_buffer = combine_option_bitmap(&[a.data(), b.data()], len);
-    let null_count = null_buffer
-        .as_ref()
-        .map(|x| len - x.count_set_bits_offset(0, len))
-        .unwrap_or_default();
+    let nulls = NullBuffer::union(a.nulls(), b.nulls());

Review Comment:
   Not only is this less code, but it can now preserve the source null buffers even if they have an offset



##########
arrow-arith/src/arity.rs:
##########
@@ -220,11 +215,7 @@ where
         return Ok(PrimitiveArray::from(ArrayData::new_empty(&O::DATA_TYPE)));
     }
 
-    let null_buffer = combine_option_bitmap(&[a.data(), b.data()], len);
-    let null_count = null_buffer
-        .as_ref()
-        .map(|x| len - x.count_set_bits_offset(0, len))
-        .unwrap_or_default();
+    let nulls = NullBuffer::union(a.data().nulls(), b.data().nulls());

Review Comment:
   Not only is this less code, but it can now preserve the source null buffers even if they have an offset



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #3885: Flesh out NullBuffer abstraction (#3880)

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on code in PR #3885:
URL: https://github.com/apache/arrow-rs/pull/3885#discussion_r1140534089


##########
arrow-string/src/regexp.rs:
##########
@@ -100,15 +99,11 @@ pub fn regexp_is_match_utf8<OffsetSize: OffsetSizeTrait>(
         .collect::<Result<Vec<()>, ArrowError>>()?;
 
     let data = unsafe {
-        ArrayData::new_unchecked(
-            DataType::Boolean,
-            array.len(),
-            None,
-            null_bit_buffer,
-            0,
-            vec![result.finish()],
-            vec![],
-        )
+        ArrayDataBuilder::new(DataType::Boolean)

Review Comment:
   A future `BooleanArray::new_unchecked` should clean this up :smile: 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold merged pull request #3885: Flesh out NullBuffer abstraction (#3880)

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold merged PR #3885:
URL: https://github.com/apache/arrow-rs/pull/3885


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #3885: Flesh out NullBuffer abstraction (#3880)

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on code in PR #3885:
URL: https://github.com/apache/arrow-rs/pull/3885#discussion_r1140533115


##########
arrow-arith/src/arity.rs:
##########
@@ -21,30 +21,25 @@ use arrow_array::builder::BufferBuilder;
 use arrow_array::iterator::ArrayIter;
 use arrow_array::types::ArrowDictionaryKeyType;
 use arrow_array::*;
-use arrow_buffer::buffer::{BooleanBuffer, NullBuffer};
+use arrow_buffer::buffer::NullBuffer;
 use arrow_buffer::{Buffer, MutableBuffer};
-use arrow_data::bit_iterator::try_for_each_valid_idx;
-use arrow_data::bit_mask::combine_option_bitmap;
-use arrow_data::ArrayData;
+use arrow_data::{ArrayData, ArrayDataBuilder};
 use arrow_schema::ArrowError;
 use std::sync::Arc;
 
 #[inline]
 unsafe fn build_primitive_array<O: ArrowPrimitiveType>(
     len: usize,
     buffer: Buffer,
-    null_count: usize,
-    null_buffer: Option<Buffer>,
+    nulls: Option<NullBuffer>,

Review Comment:
   The astute will notice this is precisely the `PrimitiveArray::new_unchecked` that #3880 proposes to add :tada: 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] alamb commented on a diff in pull request #3885: Flesh out NullBuffer abstraction (#3880)

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on code in PR #3885:
URL: https://github.com/apache/arrow-rs/pull/3885#discussion_r1142594006


##########
arrow-arith/src/arity.rs:
##########
@@ -412,14 +388,8 @@ where
             Err(err) => return Ok(Err(err)),
         };
 
-        let array_builder = builder
-            .finish()
-            .into_data()
-            .into_builder()
-            .null_bit_buffer(null_buffer)
-            .null_count(null_count);
-
-        let array_data = unsafe { array_builder.build_unchecked() };
+        let array_builder = builder.finish().into_data().into_builder();

Review Comment:
   I wonder why we need to make a new builder (rather than just update the one we had) -- though I see this is what the previous code did too



##########
arrow-buffer/src/buffer/boolean.rs:
##########
@@ -145,3 +146,41 @@ impl BooleanBuffer {
         self.buffer
     }
 }
+
+impl BitAnd<&BooleanBuffer> for &BooleanBuffer {

Review Comment:
   This is very nice 👍  -- it will be really nice to apply such bit operations on BooleanBuffer / NullBuffers directly and not via some function
   



##########
arrow-arith/src/arity.rs:
##########
@@ -391,17 +372,12 @@ where
     if a.null_count() == 0 && b.null_count() == 0 {
         try_binary_no_nulls_mut(len, a, b, op)
     } else {
-        let null_buffer = combine_option_bitmap(&[a.data(), b.data()], len);
-        let null_count = null_buffer

Review Comment:
   it is nice that the null count now is encapsulated in `NullBuffer` rather than needing to be calculated explicitly (and potentially inconsistently)



##########
arrow-buffer/src/buffer/boolean.rs:
##########
@@ -145,3 +146,41 @@ impl BooleanBuffer {
         self.buffer
     }
 }
+
+impl BitAnd<&BooleanBuffer> for &BooleanBuffer {
+    type Output = BooleanBuffer;
+
+    fn bitand(self, rhs: &BooleanBuffer) -> Self::Output {
+        assert_eq!(self.len, rhs.len);
+        BooleanBuffer {
+            buffer: buffer_bin_and(

Review Comment:
   I noticed above that buffers are is combined using `combine_option_bitmap`  which is different than this. 
   
   However, when I double checked that `combine_option_bitmap` calls eventually to `buffer_bin_and`
   
   https://github.com/apache/arrow-rs/blob/e41e0ca82ce4967a66a7eda04cb1630487aaeca9/arrow-data/src/bit_mask.rs#L71-L98



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on pull request #3885: Flesh out NullBuffer abstraction (#3880)

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on PR #3885:
URL: https://github.com/apache/arrow-rs/pull/3885#issuecomment-1477744171

   I've confirmed this to have no major impact on benchmarks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #3885: Flesh out NullBuffer abstraction (#3880)

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on code in PR #3885:
URL: https://github.com/apache/arrow-rs/pull/3885#discussion_r1140533594


##########
arrow-arith/src/arity.rs:
##########
@@ -220,11 +215,7 @@ where
         return Ok(PrimitiveArray::from(ArrayData::new_empty(&O::DATA_TYPE)));
     }
 
-    let null_buffer = combine_option_bitmap(&[a.data(), b.data()], len);
-    let null_count = null_buffer
-        .as_ref()
-        .map(|x| len - x.count_set_bits_offset(0, len))
-        .unwrap_or_default();
+    let nulls = NullBuffer::union(a.data().nulls(), b.data().nulls());

Review Comment:
   Not only is this less code, but it can now preserve the source null buffers even if they have an offset



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #3885: Flesh out NullBuffer abstraction (#3880)

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on code in PR #3885:
URL: https://github.com/apache/arrow-rs/pull/3885#discussion_r1140534435


##########
arrow-data/src/lib.rs:
##########
@@ -23,7 +23,7 @@ pub use data::*;
 mod equal;
 pub mod transform;
 
-pub mod bit_iterator;
+pub use arrow_buffer::bit_iterator;

Review Comment:
   This is moved to arrow_buffer



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org