You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/07/27 12:55:26 UTC

[GitHub] [arrow-rs] tustvold opened a new pull request, #2194: Use BitChunks in equal_bits

tustvold opened a new pull request, #2194:
URL: https://github.com/apache/arrow-rs/pull/2194

   # Which issue does this PR close?
   
   <!---
   We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123.
   -->
   
   Closes #2186 
   Part of #2188 
   
   # Rationale for this change
    
    <!---
    Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed.
    Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes.
   -->
   ```
   equal_512               time:   [33.149 ns 33.153 ns 33.157 ns]                       
                           change: [-2.0324% -1.9669% -1.9098%] (p = 0.00 < 0.05)
                           Performance has improved.
   Found 6 outliers among 100 measurements (6.00%)
     2 (2.00%) low mild
     2 (2.00%) high mild
     2 (2.00%) high severe
   
   equal_nulls_512         time:   [948.52 ns 948.88 ns 949.33 ns]                             
                           change: [-45.807% -45.682% -45.573%] (p = 0.00 < 0.05)
                           Performance has improved.
   Found 17 outliers among 100 measurements (17.00%)
     5 (5.00%) high mild
     12 (12.00%) high severe
   
   equal_string_512        time:   [46.418 ns 46.430 ns 46.442 ns]                              
                           change: [-1.5368% -1.4796% -1.4286%] (p = 0.00 < 0.05)
                           Performance has improved.
   Found 3 outliers among 100 measurements (3.00%)
     1 (1.00%) low severe
     2 (2.00%) high mild
   
   equal_string_nulls_512  time:   [4.1180 us 4.1189 us 4.1199 us]                                    
                           change: [-19.389% -19.355% -19.320%] (p = 0.00 < 0.05)
                           Performance has improved.
   Found 6 outliers among 100 measurements (6.00%)
     5 (5.00%) high mild
     1 (1.00%) high severe
   
   equal_bool_512          time:   [15.277 ns 15.286 ns 15.295 ns]                            
                           change: [-23.554% -23.428% -23.320%] (p = 0.00 < 0.05)
                           Performance has improved.
   Found 4 outliers among 100 measurements (4.00%)
     2 (2.00%) low mild
     2 (2.00%) high mild
   
   equal_bool_513          time:   [23.229 ns 23.249 ns 23.268 ns]                            
                           change: [+4.4352% +4.6082% +4.7715%] (p = 0.00 < 0.05)
                           Performance has regressed.
   Found 7 outliers among 100 measurements (7.00%)
     1 (1.00%) low severe
     1 (1.00%) low mild
     5 (5.00%) high mild
   
   ```
   
   # What changes are included in this PR?
   
   <!---
   There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR.
   -->
   
   # Are there any user-facing changes?
   
   
   <!---
   If there are user-facing changes then we may require documentation to be updated before approving the PR.
   -->
   
   <!---
   If there are any breaking changes to public APIs, please add the `breaking change` label.
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] alamb commented on a diff in pull request #2194: Use BitChunks in equal_bits

Posted by GitBox <gi...@apache.org>.
alamb commented on code in PR #2194:
URL: https://github.com/apache/arrow-rs/pull/2194#discussion_r932058241


##########
arrow/src/array/equal/boolean.rs:
##########
@@ -75,20 +77,14 @@ pub(super) fn boolean_equal(
     } else {
         // get a ref of the null buffer bytes, to use in testing for nullness
         let lhs_null_bytes = lhs.null_buffer().as_ref().unwrap().as_slice();
-        let rhs_null_bytes = rhs.null_buffer().as_ref().unwrap().as_slice();
 
         let lhs_start = lhs.offset() + lhs_start;
         let rhs_start = rhs.offset() + rhs_start;
 
-        (0..len).all(|i| {
+        BitIndexIterator::new(lhs_null_bytes, lhs_start, len).all(|i| {

Review Comment:
   Ah, I think I missed that this was checking `BooleanArray` (and hence this `get_bit` is getting the __value__ as you say). 👍 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #2194: Use BitChunks in equal_bits

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #2194:
URL: https://github.com/apache/arrow-rs/pull/2194#discussion_r931565979


##########
arrow/src/array/equal/boolean.rs:
##########
@@ -75,20 +77,14 @@ pub(super) fn boolean_equal(
     } else {
         // get a ref of the null buffer bytes, to use in testing for nullness
         let lhs_null_bytes = lhs.null_buffer().as_ref().unwrap().as_slice();
-        let rhs_null_bytes = rhs.null_buffer().as_ref().unwrap().as_slice();
 
         let lhs_start = lhs.offset() + lhs_start;
         let rhs_start = rhs.offset() + rhs_start;
 
-        (0..len).all(|i| {
+        BitIndexIterator::new(lhs_null_bytes, lhs_start, len).all(|i| {

Review Comment:
   These methods are purely checking value equality, at this point the null masks have already been checked for equality. Otherwise the previous logic would have been ill-formed. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold merged pull request #2194: Use BitChunks in equal_bits

Posted by GitBox <gi...@apache.org>.
tustvold merged PR #2194:
URL: https://github.com/apache/arrow-rs/pull/2194


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] alamb commented on a diff in pull request #2194: Use BitChunks in equal_bits

Posted by GitBox <gi...@apache.org>.
alamb commented on code in PR #2194:
URL: https://github.com/apache/arrow-rs/pull/2194#discussion_r931562098


##########
arrow/src/array/equal/boolean.rs:
##########
@@ -75,20 +77,14 @@ pub(super) fn boolean_equal(
     } else {
         // get a ref of the null buffer bytes, to use in testing for nullness
         let lhs_null_bytes = lhs.null_buffer().as_ref().unwrap().as_slice();
-        let rhs_null_bytes = rhs.null_buffer().as_ref().unwrap().as_slice();
 
         let lhs_start = lhs.offset() + lhs_start;
         let rhs_start = rhs.offset() + rhs_start;
 
-        (0..len).all(|i| {
+        BitIndexIterator::new(lhs_null_bytes, lhs_start, len).all(|i| {

Review Comment:
   Is this correct? It seems like it will would not find positions where `rhs` was null but `lhs` was not. Maybe I mis understand something
   
   It seems like we need to iterate over the positions where either is set -- like `lhs_null_bytes | rhs_null_bytes`?
   
   Maybe there is a lack of test coverage 🤔 



##########
arrow/src/array/data.rs:
##########
@@ -2865,4 +2881,15 @@ mod tests {
         let err = data.validate_values().unwrap_err();
         assert_eq!(err.to_string(), "Invalid argument error: Offset invariant failure: offset at position 1 out of bounds: 3 > 2");
     }
+
+    #[test]
+    fn test_contains_nulls() {
+        let buffer: Buffer =

Review Comment:
   I wonder if using a buffer with more than 64 bits is important (or is that already well enough covered in `BitSliceIterator` tests)?



##########
arrow/src/array/equal/boolean.rs:
##########
@@ -75,20 +77,14 @@ pub(super) fn boolean_equal(
     } else {
         // get a ref of the null buffer bytes, to use in testing for nullness
         let lhs_null_bytes = lhs.null_buffer().as_ref().unwrap().as_slice();
-        let rhs_null_bytes = rhs.null_buffer().as_ref().unwrap().as_slice();
 
         let lhs_start = lhs.offset() + lhs_start;
         let rhs_start = rhs.offset() + rhs_start;
 
-        (0..len).all(|i| {
+        BitIndexIterator::new(lhs_null_bytes, lhs_start, len).all(|i| {

Review Comment:
   Or maybe you can just call `equal_bits` here?



##########
arrow/src/array/data.rs:
##########
@@ -37,6 +38,21 @@ use std::sync::Arc;
 
 use super::equal::equal;
 
+#[inline]
+pub(crate) fn contains_nulls(

Review Comment:
   Given the potential usefulness (and non obviousness) of using `BitSliceIterator` to check for nulls, I wonder what you think about making this a function on `BitSliceIterator` such as `BitSliceIterator::contains_only_unset_bits` or something? 



##########
arrow/src/array/data.rs:
##########
@@ -37,6 +38,21 @@ use std::sync::Arc;
 
 use super::equal::equal;
 
+#[inline]
+pub(crate) fn contains_nulls(

Review Comment:
   Given the never ending confusion about "is this a size in bits or bytes" I recommend making it explicit -- either add a docstring that says `offset` and `len` are in `bits` or perhaps name them `bit_offset` and `bit_len`. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] ursabot commented on pull request #2194: Use BitChunks in equal_bits

Posted by GitBox <gi...@apache.org>.
ursabot commented on PR #2194:
URL: https://github.com/apache/arrow-rs/pull/2194#issuecomment-1198020076

   Benchmark runs are scheduled for baseline = 6c77cd56e213d8e0cda05d6dd3af1d7022be387b and contender = 7199b1b2f67ba6c1ebcf343c6bbb671737cfeded. 7199b1b2f67ba6c1ebcf343c6bbb671737cfeded is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Skipped :warning: Benchmarking of arrow-rs-commits is not supported on ec2-t3-xlarge-us-east-2] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/2dae78e2d9b84c70bdb88cbf2002ecd8...4b0b080e258b4fdd9243fa0861d56add/)
   [Skipped :warning: Benchmarking of arrow-rs-commits is not supported on test-mac-arm] [test-mac-arm](https://conbench.ursa.dev/compare/runs/b615c3db5d5940b482bd27cdee171966...c528f449548b4535886fe344b5ce60e4/)
   [Skipped :warning: Benchmarking of arrow-rs-commits is not supported on ursa-i9-9960x] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/d29d7c1ec9634a1e8b60e0e88b33fc82...dc16ab3779934178bb5eecaa75c533ec/)
   [Skipped :warning: Benchmarking of arrow-rs-commits is not supported on ursa-thinkcentre-m75q] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/e2c65bc8d8ea434dbfcc84ae13cf73b4...37cf6eaaccc84d1c91659aa9ad3d86fc/)
   Buildkite builds:
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #2194: Use BitChunks in equal_bits

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #2194:
URL: https://github.com/apache/arrow-rs/pull/2194#discussion_r931566829


##########
arrow/src/array/data.rs:
##########
@@ -2865,4 +2881,15 @@ mod tests {
         let err = data.validate_values().unwrap_err();
         assert_eq!(err.to_string(), "Invalid argument error: Offset invariant failure: offset at position 1 out of bounds: 3 > 2");
     }
+
+    #[test]
+    fn test_contains_nulls() {
+        let buffer: Buffer =

Review Comment:
   I'm happy they are well covered by BitSliceIterator



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #2194: Use BitChunks in equal_bits

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #2194:
URL: https://github.com/apache/arrow-rs/pull/2194#discussion_r931567377


##########
arrow/src/array/data.rs:
##########
@@ -37,6 +38,21 @@ use std::sync::Arc;
 
 use super::equal::equal;
 
+#[inline]
+pub(crate) fn contains_nulls(

Review Comment:
   I could definitely see this being promoted to a method on BitMap, once that supports slicing #1802 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] alamb commented on pull request #2194: Use BitChunks in equal_bits

Posted by GitBox <gi...@apache.org>.
alamb commented on PR #2194:
URL: https://github.com/apache/arrow-rs/pull/2194#issuecomment-1197335157

   I took the liberty of fixing the `fmt` lint diff and merging the branch from master


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org