You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/08/25 18:50:25 UTC

[GitHub] [arrow-rs] mathiaspeters-sig opened a new pull request #716: Optimize array::transform::utils::set_bits

mathiaspeters-sig opened a new pull request #716:
URL: https://github.com/apache/arrow-rs/pull/716


   # Which issue does this PR close?
   
   Closes #397 
   
   # Rationale for this change
    
   See issue.
   
   # What changes are included in this PR?
   
   Two changes:
   
   1. I added unit tests to make sure the function behaves the same as before my changes
   2. I updated the function body to follow the proposed algorithm in the issue which in a nutshell is:
       - Set individual bytes to the next byte offset
       - Use a `BitChunkIterator` on the remaining bytes to set entire `u8`s
       - Set remaining bits individually
   
   # Are there any user-facing changes?
   
   Sort of.
   
   The way `set_bits` is called currently is that it's always applied to byte arrays where all bits are set to 0. As long as that is true there are no user facing changes. However, the old implementation would not overwrite a `1` with a `0` if that is what the data says but in the current implementation it will sometimes do that. Where it's setting individual bits (initial bits to get to a byte alignment and remainder bits after bit chunk iterator has run out of full u64) it behaves like the old implementation, but the section where it's setting full bytes sets the bytes regardless of the value in `data` and `write_data`.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] alamb commented on pull request #716: Optimize array::transform::utils::set_bits

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #716:
URL: https://github.com/apache/arrow-rs/pull/716#issuecomment-916451919


   > @alamb anything that is left to do here in your opinion?
   
   
   Nope -- thanks @mathiaspeters-sig  I am happy with @nevi-me 's review. I am sorry for the delay in merging I have been away and am now catching up 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] Dandandan commented on a change in pull request #716: Optimize array::transform::utils::set_bits

Posted by GitBox <gi...@apache.org>.
Dandandan commented on a change in pull request #716:
URL: https://github.com/apache/arrow-rs/pull/716#discussion_r697698498



##########
File path: arrow/src/array/transform/utils.rs
##########
@@ -74,3 +103,130 @@ pub(super) unsafe fn get_last_offset<T: OffsetSizeTrait>(
     debug_assert!(prefix.is_empty() && suffix.is_empty());
     *offsets.get_unchecked(offsets.len() - 1)
 }
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+
+    #[test]
+    fn test_set_bits_aligned() {
+        let mut destination: Vec<u8> = vec![0, 0, 0, 0, 0, 0, 0, 0, 0, 0];
+        let source: &[u8] = &[
+            0b11100111, 0b11100111, 0b11100111, 0b11100111, 0b11100111, 0b11100111,
+            0b11100111, 0b11100111,
+        ];
+
+        let destination_offset = 8;
+        let source_offset = 0;
+
+        let len = 64;
+
+        let expected_data: &[u8] = &[
+            0, 0b11100111, 0b11100111, 0b11100111, 0b11100111, 0b11100111, 0b11100111,
+            0b11100111, 0b11100111, 0,
+        ];
+        let expected_null_count = 16;
+        let result = set_bits(
+            destination.as_mut_slice(),
+            source,
+            destination_offset,
+            source_offset,
+            len,
+        );
+
+        assert_eq!(destination, expected_data);
+        assert_eq!(result, expected_null_count);
+    }
+
+    #[test]
+    fn test_set_bits_unaligned_destination_start() {
+        let mut destination: Vec<u8> = vec![0, 0, 0, 0, 0, 0, 0, 0, 0, 0];
+        let source: &[u8] = &[
+            0b11100111, 0b11100111, 0b11100111, 0b11100111, 0b11100111, 0b11100111,
+            0b11100111, 0b11100111,
+        ];
+
+        let destination_offset = 3;
+        let source_offset = 0;
+
+        let len = 64;
+
+        let expected_data: &[u8] = &[
+            0b00111000, 0b00111111, 0b00111111, 0b00111111, 0b00111111, 0b00111111,
+            0b00111111, 0b00111111, 0b00000111, 0b00000000,

Review comment:
       To me this looks to be correct, the last 3 bits on the left from the source byte end up on the right of the last destination byte (set bits go from right to left)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] mathiaspeters-sig commented on pull request #716: Optimize array::transform::utils::set_bits

Posted by GitBox <gi...@apache.org>.
mathiaspeters-sig commented on pull request #716:
URL: https://github.com/apache/arrow-rs/pull/716#issuecomment-914106180


   @alamb anything that is left to do here in your opinion?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] mathiaspeters-sig commented on a change in pull request #716: Optimize array::transform::utils::set_bits

Posted by GitBox <gi...@apache.org>.
mathiaspeters-sig commented on a change in pull request #716:
URL: https://github.com/apache/arrow-rs/pull/716#discussion_r698527363



##########
File path: arrow/src/array/transform/utils.rs
##########
@@ -74,3 +103,130 @@ pub(super) unsafe fn get_last_offset<T: OffsetSizeTrait>(
     debug_assert!(prefix.is_empty() && suffix.is_empty());
     *offsets.get_unchecked(offsets.len() - 1)
 }
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+
+    #[test]
+    fn test_set_bits_aligned() {
+        let mut destination: Vec<u8> = vec![0, 0, 0, 0, 0, 0, 0, 0, 0, 0];
+        let source: &[u8] = &[
+            0b11100111, 0b11100111, 0b11100111, 0b11100111, 0b11100111, 0b11100111,
+            0b11100111, 0b11100111,

Review comment:
       That is a good idea. I changed the patterns, although it know repeats every 7 bytes, but I doubt there will be any "off by 48" errors and we don't have enough data for an "off by 64" error

##########
File path: arrow/src/array/transform/utils.rs
##########
@@ -74,3 +103,130 @@ pub(super) unsafe fn get_last_offset<T: OffsetSizeTrait>(
     debug_assert!(prefix.is_empty() && suffix.is_empty());
     *offsets.get_unchecked(offsets.len() - 1)
 }
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+
+    #[test]
+    fn test_set_bits_aligned() {
+        let mut destination: Vec<u8> = vec![0, 0, 0, 0, 0, 0, 0, 0, 0, 0];
+        let source: &[u8] = &[
+            0b11100111, 0b11100111, 0b11100111, 0b11100111, 0b11100111, 0b11100111,
+            0b11100111, 0b11100111,
+        ];
+
+        let destination_offset = 8;
+        let source_offset = 0;
+
+        let len = 64;
+
+        let expected_data: &[u8] = &[
+            0, 0b11100111, 0b11100111, 0b11100111, 0b11100111, 0b11100111, 0b11100111,
+            0b11100111, 0b11100111, 0,
+        ];
+        let expected_null_count = 16;
+        let result = set_bits(
+            destination.as_mut_slice(),
+            source,
+            destination_offset,
+            source_offset,
+            len,
+        );
+
+        assert_eq!(destination, expected_data);
+        assert_eq!(result, expected_null_count);
+    }
+
+    #[test]
+    fn test_set_bits_unaligned_destination_start() {
+        let mut destination: Vec<u8> = vec![0, 0, 0, 0, 0, 0, 0, 0, 0, 0];
+        let source: &[u8] = &[
+            0b11100111, 0b11100111, 0b11100111, 0b11100111, 0b11100111, 0b11100111,
+            0b11100111, 0b11100111,
+        ];
+
+        let destination_offset = 3;
+        let source_offset = 0;
+
+        let len = 64;
+
+        let expected_data: &[u8] = &[
+            0b00111000, 0b00111111, 0b00111111, 0b00111111, 0b00111111, 0b00111111,
+            0b00111111, 0b00111111, 0b00000111, 0b00000000,

Review comment:
       I can also mention here that I double checked that the old implementation returned the specified result as well, for all 4 tests




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] jhorstmann commented on a change in pull request #716: Optimize array::transform::utils::set_bits

Posted by GitBox <gi...@apache.org>.
jhorstmann commented on a change in pull request #716:
URL: https://github.com/apache/arrow-rs/pull/716#discussion_r697322266



##########
File path: arrow/src/array/transform/utils.rs
##########
@@ -35,15 +42,37 @@ pub(super) fn set_bits(
     offset_read: usize,
     len: usize,
 ) -> usize {
-    let mut count = 0;
-    (0..len).for_each(|i| {
-        if bit_util::get_bit(data, offset_read + i) {
-            bit_util::set_bit(write_data, offset_write + i);
-        } else {
-            count += 1;
-        }
+    let mut null_count = 0;
+
+    let mut bits_to_align = offset_write % 8;
+    if bits_to_align > 0 {
+        bits_to_align = std::cmp::min(len, 8 - bits_to_align);
+    }
+    let mut byte_index = ceil(offset_write + bits_to_align, 8);
+
+    // Set full bytes provided by bit chunk iterator
+    let chunks = BitChunks::new(data, offset_read + bits_to_align, len - bits_to_align);
+    chunks.iter().for_each(|chunk| {
+        null_count += chunk.count_zeros();
+        chunk.to_ne_bytes().iter().for_each(|b| {

Review comment:
       I think this needs to use `to_le_bytes`, for example see the comments in `ops.rs`, `bitwise_bin_op_helper` (which has some typos):
   
       // we are counting bits starting from the least significant bit, so to_le_bytes should be correct




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] jhorstmann commented on pull request #716: Optimize array::transform::utils::set_bits

Posted by GitBox <gi...@apache.org>.
jhorstmann commented on pull request #716:
URL: https://github.com/apache/arrow-rs/pull/716#issuecomment-906620829


   Code looks good to me. The chained loop with bits_to_align and remainder could be split into two loops to write more sequential, but the code looks a bit simpler with one loop. I benchmarked this locally by adding to `concatenate_kernels`:
   
   ```
       let v1 = create_boolean_array(1024, 0.5, 0.0);
       let v2 = create_boolean_array(1024, 0.5, 0.0);
       c.bench_function("concat bool 1024", |b| {
           b.iter(|| bench_concat(&v1, &v2))
       });
   
       let v1 = create_boolean_array(1024, 0.5, 0.5);
       let v2 = create_boolean_array(1024, 0.5, 0.5);
       c.bench_function("concat bool nulls 1024", |b| {
           b.iter(|| bench_concat(&v1, &v2))
       });
   ```
   
   The results are very good, speedup between factor 3-4, improvement on bigger batches could be even better. Interestingly the benchmark setup seems to always create a null bitmap, also for the tests that are supposed to be non-null. Otherwise I can't explain why those benches also see a big speedup.
   
   There is minimal additional overhead in "concat 1024 arrays i32 4" but that is probably the worst case, concatenating 1024 arrays of length 4.
   
   ```
   concat i32 1024         time:   [1.4334 us 1.4357 us 1.4382 us]                             
                           change: [-67.795% -67.484% -67.188%] (p = 0.00 < 0.05)
                           Performance has improved.
   Found 12 outliers among 100 measurements (12.00%)
     7 (7.00%) high mild
     5 (5.00%) high severe
   
   concat i32 nulls 1024   time:   [1.6528 us 1.6549 us 1.6572 us]                                   
                           change: [-58.407% -57.885% -57.194%] (p = 0.00 < 0.05)
                           Performance has improved.
   Found 15 outliers among 100 measurements (15.00%)
     7 (7.00%) high mild
     8 (8.00%) high severe
   
   concat 1024 arrays i32 4                                                                            
                           time:   [162.80 us 162.99 us 163.23 us]
                           change: [+4.3373% +6.1785% +7.9774%] (p = 0.00 < 0.05)
                           Performance has regressed.
   Found 11 outliers among 100 measurements (11.00%)
     4 (4.00%) high mild
     7 (7.00%) high severe
   
   concat str 1024         time:   [4.1305 us 4.1378 us 4.1471 us]                             
                           change: [-40.416% -40.067% -39.739%] (p = 0.00 < 0.05)
                           Performance has improved.
   Found 14 outliers among 100 measurements (14.00%)
     5 (5.00%) high mild
     9 (9.00%) high severe
   
   concat str nulls 1024   time:   [21.156 us 21.181 us 21.208 us]                                   
                           change: [-3.1958% -2.3516% -1.5638%] (p = 0.00 < 0.05)
                           Performance has improved.
   Found 7 outliers among 100 measurements (7.00%)
     3 (3.00%) high mild
     4 (4.00%) high severe
   
   concat bool 1024        time:   [1.4137 us 1.4203 us 1.4281 us]                              
                           change: [-74.572% -74.403% -74.216%] (p = 0.00 < 0.05)
                           Performance has improved.
   Found 8 outliers among 100 measurements (8.00%)
     2 (2.00%) high mild
     6 (6.00%) high severe
   
   concat bool nulls 1024  time:   [1.4999 us 1.5033 us 1.5070 us]                                    
                           change: [-74.566% -74.398% -74.230%] (p = 0.00 < 0.05)
                           Performance has improved.
   Found 7 outliers among 100 measurements (7.00%)
     3 (3.00%) high mild
     4 (4.00%) high severe
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] codecov-commenter edited a comment on pull request #716: Optimize array::transform::utils::set_bits

Posted by GitBox <gi...@apache.org>.
codecov-commenter edited a comment on pull request #716:
URL: https://github.com/apache/arrow-rs/pull/716#issuecomment-906739846


   # [Codecov](https://codecov.io/gh/apache/arrow-rs/pull/716?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#716](https://codecov.io/gh/apache/arrow-rs/pull/716?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (39c109a) into [master](https://codecov.io/gh/apache/arrow-rs/commit/8308615d40e14caa5cdbee118ecc2f46696b920f?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (8308615) will **increase** coverage by `0.05%`.
   > The diff coverage is `94.42%`.
   
   > :exclamation: Current head 39c109a differs from pull request most recent head fe1456d. Consider uploading reports for the commit fe1456d to get more accurate results
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow-rs/pull/716/graphs/tree.svg?width=650&height=150&src=pr&token=pq9V9qWZ1N&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/arrow-rs/pull/716?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   
   ```diff
   @@            Coverage Diff             @@
   ##           master     #716      +/-   ##
   ==========================================
   + Coverage   82.46%   82.52%   +0.05%     
   ==========================================
     Files         168      168              
     Lines       47419    47647     +228     
   ==========================================
   + Hits        39104    39319     +215     
   - Misses       8315     8328      +13     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow-rs/pull/716?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
   |---|---|---|
   | [arrow-flight/src/arrow.flight.protocol.rs](https://codecov.io/gh/apache/arrow-rs/pull/716/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3ctZmxpZ2h0L3NyYy9hcnJvdy5mbGlnaHQucHJvdG9jb2wucnM=) | `0.00% <ø> (ø)` | |
   | [arrow/src/alloc/types.rs](https://codecov.io/gh/apache/arrow-rs/pull/716/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2FsbG9jL3R5cGVzLnJz) | `0.00% <ø> (ø)` | |
   | [parquet/src/data\_type.rs](https://codecov.io/gh/apache/arrow-rs/pull/716/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvZGF0YV90eXBlLnJz) | `77.32% <ø> (ø)` | |
   | [parquet\_derive/src/lib.rs](https://codecov.io/gh/apache/arrow-rs/pull/716/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldF9kZXJpdmUvc3JjL2xpYi5ycw==) | `0.00% <ø> (ø)` | |
   | [arrow/src/compute/kernels/comparison.rs](https://codecov.io/gh/apache/arrow-rs/pull/716/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2NvbXB1dGUva2VybmVscy9jb21wYXJpc29uLnJz) | `95.08% <91.74%> (-0.76%)` | :arrow_down: |
   | [parquet/src/arrow/arrow\_array\_reader.rs](https://codecov.io/gh/apache/arrow-rs/pull/716/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvYXJyb3cvYXJyb3dfYXJyYXlfcmVhZGVyLnJz) | `79.71% <93.44%> (+1.28%)` | :arrow_up: |
   | [arrow/src/array/transform/utils.rs](https://codecov.io/gh/apache/arrow-rs/pull/716/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2FycmF5L3RyYW5zZm9ybS91dGlscy5ycw==) | `98.71% <100.00%> (+3.71%)` | :arrow_up: |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow-rs/pull/716?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow-rs/pull/716?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Last update [8308615...fe1456d](https://codecov.io/gh/apache/arrow-rs/pull/716?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] alamb merged pull request #716: Optimize array::transform::utils::set_bits

Posted by GitBox <gi...@apache.org>.
alamb merged pull request #716:
URL: https://github.com/apache/arrow-rs/pull/716


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] codecov-commenter edited a comment on pull request #716: Optimize array::transform::utils::set_bits

Posted by GitBox <gi...@apache.org>.
codecov-commenter edited a comment on pull request #716:
URL: https://github.com/apache/arrow-rs/pull/716#issuecomment-906739846


   # [Codecov](https://codecov.io/gh/apache/arrow-rs/pull/716?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#716](https://codecov.io/gh/apache/arrow-rs/pull/716?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (8b9c28b) into [master](https://codecov.io/gh/apache/arrow-rs/commit/8308615d40e14caa5cdbee118ecc2f46696b920f?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (8308615) will **increase** coverage by `0.05%`.
   > The diff coverage is `94.42%`.
   
   > :exclamation: Current head 8b9c28b differs from pull request most recent head 6387c14. Consider uploading reports for the commit 6387c14 to get more accurate results
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow-rs/pull/716/graphs/tree.svg?width=650&height=150&src=pr&token=pq9V9qWZ1N&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/arrow-rs/pull/716?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   
   ```diff
   @@            Coverage Diff             @@
   ##           master     #716      +/-   ##
   ==========================================
   + Coverage   82.46%   82.52%   +0.05%     
   ==========================================
     Files         168      168              
     Lines       47419    47647     +228     
   ==========================================
   + Hits        39104    39319     +215     
   - Misses       8315     8328      +13     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow-rs/pull/716?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
   |---|---|---|
   | [arrow-flight/src/arrow.flight.protocol.rs](https://codecov.io/gh/apache/arrow-rs/pull/716/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3ctZmxpZ2h0L3NyYy9hcnJvdy5mbGlnaHQucHJvdG9jb2wucnM=) | `0.00% <ø> (ø)` | |
   | [arrow/src/alloc/types.rs](https://codecov.io/gh/apache/arrow-rs/pull/716/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2FsbG9jL3R5cGVzLnJz) | `0.00% <ø> (ø)` | |
   | [parquet/src/data\_type.rs](https://codecov.io/gh/apache/arrow-rs/pull/716/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvZGF0YV90eXBlLnJz) | `77.32% <ø> (ø)` | |
   | [parquet\_derive/src/lib.rs](https://codecov.io/gh/apache/arrow-rs/pull/716/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldF9kZXJpdmUvc3JjL2xpYi5ycw==) | `0.00% <ø> (ø)` | |
   | [arrow/src/compute/kernels/comparison.rs](https://codecov.io/gh/apache/arrow-rs/pull/716/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2NvbXB1dGUva2VybmVscy9jb21wYXJpc29uLnJz) | `95.08% <91.74%> (-0.76%)` | :arrow_down: |
   | [parquet/src/arrow/arrow\_array\_reader.rs](https://codecov.io/gh/apache/arrow-rs/pull/716/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvYXJyb3cvYXJyb3dfYXJyYXlfcmVhZGVyLnJz) | `79.71% <93.44%> (+1.28%)` | :arrow_up: |
   | [arrow/src/array/transform/utils.rs](https://codecov.io/gh/apache/arrow-rs/pull/716/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2FycmF5L3RyYW5zZm9ybS91dGlscy5ycw==) | `98.71% <100.00%> (+3.71%)` | :arrow_up: |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow-rs/pull/716?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow-rs/pull/716?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Last update [8308615...6387c14](https://codecov.io/gh/apache/arrow-rs/pull/716?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] codecov-commenter edited a comment on pull request #716: Optimize array::transform::utils::set_bits

Posted by GitBox <gi...@apache.org>.
codecov-commenter edited a comment on pull request #716:
URL: https://github.com/apache/arrow-rs/pull/716#issuecomment-906739846


   # [Codecov](https://codecov.io/gh/apache/arrow-rs/pull/716?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#716](https://codecov.io/gh/apache/arrow-rs/pull/716?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (ba6a71d) into [master](https://codecov.io/gh/apache/arrow-rs/commit/8308615d40e14caa5cdbee118ecc2f46696b920f?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (8308615) will **increase** coverage by `0.07%`.
   > The diff coverage is `100.00%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow-rs/pull/716/graphs/tree.svg?width=650&height=150&src=pr&token=pq9V9qWZ1N&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/arrow-rs/pull/716?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   
   ```diff
   @@            Coverage Diff             @@
   ##           master     #716      +/-   ##
   ==========================================
   + Coverage   82.46%   82.53%   +0.07%     
   ==========================================
     Files         168      168              
     Lines       47419    47705     +286     
   ==========================================
   + Hits        39104    39375     +271     
   - Misses       8315     8330      +15     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow-rs/pull/716?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
   |---|---|---|
   | [arrow/src/array/transform/utils.rs](https://codecov.io/gh/apache/arrow-rs/pull/716/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2FycmF5L3RyYW5zZm9ybS91dGlscy5ycw==) | `98.71% <100.00%> (+3.71%)` | :arrow_up: |
   | [arrow/src/compute/kernels/comparison.rs](https://codecov.io/gh/apache/arrow-rs/pull/716/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2NvbXB1dGUva2VybmVscy9jb21wYXJpc29uLnJz) | `95.08% <0.00%> (-0.76%)` | :arrow_down: |
   | [parquet\_derive\_test/src/lib.rs](https://codecov.io/gh/apache/arrow-rs/pull/716/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldF9kZXJpdmVfdGVzdC9zcmMvbGliLnJz) | `100.00% <0.00%> (ø)` | |
   | [arrow/src/array/array\_binary.rs](https://codecov.io/gh/apache/arrow-rs/pull/716/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2FycmF5L2FycmF5X2JpbmFyeS5ycw==) | `92.46% <0.00%> (+0.23%)` | :arrow_up: |
   | [parquet/src/arrow/arrow\_array\_reader.rs](https://codecov.io/gh/apache/arrow-rs/pull/716/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvYXJyb3cvYXJyb3dfYXJyYXlfcmVhZGVyLnJz) | `79.71% <0.00%> (+1.28%)` | :arrow_up: |
   | [parquet\_derive/src/parquet\_field.rs](https://codecov.io/gh/apache/arrow-rs/pull/716/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldF9kZXJpdmUvc3JjL3BhcnF1ZXRfZmllbGQucnM=) | `68.86% <0.00%> (+2.69%)` | :arrow_up: |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow-rs/pull/716?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow-rs/pull/716?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Last update [8308615...ba6a71d](https://codecov.io/gh/apache/arrow-rs/pull/716?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] mathiaspeters-sig commented on a change in pull request #716: Optimize array::transform::utils::set_bits

Posted by GitBox <gi...@apache.org>.
mathiaspeters-sig commented on a change in pull request #716:
URL: https://github.com/apache/arrow-rs/pull/716#discussion_r698527363



##########
File path: arrow/src/array/transform/utils.rs
##########
@@ -74,3 +103,130 @@ pub(super) unsafe fn get_last_offset<T: OffsetSizeTrait>(
     debug_assert!(prefix.is_empty() && suffix.is_empty());
     *offsets.get_unchecked(offsets.len() - 1)
 }
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+
+    #[test]
+    fn test_set_bits_aligned() {
+        let mut destination: Vec<u8> = vec![0, 0, 0, 0, 0, 0, 0, 0, 0, 0];
+        let source: &[u8] = &[
+            0b11100111, 0b11100111, 0b11100111, 0b11100111, 0b11100111, 0b11100111,
+            0b11100111, 0b11100111,

Review comment:
       That is a good idea. I changed the patterns, although it know repeats every 7 bytes, but I doubt there will be any "off by 48" errors and we don't have enough data for an "off by 64" error




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] alamb commented on pull request #716: Optimize array::transform::utils::set_bits

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #716:
URL: https://github.com/apache/arrow-rs/pull/716#issuecomment-906729311


   I plan to review this tomorrow


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] mathiaspeters-sig commented on a change in pull request #716: Optimize array::transform::utils::set_bits

Posted by GitBox <gi...@apache.org>.
mathiaspeters-sig commented on a change in pull request #716:
URL: https://github.com/apache/arrow-rs/pull/716#discussion_r698528135



##########
File path: arrow/src/array/transform/utils.rs
##########
@@ -74,3 +103,130 @@ pub(super) unsafe fn get_last_offset<T: OffsetSizeTrait>(
     debug_assert!(prefix.is_empty() && suffix.is_empty());
     *offsets.get_unchecked(offsets.len() - 1)
 }
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+
+    #[test]
+    fn test_set_bits_aligned() {
+        let mut destination: Vec<u8> = vec![0, 0, 0, 0, 0, 0, 0, 0, 0, 0];
+        let source: &[u8] = &[
+            0b11100111, 0b11100111, 0b11100111, 0b11100111, 0b11100111, 0b11100111,
+            0b11100111, 0b11100111,
+        ];
+
+        let destination_offset = 8;
+        let source_offset = 0;
+
+        let len = 64;
+
+        let expected_data: &[u8] = &[
+            0, 0b11100111, 0b11100111, 0b11100111, 0b11100111, 0b11100111, 0b11100111,
+            0b11100111, 0b11100111, 0,
+        ];
+        let expected_null_count = 16;
+        let result = set_bits(
+            destination.as_mut_slice(),
+            source,
+            destination_offset,
+            source_offset,
+            len,
+        );
+
+        assert_eq!(destination, expected_data);
+        assert_eq!(result, expected_null_count);
+    }
+
+    #[test]
+    fn test_set_bits_unaligned_destination_start() {
+        let mut destination: Vec<u8> = vec![0, 0, 0, 0, 0, 0, 0, 0, 0, 0];
+        let source: &[u8] = &[
+            0b11100111, 0b11100111, 0b11100111, 0b11100111, 0b11100111, 0b11100111,
+            0b11100111, 0b11100111,
+        ];
+
+        let destination_offset = 3;
+        let source_offset = 0;
+
+        let len = 64;
+
+        let expected_data: &[u8] = &[
+            0b00111000, 0b00111111, 0b00111111, 0b00111111, 0b00111111, 0b00111111,
+            0b00111111, 0b00111111, 0b00000111, 0b00000000,

Review comment:
       I can also mention here that I double checked that the old implementation returned the specified result as well, for all 4 tests




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] codecov-commenter commented on pull request #716: Optimize array::transform::utils::set_bits

Posted by GitBox <gi...@apache.org>.
codecov-commenter commented on pull request #716:
URL: https://github.com/apache/arrow-rs/pull/716#issuecomment-906739846


   # [Codecov](https://codecov.io/gh/apache/arrow-rs/pull/716?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#716](https://codecov.io/gh/apache/arrow-rs/pull/716?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (bcb6991) into [master](https://codecov.io/gh/apache/arrow-rs/commit/8308615d40e14caa5cdbee118ecc2f46696b920f?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (8308615) will **increase** coverage by `0.04%`.
   > The diff coverage is `98.43%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow-rs/pull/716/graphs/tree.svg?width=650&height=150&src=pr&token=pq9V9qWZ1N&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/arrow-rs/pull/716?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   
   ```diff
   @@            Coverage Diff             @@
   ##           master     #716      +/-   ##
   ==========================================
   + Coverage   82.46%   82.51%   +0.04%     
   ==========================================
     Files         168      168              
     Lines       47419    47653     +234     
   ==========================================
   + Hits        39104    39320     +216     
   - Misses       8315     8333      +18     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow-rs/pull/716?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
   |---|---|---|
   | [arrow/src/array/transform/utils.rs](https://codecov.io/gh/apache/arrow-rs/pull/716/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2FycmF5L3RyYW5zZm9ybS91dGlscy5ycw==) | `92.85% <98.43%> (-2.15%)` | :arrow_down: |
   | [arrow/src/compute/kernels/comparison.rs](https://codecov.io/gh/apache/arrow-rs/pull/716/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2NvbXB1dGUva2VybmVscy9jb21wYXJpc29uLnJz) | `95.08% <0.00%> (-0.76%)` | :arrow_down: |
   | [parquet/src/arrow/arrow\_array\_reader.rs](https://codecov.io/gh/apache/arrow-rs/pull/716/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvYXJyb3cvYXJyb3dfYXJyYXlfcmVhZGVyLnJz) | `79.71% <0.00%> (+1.28%)` | :arrow_up: |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow-rs/pull/716?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow-rs/pull/716?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Last update [8308615...bcb6991](https://codecov.io/gh/apache/arrow-rs/pull/716?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] alamb commented on a change in pull request #716: Optimize array::transform::utils::set_bits

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #716:
URL: https://github.com/apache/arrow-rs/pull/716#discussion_r697690544



##########
File path: arrow/src/array/transform/utils.rs
##########
@@ -35,15 +42,37 @@ pub(super) fn set_bits(
     offset_read: usize,
     len: usize,
 ) -> usize {
-    let mut count = 0;
-    (0..len).for_each(|i| {
-        if bit_util::get_bit(data, offset_read + i) {
-            bit_util::set_bit(write_data, offset_write + i);
-        } else {
-            count += 1;
-        }
+    let mut null_count = 0;
+
+    let mut bits_to_align = offset_write % 8;
+    if bits_to_align > 0 {
+        bits_to_align = std::cmp::min(len, 8 - bits_to_align);
+    }
+    let mut byte_index = ceil(offset_write + bits_to_align, 8);

Review comment:
       Maybe callig this `write_byte_index` would help make it clearer what it represented

##########
File path: arrow/src/array/transform/utils.rs
##########
@@ -35,15 +42,37 @@ pub(super) fn set_bits(
     offset_read: usize,
     len: usize,
 ) -> usize {
-    let mut count = 0;
-    (0..len).for_each(|i| {
-        if bit_util::get_bit(data, offset_read + i) {
-            bit_util::set_bit(write_data, offset_write + i);
-        } else {
-            count += 1;
-        }
+    let mut null_count = 0;
+
+    let mut bits_to_align = offset_write % 8;
+    if bits_to_align > 0 {
+        bits_to_align = std::cmp::min(len, 8 - bits_to_align);
+    }
+    let mut byte_index = ceil(offset_write + bits_to_align, 8);
+
+    // Set full bytes provided by bit chunk iterator
+    let chunks = BitChunks::new(data, offset_read + bits_to_align, len - bits_to_align);
+    chunks.iter().for_each(|chunk| {
+        null_count += chunk.count_zeros();
+        chunk.to_le_bytes().iter().for_each(|b| {

Review comment:
       👍  bit chunk iterator seems to use little endian as well: https://github.com/mathiaspeters-sig/arrow-rs/blob/improve-set-bits/arrow/src/util/bit_chunk_iterator.rs#L138

##########
File path: arrow/src/array/transform/utils.rs
##########
@@ -74,3 +103,130 @@ pub(super) unsafe fn get_last_offset<T: OffsetSizeTrait>(
     debug_assert!(prefix.is_empty() && suffix.is_empty());
     *offsets.get_unchecked(offsets.len() - 1)
 }
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+
+    #[test]
+    fn test_set_bits_aligned() {
+        let mut destination: Vec<u8> = vec![0, 0, 0, 0, 0, 0, 0, 0, 0, 0];
+        let source: &[u8] = &[
+            0b11100111, 0b11100111, 0b11100111, 0b11100111, 0b11100111, 0b11100111,
+            0b11100111, 0b11100111,

Review comment:
       I think one danger of using the same values for all of the source data is that it may mask offset calculation errors
   
   What would you think of using a different pattern for each byte? Perhaps something like
   
   
   ```suggestion
               0b10101010, 0b11001100, 0b11100011, 0b1111000, 0b11111000, 0b11111100,
               0b11111110, 0b1111111,
   ```
   
   Or some other pattern where having "off by 8 errors" is not as likely to be masked?

##########
File path: arrow/src/array/transform/utils.rs
##########
@@ -74,3 +103,130 @@ pub(super) unsafe fn get_last_offset<T: OffsetSizeTrait>(
     debug_assert!(prefix.is_empty() && suffix.is_empty());
     *offsets.get_unchecked(offsets.len() - 1)
 }
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+
+    #[test]
+    fn test_set_bits_aligned() {
+        let mut destination: Vec<u8> = vec![0, 0, 0, 0, 0, 0, 0, 0, 0, 0];
+        let source: &[u8] = &[
+            0b11100111, 0b11100111, 0b11100111, 0b11100111, 0b11100111, 0b11100111,
+            0b11100111, 0b11100111,
+        ];
+
+        let destination_offset = 8;
+        let source_offset = 0;
+
+        let len = 64;
+
+        let expected_data: &[u8] = &[
+            0, 0b11100111, 0b11100111, 0b11100111, 0b11100111, 0b11100111, 0b11100111,
+            0b11100111, 0b11100111, 0,
+        ];
+        let expected_null_count = 16;
+        let result = set_bits(
+            destination.as_mut_slice(),
+            source,
+            destination_offset,
+            source_offset,
+            len,
+        );
+
+        assert_eq!(destination, expected_data);
+        assert_eq!(result, expected_null_count);
+    }
+
+    #[test]
+    fn test_set_bits_unaligned_destination_start() {
+        let mut destination: Vec<u8> = vec![0, 0, 0, 0, 0, 0, 0, 0, 0, 0];
+        let source: &[u8] = &[
+            0b11100111, 0b11100111, 0b11100111, 0b11100111, 0b11100111, 0b11100111,
+            0b11100111, 0b11100111,
+        ];
+
+        let destination_offset = 3;
+        let source_offset = 0;
+
+        let len = 64;
+
+        let expected_data: &[u8] = &[
+            0b00111000, 0b00111111, 0b00111111, 0b00111111, 0b00111111, 0b00111111,
+            0b00111111, 0b00111111, 0b00000111, 0b00000000,

Review comment:
       i think the 3rd byte looks suspicious -- should it be?
   
   ```suggestion
               0b00111111, 0b00111111, 0b00111100, 0b00000000,
   ```

##########
File path: arrow/src/array/transform/utils.rs
##########
@@ -35,15 +42,37 @@ pub(super) fn set_bits(
     offset_read: usize,
     len: usize,
 ) -> usize {

Review comment:
       It would be cool to update the doc strings here too to explain what the return value of `set_bits` is supposed to be (seems like it is supposed to be the total number of `0` bits in  `data[offset_read..offset_read+len]`

##########
File path: arrow/src/array/transform/utils.rs
##########
@@ -35,15 +42,37 @@ pub(super) fn set_bits(
     offset_read: usize,
     len: usize,
 ) -> usize {
-    let mut count = 0;
-    (0..len).for_each(|i| {
-        if bit_util::get_bit(data, offset_read + i) {
-            bit_util::set_bit(write_data, offset_write + i);
-        } else {
-            count += 1;
-        }
+    let mut null_count = 0;
+
+    let mut bits_to_align = offset_write % 8;
+    if bits_to_align > 0 {
+        bits_to_align = std::cmp::min(len, 8 - bits_to_align);
+    }
+    let mut byte_index = ceil(offset_write + bits_to_align, 8);
+
+    // Set full bytes provided by bit chunk iterator

Review comment:
       ```suggestion
       // Set full bytes provided by bit chunk iterator (which iterates in 64 bits at a time)
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] codecov-commenter edited a comment on pull request #716: Optimize array::transform::utils::set_bits

Posted by GitBox <gi...@apache.org>.
codecov-commenter edited a comment on pull request #716:
URL: https://github.com/apache/arrow-rs/pull/716#issuecomment-906739846


   # [Codecov](https://codecov.io/gh/apache/arrow-rs/pull/716?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#716](https://codecov.io/gh/apache/arrow-rs/pull/716?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (ba6a71d) into [master](https://codecov.io/gh/apache/arrow-rs/commit/8308615d40e14caa5cdbee118ecc2f46696b920f?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (8308615) will **increase** coverage by `0.07%`.
   > The diff coverage is `100.00%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow-rs/pull/716/graphs/tree.svg?width=650&height=150&src=pr&token=pq9V9qWZ1N&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/arrow-rs/pull/716?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   
   ```diff
   @@            Coverage Diff             @@
   ##           master     #716      +/-   ##
   ==========================================
   + Coverage   82.46%   82.53%   +0.07%     
   ==========================================
     Files         168      168              
     Lines       47419    47705     +286     
   ==========================================
   + Hits        39104    39375     +271     
   - Misses       8315     8330      +15     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow-rs/pull/716?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
   |---|---|---|
   | [arrow/src/array/transform/utils.rs](https://codecov.io/gh/apache/arrow-rs/pull/716/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2FycmF5L3RyYW5zZm9ybS91dGlscy5ycw==) | `98.71% <100.00%> (+3.71%)` | :arrow_up: |
   | [arrow/src/compute/kernels/comparison.rs](https://codecov.io/gh/apache/arrow-rs/pull/716/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2NvbXB1dGUva2VybmVscy9jb21wYXJpc29uLnJz) | `95.08% <0.00%> (-0.76%)` | :arrow_down: |
   | [parquet\_derive\_test/src/lib.rs](https://codecov.io/gh/apache/arrow-rs/pull/716/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldF9kZXJpdmVfdGVzdC9zcmMvbGliLnJz) | `100.00% <0.00%> (ø)` | |
   | [arrow/src/array/array\_binary.rs](https://codecov.io/gh/apache/arrow-rs/pull/716/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2FycmF5L2FycmF5X2JpbmFyeS5ycw==) | `92.46% <0.00%> (+0.23%)` | :arrow_up: |
   | [parquet/src/arrow/arrow\_array\_reader.rs](https://codecov.io/gh/apache/arrow-rs/pull/716/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvYXJyb3cvYXJyb3dfYXJyYXlfcmVhZGVyLnJz) | `79.71% <0.00%> (+1.28%)` | :arrow_up: |
   | [parquet\_derive/src/parquet\_field.rs](https://codecov.io/gh/apache/arrow-rs/pull/716/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldF9kZXJpdmUvc3JjL3BhcnF1ZXRfZmllbGQucnM=) | `68.86% <0.00%> (+2.69%)` | :arrow_up: |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow-rs/pull/716?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow-rs/pull/716?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Last update [8308615...ba6a71d](https://codecov.io/gh/apache/arrow-rs/pull/716?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org