You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/12/12 12:34:03 UTC

[GitHub] [arrow] Dandandan opened a new pull request #8900: ARROW-10810: Comparison kernels performance

Dandandan opened a new pull request #8900:
URL: https://github.com/apache/arrow/pull/8900


   This PR shows that there is still about a ~2x performance (compared to ~8x earlier) difference between using a builder vs using a mutable buffer directly after https://github.com/apache/arrow/pull/8842 .
   This also account for a ~5% difference on some queries (when not using the simd feature, where the implementation doesn't use the buffer). Also the bounds checks are a bit expensive. In some `value` functions they are explicitly not there whereas in other (like for string) they are there.
   
   I guess there will be always _some_ overhead in the builder as it does need to do some bookkeeping, but I think it's a good idea to see how we can write kernels while not losing too much performance.
   
   FYI @jorgecarleitao 
   
   ```
   Gnuplot not found, using plotters backend
   eq Float32              time:   [107.02 us 107.29 us 107.60 us]                       
                           change: [-54.994% -54.839% -54.681%] (p = 0.00 < 0.05)
                           Performance has improved.
   Found 2 outliers among 100 measurements (2.00%)
     1 (1.00%) high mild
     1 (1.00%) high severe
   
   eq scalar Float32       time:   [70.271 us 70.356 us 70.446 us]                              
                           change: [-48.540% -48.392% -48.258%] (p = 0.00 < 0.05)
                           Performance has improved.
   Found 4 outliers among 100 measurements (4.00%)
     2 (2.00%) high mild
     2 (2.00%) high severe
   
   neq Float32             time:   [71.580 us 71.655 us 71.732 us]                        
                           change: [-58.072% -58.001% -57.931%] (p = 0.00 < 0.05)
                           Performance has improved.
   Found 7 outliers among 100 measurements (7.00%)
     1 (1.00%) low mild
     4 (4.00%) high mild
     2 (2.00%) high severe
   
   neq scalar Float32      time:   [70.011 us 70.079 us 70.155 us]                               
                           change: [-59.055% -58.980% -58.908%] (p = 0.00 < 0.05)
                           Performance has improved.
   Found 5 outliers among 100 measurements (5.00%)
     2 (2.00%) low severe
     3 (3.00%) high mild
   
   lt Float32              time:   [70.945 us 70.991 us 71.038 us]                       
                           change: [-55.834% -55.757% -55.683%] (p = 0.00 < 0.05)
                           Performance has improved.
   
   lt scalar Float32       time:   [50.708 us 50.789 us 50.882 us]                               
                           change: [-62.939% -62.825% -62.689%] (p = 0.00 < 0.05)
                           Performance has improved.
   Found 5 outliers among 100 measurements (5.00%)
     3 (3.00%) high mild
     2 (2.00%) high severe
   
   lt_eq Float32           time:   [106.29 us 106.40 us 106.52 us]                          
                           change: [-42.593% -42.470% -42.350%] (p = 0.00 < 0.05)
                           Performance has improved.
   Found 6 outliers among 100 measurements (6.00%)
     1 (1.00%) low severe
     1 (1.00%) low mild
     2 (2.00%) high mild
     2 (2.00%) high severe
   
   lt_eq scalar Float32    time:   [71.089 us 71.170 us 71.261 us]                                 
                           change: [-52.021% -51.941% -51.857%] (p = 0.00 < 0.05)
                           Performance has improved.
   Found 2 outliers among 100 measurements (2.00%)
     1 (1.00%) high mild
     1 (1.00%) high severe
   
   gt Float32              time:   [71.759 us 71.939 us 72.131 us]                       
                           change: [-58.319% -58.190% -58.067%] (p = 0.00 < 0.05)
                           Performance has improved.
   Found 7 outliers among 100 measurements (7.00%)
     5 (5.00%) high mild
     2 (2.00%) high severe
   
   gt scalar Float32       time:   [38.748 us 38.782 us 38.821 us]                               
                           change: [-73.757% -73.691% -73.624%] (p = 0.00 < 0.05)
                           Performance has improved.
   Found 5 outliers among 100 measurements (5.00%)
     4 (4.00%) high mild
     1 (1.00%) high severe
   
   gt_eq Float32           time:   [102.79 us 102.87 us 102.96 us]                          
                           change: [-53.103% -52.953% -52.805%] (p = 0.00 < 0.05)
                           Performance has improved.
   Found 8 outliers among 100 measurements (8.00%)
     1 (1.00%) low severe
     4 (4.00%) high mild
     3 (3.00%) high severe
   
   gt_eq scalar Float32    time:   [55.034 us 55.109 us 55.201 us]                                 
                           change: [-59.706% -59.544% -59.381%] (p = 0.00 < 0.05)
                           Performance has improved.
   Found 7 outliers among 100 measurements (7.00%)
     7 (7.00%) high mild
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] Dandandan commented on pull request #8900: ARROW-10810: [Rust] Improve comparison kernels performance

Posted by GitBox <gi...@apache.org>.

Dandandan commented on pull request #8900:
URL: https://github.com/apache/arrow/pull/8900#issuecomment-744033619


   @alamb 
   
   Took it a bit further and the `assert!` is also having the same effect on your benchmark:
   
   commit 23c8ff28e56ccb381bcbb321dcbbb946d8fd7db0
   
   ```
   Hello, world!
   example_with_vec
   created array with 20000000 elements in 625.670877ms
   Completed finding bitset: 20000000 elements in 62.028751ms
   Completed finding bitset: 20000000 elements in 61.464365ms
   Completed finding bitset: 20000000 elements in 59.990842ms
   Completed finding bitset: 20000000 elements in 56.594087ms
   Completed finding bitset: 20000000 elements in 55.969562ms
   Completed finding bitset: 20000000 elements in 56.594535ms
   Completed finding bitset: 20000000 elements in 55.987112ms
   Completed finding bitset: 20000000 elements in 56.379664ms
   Completed finding bitset: 20000000 elements in 56.47164ms
   Completed finding bitset: 20000000 elements in 56.342637ms
   example_with_arrow
   created array with 20000000 elements in 773.245663ms
   Found 20000000 not in west in 72.449124ms
   Found 20000000 not in west in 74.481405ms
   Found 20000000 not in west in 75.160204ms
   Found 20000000 not in west in 75.033379ms
   Found 20000000 not in west in 75.064686ms
   Found 20000000 not in west in 75.221221ms
   Found 20000000 not in west in 75.979403ms
   Found 20000000 not in west in 76.288346ms
   Found 20000000 not in west in 74.639085ms
   Found 20000000 not in west in 73.664288ms
   ```
   
   commit 3b67f70d670c8ba76f327fb4c8a757155d021f67 :
   
   ```
   Hello, world!
   example_with_vec
   created array with 20000000 elements in 550.194933ms
   Completed finding bitset: 20000000 elements in 58.936386ms
   Completed finding bitset: 20000000 elements in 57.900239ms
   Completed finding bitset: 20000000 elements in 54.868759ms
   Completed finding bitset: 20000000 elements in 52.913171ms
   Completed finding bitset: 20000000 elements in 52.707473ms
   Completed finding bitset: 20000000 elements in 53.600913ms
   Completed finding bitset: 20000000 elements in 53.315755ms
   Completed finding bitset: 20000000 elements in 53.188747ms
   Completed finding bitset: 20000000 elements in 52.896293ms
   Completed finding bitset: 20000000 elements in 54.134426ms
   example_with_arrow
   created array with 20000000 elements in 666.080968ms
   Found 20000000 not in west in 58.209991ms
   Found 20000000 not in west in 58.496839ms
   Found 20000000 not in west in 58.294964ms
   Found 20000000 not in west in 57.771822ms
   Found 20000000 not in west in 58.266169ms
   Found 20000000 not in west in 58.028011ms
   Found 20000000 not in west in 58.431954ms
   Found 20000000 not in west in 58.545879ms
   Found 20000000 not in west in 58.131794ms
   Found 20000000 not in west in 59.211415ms
   ```
   
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] codecov-io commented on pull request #8900: ARROW-10810: Comparison kernels performance

Posted by GitBox <gi...@apache.org>.

codecov-io commented on pull request #8900:
URL: https://github.com/apache/arrow/pull/8900#issuecomment-743750244


   # [Codecov](https://codecov.io/gh/apache/arrow/pull/8900?src=pr&el=h1) Report
   > Merging [#8900](https://codecov.io/gh/apache/arrow/pull/8900?src=pr&el=desc) (c64e6f0) into [master](https://codecov.io/gh/apache/arrow/commit/1378c20bad350dba484ae21aa8a2588f44012452?el=desc) (1378c20) will **increase** coverage by `22.84%`.
   > The diff coverage is `100.00%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/8900/graphs/tree.svg?width=650&height=150&src=pr&token=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/8900?src=pr&el=tree)
   
   ```diff
   @@             Coverage Diff             @@
   ##           master    #8900       +/-   ##
   ===========================================
   + Coverage   53.96%   76.81%   +22.84%     
   ===========================================
     Files         170      181       +11     
     Lines       30707    40987    +10280     
   ===========================================
   + Hits        16571    31483    +14912     
   + Misses      14136     9504     -4632     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow/pull/8900?src=pr&el=tree) | Coverage Δ | |
   |---|---|---|
   | [rust/arrow/src/compute/kernels/comparison.rs](https://codecov.io/gh/apache/arrow/pull/8900/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvY29tcHV0ZS9rZXJuZWxzL2NvbXBhcmlzb24ucnM=) | `96.27% <ø> (ø)` | |
   | [rust/arrow/src/array/array\_binary.rs](https://codecov.io/gh/apache/arrow/pull/8900/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvYXJyYXlfYmluYXJ5LnJz) | `90.47% <100.00%> (ø)` | |
   | [rust/arrow/src/array/array\_string.rs](https://codecov.io/gh/apache/arrow/pull/8900/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvYXJyYXlfc3RyaW5nLnJz) | `90.37% <100.00%> (+1.06%)` | :arrow_up: |
   | [...ntegration-testing/src/bin/arrow-stream-to-file.rs](https://codecov.io/gh/apache/arrow/pull/8900/diff?src=pr&el=tree#diff-cnVzdC9pbnRlZ3JhdGlvbi10ZXN0aW5nL3NyYy9iaW4vYXJyb3ctc3RyZWFtLXRvLWZpbGUucnM=) | `0.00% <0.00%> (ø)` | |
   | [rust/parquet/src/bin/parquet-schema.rs](https://codecov.io/gh/apache/arrow/pull/8900/diff?src=pr&el=tree#diff-cnVzdC9wYXJxdWV0L3NyYy9iaW4vcGFycXVldC1zY2hlbWEucnM=) | `0.00% <0.00%> (ø)` | |
   | [rust/parquet/src/util/test\_common/file\_util.rs](https://codecov.io/gh/apache/arrow/pull/8900/diff?src=pr&el=tree#diff-cnVzdC9wYXJxdWV0L3NyYy91dGlsL3Rlc3RfY29tbW9uL2ZpbGVfdXRpbC5ycw==) | `77.77% <0.00%> (ø)` | |
   | [rust/parquet/src/file/footer.rs](https://codecov.io/gh/apache/arrow/pull/8900/diff?src=pr&el=tree#diff-cnVzdC9wYXJxdWV0L3NyYy9maWxlL2Zvb3Rlci5ycw==) | `96.22% <0.00%> (ø)` | |
   | [...ion-testing/src/bin/arrow-json-integration-test.rs](https://codecov.io/gh/apache/arrow/pull/8900/diff?src=pr&el=tree#diff-cnVzdC9pbnRlZ3JhdGlvbi10ZXN0aW5nL3NyYy9iaW4vYXJyb3ctanNvbi1pbnRlZ3JhdGlvbi10ZXN0LnJz) | `0.00% <0.00%> (ø)` | |
   | [rust/parquet/src/util/hash\_util.rs](https://codecov.io/gh/apache/arrow/pull/8900/diff?src=pr&el=tree#diff-cnVzdC9wYXJxdWV0L3NyYy91dGlsL2hhc2hfdXRpbC5ycw==) | `95.77% <0.00%> (ø)` | |
   | [...ntegration-testing/src/bin/arrow-file-to-stream.rs](https://codecov.io/gh/apache/arrow/pull/8900/diff?src=pr&el=tree#diff-cnVzdC9pbnRlZ3JhdGlvbi10ZXN0aW5nL3NyYy9iaW4vYXJyb3ctZmlsZS10by1zdHJlYW0ucnM=) | `0.00% <0.00%> (ø)` | |
   | ... and [54 more](https://codecov.io/gh/apache/arrow/pull/8900/diff?src=pr&el=tree-more) | |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/8900?src=pr&el=continue).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/8900?src=pr&el=footer). Last update [1378c20...c64e6f0](https://codecov.io/gh/apache/arrow/pull/8900?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] Dandandan commented on a change in pull request #8900: ARROW-10810: [Rust] Comparison kernels performance

Posted by GitBox <gi...@apache.org>.

Dandandan commented on a change in pull request #8900:
URL: https://github.com/apache/arrow/pull/8900#discussion_r541769546



##########
File path: rust/arrow/src/array/array_binary.rs
##########
@@ -317,7 +317,7 @@ pub struct FixedSizeBinaryArray {
 impl FixedSizeBinaryArray {
     /// Returns the element at index `i` as a byte slice.
     pub fn value(&self, i: usize) -> &[u8] {
-        assert!(
+        debug_assert!(

Review comment:
       I think we can think more about safe/unsafe and being more clear about it. I will move the partial change from this PR for now. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] Dandandan commented on pull request #8900: ARROW-10810: [Rust] Comparison kernels performance

Posted by GitBox <gi...@apache.org>.

Dandandan commented on pull request #8900:
URL: https://github.com/apache/arrow/pull/8900#issuecomment-743831579


   Will revert the `assert` changes tomorrow, will create a ticket for the iterators over `T` to avoid bounds checking.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #8900: ARROW-10810: Comparison kernels performance

Posted by GitBox <gi...@apache.org>.

github-actions[bot] commented on pull request #8900:
URL: https://github.com/apache/arrow/pull/8900#issuecomment-743749965


   https://issues.apache.org/jira/browse/ARROW-10810


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] codecov-io edited a comment on pull request #8900: ARROW-10810: [Rust] Comparison kernels performance

Posted by GitBox <gi...@apache.org>.

codecov-io edited a comment on pull request #8900:
URL: https://github.com/apache/arrow/pull/8900#issuecomment-743750244


   # [Codecov](https://codecov.io/gh/apache/arrow/pull/8900?src=pr&el=h1) Report
   > Merging [#8900](https://codecov.io/gh/apache/arrow/pull/8900?src=pr&el=desc) (23c8ff2) into [master](https://codecov.io/gh/apache/arrow/commit/1378c20bad350dba484ae21aa8a2588f44012452?el=desc) (1378c20) will **increase** coverage by `22.81%`.
   > The diff coverage is `n/a`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/8900/graphs/tree.svg?width=650&height=150&src=pr&token=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/8900?src=pr&el=tree)
   
   ```diff
   @@             Coverage Diff             @@
   ##           master    #8900       +/-   ##
   ===========================================
   + Coverage   53.96%   76.77%   +22.81%     
   ===========================================
     Files         170      181       +11     
     Lines       30707    41009    +10302     
   ===========================================
   + Hits        16571    31485    +14914     
   + Misses      14136     9524     -4612     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow/pull/8900?src=pr&el=tree) | Coverage Δ | |
   |---|---|---|
   | [rust/arrow/src/compute/kernels/comparison.rs](https://codecov.io/gh/apache/arrow/pull/8900/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvY29tcHV0ZS9rZXJuZWxzL2NvbXBhcmlzb24ucnM=) | `96.27% <ø> (ø)` | |
   | [rust/arrow/src/ipc/writer.rs](https://codecov.io/gh/apache/arrow/pull/8900/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvaXBjL3dyaXRlci5ycw==) | `83.65% <0.00%> (-3.44%)` | :arrow_down: |
   | [rust/arrow/src/json/reader.rs](https://codecov.io/gh/apache/arrow/pull/8900/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvanNvbi9yZWFkZXIucnM=) | `83.17% <0.00%> (-0.09%)` | :arrow_down: |
   | [rust/arrow/src/buffer.rs](https://codecov.io/gh/apache/arrow/pull/8900/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYnVmZmVyLnJz) | `97.96% <0.00%> (ø)` | |
   | [rust/arrow/src/datatypes.rs](https://codecov.io/gh/apache/arrow/pull/8900/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvZGF0YXR5cGVzLnJz) | `76.85% <0.00%> (ø)` | |
   | [rust/arrow-flight/src/utils.rs](https://codecov.io/gh/apache/arrow/pull/8900/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy1mbGlnaHQvc3JjL3V0aWxzLnJz) | `0.00% <0.00%> (ø)` | |
   | [rust/datafusion/examples/flight\_server.rs](https://codecov.io/gh/apache/arrow/pull/8900/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL2V4YW1wbGVzL2ZsaWdodF9zZXJ2ZXIucnM=) | `0.00% <0.00%> (ø)` | |
   | [rust/arrow/src/compute/kernels/arithmetic.rs](https://codecov.io/gh/apache/arrow/pull/8900/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvY29tcHV0ZS9rZXJuZWxzL2FyaXRobWV0aWMucnM=) | `98.45% <0.00%> (ø)` | |
   | [...ion-testing/src/bin/arrow-json-integration-test.rs](https://codecov.io/gh/apache/arrow/pull/8900/diff?src=pr&el=tree#diff-cnVzdC9pbnRlZ3JhdGlvbi10ZXN0aW5nL3NyYy9iaW4vYXJyb3ctanNvbi1pbnRlZ3JhdGlvbi10ZXN0LnJz) | `0.00% <0.00%> (ø)` | |
   | [...ntegration-testing/src/bin/arrow-stream-to-file.rs](https://codecov.io/gh/apache/arrow/pull/8900/diff?src=pr&el=tree#diff-cnVzdC9pbnRlZ3JhdGlvbi10ZXN0aW5nL3NyYy9iaW4vYXJyb3ctc3RyZWFtLXRvLWZpbGUucnM=) | `0.00% <0.00%> (ø)` | |
   | ... and [60 more](https://codecov.io/gh/apache/arrow/pull/8900/diff?src=pr&el=tree-more) | |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/8900?src=pr&el=continue).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/8900?src=pr&el=footer). Last update [1378c20...23c8ff2](https://codecov.io/gh/apache/arrow/pull/8900?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] Dandandan commented on a change in pull request #8900: ARROW-10810: [Rust] Comparison kernels performance

Posted by GitBox <gi...@apache.org>.

Dandandan commented on a change in pull request #8900:
URL: https://github.com/apache/arrow/pull/8900#discussion_r541769546



##########
File path: rust/arrow/src/array/array_binary.rs
##########
@@ -317,7 +317,7 @@ pub struct FixedSizeBinaryArray {
 impl FixedSizeBinaryArray {
     /// Returns the element at index `i` as a byte slice.
     pub fn value(&self, i: usize) -> &[u8] {
-        assert!(
+        debug_assert!(

Review comment:
       I think we can think more about safe/unsafe and being more clear about it. I will move this change from this PR for now. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] Dandandan commented on a change in pull request #8900: ARROW-10810: [Rust] Comparison kernels performance

Posted by GitBox <gi...@apache.org>.

Dandandan commented on a change in pull request #8900:
URL: https://github.com/apache/arrow/pull/8900#discussion_r541772220



##########
File path: rust/arrow/src/compute/kernels/comparison.rs
##########
@@ -47,9 +47,18 @@ macro_rules! compare_op {
         let null_bit_buffer =
             combine_option_bitmap($left.data_ref(), $right.data_ref(), $left.len())?;
 
-        let mut result = BooleanBufferBuilder::new($left.len());
+        let byte_capacity = bit_util::ceil($left.len(), 8);
+        let actual_capacity = bit_util::round_upto_multiple_of_64(byte_capacity);
+        let mut buffer = MutableBuffer::new(actual_capacity);
+        buffer.resize(byte_capacity);
+        let data = buffer.raw_data_mut();
+
         for i in 0..$left.len() {
-            result.append($op($left.value(i), $right.value(i)))?;
+            if $op($left.value(i), $right.value(i)) {
+                unsafe {

Review comment:
       I will revert the change with the asserts for now. I think the `value` functions should be marked unsafe that don't perform bound checks, and other functions that can trigger UB. The macro's themselves should be "relatively" safe I think, as long as the `.len()` value is correct.
   I think a better solution for the future would be to have a safe iterator that doesn't do bound checking, so I think it's better to move the particular change out of this PR for now.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] Dandandan commented on pull request #8900: ARROW-10810: [Rust] Improve comparison kernels performance

Posted by GitBox <gi...@apache.org>.

Dandandan commented on pull request #8900:
URL: https://github.com/apache/arrow/pull/8900#issuecomment-744024794


   Cool to know / read @alamb to see that we are getting closer to "optimal" performance on a single thread. The PRs I create are also based on profiling DataFusion, showing that it has "real world" improvement on those changes as well besides getting good result on the benchmarks. I hope they are useful for influxdb as well.
   
   The docs / presentation are looking very interesting, it probably isn't recorded?
   
   As mentioned there is for some datatypes (like strings) there is some overhead still related to `assert!` that I reverted for now. Probably the bitmap code also could be improved which could close the gap. I think it is cool to compare it against a "native" rust implementation.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] nevi-me closed pull request #8900: ARROW-10810: [Rust] Improve comparison kernels performance

Posted by GitBox <gi...@apache.org>.

nevi-me closed pull request #8900:
URL: https://github.com/apache/arrow/pull/8900


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] andygrove commented on a change in pull request #8900: ARROW-10810: [Rust] Comparison kernels performance

Posted by GitBox <gi...@apache.org>.

andygrove commented on a change in pull request #8900:
URL: https://github.com/apache/arrow/pull/8900#discussion_r541727214



##########
File path: rust/arrow/src/compute/kernels/comparison.rs
##########
@@ -47,9 +47,18 @@ macro_rules! compare_op {
         let null_bit_buffer =
             combine_option_bitmap($left.data_ref(), $right.data_ref(), $left.len())?;
 
-        let mut result = BooleanBufferBuilder::new($left.len());
+        let byte_capacity = bit_util::ceil($left.len(), 8);
+        let actual_capacity = bit_util::round_upto_multiple_of_64(byte_capacity);
+        let mut buffer = MutableBuffer::new(actual_capacity);
+        buffer.resize(byte_capacity);
+        let data = buffer.raw_data_mut();
+
         for i in 0..$left.len() {
-            result.append($op($left.value(i), $right.value(i)))?;
+            if $op($left.value(i), $right.value(i)) {
+                unsafe {

Review comment:
       Functions that use this macro should now be declared `unsafe`?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] andygrove commented on a change in pull request #8900: ARROW-10810: [Rust] Comparison kernels performance

Posted by GitBox <gi...@apache.org>.

andygrove commented on a change in pull request #8900:
URL: https://github.com/apache/arrow/pull/8900#discussion_r541726473



##########
File path: rust/arrow/src/array/array_binary.rs
##########
@@ -317,7 +317,7 @@ pub struct FixedSizeBinaryArray {
 impl FixedSizeBinaryArray {
     /// Returns the element at index `i` as a byte slice.
     pub fn value(&self, i: usize) -> &[u8] {
-        assert!(
+        debug_assert!(

Review comment:
       Does this change mean that we should mark the function as `unsafe`? Should we provide safe and unsafe versions of these functions?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] Dandandan commented on pull request #8900: ARROW-10810: [Rust] Comparison kernels performance

Posted by GitBox <gi...@apache.org>.

Dandandan commented on pull request #8900:
URL: https://github.com/apache/arrow/pull/8900#issuecomment-743974209


   @andygrove @jorgecarleitao 
   Reverted the assert changes and created 
   https://issues.apache.org/jira/browse/ARROW-10892 for the iterator.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] codecov-io edited a comment on pull request #8900: ARROW-10810: Comparison kernels performance

Posted by GitBox <gi...@apache.org>.

codecov-io edited a comment on pull request #8900:
URL: https://github.com/apache/arrow/pull/8900#issuecomment-743750244


   # [Codecov](https://codecov.io/gh/apache/arrow/pull/8900?src=pr&el=h1) Report
   > Merging [#8900](https://codecov.io/gh/apache/arrow/pull/8900?src=pr&el=desc) (3b67f70) into [master](https://codecov.io/gh/apache/arrow/commit/1378c20bad350dba484ae21aa8a2588f44012452?el=desc) (1378c20) will **not change** coverage.
   > The diff coverage is `100.00%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/8900/graphs/tree.svg?width=650&height=150&src=pr&token=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/8900?src=pr&el=tree)
   
   ```diff
   @@           Coverage Diff           @@
   ##           master    #8900   +/-   ##
   =======================================
     Coverage   53.96%   53.96%           
   =======================================
     Files         170      170           
     Lines       30707    30707           
   =======================================
     Hits        16571    16571           
     Misses      14136    14136           
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow/pull/8900?src=pr&el=tree) | Coverage Δ | |
   |---|---|---|
   | [rust/arrow/src/compute/kernels/comparison.rs](https://codecov.io/gh/apache/arrow/pull/8900/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvY29tcHV0ZS9rZXJuZWxzL2NvbXBhcmlzb24ucnM=) | `96.27% <ø> (ø)` | |
   | [rust/arrow/src/array/array\_binary.rs](https://codecov.io/gh/apache/arrow/pull/8900/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvYXJyYXlfYmluYXJ5LnJz) | `90.47% <100.00%> (ø)` | |
   | [rust/arrow/src/array/array\_string.rs](https://codecov.io/gh/apache/arrow/pull/8900/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvYXJyYXlfc3RyaW5nLnJz) | `89.30% <100.00%> (ø)` | |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/8900?src=pr&el=continue).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/8900?src=pr&el=footer). Last update [1378c20...3b67f70](https://codecov.io/gh/apache/arrow/pull/8900?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] alamb commented on pull request #8900: ARROW-10810: [Rust] Comparison kernels performance

Posted by GitBox <gi...@apache.org>.

alamb commented on pull request #8900:
URL: https://github.com/apache/arrow/pull/8900#issuecomment-743998066


   FWIW I tested this branch with a local micro benchmark I had and it shows significant improvement. I wrote up some details here: https://docs.google.com/document/d/15DRpIr1EUqo7zVR1psBhoX95ML1PO55kkHwt2EasM88/edit#heading=h.pfa677dojqtv
   
   Backstory was I was looking at the performance of these kernels (inspired by @rdettai ) as `utf8_neq_scalar` came up in a talk I was giving last week. What a wonderful surprise that there is a PR actively improving them (and it happens to also fit the narrative of my talk well!)
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] Dandandan commented on pull request #8900: ARROW-10810: [Rust] Improve comparison kernels performance

Posted by GitBox <gi...@apache.org>.

Dandandan commented on pull request #8900:
URL: https://github.com/apache/arrow/pull/8900#issuecomment-744025315


   Ah just saw the link to the YouTube video, thanks @alamb 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] Dandandan edited a comment on pull request #8900: ARROW-10810: [Rust] Improve comparison kernels performance

Posted by GitBox <gi...@apache.org>.

Dandandan edited a comment on pull request #8900:
URL: https://github.com/apache/arrow/pull/8900#issuecomment-744024794


   Cool to know / read @alamb to see that we are getting closer to "optimal" performance on a single thread. The PRs I create are also based on profiling DataFusion, showing that it has "real world" improvement on those changes as well besides getting good result on the benchmarks. I hope they are useful for influxdb as well.
   
   The docs / presentation are looking very interesting, it probably isn't recorded?
   
   As mentioned there is for some datatypes (like strings) there is some overhead still related to the `assert!`s that I reverted for now. Probably the bitmap code also could be improved which could close the gap. I think it is cool to compare it against a "native" rust implementation.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorgecarleitao commented on a change in pull request #8900: ARROW-10810: [Rust] Comparison kernels performance

Posted by GitBox <gi...@apache.org>.

jorgecarleitao commented on a change in pull request #8900:
URL: https://github.com/apache/arrow/pull/8900#discussion_r541774429



##########
File path: rust/arrow/src/compute/kernels/comparison.rs
##########
@@ -47,9 +47,18 @@ macro_rules! compare_op {
         let null_bit_buffer =
             combine_option_bitmap($left.data_ref(), $right.data_ref(), $left.len())?;
 
-        let mut result = BooleanBufferBuilder::new($left.len());
+        let byte_capacity = bit_util::ceil($left.len(), 8);
+        let actual_capacity = bit_util::round_upto_multiple_of_64(byte_capacity);
+        let mut buffer = MutableBuffer::new(actual_capacity);
+        buffer.resize(byte_capacity);
+        let data = buffer.raw_data_mut();
+
         for i in 0..$left.len() {
-            result.append($op($left.value(i), $right.value(i)))?;
+            if $op($left.value(i), $right.value(i)) {
+                unsafe {

Review comment:
       I agree, @Dandandan . Note that all primitives and strings have iterators and `FromIterator`: https://github.com/apache/arrow/blob/master/rust/arrow/src/array/iterator.rs , but they are for `Option<T>`, not `T`.
   
   I agree with you that we should mark that `fn value` as `unsafe` and offer an iterator over `T` (besides the one over `Option<T>`). That UB is really obvious and it is also a security vulnerability causing an escalation of privileges as it allows privileged access to the application's memory via out of bounds accesses.
   
   I usually see the iterator over `T` when they can mask the result or OR / AND the null bitmaps, while `Option<T>` is used when that is not possible / useful.
   
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org