You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/12/12 12:34:03 UTC
[GitHub] [arrow] Dandandan opened a new pull request #8900: ARROW-10810: Comparison kernels performance
Dandandan opened a new pull request #8900:
URL: https://github.com/apache/arrow/pull/8900
This PR shows that there is still about a ~2x performance (compared to ~8x earlier) difference between using a builder vs using a mutable buffer directly after https://github.com/apache/arrow/pull/8842 .
This also account for a ~5% difference on some queries (when not using the simd feature, where the implementation doesn't use the buffer). Also the bounds checks are a bit expensive. In some `value` functions they are explicitly not there whereas in other (like for string) they are there.
I guess there will be always _some_ overhead in the builder as it does need to do some bookkeeping, but I think it's a good idea to see how we can write kernels while not losing too much performance.
FYI @jorgecarleitao
```
Gnuplot not found, using plotters backend
eq Float32 time: [107.02 us 107.29 us 107.60 us]
change: [-54.994% -54.839% -54.681%] (p = 0.00 < 0.05)
Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
1 (1.00%) high mild
1 (1.00%) high severe
eq scalar Float32 time: [70.271 us 70.356 us 70.446 us]
change: [-48.540% -48.392% -48.258%] (p = 0.00 < 0.05)
Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
2 (2.00%) high mild
2 (2.00%) high severe
neq Float32 time: [71.580 us 71.655 us 71.732 us]
change: [-58.072% -58.001% -57.931%] (p = 0.00 < 0.05)
Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
1 (1.00%) low mild
4 (4.00%) high mild
2 (2.00%) high severe
neq scalar Float32 time: [70.011 us 70.079 us 70.155 us]
change: [-59.055% -58.980% -58.908%] (p = 0.00 < 0.05)
Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
2 (2.00%) low severe
3 (3.00%) high mild
lt Float32 time: [70.945 us 70.991 us 71.038 us]
change: [-55.834% -55.757% -55.683%] (p = 0.00 < 0.05)
Performance has improved.
lt scalar Float32 time: [50.708 us 50.789 us 50.882 us]
change: [-62.939% -62.825% -62.689%] (p = 0.00 < 0.05)
Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
3 (3.00%) high mild
2 (2.00%) high severe
lt_eq Float32 time: [106.29 us 106.40 us 106.52 us]
change: [-42.593% -42.470% -42.350%] (p = 0.00 < 0.05)
Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
1 (1.00%) low severe
1 (1.00%) low mild
2 (2.00%) high mild
2 (2.00%) high severe
lt_eq scalar Float32 time: [71.089 us 71.170 us 71.261 us]
change: [-52.021% -51.941% -51.857%] (p = 0.00 < 0.05)
Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
1 (1.00%) high mild
1 (1.00%) high severe
gt Float32 time: [71.759 us 71.939 us 72.131 us]
change: [-58.319% -58.190% -58.067%] (p = 0.00 < 0.05)
Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
5 (5.00%) high mild
2 (2.00%) high severe
gt scalar Float32 time: [38.748 us 38.782 us 38.821 us]
change: [-73.757% -73.691% -73.624%] (p = 0.00 < 0.05)
Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
4 (4.00%) high mild
1 (1.00%) high severe
gt_eq Float32 time: [102.79 us 102.87 us 102.96 us]
change: [-53.103% -52.953% -52.805%] (p = 0.00 < 0.05)
Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
1 (1.00%) low severe
4 (4.00%) high mild
3 (3.00%) high severe
gt_eq scalar Float32 time: [55.034 us 55.109 us 55.201 us]
change: [-59.706% -59.544% -59.381%] (p = 0.00 < 0.05)
Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
7 (7.00%) high mild
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] Dandandan commented on pull request #8900: ARROW-10810: [Rust] Improve comparison kernels performance
Posted by GitBox <gi...@apache.org>.
Dandandan commented on pull request #8900:
URL: https://github.com/apache/arrow/pull/8900#issuecomment-744033619
@alamb
Took it a bit further and the `assert!` is also having the same effect on your benchmark:
commit 23c8ff28e56ccb381bcbb321dcbbb946d8fd7db0
```
Hello, world!
example_with_vec
created array with 20000000 elements in 625.670877ms
Completed finding bitset: 20000000 elements in 62.028751ms
Completed finding bitset: 20000000 elements in 61.464365ms
Completed finding bitset: 20000000 elements in 59.990842ms
Completed finding bitset: 20000000 elements in 56.594087ms
Completed finding bitset: 20000000 elements in 55.969562ms
Completed finding bitset: 20000000 elements in 56.594535ms
Completed finding bitset: 20000000 elements in 55.987112ms
Completed finding bitset: 20000000 elements in 56.379664ms
Completed finding bitset: 20000000 elements in 56.47164ms
Completed finding bitset: 20000000 elements in 56.342637ms
example_with_arrow
created array with 20000000 elements in 773.245663ms
Found 20000000 not in west in 72.449124ms
Found 20000000 not in west in 74.481405ms
Found 20000000 not in west in 75.160204ms
Found 20000000 not in west in 75.033379ms
Found 20000000 not in west in 75.064686ms
Found 20000000 not in west in 75.221221ms
Found 20000000 not in west in 75.979403ms
Found 20000000 not in west in 76.288346ms
Found 20000000 not in west in 74.639085ms
Found 20000000 not in west in 73.664288ms
```
commit 3b67f70d670c8ba76f327fb4c8a757155d021f67 :
```
Hello, world!
example_with_vec
created array with 20000000 elements in 550.194933ms
Completed finding bitset: 20000000 elements in 58.936386ms
Completed finding bitset: 20000000 elements in 57.900239ms
Completed finding bitset: 20000000 elements in 54.868759ms
Completed finding bitset: 20000000 elements in 52.913171ms
Completed finding bitset: 20000000 elements in 52.707473ms
Completed finding bitset: 20000000 elements in 53.600913ms
Completed finding bitset: 20000000 elements in 53.315755ms
Completed finding bitset: 20000000 elements in 53.188747ms
Completed finding bitset: 20000000 elements in 52.896293ms
Completed finding bitset: 20000000 elements in 54.134426ms
example_with_arrow
created array with 20000000 elements in 666.080968ms
Found 20000000 not in west in 58.209991ms
Found 20000000 not in west in 58.496839ms
Found 20000000 not in west in 58.294964ms
Found 20000000 not in west in 57.771822ms
Found 20000000 not in west in 58.266169ms
Found 20000000 not in west in 58.028011ms
Found 20000000 not in west in 58.431954ms
Found 20000000 not in west in 58.545879ms
Found 20000000 not in west in 58.131794ms
Found 20000000 not in west in 59.211415ms
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] codecov-io commented on pull request #8900: ARROW-10810: Comparison kernels performance
Posted by GitBox <gi...@apache.org>.
codecov-io commented on pull request #8900:
URL: https://github.com/apache/arrow/pull/8900#issuecomment-743750244
# [Codecov](https://codecov.io/gh/apache/arrow/pull/8900?src=pr&el=h1) Report
> Merging [#8900](https://codecov.io/gh/apache/arrow/pull/8900?src=pr&el=desc) (c64e6f0) into [master](https://codecov.io/gh/apache/arrow/commit/1378c20bad350dba484ae21aa8a2588f44012452?el=desc) (1378c20) will **increase** coverage by `22.84%`.
> The diff coverage is `100.00%`.
[![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/8900/graphs/tree.svg?width=650&height=150&src=pr&token=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/8900?src=pr&el=tree)
```diff
@@ Coverage Diff @@
## master #8900 +/- ##
===========================================
+ Coverage 53.96% 76.81% +22.84%
===========================================
Files 170 181 +11
Lines 30707 40987 +10280
===========================================
+ Hits 16571 31483 +14912
+ Misses 14136 9504 -4632
```
| [Impacted Files](https://codecov.io/gh/apache/arrow/pull/8900?src=pr&el=tree) | Coverage Δ | |
|---|---|---|
| [rust/arrow/src/compute/kernels/comparison.rs](https://codecov.io/gh/apache/arrow/pull/8900/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvY29tcHV0ZS9rZXJuZWxzL2NvbXBhcmlzb24ucnM=) | `96.27% <ø> (ø)` | |
| [rust/arrow/src/array/array\_binary.rs](https://codecov.io/gh/apache/arrow/pull/8900/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvYXJyYXlfYmluYXJ5LnJz) | `90.47% <100.00%> (ø)` | |
| [rust/arrow/src/array/array\_string.rs](https://codecov.io/gh/apache/arrow/pull/8900/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvYXJyYXlfc3RyaW5nLnJz) | `90.37% <100.00%> (+1.06%)` | :arrow_up: |
| [...ntegration-testing/src/bin/arrow-stream-to-file.rs](https://codecov.io/gh/apache/arrow/pull/8900/diff?src=pr&el=tree#diff-cnVzdC9pbnRlZ3JhdGlvbi10ZXN0aW5nL3NyYy9iaW4vYXJyb3ctc3RyZWFtLXRvLWZpbGUucnM=) | `0.00% <0.00%> (ø)` | |
| [rust/parquet/src/bin/parquet-schema.rs](https://codecov.io/gh/apache/arrow/pull/8900/diff?src=pr&el=tree#diff-cnVzdC9wYXJxdWV0L3NyYy9iaW4vcGFycXVldC1zY2hlbWEucnM=) | `0.00% <0.00%> (ø)` | |
| [rust/parquet/src/util/test\_common/file\_util.rs](https://codecov.io/gh/apache/arrow/pull/8900/diff?src=pr&el=tree#diff-cnVzdC9wYXJxdWV0L3NyYy91dGlsL3Rlc3RfY29tbW9uL2ZpbGVfdXRpbC5ycw==) | `77.77% <0.00%> (ø)` | |
| [rust/parquet/src/file/footer.rs](https://codecov.io/gh/apache/arrow/pull/8900/diff?src=pr&el=tree#diff-cnVzdC9wYXJxdWV0L3NyYy9maWxlL2Zvb3Rlci5ycw==) | `96.22% <0.00%> (ø)` | |
| [...ion-testing/src/bin/arrow-json-integration-test.rs](https://codecov.io/gh/apache/arrow/pull/8900/diff?src=pr&el=tree#diff-cnVzdC9pbnRlZ3JhdGlvbi10ZXN0aW5nL3NyYy9iaW4vYXJyb3ctanNvbi1pbnRlZ3JhdGlvbi10ZXN0LnJz) | `0.00% <0.00%> (ø)` | |
| [rust/parquet/src/util/hash\_util.rs](https://codecov.io/gh/apache/arrow/pull/8900/diff?src=pr&el=tree#diff-cnVzdC9wYXJxdWV0L3NyYy91dGlsL2hhc2hfdXRpbC5ycw==) | `95.77% <0.00%> (ø)` | |
| [...ntegration-testing/src/bin/arrow-file-to-stream.rs](https://codecov.io/gh/apache/arrow/pull/8900/diff?src=pr&el=tree#diff-cnVzdC9pbnRlZ3JhdGlvbi10ZXN0aW5nL3NyYy9iaW4vYXJyb3ctZmlsZS10by1zdHJlYW0ucnM=) | `0.00% <0.00%> (ø)` | |
| ... and [54 more](https://codecov.io/gh/apache/arrow/pull/8900/diff?src=pr&el=tree-more) | |
------
[Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/8900?src=pr&el=continue).
> **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
> `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
> Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/8900?src=pr&el=footer). Last update [1378c20...c64e6f0](https://codecov.io/gh/apache/arrow/pull/8900?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] Dandandan commented on a change in pull request #8900: ARROW-10810: [Rust] Comparison kernels performance
Posted by GitBox <gi...@apache.org>.
Dandandan commented on a change in pull request #8900:
URL: https://github.com/apache/arrow/pull/8900#discussion_r541769546
##########
File path: rust/arrow/src/array/array_binary.rs
##########
@@ -317,7 +317,7 @@ pub struct FixedSizeBinaryArray {
impl FixedSizeBinaryArray {
/// Returns the element at index `i` as a byte slice.
pub fn value(&self, i: usize) -> &[u8] {
- assert!(
+ debug_assert!(
Review comment:
I think we can think more about safe/unsafe and being more clear about it. I will move the partial change from this PR for now.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] Dandandan commented on pull request #8900: ARROW-10810: [Rust] Comparison kernels performance
Posted by GitBox <gi...@apache.org>.
Dandandan commented on pull request #8900:
URL: https://github.com/apache/arrow/pull/8900#issuecomment-743831579
Will revert the `assert` changes tomorrow, will create a ticket for the iterators over `T` to avoid bounds checking.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #8900: ARROW-10810: Comparison kernels performance
Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #8900:
URL: https://github.com/apache/arrow/pull/8900#issuecomment-743749965
https://issues.apache.org/jira/browse/ARROW-10810
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] codecov-io edited a comment on pull request #8900: ARROW-10810: [Rust] Comparison kernels performance
Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #8900:
URL: https://github.com/apache/arrow/pull/8900#issuecomment-743750244
# [Codecov](https://codecov.io/gh/apache/arrow/pull/8900?src=pr&el=h1) Report
> Merging [#8900](https://codecov.io/gh/apache/arrow/pull/8900?src=pr&el=desc) (23c8ff2) into [master](https://codecov.io/gh/apache/arrow/commit/1378c20bad350dba484ae21aa8a2588f44012452?el=desc) (1378c20) will **increase** coverage by `22.81%`.
> The diff coverage is `n/a`.
[![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/8900/graphs/tree.svg?width=650&height=150&src=pr&token=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/8900?src=pr&el=tree)
```diff
@@ Coverage Diff @@
## master #8900 +/- ##
===========================================
+ Coverage 53.96% 76.77% +22.81%
===========================================
Files 170 181 +11
Lines 30707 41009 +10302
===========================================
+ Hits 16571 31485 +14914
+ Misses 14136 9524 -4612
```
| [Impacted Files](https://codecov.io/gh/apache/arrow/pull/8900?src=pr&el=tree) | Coverage Δ | |
|---|---|---|
| [rust/arrow/src/compute/kernels/comparison.rs](https://codecov.io/gh/apache/arrow/pull/8900/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvY29tcHV0ZS9rZXJuZWxzL2NvbXBhcmlzb24ucnM=) | `96.27% <ø> (ø)` | |
| [rust/arrow/src/ipc/writer.rs](https://codecov.io/gh/apache/arrow/pull/8900/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvaXBjL3dyaXRlci5ycw==) | `83.65% <0.00%> (-3.44%)` | :arrow_down: |
| [rust/arrow/src/json/reader.rs](https://codecov.io/gh/apache/arrow/pull/8900/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvanNvbi9yZWFkZXIucnM=) | `83.17% <0.00%> (-0.09%)` | :arrow_down: |
| [rust/arrow/src/buffer.rs](https://codecov.io/gh/apache/arrow/pull/8900/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYnVmZmVyLnJz) | `97.96% <0.00%> (ø)` | |
| [rust/arrow/src/datatypes.rs](https://codecov.io/gh/apache/arrow/pull/8900/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvZGF0YXR5cGVzLnJz) | `76.85% <0.00%> (ø)` | |
| [rust/arrow-flight/src/utils.rs](https://codecov.io/gh/apache/arrow/pull/8900/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy1mbGlnaHQvc3JjL3V0aWxzLnJz) | `0.00% <0.00%> (ø)` | |
| [rust/datafusion/examples/flight\_server.rs](https://codecov.io/gh/apache/arrow/pull/8900/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL2V4YW1wbGVzL2ZsaWdodF9zZXJ2ZXIucnM=) | `0.00% <0.00%> (ø)` | |
| [rust/arrow/src/compute/kernels/arithmetic.rs](https://codecov.io/gh/apache/arrow/pull/8900/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvY29tcHV0ZS9rZXJuZWxzL2FyaXRobWV0aWMucnM=) | `98.45% <0.00%> (ø)` | |
| [...ion-testing/src/bin/arrow-json-integration-test.rs](https://codecov.io/gh/apache/arrow/pull/8900/diff?src=pr&el=tree#diff-cnVzdC9pbnRlZ3JhdGlvbi10ZXN0aW5nL3NyYy9iaW4vYXJyb3ctanNvbi1pbnRlZ3JhdGlvbi10ZXN0LnJz) | `0.00% <0.00%> (ø)` | |
| [...ntegration-testing/src/bin/arrow-stream-to-file.rs](https://codecov.io/gh/apache/arrow/pull/8900/diff?src=pr&el=tree#diff-cnVzdC9pbnRlZ3JhdGlvbi10ZXN0aW5nL3NyYy9iaW4vYXJyb3ctc3RyZWFtLXRvLWZpbGUucnM=) | `0.00% <0.00%> (ø)` | |
| ... and [60 more](https://codecov.io/gh/apache/arrow/pull/8900/diff?src=pr&el=tree-more) | |
------
[Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/8900?src=pr&el=continue).
> **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
> `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
> Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/8900?src=pr&el=footer). Last update [1378c20...23c8ff2](https://codecov.io/gh/apache/arrow/pull/8900?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] Dandandan commented on a change in pull request #8900: ARROW-10810: [Rust] Comparison kernels performance
Posted by GitBox <gi...@apache.org>.
Dandandan commented on a change in pull request #8900:
URL: https://github.com/apache/arrow/pull/8900#discussion_r541769546
##########
File path: rust/arrow/src/array/array_binary.rs
##########
@@ -317,7 +317,7 @@ pub struct FixedSizeBinaryArray {
impl FixedSizeBinaryArray {
/// Returns the element at index `i` as a byte slice.
pub fn value(&self, i: usize) -> &[u8] {
- assert!(
+ debug_assert!(
Review comment:
I think we can think more about safe/unsafe and being more clear about it. I will move this change from this PR for now.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] Dandandan commented on a change in pull request #8900: ARROW-10810: [Rust] Comparison kernels performance
Posted by GitBox <gi...@apache.org>.
Dandandan commented on a change in pull request #8900:
URL: https://github.com/apache/arrow/pull/8900#discussion_r541772220
##########
File path: rust/arrow/src/compute/kernels/comparison.rs
##########
@@ -47,9 +47,18 @@ macro_rules! compare_op {
let null_bit_buffer =
combine_option_bitmap($left.data_ref(), $right.data_ref(), $left.len())?;
- let mut result = BooleanBufferBuilder::new($left.len());
+ let byte_capacity = bit_util::ceil($left.len(), 8);
+ let actual_capacity = bit_util::round_upto_multiple_of_64(byte_capacity);
+ let mut buffer = MutableBuffer::new(actual_capacity);
+ buffer.resize(byte_capacity);
+ let data = buffer.raw_data_mut();
+
for i in 0..$left.len() {
- result.append($op($left.value(i), $right.value(i)))?;
+ if $op($left.value(i), $right.value(i)) {
+ unsafe {
Review comment:
I will revert the change with the asserts for now. I think the `value` functions should be marked unsafe that don't perform bound checks, and other functions that can trigger UB. The macro's themselves should be "relatively" safe I think, as long as the `.len()` value is correct.
I think a better solution for the future would be to have a safe iterator that doesn't do bound checking, so I think it's better to move the particular change out of this PR for now.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] Dandandan commented on pull request #8900: ARROW-10810: [Rust] Improve comparison kernels performance
Posted by GitBox <gi...@apache.org>.
Dandandan commented on pull request #8900:
URL: https://github.com/apache/arrow/pull/8900#issuecomment-744024794
Cool to know / read @alamb to see that we are getting closer to "optimal" performance on a single thread. The PRs I create are also based on profiling DataFusion, showing that it has "real world" improvement on those changes as well besides getting good result on the benchmarks. I hope they are useful for influxdb as well.
The docs / presentation are looking very interesting, it probably isn't recorded?
As mentioned there is for some datatypes (like strings) there is some overhead still related to `assert!` that I reverted for now. Probably the bitmap code also could be improved which could close the gap. I think it is cool to compare it against a "native" rust implementation.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] nevi-me closed pull request #8900: ARROW-10810: [Rust] Improve comparison kernels performance
Posted by GitBox <gi...@apache.org>.
nevi-me closed pull request #8900:
URL: https://github.com/apache/arrow/pull/8900
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] andygrove commented on a change in pull request #8900: ARROW-10810: [Rust] Comparison kernels performance
Posted by GitBox <gi...@apache.org>.
andygrove commented on a change in pull request #8900:
URL: https://github.com/apache/arrow/pull/8900#discussion_r541727214
##########
File path: rust/arrow/src/compute/kernels/comparison.rs
##########
@@ -47,9 +47,18 @@ macro_rules! compare_op {
let null_bit_buffer =
combine_option_bitmap($left.data_ref(), $right.data_ref(), $left.len())?;
- let mut result = BooleanBufferBuilder::new($left.len());
+ let byte_capacity = bit_util::ceil($left.len(), 8);
+ let actual_capacity = bit_util::round_upto_multiple_of_64(byte_capacity);
+ let mut buffer = MutableBuffer::new(actual_capacity);
+ buffer.resize(byte_capacity);
+ let data = buffer.raw_data_mut();
+
for i in 0..$left.len() {
- result.append($op($left.value(i), $right.value(i)))?;
+ if $op($left.value(i), $right.value(i)) {
+ unsafe {
Review comment:
Functions that use this macro should now be declared `unsafe`?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] andygrove commented on a change in pull request #8900: ARROW-10810: [Rust] Comparison kernels performance
Posted by GitBox <gi...@apache.org>.
andygrove commented on a change in pull request #8900:
URL: https://github.com/apache/arrow/pull/8900#discussion_r541726473
##########
File path: rust/arrow/src/array/array_binary.rs
##########
@@ -317,7 +317,7 @@ pub struct FixedSizeBinaryArray {
impl FixedSizeBinaryArray {
/// Returns the element at index `i` as a byte slice.
pub fn value(&self, i: usize) -> &[u8] {
- assert!(
+ debug_assert!(
Review comment:
Does this change mean that we should mark the function as `unsafe`? Should we provide safe and unsafe versions of these functions?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] Dandandan commented on pull request #8900: ARROW-10810: [Rust] Comparison kernels performance
Posted by GitBox <gi...@apache.org>.
Dandandan commented on pull request #8900:
URL: https://github.com/apache/arrow/pull/8900#issuecomment-743974209
@andygrove @jorgecarleitao
Reverted the assert changes and created
https://issues.apache.org/jira/browse/ARROW-10892 for the iterator.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] codecov-io edited a comment on pull request #8900: ARROW-10810: Comparison kernels performance
Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #8900:
URL: https://github.com/apache/arrow/pull/8900#issuecomment-743750244
# [Codecov](https://codecov.io/gh/apache/arrow/pull/8900?src=pr&el=h1) Report
> Merging [#8900](https://codecov.io/gh/apache/arrow/pull/8900?src=pr&el=desc) (3b67f70) into [master](https://codecov.io/gh/apache/arrow/commit/1378c20bad350dba484ae21aa8a2588f44012452?el=desc) (1378c20) will **not change** coverage.
> The diff coverage is `100.00%`.
[![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/8900/graphs/tree.svg?width=650&height=150&src=pr&token=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/8900?src=pr&el=tree)
```diff
@@ Coverage Diff @@
## master #8900 +/- ##
=======================================
Coverage 53.96% 53.96%
=======================================
Files 170 170
Lines 30707 30707
=======================================
Hits 16571 16571
Misses 14136 14136
```
| [Impacted Files](https://codecov.io/gh/apache/arrow/pull/8900?src=pr&el=tree) | Coverage Δ | |
|---|---|---|
| [rust/arrow/src/compute/kernels/comparison.rs](https://codecov.io/gh/apache/arrow/pull/8900/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvY29tcHV0ZS9rZXJuZWxzL2NvbXBhcmlzb24ucnM=) | `96.27% <ø> (ø)` | |
| [rust/arrow/src/array/array\_binary.rs](https://codecov.io/gh/apache/arrow/pull/8900/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvYXJyYXlfYmluYXJ5LnJz) | `90.47% <100.00%> (ø)` | |
| [rust/arrow/src/array/array\_string.rs](https://codecov.io/gh/apache/arrow/pull/8900/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvYXJyYXlfc3RyaW5nLnJz) | `89.30% <100.00%> (ø)` | |
------
[Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/8900?src=pr&el=continue).
> **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
> `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
> Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/8900?src=pr&el=footer). Last update [1378c20...3b67f70](https://codecov.io/gh/apache/arrow/pull/8900?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] alamb commented on pull request #8900: ARROW-10810: [Rust] Comparison kernels performance
Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #8900:
URL: https://github.com/apache/arrow/pull/8900#issuecomment-743998066
FWIW I tested this branch with a local micro benchmark I had and it shows significant improvement. I wrote up some details here: https://docs.google.com/document/d/15DRpIr1EUqo7zVR1psBhoX95ML1PO55kkHwt2EasM88/edit#heading=h.pfa677dojqtv
Backstory was I was looking at the performance of these kernels (inspired by @rdettai ) as `utf8_neq_scalar` came up in a talk I was giving last week. What a wonderful surprise that there is a PR actively improving them (and it happens to also fit the narrative of my talk well!)
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] Dandandan commented on pull request #8900: ARROW-10810: [Rust] Improve comparison kernels performance
Posted by GitBox <gi...@apache.org>.
Dandandan commented on pull request #8900:
URL: https://github.com/apache/arrow/pull/8900#issuecomment-744025315
Ah just saw the link to the YouTube video, thanks @alamb
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] Dandandan edited a comment on pull request #8900: ARROW-10810: [Rust] Improve comparison kernels performance
Posted by GitBox <gi...@apache.org>.
Dandandan edited a comment on pull request #8900:
URL: https://github.com/apache/arrow/pull/8900#issuecomment-744024794
Cool to know / read @alamb to see that we are getting closer to "optimal" performance on a single thread. The PRs I create are also based on profiling DataFusion, showing that it has "real world" improvement on those changes as well besides getting good result on the benchmarks. I hope they are useful for influxdb as well.
The docs / presentation are looking very interesting, it probably isn't recorded?
As mentioned there is for some datatypes (like strings) there is some overhead still related to the `assert!`s that I reverted for now. Probably the bitmap code also could be improved which could close the gap. I think it is cool to compare it against a "native" rust implementation.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] jorgecarleitao commented on a change in pull request #8900: ARROW-10810: [Rust] Comparison kernels performance
Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on a change in pull request #8900:
URL: https://github.com/apache/arrow/pull/8900#discussion_r541774429
##########
File path: rust/arrow/src/compute/kernels/comparison.rs
##########
@@ -47,9 +47,18 @@ macro_rules! compare_op {
let null_bit_buffer =
combine_option_bitmap($left.data_ref(), $right.data_ref(), $left.len())?;
- let mut result = BooleanBufferBuilder::new($left.len());
+ let byte_capacity = bit_util::ceil($left.len(), 8);
+ let actual_capacity = bit_util::round_upto_multiple_of_64(byte_capacity);
+ let mut buffer = MutableBuffer::new(actual_capacity);
+ buffer.resize(byte_capacity);
+ let data = buffer.raw_data_mut();
+
for i in 0..$left.len() {
- result.append($op($left.value(i), $right.value(i)))?;
+ if $op($left.value(i), $right.value(i)) {
+ unsafe {
Review comment:
I agree, @Dandandan . Note that all primitives and strings have iterators and `FromIterator`: https://github.com/apache/arrow/blob/master/rust/arrow/src/array/iterator.rs , but they are for `Option<T>`, not `T`.
I agree with you that we should mark that `fn value` as `unsafe` and offer an iterator over `T` (besides the one over `Option<T>`). That UB is really obvious and it is also a security vulnerability causing an escalation of privileges as it allows privileged access to the application's memory via out of bounds accesses.
I usually see the iterator over `T` when they can mask the result or OR / AND the null bitmaps, while `Option<T>` is used when that is not possible / useful.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org