You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/07/21 14:08:29 UTC
[GitHub] [arrow-rs] Jimexist opened a new pull request #585: use exponential search in lexico partition to speed up
Jimexist opened a new pull request #585:
URL: https://github.com/apache/arrow-rs/pull/585
# Which issue does this PR close?
Closes #.
# Rationale for this change
benchmark:
```
lexicographical_partition_ranges(u8) 2^10
time: [13.430 us 13.624 us 13.846 us]
change: [-20.446% -19.617% -18.718%] (p = 0.00 < 0.05)
Performance has improved.
Found 18 outliers among 100 measurements (18.00%)
9 (9.00%) low mild
3 (3.00%) high mild
6 (6.00%) high severe
lexicographical_partition_ranges(u8) 2^12
time: [21.706 us 22.377 us 23.029 us]
change: [-10.809% -8.7265% -6.6501%] (p = 0.00 < 0.05)
Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
6 (6.00%) high mild
6 (6.00%) high severe
lexicographical_partition_ranges(u8) 2^10 with nulls
time: [12.534 us 12.701 us 12.869 us]
change: [-21.677% -20.203% -18.676%] (p = 0.00 < 0.05)
Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
3 (3.00%) high mild
2 (2.00%) high severe
lexicographical_partition_ranges(u8) 2^12 with nulls
time: [21.408 us 21.631 us 21.883 us]
change: [-9.3607% -7.8667% -6.3528%] (p = 0.00 < 0.05)
Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
6 (6.00%) high mild
2 (2.00%) high severe
lexicographical_partition_ranges(f64) 2^10
time: [20.639 us 20.846 us 21.084 us]
change: [-64.686% -64.138% -63.561%] (p = 0.00 < 0.05)
Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
1 (1.00%) low mild
5 (5.00%) high mild
4 (4.00%) high severe
lexicographical_partition_ranges(low cardinality) 1024
time: [1.2718 us 1.2830 us 1.2953 us]
change: [+7.0358% +8.4813% +9.9578%] (p = 0.00 < 0.05)
Performance has regressed.
Found 4 outliers among 100 measurements (4.00%)
4 (4.00%) high mild
```
# What changes are included in this PR?
adopting an exponential search
# Are there any user-facing changes?
<!---
If there are user-facing changes then we may require documentation to be updated before approving the PR.
-->
<!---
If there are any breaking changes to public APIs, please add the `breaking change` label.
-->
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-rs] alamb merged pull request #585: use exponential search to speed up lexico partition
Posted by GitBox <gi...@apache.org>.
alamb merged pull request #585:
URL: https://github.com/apache/arrow-rs/pull/585
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-rs] Jimexist commented on a change in pull request #585: use exponential search in lexico partition to speed up
Posted by GitBox <gi...@apache.org>.
Jimexist commented on a change in pull request #585:
URL: https://github.com/apache/arrow-rs/pull/585#discussion_r674016802
##########
File path: arrow/src/compute/kernels/partition.rs
##########
@@ -73,6 +73,25 @@ impl<'a> LexicographicalPartitionIterator<'a> {
}
}
+/// Exponential search is to remedy for the case when array size and cardinality are both large
+/// see <https://en.wikipedia.org/wiki/Exponential_search>
+#[inline]
+fn exponential_search(
+ indices: &[usize],
+ target: &usize,
+ comparator: &LexicographicalComparator<'_>,
+) -> usize {
+ let mut bound = 1;
+ while bound < indices.len()
+ && comparator.compare(&indices[bound], target) != Ordering::Greater
+ {
+ bound *= 2;
+ }
+ (bound / 2)
Review comment:
invariant:
`indices[bound / 2] <= target < indices[min(indices.len(), bound + 1)]` where `<=` and `<` are defined by the `comparator`; note the `bound + 1` because `while` might exit when `target = indices[bound]`
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-rs] alamb commented on a change in pull request #585: use exponential search to speed up lexico partition
Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #585:
URL: https://github.com/apache/arrow-rs/pull/585#discussion_r676125985
##########
File path: arrow/src/compute/kernels/partition.rs
##########
@@ -73,6 +73,30 @@ impl<'a> LexicographicalPartitionIterator<'a> {
}
}
+/// Exponential search is to remedy for the case when array size and cardinality are both large
+/// see <https://en.wikipedia.org/wiki/Exponential_search>
+#[inline]
+fn exponential_search(
+ indices: &[usize],
+ target: &usize,
+ comparator: &LexicographicalComparator<'_>,
+) -> usize {
+ let mut bound = 1;
+ while bound < indices.len()
+ && comparator.compare(&indices[bound], target) != Ordering::Greater
+ {
+ bound *= 2;
+ }
+ // invariant after while loop:
+ // indices[bound / 2] <= target < indices[min(indices.len(), bound + 1)]
+ // where <= and < are defined by the comparator;
+ // note here we have right = min(indices.len(), bound + 1) because indices[bound] might
+ // actually be considered and must be included.
+ (bound / 2)
+ + indices[(bound / 2)..indices.len().min(bound + 1)]
Review comment:
I wonder if some of the performance improvement also comes from potentially making a smaller sequence -- aka `(bound/2) .. len` rather than `partition_point .. len`.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-rs] codecov-commenter commented on pull request #585: use exponential search in lexico partition to speed up
Posted by GitBox <gi...@apache.org>.
codecov-commenter commented on pull request #585:
URL: https://github.com/apache/arrow-rs/pull/585#issuecomment-884246254
# [Codecov](https://codecov.io/gh/apache/arrow-rs/pull/585?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
> Merging [#585](https://codecov.io/gh/apache/arrow-rs/pull/585?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (cafcac6) into [master](https://codecov.io/gh/apache/arrow-rs/commit/d8da82610d377a993eba4ac919c97942895c02f8?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (d8da826) will **increase** coverage by `0.00%`.
> The diff coverage is `91.66%`.
[![Impacted file tree graph](https://codecov.io/gh/apache/arrow-rs/pull/585/graphs/tree.svg?width=650&height=150&src=pr&token=pq9V9qWZ1N&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/arrow-rs/pull/585?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
```diff
@@ Coverage Diff @@
## master #585 +/- ##
=======================================
Coverage 82.46% 82.46%
=======================================
Files 167 167
Lines 46205 46213 +8
=======================================
+ Hits 38101 38108 +7
- Misses 8104 8105 +1
```
| [Impacted Files](https://codecov.io/gh/apache/arrow-rs/pull/585?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
|---|---|---|
| [arrow/src/compute/kernels/partition.rs](https://codecov.io/gh/apache/arrow-rs/pull/585/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2NvbXB1dGUva2VybmVscy9wYXJ0aXRpb24ucnM=) | `97.65% <91.66%> (+0.15%)` | :arrow_up: |
| [parquet/src/encodings/encoding.rs](https://codecov.io/gh/apache/arrow-rs/pull/585/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvZW5jb2RpbmdzL2VuY29kaW5nLnJz) | `94.85% <0.00%> (-0.20%)` | :arrow_down: |
------
[Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow-rs/pull/585?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
> **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
> `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
> Powered by [Codecov](https://codecov.io/gh/apache/arrow-rs/pull/585?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Last update [d8da826...cafcac6](https://codecov.io/gh/apache/arrow-rs/pull/585?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-rs] Jimexist commented on a change in pull request #585: use exponential search in lexico partition to speed up
Posted by GitBox <gi...@apache.org>.
Jimexist commented on a change in pull request #585:
URL: https://github.com/apache/arrow-rs/pull/585#discussion_r674016802
##########
File path: arrow/src/compute/kernels/partition.rs
##########
@@ -73,6 +73,25 @@ impl<'a> LexicographicalPartitionIterator<'a> {
}
}
+/// Exponential search is to remedy for the case when array size and cardinality are both large
+/// see <https://en.wikipedia.org/wiki/Exponential_search>
+#[inline]
+fn exponential_search(
+ indices: &[usize],
+ target: &usize,
+ comparator: &LexicographicalComparator<'_>,
+) -> usize {
+ let mut bound = 1;
+ while bound < indices.len()
+ && comparator.compare(&indices[bound], target) != Ordering::Greater
+ {
+ bound *= 2;
+ }
+ (bound / 2)
Review comment:
invariant:
`indices[bound / 2] <= target < indices[min(indices.len(), bound + 1)]` where `<=` and `<` are defined by the `comparator`; here `bound + 1` because `indices[bound]` might actually be considered.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-rs] Jimexist commented on a change in pull request #585: use exponential search in lexico partition to speed up
Posted by GitBox <gi...@apache.org>.
Jimexist commented on a change in pull request #585:
URL: https://github.com/apache/arrow-rs/pull/585#discussion_r674016802
##########
File path: arrow/src/compute/kernels/partition.rs
##########
@@ -73,6 +73,25 @@ impl<'a> LexicographicalPartitionIterator<'a> {
}
}
+/// Exponential search is to remedy for the case when array size and cardinality are both large
+/// see <https://en.wikipedia.org/wiki/Exponential_search>
+#[inline]
+fn exponential_search(
+ indices: &[usize],
+ target: &usize,
+ comparator: &LexicographicalComparator<'_>,
+) -> usize {
+ let mut bound = 1;
+ while bound < indices.len()
+ && comparator.compare(&indices[bound], target) != Ordering::Greater
+ {
+ bound *= 2;
+ }
+ (bound / 2)
Review comment:
invariant:
`indices[bound / 2] <= target < indices[min(indices.len(), bound)]` where `<=` and `<` are defined by the `comparator`
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org