You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/07/21 14:08:29 UTC

[GitHub] [arrow-rs] Jimexist opened a new pull request #585: use exponential search in lexico partition to speed up

Jimexist opened a new pull request #585:
URL: https://github.com/apache/arrow-rs/pull/585


   # Which issue does this PR close?
   
   Closes #.
   
   # Rationale for this change
    
   benchmark:
   
   ```
   lexicographical_partition_ranges(u8) 2^10
                           time:   [13.430 us 13.624 us 13.846 us]
                           change: [-20.446% -19.617% -18.718%] (p = 0.00 < 0.05)
                           Performance has improved.
   Found 18 outliers among 100 measurements (18.00%)
     9 (9.00%) low mild
     3 (3.00%) high mild
     6 (6.00%) high severe
   
   lexicographical_partition_ranges(u8) 2^12
                           time:   [21.706 us 22.377 us 23.029 us]
                           change: [-10.809% -8.7265% -6.6501%] (p = 0.00 < 0.05)
                           Performance has improved.
   Found 12 outliers among 100 measurements (12.00%)
     6 (6.00%) high mild
     6 (6.00%) high severe
   
   lexicographical_partition_ranges(u8) 2^10 with nulls
                           time:   [12.534 us 12.701 us 12.869 us]
                           change: [-21.677% -20.203% -18.676%] (p = 0.00 < 0.05)
                           Performance has improved.
   Found 5 outliers among 100 measurements (5.00%)
     3 (3.00%) high mild
     2 (2.00%) high severe
   
   lexicographical_partition_ranges(u8) 2^12 with nulls
                           time:   [21.408 us 21.631 us 21.883 us]
                           change: [-9.3607% -7.8667% -6.3528%] (p = 0.00 < 0.05)
                           Performance has improved.
   Found 8 outliers among 100 measurements (8.00%)
     6 (6.00%) high mild
     2 (2.00%) high severe
   
   lexicographical_partition_ranges(f64) 2^10
                           time:   [20.639 us 20.846 us 21.084 us]
                           change: [-64.686% -64.138% -63.561%] (p = 0.00 < 0.05)
                           Performance has improved.
   Found 10 outliers among 100 measurements (10.00%)
     1 (1.00%) low mild
     5 (5.00%) high mild
     4 (4.00%) high severe
   
   lexicographical_partition_ranges(low cardinality) 1024
                           time:   [1.2718 us 1.2830 us 1.2953 us]
                           change: [+7.0358% +8.4813% +9.9578%] (p = 0.00 < 0.05)
                           Performance has regressed.
   Found 4 outliers among 100 measurements (4.00%)
     4 (4.00%) high mild
   ```
   
   # What changes are included in this PR?
   
   adopting an exponential search
   
   # Are there any user-facing changes?
   
   
   <!---
   If there are user-facing changes then we may require documentation to be updated before approving the PR.
   -->
   
   <!---
   If there are any breaking changes to public APIs, please add the `breaking change` label.
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] alamb merged pull request #585: use exponential search to speed up lexico partition

Posted by GitBox <gi...@apache.org>.
alamb merged pull request #585:
URL: https://github.com/apache/arrow-rs/pull/585


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] Jimexist commented on a change in pull request #585: use exponential search in lexico partition to speed up

Posted by GitBox <gi...@apache.org>.
Jimexist commented on a change in pull request #585:
URL: https://github.com/apache/arrow-rs/pull/585#discussion_r674016802



##########
File path: arrow/src/compute/kernels/partition.rs
##########
@@ -73,6 +73,25 @@ impl<'a> LexicographicalPartitionIterator<'a> {
     }
 }
 
+/// Exponential search is to remedy for the case when array size and cardinality are both large
+/// see <https://en.wikipedia.org/wiki/Exponential_search>
+#[inline]
+fn exponential_search(
+    indices: &[usize],
+    target: &usize,
+    comparator: &LexicographicalComparator<'_>,
+) -> usize {
+    let mut bound = 1;
+    while bound < indices.len()
+        && comparator.compare(&indices[bound], target) != Ordering::Greater
+    {
+        bound *= 2;
+    }
+    (bound / 2)

Review comment:
       invariant:
   
   `indices[bound / 2] <= target < indices[min(indices.len(), bound + 1)]` where `<=` and `<` are defined by the `comparator`; note the `bound + 1` because `while` might exit when `target = indices[bound]`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] alamb commented on a change in pull request #585: use exponential search to speed up lexico partition

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #585:
URL: https://github.com/apache/arrow-rs/pull/585#discussion_r676125985



##########
File path: arrow/src/compute/kernels/partition.rs
##########
@@ -73,6 +73,30 @@ impl<'a> LexicographicalPartitionIterator<'a> {
     }
 }
 
+/// Exponential search is to remedy for the case when array size and cardinality are both large
+/// see <https://en.wikipedia.org/wiki/Exponential_search>
+#[inline]
+fn exponential_search(
+    indices: &[usize],
+    target: &usize,
+    comparator: &LexicographicalComparator<'_>,
+) -> usize {
+    let mut bound = 1;
+    while bound < indices.len()
+        && comparator.compare(&indices[bound], target) != Ordering::Greater
+    {
+        bound *= 2;
+    }
+    // invariant after while loop:
+    // indices[bound / 2] <= target < indices[min(indices.len(), bound + 1)]
+    // where <= and < are defined by the comparator;
+    // note here we have right = min(indices.len(), bound + 1) because indices[bound] might
+    // actually be considered and must be included.
+    (bound / 2)
+        + indices[(bound / 2)..indices.len().min(bound + 1)]

Review comment:
       I wonder if some of the performance improvement also comes from potentially making a smaller sequence -- aka `(bound/2) .. len` rather than `partition_point .. len`.
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] codecov-commenter commented on pull request #585: use exponential search in lexico partition to speed up

Posted by GitBox <gi...@apache.org>.
codecov-commenter commented on pull request #585:
URL: https://github.com/apache/arrow-rs/pull/585#issuecomment-884246254


   # [Codecov](https://codecov.io/gh/apache/arrow-rs/pull/585?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#585](https://codecov.io/gh/apache/arrow-rs/pull/585?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (cafcac6) into [master](https://codecov.io/gh/apache/arrow-rs/commit/d8da82610d377a993eba4ac919c97942895c02f8?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (d8da826) will **increase** coverage by `0.00%`.
   > The diff coverage is `91.66%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow-rs/pull/585/graphs/tree.svg?width=650&height=150&src=pr&token=pq9V9qWZ1N&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/arrow-rs/pull/585?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   
   ```diff
   @@           Coverage Diff           @@
   ##           master     #585   +/-   ##
   =======================================
     Coverage   82.46%   82.46%           
   =======================================
     Files         167      167           
     Lines       46205    46213    +8     
   =======================================
   + Hits        38101    38108    +7     
   - Misses       8104     8105    +1     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow-rs/pull/585?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
   |---|---|---|
   | [arrow/src/compute/kernels/partition.rs](https://codecov.io/gh/apache/arrow-rs/pull/585/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2NvbXB1dGUva2VybmVscy9wYXJ0aXRpb24ucnM=) | `97.65% <91.66%> (+0.15%)` | :arrow_up: |
   | [parquet/src/encodings/encoding.rs](https://codecov.io/gh/apache/arrow-rs/pull/585/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvZW5jb2RpbmdzL2VuY29kaW5nLnJz) | `94.85% <0.00%> (-0.20%)` | :arrow_down: |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow-rs/pull/585?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow-rs/pull/585?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Last update [d8da826...cafcac6](https://codecov.io/gh/apache/arrow-rs/pull/585?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] Jimexist commented on a change in pull request #585: use exponential search in lexico partition to speed up

Posted by GitBox <gi...@apache.org>.
Jimexist commented on a change in pull request #585:
URL: https://github.com/apache/arrow-rs/pull/585#discussion_r674016802



##########
File path: arrow/src/compute/kernels/partition.rs
##########
@@ -73,6 +73,25 @@ impl<'a> LexicographicalPartitionIterator<'a> {
     }
 }
 
+/// Exponential search is to remedy for the case when array size and cardinality are both large
+/// see <https://en.wikipedia.org/wiki/Exponential_search>
+#[inline]
+fn exponential_search(
+    indices: &[usize],
+    target: &usize,
+    comparator: &LexicographicalComparator<'_>,
+) -> usize {
+    let mut bound = 1;
+    while bound < indices.len()
+        && comparator.compare(&indices[bound], target) != Ordering::Greater
+    {
+        bound *= 2;
+    }
+    (bound / 2)

Review comment:
       invariant:
   
   `indices[bound / 2] <= target < indices[min(indices.len(), bound + 1)]` where `<=` and `<` are defined by the `comparator`; here `bound + 1` because `indices[bound]` might actually be considered.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] Jimexist commented on a change in pull request #585: use exponential search in lexico partition to speed up

Posted by GitBox <gi...@apache.org>.
Jimexist commented on a change in pull request #585:
URL: https://github.com/apache/arrow-rs/pull/585#discussion_r674016802



##########
File path: arrow/src/compute/kernels/partition.rs
##########
@@ -73,6 +73,25 @@ impl<'a> LexicographicalPartitionIterator<'a> {
     }
 }
 
+/// Exponential search is to remedy for the case when array size and cardinality are both large
+/// see <https://en.wikipedia.org/wiki/Exponential_search>
+#[inline]
+fn exponential_search(
+    indices: &[usize],
+    target: &usize,
+    comparator: &LexicographicalComparator<'_>,
+) -> usize {
+    let mut bound = 1;
+    while bound < indices.len()
+        && comparator.compare(&indices[bound], target) != Ordering::Greater
+    {
+        bound *= 2;
+    }
+    (bound / 2)

Review comment:
       invariant:
   
   `indices[bound / 2] <= target < indices[min(indices.len(), bound)]` where `<=` and `<` are defined by the `comparator`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org