You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2021/11/18 14:38:02 UTC
[GitHub] [lucene] costin opened a new pull request #453: Investigate whether should Packed64 use a byte[] plus VarHandles instead of a long[]
costin opened a new pull request #453:
URL: https://github.com/apache/lucene/pull/453
Add VarHandle variants for `Packed64` to investigate whether using VarHandles
improved performance or not.
This PR introduces two new Packed64 variants:
~ Packed64LongLong
Same as Packed64 however instead of using direct array access, it
relies on `VarHandle`. It's main purpose benchmarking.
~ Packed64LongByte
A different implementation of Packed64 backed by a byte[]. It reads the data as a
long and in case of overflow reads an extra byte. The algorithm is slightly different
since the overflowing bits need to be computed on a different block size (8 vs 64)
which requires extra cycles.
Related LUCENE-10205
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] costin commented on pull request #453: Investigate whether should Packed64 use a byte[] plus VarHandles instead of a long[]
Posted by GitBox <gi...@apache.org>.
costin commented on pull request #453:
URL: https://github.com/apache/lucene/pull/453#issuecomment-973193877
Thanks to @jpountz I've added two more tests beside JMH.
`PackedRandomTest` which does a brute force for testing performance of random locations instead of consecutive ones. Surprisingly the VHLongByte implementation comes ahead, followed by Packed64 and finally Packed64VHLongLong - the results on my machine are roughly:
```
1. VHLongAndByte 939100
2. Packed64 1358500
3. VHLongLong 1510400
```
Adrien suggested another workload, more realistic `PackedMacroTest` which does `SortedDocValues` merging, which creates an `OrdinalMap` backed by `PackedInts` and then recorded the merge times.
In all cases I manually updated the PackedInt class to returned the desired implementation. Based on these set of results, Packed64 is on the first place, followed by VHLongAndByte with VHLongLong coming in third.
I'm attaching the merging times found for reference:
[p64vhll.log](https://github.com/apache/lucene/files/7565523/p64vhll.log)
[p64.log](https://github.com/apache/lucene/files/7565524/p64.log)
[p64vhlb.log](https://github.com/apache/lucene/files/7565525/p64vhlb.log)
.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] costin commented on pull request #453: Investigate whether should Packed64 use a byte[] plus VarHandles instead of a long[]
Posted by GitBox <gi...@apache.org>.
costin commented on pull request #453:
URL: https://github.com/apache/lucene/pull/453#issuecomment-984534767
Updated the benchmark to use JMH `blackhole` and tested it with/without the blackhole flag enabled on the Java 17.0.1 compiler and got the following results:
```
Benchmark (bpv) (size) Mode Cnt Score Error Units
Packed64Benchmark.packed64VarHandleLongByte_Consecutive 47 10240 thrpt 3 12381.811 ± 6265.438 ops/s
Packed64Benchmark.packed64VarHandleLongByte_Sparse 47 10240 thrpt 3 19846.767 ± 1419.612 ops/s
Packed64Benchmark.packed64VarHandleLongLong_Consecutive 47 10240 thrpt 3 19840.317 ± 11542.155 ops/s
Packed64Benchmark.packed64VarHandleLongLong_Sparse 47 10240 thrpt 3 21213.913 ± 592.893 ops/s
Packed64Benchmark.packed64_Consecutive 47 10240 thrpt 3 22211.687 ± 2118.587 ops/s
Packed64Benchmark.packed64_Sparse 47 10240 thrpt 3 22870.002 ± 418.572 ops/s
Compiler blackhole:
Packed64Benchmark.packed64VarHandleLongByte_Consecutive 47 10240 thrpt 3 15021.054 ± 2538.483 ops/s
Packed64Benchmark.packed64VarHandleLongByte_Sparse 47 10240 thrpt 3 25409.229 ± 5009.585 ops/s
Packed64Benchmark.packed64VarHandleLongLong_Consecutive 47 10240 thrpt 3 25273.456 ± 21212.884 ops/s
Packed64Benchmark.packed64VarHandleLongLong_Sparse 47 10240 thrpt 3 28486.621 ± 1019.290 ops/s
Packed64Benchmark.packed64_Consecutive 47 10240 thrpt 3 28826.916 ± 7247.116 ops/s
Packed64Benchmark.packed64_Sparse 47 10240 thrpt 3 29529.579 ± 12940.266 ops/s
```
Also plugged Packed64VHLB into PackedInts so it gets used for bpv up to 56 (included) and ran the macro test to check performance and got the following results
[dynamic.log](https://github.com/apache/lucene/files/7641422/dynamic.log)
Overall it's a mixed bag - with some results better especially at the beginning of the test and then mixed results in terms of performance.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] costin edited a comment on pull request #453: Investigate whether should Packed64 use a byte[] plus VarHandles instead of a long[]
Posted by GitBox <gi...@apache.org>.
costin edited a comment on pull request #453:
URL: https://github.com/apache/lucene/pull/453#issuecomment-973999178
I have tighten the implementation a bit, removing an extra field adding some constants and following more the style of Packed64 with regards to the conditionals.
In addition updated the benchmark to differentiate between consecutive get/set and spare (get/set) where different parts of memory are being read.
This has a big impact on the VHLB performance (almost double) while the other implementations don't exhibit much difference, making them more consistent, for example:
```
Packed64Benchmark.packed64VarHandleLongByte_Consecutive 23 10240 thrpt 3 16121.142 ± 836.602 ops/s
Packed64Benchmark.packed64VarHandleLongByte_Sparse 23 10240 thrpt 3 28436.567 ± 771.609 ops/s
Packed64Benchmark.packed64VarHandleLongLong_Consecutive 23 10240 thrpt 3 42751.522 ± 2367.785 ops/s
Packed64Benchmark.packed64VarHandleLongLong_Sparse 23 10240 thrpt 3 40369.377 ± 1737.285 ops/s
Packed64Benchmark.packed64_Consecutive 23 10240 thrpt 3 52004.882 ± 2006.942 ops/s
Packed64Benchmark.packed64_Sparse 23 10240 thrpt 3 44671.486 ± 1567.467 ops/s
```
It might be that the sparse benchmark is not adequate enough (the operations happen from the outside in, which should give slight advantage towards the middle due to data locality).
Below is the full benchmark:
```
Benchmark (bpv) (size) Mode Cnt Score Error Units
Packed64Benchmark.packed64VarHandleLongByte_Consecutive 1 10240 thrpt 3 38301.741 ± 655.443 ops/s
Packed64Benchmark.packed64VarHandleLongByte_Consecutive 4 10240 thrpt 3 25817.622 ± 16839.449 ops/s
Packed64Benchmark.packed64VarHandleLongByte_Consecutive 5 10240 thrpt 3 21086.108 ± 1738.288 ops/s
Packed64Benchmark.packed64VarHandleLongByte_Consecutive 8 10240 thrpt 3 16114.364 ± 1042.073 ops/s
Packed64Benchmark.packed64VarHandleLongByte_Consecutive 11 10240 thrpt 3 16056.271 ± 2397.599 ops/s
Packed64Benchmark.packed64VarHandleLongByte_Consecutive 16 10240 thrpt 3 17125.401 ± 1574.413 ops/s
Packed64Benchmark.packed64VarHandleLongByte_Consecutive 23 10240 thrpt 3 16063.316 ± 453.476 ops/s
Packed64Benchmark.packed64VarHandleLongByte_Consecutive 25 10240 thrpt 3 16046.605 ± 315.952 ops/s
Packed64Benchmark.packed64VarHandleLongByte_Consecutive 31 10240 thrpt 3 16017.969 ± 894.789 ops/s
Packed64Benchmark.packed64VarHandleLongByte_Consecutive 32 10240 thrpt 3 19653.331 ± 1231.635 ops/s
Packed64Benchmark.packed64VarHandleLongByte_Consecutive 47 10240 thrpt 3 15992.127 ± 1909.132 ops/s
Packed64Benchmark.packed64VarHandleLongByte_Consecutive 59 10240 thrpt 3 17114.009 ± 6263.882 ops/s
Packed64Benchmark.packed64VarHandleLongByte_Consecutive 61 10240 thrpt 3 15822.847 ± 8200.733 ops/s
Packed64Benchmark.packed64VarHandleLongByte_Consecutive 64 10240 thrpt 3 40686.026 ± 2245.477 ops/s
Packed64Benchmark.packed64VarHandleLongByte_Sparse 1 10240 thrpt 3 49795.721 ± 1016.448 ops/s
Packed64Benchmark.packed64VarHandleLongByte_Sparse 4 10240 thrpt 3 37455.051 ± 1059.990 ops/s
Packed64Benchmark.packed64VarHandleLongByte_Sparse 5 10240 thrpt 3 34629.635 ± 730.416 ops/s
Packed64Benchmark.packed64VarHandleLongByte_Sparse 8 10240 thrpt 3 28438.560 ± 364.251 ops/s
Packed64Benchmark.packed64VarHandleLongByte_Sparse 11 10240 thrpt 3 28240.196 ± 752.703 ops/s
Packed64Benchmark.packed64VarHandleLongByte_Sparse 16 10240 thrpt 3 29998.199 ± 1275.620 ops/s
Packed64Benchmark.packed64VarHandleLongByte_Sparse 23 10240 thrpt 3 28481.596 ± 1537.821 ops/s
Packed64Benchmark.packed64VarHandleLongByte_Sparse 25 10240 thrpt 3 28585.727 ± 948.670 ops/s
Packed64Benchmark.packed64VarHandleLongByte_Sparse 31 10240 thrpt 3 28002.335 ± 1701.436 ops/s
Packed64Benchmark.packed64VarHandleLongByte_Sparse 32 10240 thrpt 3 34116.362 ± 421.667 ops/s
Packed64Benchmark.packed64VarHandleLongByte_Sparse 47 10240 thrpt 3 28258.341 ± 1065.642 ops/s
Packed64Benchmark.packed64VarHandleLongByte_Sparse 59 10240 thrpt 3 29776.379 ± 469.763 ops/s
Packed64Benchmark.packed64VarHandleLongByte_Sparse 61 10240 thrpt 3 28820.101 ± 3651.373 ops/s
Packed64Benchmark.packed64VarHandleLongByte_Sparse 64 10240 thrpt 3 57477.947 ± 3698.974 ops/s
Packed64Benchmark.packed64VarHandleLongLong_Consecutive 1 10240 thrpt 3 32689.162 ± 1387.629 ops/s
Packed64Benchmark.packed64VarHandleLongLong_Consecutive 4 10240 thrpt 3 35393.931 ± 1447.491 ops/s
Packed64Benchmark.packed64VarHandleLongLong_Consecutive 5 10240 thrpt 3 40258.860 ± 2152.352 ops/s
Packed64Benchmark.packed64VarHandleLongLong_Consecutive 8 10240 thrpt 3 35385.111 ± 347.894 ops/s
Packed64Benchmark.packed64VarHandleLongLong_Consecutive 11 10240 thrpt 3 45596.088 ± 990.686 ops/s
Packed64Benchmark.packed64VarHandleLongLong_Consecutive 16 10240 thrpt 3 35012.112 ± 9437.142 ops/s
Packed64Benchmark.packed64VarHandleLongLong_Consecutive 23 10240 thrpt 3 47095.570 ± 905.039 ops/s
Packed64Benchmark.packed64VarHandleLongLong_Consecutive 25 10240 thrpt 3 31985.949 ± 590.707 ops/s
Packed64Benchmark.packed64VarHandleLongLong_Consecutive 31 10240 thrpt 3 42513.815 ± 1896.300 ops/s
Packed64Benchmark.packed64VarHandleLongLong_Consecutive 32 10240 thrpt 3 35779.268 ± 956.749 ops/s
Packed64Benchmark.packed64VarHandleLongLong_Consecutive 47 10240 thrpt 3 30137.376 ± 516.086 ops/s
Packed64Benchmark.packed64VarHandleLongLong_Consecutive 59 10240 thrpt 3 25869.023 ± 1372.035 ops/s
Packed64Benchmark.packed64VarHandleLongLong_Consecutive 61 10240 thrpt 3 24951.079 ± 345.293 ops/s
Packed64Benchmark.packed64VarHandleLongLong_Consecutive 64 10240 thrpt 3 35496.562 ± 2744.344 ops/s
Packed64Benchmark.packed64VarHandleLongLong_Sparse 1 10240 thrpt 3 53103.799 ± 564.971 ops/s
Packed64Benchmark.packed64VarHandleLongLong_Sparse 4 10240 thrpt 3 53134.834 ± 763.798 ops/s
Packed64Benchmark.packed64VarHandleLongLong_Sparse 5 10240 thrpt 3 42840.502 ± 547.945 ops/s
Packed64Benchmark.packed64VarHandleLongLong_Sparse 8 10240 thrpt 3 53661.284 ± 2760.469 ops/s
Packed64Benchmark.packed64VarHandleLongLong_Sparse 11 10240 thrpt 3 42976.618 ± 4369.925 ops/s
Packed64Benchmark.packed64VarHandleLongLong_Sparse 16 10240 thrpt 3 53319.723 ± 8308.326 ops/s
Packed64Benchmark.packed64VarHandleLongLong_Sparse 23 10240 thrpt 3 40894.483 ± 772.369 ops/s
Packed64Benchmark.packed64VarHandleLongLong_Sparse 25 10240 thrpt 3 40646.463 ± 1482.981 ops/s
Packed64Benchmark.packed64VarHandleLongLong_Sparse 31 10240 thrpt 3 40995.711 ± 1436.172 ops/s
Packed64Benchmark.packed64VarHandleLongLong_Sparse 32 10240 thrpt 3 53884.020 ± 2006.964 ops/s
Packed64Benchmark.packed64VarHandleLongLong_Sparse 47 10240 thrpt 3 29708.915 ± 657.222 ops/s
Packed64Benchmark.packed64VarHandleLongLong_Sparse 59 10240 thrpt 3 35063.658 ± 1556.863 ops/s
Packed64Benchmark.packed64VarHandleLongLong_Sparse 61 10240 thrpt 3 28871.592 ± 366.623 ops/s
Packed64Benchmark.packed64VarHandleLongLong_Sparse 64 10240 thrpt 3 53333.751 ± 15511.237 ops/s
Packed64Benchmark.packed64_Consecutive 1 10240 thrpt 3 36595.316 ± 1213.323 ops/s
Packed64Benchmark.packed64_Consecutive 4 10240 thrpt 3 39777.028 ± 953.027 ops/s
Packed64Benchmark.packed64_Consecutive 5 10240 thrpt 3 46119.465 ± 211.767 ops/s
Packed64Benchmark.packed64_Consecutive 8 10240 thrpt 3 39886.892 ± 1574.036 ops/s
Packed64Benchmark.packed64_Consecutive 11 10240 thrpt 3 53222.462 ± 1847.024 ops/s
Packed64Benchmark.packed64_Consecutive 16 10240 thrpt 3 39897.330 ± 1012.499 ops/s
Packed64Benchmark.packed64_Consecutive 23 10240 thrpt 3 50924.631 ± 2607.771 ops/s
Packed64Benchmark.packed64_Consecutive 25 10240 thrpt 3 51179.396 ± 4118.732 ops/s
Packed64Benchmark.packed64_Consecutive 31 10240 thrpt 3 49350.911 ± 2652.142 ops/s
Packed64Benchmark.packed64_Consecutive 32 10240 thrpt 3 40245.046 ± 1506.183 ops/s
Packed64Benchmark.packed64_Consecutive 47 10240 thrpt 3 43794.173 ± 4789.862 ops/s
Packed64Benchmark.packed64_Consecutive 59 10240 thrpt 3 41248.048 ± 2533.885 ops/s
Packed64Benchmark.packed64_Consecutive 61 10240 thrpt 3 42538.675 ± 3462.201 ops/s
Packed64Benchmark.packed64_Consecutive 64 10240 thrpt 3 40091.319 ± 962.752 ops/s
Packed64Benchmark.packed64_Sparse 1 10240 thrpt 3 60621.693 ± 6289.038 ops/s
Packed64Benchmark.packed64_Sparse 4 10240 thrpt 3 63121.896 ± 5265.890 ops/s
Packed64Benchmark.packed64_Sparse 5 10240 thrpt 3 49445.705 ± 1348.639 ops/s
Packed64Benchmark.packed64_Sparse 8 10240 thrpt 3 63533.078 ± 5166.244 ops/s
Packed64Benchmark.packed64_Sparse 11 10240 thrpt 3 47624.192 ± 5238.701 ops/s
Packed64Benchmark.packed64_Sparse 16 10240 thrpt 3 64148.964 ± 820.543 ops/s
Packed64Benchmark.packed64_Sparse 23 10240 thrpt 3 44013.707 ± 3850.643 ops/s
Packed64Benchmark.packed64_Sparse 25 10240 thrpt 3 44837.906 ± 2231.909 ops/s
Packed64Benchmark.packed64_Sparse 31 10240 thrpt 3 44638.184 ± 1887.982 ops/s
Packed64Benchmark.packed64_Sparse 32 10240 thrpt 3 63831.488 ± 1325.146 ops/s
Packed64Benchmark.packed64_Sparse 47 10240 thrpt 3 41508.192 ± 3339.770 ops/s
Packed64Benchmark.packed64_Sparse 59 10240 thrpt 3 40849.655 ± 2445.001 ops/s
Packed64Benchmark.packed64_Sparse 61 10240 thrpt 3 36888.602 ± 1342.537 ops/s
Packed64Benchmark.packed64_Sparse 64 10240 thrpt 3 64386.395 ± 1694.784 ops/s
```
Also reran the macro test with similar results to yesterday.
[p64vhlb.log](https://github.com/apache/lucene/files/7569608/p64vhlb.log)
As a last item, adding the compile info of the improved methods:
```
============================= C1-compiled nmethod ==============================
----------------------------------- Assembly -----------------------------------
Compiled method (c1) 214 821 3 org.apache.lucene.util.packed.Packed64VHLongAndByte::get (94 bytes)
total in heap [0x0000028932208490,0x0000028932209218] = 3464
relocation [0x00000289322085e8,0x00000289322086b0] = 200
main code [0x00000289322086c0,0x0000028932208e80] = 1984
stub code [0x0000028932208e80,0x0000028932208ed8] = 88
oops [0x0000028932208ed8,0x0000028932208f08] = 48
metadata [0x0000028932208f08,0x0000028932208f50] = 72
scopes data [0x0000028932208f50,0x00000289322090c8] = 376
scopes pcs [0x00000289322090c8,0x00000289322091e8] = 288
dependencies [0x00000289322091e8,0x00000289322091f0] = 8
nul chk table [0x00000289322091f0,0x0000028932209218] = 40
============================= C2-compiled nmethod ==============================
----------------------------------- Assembly -----------------------------------
Compiled method (c2) 221 829 4 org.apache.lucene.util.packed.Packed64VHLongAndByte::set (127 bytes)
total in heap [0x0000028939553110,0x0000028939553668] = 1368
relocation [0x0000028939553268,0x0000028939553288] = 32
main code [0x00000289395532a0,0x0000028939553440] = 416
stub code [0x0000028939553440,0x0000028939553458] = 24
oops [0x0000028939553458,0x0000028939553478] = 32
metadata [0x0000028939553478,0x00000289395534c0] = 72
scopes data [0x00000289395534c0,0x00000289395535d8] = 280
scopes pcs [0x00000289395535d8,0x0000028939553648] = 112
dependencies [0x0000028939553648,0x0000028939553658] = 16
nul chk table [0x0000028939553658,0x0000028939553668] = 16
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] costin commented on pull request #453: Investigate whether should Packed64 use a byte[] plus VarHandles instead of a long[]
Posted by GitBox <gi...@apache.org>.
costin commented on pull request #453:
URL: https://github.com/apache/lucene/pull/453#issuecomment-974257210
To see how VHLB behaves without the branch conditional (for bpv < 56) I've created removed the endBits field and related codepath from `get` and `set`. With it the benchmark becomes:
```
Benchmark (bpv) (size) Mode Cnt Score Error Units
Packed64Benchmark.packed64VarHandleLongByte_Consecutive 47 10240 thrpt 3 16746.842 ± 2978.916 ops/s
Packed64Benchmark.packed64VarHandleLongByte_Sparse 47 10240 thrpt 3 31318.413 ± 639.686 ops/s
Packed64Benchmark.packed64VarHandleLongLong_Consecutive 47 10240 thrpt 3 40157.645 ± 612.249 ops/s
Packed64Benchmark.packed64VarHandleLongLong_Sparse 47 10240 thrpt 3 37980.091 ± 373.676 ops/s
Packed64Benchmark.packed64_Consecutive 47 10240 thrpt 3 43703.192 ± 993.999 ops/s
Packed64Benchmark.packed64_Sparse 47 10240 thrpt 3 41365.544 ± 1254.313 ops/s
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] costin closed pull request #453: Investigate whether should Packed64 use a byte[] plus VarHandles instead of a long[]
Posted by GitBox <gi...@apache.org>.
costin closed pull request #453:
URL: https://github.com/apache/lucene/pull/453
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] costin commented on pull request #453: Investigate whether should Packed64 use a byte[] plus VarHandles instead of a long[]
Posted by GitBox <gi...@apache.org>.
costin commented on pull request #453:
URL: https://github.com/apache/lucene/pull/453#issuecomment-973999178
I have tighten the implementation a bit, removing an extra field adding some constants and following more the style of Packed64 with regards to the conditionals.
In addition updated the benchmark to differentiate between consecutive get/set and spare (get/set) where different parts of memory are being read.
This has a big impact on the VHLB performance (almost double) while the other implementations don't exhibit much difference, making them more consistent, for example:
```
Packed64Benchmark.packed64VarHandleLongByte_Consecutive 23 10240 thrpt 3 16121.142 ± 836.602 ops/s
Packed64Benchmark.packed64VarHandleLongByte_Sparse 23 10240 thrpt 3 28436.567 ± 771.609 ops/s
Packed64Benchmark.packed64VarHandleLongLong_Consecutive 23 10240 thrpt 3 42751.522 ± 2367.785 ops/s
Packed64Benchmark.packed64VarHandleLongLong_Sparse 23 10240 thrpt 3 40369.377 ± 1737.285 ops/s
Packed64Benchmark.packed64_Consecutive 23 10240 thrpt 3 52004.882 ± 2006.942 ops/s
Packed64Benchmark.packed64_Sparse 23 10240 thrpt 3 44671.486 ± 1567.467 ops/s
```
It might be that the sparse benchmark is not adequate enough (the operations happen from the outside in, which should give slight advantage towards the middle due to data locality).
Below is the full benchmark:
```
Benchmark (bpv) (size) Mode Cnt Score Error Units
Packed64Benchmark.packed64VarHandleLongByte_Consecutive 1 10240 thrpt 3 38301.741 ± 655.443 ops/s
Packed64Benchmark.packed64VarHandleLongByte_Consecutive 4 10240 thrpt 3 25817.622 ± 16839.449 ops/s
Packed64Benchmark.packed64VarHandleLongByte_Consecutive 5 10240 thrpt 3 21086.108 ± 1738.288 ops/s
Packed64Benchmark.packed64VarHandleLongByte_Consecutive 8 10240 thrpt 3 16114.364 ± 1042.073 ops/s
Packed64Benchmark.packed64VarHandleLongByte_Consecutive 11 10240 thrpt 3 16056.271 ± 2397.599 ops/s
Packed64Benchmark.packed64VarHandleLongByte_Consecutive 16 10240 thrpt 3 17125.401 ± 1574.413 ops/s
Packed64Benchmark.packed64VarHandleLongByte_Consecutive 23 10240 thrpt 3 16063.316 ± 453.476 ops/s
Packed64Benchmark.packed64VarHandleLongByte_Consecutive 25 10240 thrpt 3 16046.605 ± 315.952 ops/s
Packed64Benchmark.packed64VarHandleLongByte_Consecutive 31 10240 thrpt 3 16017.969 ± 894.789 ops/s
Packed64Benchmark.packed64VarHandleLongByte_Consecutive 32 10240 thrpt 3 19653.331 ± 1231.635 ops/s
Packed64Benchmark.packed64VarHandleLongByte_Consecutive 47 10240 thrpt 3 15992.127 ± 1909.132 ops/s
Packed64Benchmark.packed64VarHandleLongByte_Consecutive 59 10240 thrpt 3 17114.009 ± 6263.882 ops/s
Packed64Benchmark.packed64VarHandleLongByte_Consecutive 61 10240 thrpt 3 15822.847 ± 8200.733 ops/s
Packed64Benchmark.packed64VarHandleLongByte_Consecutive 64 10240 thrpt 3 40686.026 ± 2245.477 ops/s
Packed64Benchmark.packed64VarHandleLongByte_Sparse 1 10240 thrpt 3 49795.721 ± 1016.448 ops/s
Packed64Benchmark.packed64VarHandleLongByte_Sparse 4 10240 thrpt 3 37455.051 ± 1059.990 ops/s
Packed64Benchmark.packed64VarHandleLongByte_Sparse 5 10240 thrpt 3 34629.635 ± 730.416 ops/s
Packed64Benchmark.packed64VarHandleLongByte_Sparse 8 10240 thrpt 3 28438.560 ± 364.251 ops/s
Packed64Benchmark.packed64VarHandleLongByte_Sparse 11 10240 thrpt 3 28240.196 ± 752.703 ops/s
Packed64Benchmark.packed64VarHandleLongByte_Sparse 16 10240 thrpt 3 29998.199 ± 1275.620 ops/s
Packed64Benchmark.packed64VarHandleLongByte_Sparse 23 10240 thrpt 3 28481.596 ± 1537.821 ops/s
Packed64Benchmark.packed64VarHandleLongByte_Sparse 25 10240 thrpt 3 28585.727 ± 948.670 ops/s
Packed64Benchmark.packed64VarHandleLongByte_Sparse 31 10240 thrpt 3 28002.335 ± 1701.436 ops/s
Packed64Benchmark.packed64VarHandleLongByte_Sparse 32 10240 thrpt 3 34116.362 ± 421.667 ops/s
Packed64Benchmark.packed64VarHandleLongByte_Sparse 47 10240 thrpt 3 28258.341 ± 1065.642 ops/s
Packed64Benchmark.packed64VarHandleLongByte_Sparse 59 10240 thrpt 3 29776.379 ± 469.763 ops/s
Packed64Benchmark.packed64VarHandleLongByte_Sparse 61 10240 thrpt 3 28820.101 ± 3651.373 ops/s
Packed64Benchmark.packed64VarHandleLongByte_Sparse 64 10240 thrpt 3 57477.947 ± 3698.974 ops/s
Packed64Benchmark.packed64VarHandleLongLong_Consecutive 1 10240 thrpt 3 32689.162 ± 1387.629 ops/s
Packed64Benchmark.packed64VarHandleLongLong_Consecutive 4 10240 thrpt 3 35393.931 ± 1447.491 ops/s
Packed64Benchmark.packed64VarHandleLongLong_Consecutive 5 10240 thrpt 3 40258.860 ± 2152.352 ops/s
Packed64Benchmark.packed64VarHandleLongLong_Consecutive 8 10240 thrpt 3 35385.111 ± 347.894 ops/s
Packed64Benchmark.packed64VarHandleLongLong_Consecutive 11 10240 thrpt 3 45596.088 ± 990.686 ops/s
Packed64Benchmark.packed64VarHandleLongLong_Consecutive 16 10240 thrpt 3 35012.112 ± 9437.142 ops/s
Packed64Benchmark.packed64VarHandleLongLong_Consecutive 23 10240 thrpt 3 47095.570 ± 905.039 ops/s
Packed64Benchmark.packed64VarHandleLongLong_Consecutive 25 10240 thrpt 3 31985.949 ± 590.707 ops/s
Packed64Benchmark.packed64VarHandleLongLong_Consecutive 31 10240 thrpt 3 42513.815 ± 1896.300 ops/s
Packed64Benchmark.packed64VarHandleLongLong_Consecutive 32 10240 thrpt 3 35779.268 ± 956.749 ops/s
Packed64Benchmark.packed64VarHandleLongLong_Consecutive 47 10240 thrpt 3 30137.376 ± 516.086 ops/s
Packed64Benchmark.packed64VarHandleLongLong_Consecutive 59 10240 thrpt 3 25869.023 ± 1372.035 ops/s
Packed64Benchmark.packed64VarHandleLongLong_Consecutive 61 10240 thrpt 3 24951.079 ± 345.293 ops/s
Packed64Benchmark.packed64VarHandleLongLong_Consecutive 64 10240 thrpt 3 35496.562 ± 2744.344 ops/s
Packed64Benchmark.packed64VarHandleLongLong_Sparse 1 10240 thrpt 3 53103.799 ± 564.971 ops/s
Packed64Benchmark.packed64VarHandleLongLong_Sparse 4 10240 thrpt 3 53134.834 ± 763.798 ops/s
Packed64Benchmark.packed64VarHandleLongLong_Sparse 5 10240 thrpt 3 42840.502 ± 547.945 ops/s
Packed64Benchmark.packed64VarHandleLongLong_Sparse 8 10240 thrpt 3 53661.284 ± 2760.469 ops/s
Packed64Benchmark.packed64VarHandleLongLong_Sparse 11 10240 thrpt 3 42976.618 ± 4369.925 ops/s
Packed64Benchmark.packed64VarHandleLongLong_Sparse 16 10240 thrpt 3 53319.723 ± 8308.326 ops/s
Packed64Benchmark.packed64VarHandleLongLong_Sparse 23 10240 thrpt 3 40894.483 ± 772.369 ops/s
Packed64Benchmark.packed64VarHandleLongLong_Sparse 25 10240 thrpt 3 40646.463 ± 1482.981 ops/s
Packed64Benchmark.packed64VarHandleLongLong_Sparse 31 10240 thrpt 3 40995.711 ± 1436.172 ops/s
Packed64Benchmark.packed64VarHandleLongLong_Sparse 32 10240 thrpt 3 53884.020 ± 2006.964 ops/s
Packed64Benchmark.packed64VarHandleLongLong_Sparse 47 10240 thrpt 3 29708.915 ± 657.222 ops/s
Packed64Benchmark.packed64VarHandleLongLong_Sparse 59 10240 thrpt 3 35063.658 ± 1556.863 ops/s
Packed64Benchmark.packed64VarHandleLongLong_Sparse 61 10240 thrpt 3 28871.592 ± 366.623 ops/s
Packed64Benchmark.packed64VarHandleLongLong_Sparse 64 10240 thrpt 3 53333.751 ± 15511.237 ops/s
Packed64Benchmark.packed64_Consecutive 1 10240 thrpt 3 36595.316 ± 1213.323 ops/s
Packed64Benchmark.packed64_Consecutive 4 10240 thrpt 3 39777.028 ± 953.027 ops/s
Packed64Benchmark.packed64_Consecutive 5 10240 thrpt 3 46119.465 ± 211.767 ops/s
Packed64Benchmark.packed64_Consecutive 8 10240 thrpt 3 39886.892 ± 1574.036 ops/s
Packed64Benchmark.packed64_Consecutive 11 10240 thrpt 3 53222.462 ± 1847.024 ops/s
Packed64Benchmark.packed64_Consecutive 16 10240 thrpt 3 39897.330 ± 1012.499 ops/s
Packed64Benchmark.packed64_Consecutive 23 10240 thrpt 3 50924.631 ± 2607.771 ops/s
Packed64Benchmark.packed64_Consecutive 25 10240 thrpt 3 51179.396 ± 4118.732 ops/s
Packed64Benchmark.packed64_Consecutive 31 10240 thrpt 3 49350.911 ± 2652.142 ops/s
Packed64Benchmark.packed64_Consecutive 32 10240 thrpt 3 40245.046 ± 1506.183 ops/s
Packed64Benchmark.packed64_Consecutive 47 10240 thrpt 3 43794.173 ± 4789.862 ops/s
Packed64Benchmark.packed64_Consecutive 59 10240 thrpt 3 41248.048 ± 2533.885 ops/s
Packed64Benchmark.packed64_Consecutive 61 10240 thrpt 3 42538.675 ± 3462.201 ops/s
Packed64Benchmark.packed64_Consecutive 64 10240 thrpt 3 40091.319 ± 962.752 ops/s
Packed64Benchmark.packed64_Sparse 1 10240 thrpt 3 60621.693 ± 6289.038 ops/s
Packed64Benchmark.packed64_Sparse 4 10240 thrpt 3 63121.896 ± 5265.890 ops/s
Packed64Benchmark.packed64_Sparse 5 10240 thrpt 3 49445.705 ± 1348.639 ops/s
Packed64Benchmark.packed64_Sparse 8 10240 thrpt 3 63533.078 ± 5166.244 ops/s
Packed64Benchmark.packed64_Sparse 11 10240 thrpt 3 47624.192 ± 5238.701 ops/s
Packed64Benchmark.packed64_Sparse 16 10240 thrpt 3 64148.964 ± 820.543 ops/s
Packed64Benchmark.packed64_Sparse 23 10240 thrpt 3 44013.707 ± 3850.643 ops/s
Packed64Benchmark.packed64_Sparse 25 10240 thrpt 3 44837.906 ± 2231.909 ops/s
Packed64Benchmark.packed64_Sparse 31 10240 thrpt 3 44638.184 ± 1887.982 ops/s
Packed64Benchmark.packed64_Sparse 32 10240 thrpt 3 63831.488 ± 1325.146 ops/s
Packed64Benchmark.packed64_Sparse 47 10240 thrpt 3 41508.192 ± 3339.770 ops/s
Packed64Benchmark.packed64_Sparse 59 10240 thrpt 3 40849.655 ± 2445.001 ops/s
Packed64Benchmark.packed64_Sparse 61 10240 thrpt 3 36888.602 ± 1342.537 ops/s
Packed64Benchmark.packed64_Sparse 64 10240 thrpt 3 64386.395 ± 1694.784 ops/s
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] costin edited a comment on pull request #453: Investigate whether should Packed64 use a byte[] plus VarHandles instead of a long[]
Posted by GitBox <gi...@apache.org>.
costin edited a comment on pull request #453:
URL: https://github.com/apache/lucene/pull/453#issuecomment-973193877
Thanks to @jpountz I've added two more tests beside JMH.
`PackedRandomTest` which does a brute force for testing performance of random locations instead of consecutive ones. Surprisingly the VHLongByte implementation comes ahead, followed by Packed64 and finally Packed64VHLongLong - the results on my machine are roughly:
```
1. VHLongAndByte 939100
2. Packed64 1358500
3. VHLongLong 1510400
```
Adrien suggested another workload, more realistic `PackedMacroTest` which does `SortedDocValues` merging, which creates an `OrdinalMap` backed by `PackedInts` and then recorded the merge times.
In all cases I manually updated the PackedInt class to returned the desired implementation. Based on these set of results, Packed64 is on the first place, followed by VHLongAndByte with VHLongLong coming in third.
I'm attaching the merging times found for reference:
[p64vhll.log](https://github.com/apache/lucene/files/7565523/p64vhll.log)
[p64.log](https://github.com/apache/lucene/files/7565524/p64.log)
[p64vhlb.log](https://github.com/apache/lucene/files/7565525/p64vhlb.log)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] sonatype-lift[bot] commented on a change in pull request #453: Investigate whether should Packed64 use a byte[] plus VarHandles instead of a long[]
Posted by GitBox <gi...@apache.org>.
sonatype-lift[bot] commented on a change in pull request #453:
URL: https://github.com/apache/lucene/pull/453#discussion_r761874066
##########
File path: lucene/core/src/java/org/apache/lucene/util/packed/Packed64VHLongAndByte.java
##########
@@ -0,0 +1,154 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.util.packed;
+
+import org.apache.lucene.util.ArrayUtil;
+import org.apache.lucene.util.BitUtil;
+import org.apache.lucene.util.RamUsageEstimator;
+
+import java.util.Arrays;
+
+/**
+ * Space optimized random access capable array of values with a fixed number of bits/value. Values
+ * are packed contiguously.
+ *
+ * <p>The implementation strives to perform as fast as possible under the constraint of contiguous
+ * bits, by avoiding expensive operations. This comes at the cost of code clarity.
+ *
+ * <p>Technical details: This implementation is a refinement of a non-branching version. The
+ * non-branching get and set methods meant that 2 or 4 atomics in the underlying array were always
+ * accessed, even for the cases where only 1 or 2 were needed. Even with caching, this had a
+ * detrimental effect on performance. Related to this issue, the old implementation used lookup
+ * tables for shifts and masks, which also proved to be a bit slower than calculating the shifts and
+ * masks on the fly. See https://issues.apache.org/jira/browse/LUCENE-4062 for details.
+ */
+class Packed64VHLongAndByte extends PackedInts.MutableImpl {
+ static final int BLOCK_SIZE = 64; // 32 = int, 64 = long
+ static final int BYTE_BITS = 3;
+ static final int BYTE_BLOCK_SIZE = 8;
+ static final int BYTE_BLOCK_MOD_MASK = BYTE_BLOCK_SIZE - 1; // x % BYTE_BLOCK
+
+ /** Values are stores contiguously in the blocks array. */
+ private final byte[] blocks;
+ /** A right-aligned mask of width BitsPerValue used by {@link #get(int)}. */
+ private final long maskRight;
+ /** Optimization: Saves one lookup in {@link #get(int)}. */
+ private final int bpvMinusBlockSize;
Review comment:
*UnusedVariable:* The field 'bpvMinusBlockSize' is never read. [(details)](https://errorprone.info/bugpattern/UnusedVariable)
(at-me [in a reply](https://help.sonatype.com/lift/talking-to-lift) with `help` or `ignore`)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] jpountz commented on pull request #453: Investigate whether should Packed64 use a byte[] plus VarHandles instead of a long[]
Posted by GitBox <gi...@apache.org>.
jpountz commented on pull request #453:
URL: https://github.com/apache/lucene/pull/453#issuecomment-985539567
Thanks a lot @costin for running these benchmarks. I'm a bit disappointed by the results, I was hoping that the removal of the condition would yield better results than that. I would suggest that we close this exploration now. Sorry that this didn't result in a code change in Lucene, but I want to say that it's good that we did these experiments because at least now we know thet VarHandles are not a silver bullet to make Packed64 faster.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] costin commented on pull request #453: Investigate whether should Packed64 use a byte[] plus VarHandles instead of a long[]
Posted by GitBox <gi...@apache.org>.
costin commented on pull request #453:
URL: https://github.com/apache/lucene/pull/453#issuecomment-972988872
I've tested `Packed64LongByte` multiple times against `TestPackedInts` and it seems to pass. Note that I've focused mainly on the `get` and `set` methods and removed the specialized fill and bulk get/set methods due to slightly different encoding.
tl;dr - `Packed64` has best performance, followed by `Packed64VarHandleLongLong` and lastly by `Packed64LongByte` with a significant gap. My guess is the long[] array is more cache friendly while the byte[] isn't - by sliding across the byte[] to read longs, the caches are thrashed which is much more expensive than doing bitwise operations.
The data in case of byte[] support this - performance is great for the 1 and 64 case (where the sliding is minimal since new positions belong to the same byte block), while for the rest it falls to the ground.
I've included the benchmark in the commit; the tests are running on latest JDK 17.0.1 on an Ryzen 5900x.
```
openjdk version "17.0.1" 2021-10-19
OpenJDK Runtime Environment Temurin-17.0.1+12 (build 17.0.1+12)
OpenJDK 64-Bit Server VM Temurin-17.0.1+12 (build 17.0.1+12, mixed mode, sharing)
```
```
# JMH version: 1.33
# VM version: JDK 17.0.1, OpenJDK 64-Bit Server VM, 17.0.1+12
Benchmark (bpv) (size) Mode Cnt Score Error Units
Packed64Benchmark.packed64 1 10240 thrpt 3 80129.482 ± 2603.161 ops/s
Packed64Benchmark.packed64 4 10240 thrpt 3 87680.030 ± 2725.245 ops/s
Packed64Benchmark.packed64 5 10240 thrpt 3 56768.849 ± 633.855 ops/s
Packed64Benchmark.packed64 8 10240 thrpt 3 91869.977 ± 1579.788 ops/s
Packed64Benchmark.packed64 11 10240 thrpt 3 71802.824 ± 2124.577 ops/s
Packed64Benchmark.packed64 16 10240 thrpt 3 94892.804 ± 1876.298 ops/s
Packed64Benchmark.packed64 23 10240 thrpt 3 65401.053 ± 690.302 ops/s
Packed64Benchmark.packed64 25 10240 thrpt 3 69623.043 ± 679.354 ops/s
Packed64Benchmark.packed64 31 10240 thrpt 3 66732.829 ± 1788.781 ops/s
Packed64Benchmark.packed64 32 10240 thrpt 3 99938.841 ± 197.846 ops/s
Packed64Benchmark.packed64 47 10240 thrpt 3 52174.874 ± 2369.976 ops/s
Packed64Benchmark.packed64 59 10240 thrpt 3 55240.166 ± 165.920 ops/s
Packed64Benchmark.packed64 61 10240 thrpt 3 54909.784 ± 5064.028 ops/s
Packed64Benchmark.packed64 64 10240 thrpt 3 99261.182 ± 2746.589 ops/s
Packed64Benchmark.packed64VarHandleLongAndByte 1 10240 thrpt 3 56909.631 ± 1986.897 ops/s
Packed64Benchmark.packed64VarHandleLongAndByte 4 10240 thrpt 3 31584.303 ± 1063.071 ops/s
Packed64Benchmark.packed64VarHandleLongAndByte 5 10240 thrpt 3 23703.179 ± 2127.908 ops/s
Packed64Benchmark.packed64VarHandleLongAndByte 8 10240 thrpt 3 17958.177 ± 825.106 ops/s
Packed64Benchmark.packed64VarHandleLongAndByte 11 10240 thrpt 3 17810.271 ± 379.489 ops/s
Packed64Benchmark.packed64VarHandleLongAndByte 16 10240 thrpt 3 19043.959 ± 2185.749 ops/s
Packed64Benchmark.packed64VarHandleLongAndByte 23 10240 thrpt 3 17784.221 ± 824.539 ops/s
Packed64Benchmark.packed64VarHandleLongAndByte 25 10240 thrpt 3 17728.510 ± 1063.605 ops/s
Packed64Benchmark.packed64VarHandleLongAndByte 31 10240 thrpt 3 17742.034 ± 198.159 ops/s
Packed64Benchmark.packed64VarHandleLongAndByte 32 10240 thrpt 3 21951.514 ± 7648.112 ops/s
Packed64Benchmark.packed64VarHandleLongAndByte 47 10240 thrpt 3 17741.987 ± 866.598 ops/s
Packed64Benchmark.packed64VarHandleLongAndByte 59 10240 thrpt 3 19481.548 ± 446.086 ops/s
Packed64Benchmark.packed64VarHandleLongAndByte 61 10240 thrpt 3 18999.020 ± 949.164 ops/s
Packed64Benchmark.packed64VarHandleLongAndByte 64 10240 thrpt 3 74178.395 ± 1219.876 ops/s
Packed64Benchmark.packed64VarHandleLongLong 1 10240 thrpt 3 73824.221 ± 3210.970 ops/s
Packed64Benchmark.packed64VarHandleLongLong 4 10240 thrpt 3 82908.298 ± 8099.224 ops/s
Packed64Benchmark.packed64VarHandleLongLong 5 10240 thrpt 3 54176.175 ± 2005.872 ops/s
Packed64Benchmark.packed64VarHandleLongLong 8 10240 thrpt 3 83556.558 ± 1658.929 ops/s
Packed64Benchmark.packed64VarHandleLongLong 11 10240 thrpt 3 56512.522 ± 380.595 ops/s
Packed64Benchmark.packed64VarHandleLongLong 16 10240 thrpt 3 83820.820 ± 2000.138 ops/s
Packed64Benchmark.packed64VarHandleLongLong 23 10240 thrpt 3 62570.823 ± 604.178 ops/s
Packed64Benchmark.packed64VarHandleLongLong 25 10240 thrpt 3 63497.723 ± 2172.188 ops/s
Packed64Benchmark.packed64VarHandleLongLong 31 10240 thrpt 3 62699.661 ± 3982.879 ops/s
Packed64Benchmark.packed64VarHandleLongLong 32 10240 thrpt 3 84795.726 ± 7409.939 ops/s
Packed64Benchmark.packed64VarHandleLongLong 47 10240 thrpt 3 55820.366 ± 1247.217 ops/s
Packed64Benchmark.packed64VarHandleLongLong 59 10240 thrpt 3 49562.673 ± 1131.955 ops/s
Packed64Benchmark.packed64VarHandleLongLong 61 10240 thrpt 3 53540.899 ± 1268.365 ops/s
Packed64Benchmark.packed64VarHandleLongLong 64 10240 thrpt 3 85549.020 ± 3293.070 ops/s
```
I'm sure the `Packed64LongByte` (p64lb) implementation could be improved however the gap in performance is significant.
p64lb is somewhat reasonable only when dealing with pbv of 64 and 1 - in all other cases performance simply drops.
Packed64LongLong (p64ll) tracks the performance of `Packed64` (p64) pretty well (which is expected).
To understand where the gap cap comes from, I've done more tests for the best bpv of 64 and worse case 23 and it boils down to this line
https://github.com/apache/lucene/blob/6bd5c14bf3c6113244773d2e1eb39233dc33ad71/lucene/core/src/java/org/apache/lucene/util/packed/Packed64.java#L73
The bigger the shift, the better the performance - not because of the shift itself but rather because the higher the chances data falls in the same data block which for consecutive operations is going to benefit from caching.
To test this out, I've changed the code in `Packed64LongAndByte` from :
```
final int elementPos = (int) (majorBitPos >>> 3); // /8
```
to
```
final int elementPos = (int) (majorBitPos >>> 6); // /64
```
which almost doubled the performance of pb64lb (which is still half of packed64):
```
Benchmark (bpv) (size) Mode Cnt Score Error Units
Packed64Benchmark.packed64 23 10240 thrpt 3 65188.193 ± 3034.465 ops/s
Packed64Benchmark.packed64VarHandleLongAndByte 23 10240 thrpt 3 30850.017 ± 422.149 ops/s
Packed64Benchmark.packed64VarHandleLongLong 23 10240 thrpt 3 56759.947 ± 1055.780 ops/s
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] costin commented on pull request #453: Investigate whether should Packed64 use a byte[] plus VarHandles instead of a long[]
Posted by GitBox <gi...@apache.org>.
costin commented on pull request #453:
URL: https://github.com/apache/lucene/pull/453#issuecomment-972999769
Adding a few more info to this ticket regarding the underlying assembly.
The p64lb method is the shortest, followed by p64 and p64ll.
```
============================= C1-compiled nmethod ==============================
----------------------------------- Assembly -----------------------------------
Compiled method (c1) 194 774 3 org.apache.lucene.util.packed.Packed64::set (145 bytes)
total in heap [0x000001de36e94990,0x000001de36e94f20] = 1424
relocation [0x000001de36e94ae8,0x000001de36e94b30] = 72
main code [0x000001de36e94b40,0x000001de36e94dc0] = 640
stub code [0x000001de36e94dc0,0x000001de36e94df8] = 56
oops [0x000001de36e94df8,0x000001de36e94e00] = 8
metadata [0x000001de36e94e00,0x000001de36e94e08] = 8
scopes data [0x000001de36e94e08,0x000001de36e94e50] = 72
scopes pcs [0x000001de36e94e50,0x000001de36e94ef0] = 160
dependencies [0x000001de36e94ef0,0x000001de36e94ef8] = 8
nul chk table [0x000001de36e94ef8,0x000001de36e94f20] = 40
----------------------------------- Assembly -----------------------------------
Compiled method (c1) 173 799 3 org.apache.lucene.util.packed.Packed64VHLongAndByte::set (144 bytes)
total in heap [0x000001c7d511be10,0x000001c7d511d720] = 6416
relocation [0x000001c7d511bf68,0x000001c7d511c0c8] = 352
main code [0x000001c7d511c0e0,0x000001c7d511d0e0] = 4096
stub code [0x000001c7d511d0e0,0x000001c7d511d158] = 120
oops [0x000001c7d511d158,0x000001c7d511d190] = 56
metadata [0x000001c7d511d190,0x000001c7d511d1e8] = 88
scopes data [0x000001c7d511d1e8,0x000001c7d511d500] = 792
scopes pcs [0x000001c7d511d500,0x000001c7d511d6e0] = 480
dependencies [0x000001c7d511d6e0,0x000001c7d511d6e8] = 8
nul chk table [0x000001c7d511d6e8,0x000001c7d511d720] = 56
----------------------------------- Assembly -----------------------------------
Compiled method (c1) 177 797 3 org.apache.lucene.util.packed.Packed64VHLongLong::set (204 bytes)
total in heap [0x0000018a00893b90,0x0000018a00897390] = 14336
relocation [0x0000018a00893ce8,0x0000018a00894060] = 888
main code [0x0000018a00894060,0x0000018a008965a0] = 9536
stub code [0x0000018a008965a0,0x0000018a00896698] = 248
oops [0x0000018a00896698,0x0000018a008966d0] = 56
metadata [0x0000018a008966d0,0x0000018a00896728] = 88
scopes data [0x0000018a00896728,0x0000018a00896e70] = 1864
scopes pcs [0x0000018a00896e70,0x0000018a00897320] = 1200
dependencies [0x0000018a00897320,0x0000018a00897328] = 8
nul chk table [0x0000018a00897328,0x0000018a00897390] = 104
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] costin commented on pull request #453: Investigate whether should Packed64 use a byte[] plus VarHandles instead of a long[]
Posted by GitBox <gi...@apache.org>.
costin commented on pull request #453:
URL: https://github.com/apache/lucene/pull/453#issuecomment-985547790
Thanks for the feedback @jpountz. Closing the issue.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] sonatype-lift[bot] commented on a change in pull request #453: Investigate whether should Packed64 use a byte[] plus VarHandles instead of a long[]
Posted by GitBox <gi...@apache.org>.
sonatype-lift[bot] commented on a change in pull request #453:
URL: https://github.com/apache/lucene/pull/453#discussion_r761116584
##########
File path: lucene/core/src/java/org/apache/lucene/util/packed/Packed64VHLongAndByte.java
##########
@@ -0,0 +1,154 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.util.packed;
+
+import org.apache.lucene.util.ArrayUtil;
+import org.apache.lucene.util.BitUtil;
+import org.apache.lucene.util.RamUsageEstimator;
+
+import java.util.Arrays;
+
+/**
+ * Space optimized random access capable array of values with a fixed number of bits/value. Values
+ * are packed contiguously.
+ *
+ * <p>The implementation strives to perform as fast as possible under the constraint of contiguous
+ * bits, by avoiding expensive operations. This comes at the cost of code clarity.
+ *
+ * <p>Technical details: This implementation is a refinement of a non-branching version. The
+ * non-branching get and set methods meant that 2 or 4 atomics in the underlying array were always
+ * accessed, even for the cases where only 1 or 2 were needed. Even with caching, this had a
+ * detrimental effect on performance. Related to this issue, the old implementation used lookup
+ * tables for shifts and masks, which also proved to be a bit slower than calculating the shifts and
+ * masks on the fly. See https://issues.apache.org/jira/browse/LUCENE-4062 for details.
+ */
+class Packed64VHLongAndByte extends PackedInts.MutableImpl {
+ static final int BLOCK_SIZE = 64; // 32 = int, 64 = long
+ static final int BYTE_BITS = 3;
+ static final int BYTE_BLOCK_SIZE = 8;
+ static final int BYTE_BLOCK_MOD_MASK = BYTE_BLOCK_SIZE - 1; // x % BYTE_BLOCK
+
+ /** Values are stores contiguously in the blocks array. */
+ private final byte[] blocks;
+ /** A right-aligned mask of width BitsPerValue used by {@link #get(int)}. */
+ private final long maskRight;
+ /** Optimization: Saves one lookup in {@link #get(int)}. */
+ private final int bpvMinusBlockSize;
+
+ /**
+ * Creates an array with the internal structures adjusted for the given limits and initialized to
+ * 0.
+ *
+ * @param valueCount the number of elements.
+ * @param bitsPerValue the number of bits available for any given value.
+ */
+ public Packed64VHLongAndByte(int valueCount, int bitsPerValue) {
+ super(valueCount, bitsPerValue);
+ final PackedInts.Format format = PackedInts.Format.PACKED;
+ final long byteCount = format.byteCount(PackedInts.VERSION_CURRENT, valueCount, bitsPerValue);
+ if (bitsPerValue > 56) {
+ throw new IllegalArgumentException("Only bitsPerValue up to 56 allowed");
+ }
+ if (byteCount > ArrayUtil.MAX_ARRAY_LENGTH) {
+ throw new IllegalArgumentException("Too many values/bits to store");
+ }
+ this.blocks = new byte[(int) byteCount + Long.BYTES];
+ maskRight = ~0L << (BLOCK_SIZE - bitsPerValue) >>> (BLOCK_SIZE - bitsPerValue);
+ bpvMinusBlockSize = bitsPerValue - BLOCK_SIZE;
+ }
+
+ /**
+ * @param index the position of the value.
+ * @return the value at the given index.
+ */
+ @Override
+ public long get(final int index) {
+ // The abstract index in a bit stream
+ final long majorBitPos = (long) index * bitsPerValue;
+ // The index in the backing byte-array
+ final int elementPos = (int) (majorBitPos >>> BYTE_BITS); // BYTE_BLOCK
+ // The number of bits already occupied from the previous position
+ final int maskOffset = (int) majorBitPos & BYTE_BLOCK_MOD_MASK;
+// final int endBits = maskOffset + bpvMinusBlockSize;
+ // Read the main long
+ //final long packed = ((long) BitUtil.VH_LE_LONG.get(blocks, elementPos) >>> maskOffset) & maskRight;
+ return ((long) BitUtil.VH_LE_LONG.get(blocks, elementPos) >>> maskOffset) & maskRight;
+
+// if (endBits <= 0) { // Single block
+// return packed;
+// }
+// // compute next mask
+// return packed | (blocks[elementPos + BYTE_BLOCK_SIZE] & ~(~0L << endBits)) << (BLOCK_SIZE - maskOffset);
+ }
+
+ @Override
+ public void set(final int index, final long value) {
+ // The abstract index in a contiguous bit stream
+ final long majorBitPos = (long) index * bitsPerValue;
+ // The index in the backing byte-array
+ final int elementPos = (int) (majorBitPos >>> BYTE_BITS); // / BYTE_BLOCK
+ // The number of bits already occupied from the previous position
+ final int maskOffset = (int) majorBitPos & BYTE_BLOCK_MOD_MASK;
+ final int endBits = maskOffset + bpvMinusBlockSize;
+ // Read the main long
+ final long packed = (long) BitUtil.VH_LE_LONG.get(blocks, elementPos);
+ BitUtil.VH_LE_LONG.set(blocks, elementPos, packed & ~(maskRight << maskOffset) | (value << maskOffset));
Review comment:
*OperatorPrecedence:* Use grouping parenthesis to make the operator precedence explicit [(details)](https://errorprone.info/bugpattern/OperatorPrecedence)
(at-me [in a reply](https://help.sonatype.com/lift/talking-to-lift) with `help` or `ignore`)
##########
File path: lucene/core/src/java/org/apache/lucene/util/packed/Packed64VHLongLong.java
##########
@@ -0,0 +1,159 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.util.packed;
+
+import org.apache.lucene.util.ArrayUtil;
+import org.apache.lucene.util.BitUtil;
+import org.apache.lucene.util.RamUsageEstimator;
+
+import java.sql.Array;
+import java.util.Arrays;
+
+/**
+ * Space optimized random access capable array of values with a fixed number of bits/value. Values
+ * are packed contiguously.
+ *
+ * <p>The implementation strives to perform as fast as possible under the constraint of contiguous
+ * bits, by avoiding expensive operations. This comes at the cost of code clarity.
+ *
+ * <p>Technical details: This implementation is a refinement of a non-branching version. The
+ * non-branching get and set methods meant that 2 or 4 atomics in the underlying array were always
+ * accessed, even for the cases where only 1 or 2 were needed. Even with caching, this had a
+ * detrimental effect on performance. Related to this issue, the old implementation used lookup
+ * tables for shifts and masks, which also proved to be a bit slower than calculating the shifts and
+ * masks on the fly. See https://issues.apache.org/jira/browse/LUCENE-4062 for details.
+ */
+class Packed64VHLongLong extends PackedInts.MutableImpl {
+ static final int BLOCK_SIZE = 64; // 32 = int, 64 = long
+ static final int BLOCK_BITS = 6; // The #bits representing BLOCK_SIZE
+ static final int MOD_MASK = BLOCK_SIZE - 1; // x % BLOCK_SIZE
+
+ /** Values are stores contiguously in the blocks array. */
+ private final byte[] blocks;
+ /** A right-aligned mask of width BitsPerValue used by {@link #get(int)}. */
+ private final long maskRight;
+ /** Optimization: Saves one lookup in {@link #get(int)}. */
+ private final int bpvMinusBlockSize;
+
+ /**
+ * Creates an array with the internal structures adjusted for the given limits and initialized to
+ * 0.
+ *
+ * @param valueCount the number of elements.
+ * @param bitsPerValue the number of bits available for any given value.
+ */
+ public Packed64VHLongLong(int valueCount, int bitsPerValue) {
+ super(valueCount, bitsPerValue);
+ final PackedInts.Format format = PackedInts.Format.PACKED;
+ final long byteCount = format.byteCount(PackedInts.VERSION_CURRENT, valueCount, bitsPerValue);
+ if (byteCount > ArrayUtil.MAX_ARRAY_LENGTH) {
+ throw new IllegalArgumentException("Not yet supported");
+ }
+ this.blocks = new byte[(int) byteCount + 8];
+ maskRight = ~0L << (BLOCK_SIZE - bitsPerValue) >>> (BLOCK_SIZE - bitsPerValue);
+ bpvMinusBlockSize = bitsPerValue - BLOCK_SIZE;
+ }
+
+ /**
+ * @param index the position of the value.
+ * @return the value at the given index.
+ */
+ @Override
+ public long get(final int index) {
+ // The abstract index in a bit stream
+ final long majorBitPos = (long) index * bitsPerValue;
+ // The index in the backing byte-array
+ final int elementPos = (int) (majorBitPos >>> BLOCK_BITS) * 8;
+ // read entry as a long
+ long packed;
+
+ // The number of value-bits in the second long
+ final long endBits = (majorBitPos & MOD_MASK) + bpvMinusBlockSize;
+
+ if (endBits <= 0) { // Single block
+ packed = (long) BitUtil.VH_LE_LONG.get(blocks, elementPos);
+ return (packed >>> -endBits) & maskRight;
+ }
+ // Two blocks
+ packed = (long) BitUtil.VH_LE_LONG.get(blocks, elementPos);
+ long next = (long) BitUtil.VH_LE_LONG.get(blocks, elementPos + 8);
+ return ((packed << endBits) | (next >>> (BLOCK_SIZE - endBits)))
+ & maskRight;
+ }
+
+ @Override
+ public void set(final int index, final long value) {
+ // The abstract index in a contiguous bit stream
+ final long majorBitPos = (long) index * bitsPerValue;
+ // The index in the backing byte-array
+ final int elementPos = (int) (majorBitPos >>> BLOCK_BITS) * 8; // / BLOCK_SIZE
+ // The number of value-bits in the last block
+ final long endBits = (majorBitPos & MOD_MASK) + bpvMinusBlockSize;
+
+ long packed;
+ if (endBits <= 0) { // Single block
+ packed = (long) BitUtil.VH_LE_LONG.get(blocks, elementPos);
+ packed = packed & ~(maskRight << -endBits) | (value << -endBits);
Review comment:
*OperatorPrecedence:* Use grouping parenthesis to make the operator precedence explicit [(details)](https://errorprone.info/bugpattern/OperatorPrecedence)
(at-me [in a reply](https://help.sonatype.com/lift/talking-to-lift) with `help` or `ignore`)
##########
File path: lucene/core/src/java/org/apache/lucene/util/packed/Packed64VHLongLong.java
##########
@@ -0,0 +1,159 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.util.packed;
+
+import org.apache.lucene.util.ArrayUtil;
+import org.apache.lucene.util.BitUtil;
+import org.apache.lucene.util.RamUsageEstimator;
+
+import java.sql.Array;
+import java.util.Arrays;
+
+/**
+ * Space optimized random access capable array of values with a fixed number of bits/value. Values
+ * are packed contiguously.
+ *
+ * <p>The implementation strives to perform as fast as possible under the constraint of contiguous
+ * bits, by avoiding expensive operations. This comes at the cost of code clarity.
+ *
+ * <p>Technical details: This implementation is a refinement of a non-branching version. The
+ * non-branching get and set methods meant that 2 or 4 atomics in the underlying array were always
+ * accessed, even for the cases where only 1 or 2 were needed. Even with caching, this had a
+ * detrimental effect on performance. Related to this issue, the old implementation used lookup
+ * tables for shifts and masks, which also proved to be a bit slower than calculating the shifts and
+ * masks on the fly. See https://issues.apache.org/jira/browse/LUCENE-4062 for details.
+ */
+class Packed64VHLongLong extends PackedInts.MutableImpl {
+ static final int BLOCK_SIZE = 64; // 32 = int, 64 = long
+ static final int BLOCK_BITS = 6; // The #bits representing BLOCK_SIZE
+ static final int MOD_MASK = BLOCK_SIZE - 1; // x % BLOCK_SIZE
+
+ /** Values are stores contiguously in the blocks array. */
+ private final byte[] blocks;
+ /** A right-aligned mask of width BitsPerValue used by {@link #get(int)}. */
+ private final long maskRight;
+ /** Optimization: Saves one lookup in {@link #get(int)}. */
+ private final int bpvMinusBlockSize;
+
+ /**
+ * Creates an array with the internal structures adjusted for the given limits and initialized to
+ * 0.
+ *
+ * @param valueCount the number of elements.
+ * @param bitsPerValue the number of bits available for any given value.
+ */
+ public Packed64VHLongLong(int valueCount, int bitsPerValue) {
+ super(valueCount, bitsPerValue);
+ final PackedInts.Format format = PackedInts.Format.PACKED;
+ final long byteCount = format.byteCount(PackedInts.VERSION_CURRENT, valueCount, bitsPerValue);
+ if (byteCount > ArrayUtil.MAX_ARRAY_LENGTH) {
+ throw new IllegalArgumentException("Not yet supported");
+ }
+ this.blocks = new byte[(int) byteCount + 8];
+ maskRight = ~0L << (BLOCK_SIZE - bitsPerValue) >>> (BLOCK_SIZE - bitsPerValue);
+ bpvMinusBlockSize = bitsPerValue - BLOCK_SIZE;
+ }
+
+ /**
+ * @param index the position of the value.
+ * @return the value at the given index.
+ */
+ @Override
+ public long get(final int index) {
+ // The abstract index in a bit stream
+ final long majorBitPos = (long) index * bitsPerValue;
+ // The index in the backing byte-array
+ final int elementPos = (int) (majorBitPos >>> BLOCK_BITS) * 8;
+ // read entry as a long
+ long packed;
+
+ // The number of value-bits in the second long
+ final long endBits = (majorBitPos & MOD_MASK) + bpvMinusBlockSize;
+
+ if (endBits <= 0) { // Single block
+ packed = (long) BitUtil.VH_LE_LONG.get(blocks, elementPos);
+ return (packed >>> -endBits) & maskRight;
+ }
+ // Two blocks
+ packed = (long) BitUtil.VH_LE_LONG.get(blocks, elementPos);
+ long next = (long) BitUtil.VH_LE_LONG.get(blocks, elementPos + 8);
+ return ((packed << endBits) | (next >>> (BLOCK_SIZE - endBits)))
+ & maskRight;
+ }
+
+ @Override
+ public void set(final int index, final long value) {
+ // The abstract index in a contiguous bit stream
+ final long majorBitPos = (long) index * bitsPerValue;
+ // The index in the backing byte-array
+ final int elementPos = (int) (majorBitPos >>> BLOCK_BITS) * 8; // / BLOCK_SIZE
+ // The number of value-bits in the last block
+ final long endBits = (majorBitPos & MOD_MASK) + bpvMinusBlockSize;
+
+ long packed;
+ if (endBits <= 0) { // Single block
+ packed = (long) BitUtil.VH_LE_LONG.get(blocks, elementPos);
+ packed = packed & ~(maskRight << -endBits) | (value << -endBits);
+ BitUtil.VH_LE_LONG.set(blocks, elementPos, packed);
+ return;
+ }
+ // Two blocks
+ packed = (long) BitUtil.VH_LE_LONG.get(blocks, elementPos);
+ long newPacked = packed & ~(maskRight >>> endBits) | (value >>> endBits);
Review comment:
*OperatorPrecedence:* Use grouping parenthesis to make the operator precedence explicit [(details)](https://errorprone.info/bugpattern/OperatorPrecedence)
(at-me [in a reply](https://help.sonatype.com/lift/talking-to-lift) with `help` or `ignore`)
##########
File path: lucene/core/src/java/org/apache/lucene/util/packed/Packed64VHLongLong.java
##########
@@ -0,0 +1,159 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.util.packed;
+
+import org.apache.lucene.util.ArrayUtil;
+import org.apache.lucene.util.BitUtil;
+import org.apache.lucene.util.RamUsageEstimator;
+
+import java.sql.Array;
+import java.util.Arrays;
+
+/**
+ * Space optimized random access capable array of values with a fixed number of bits/value. Values
+ * are packed contiguously.
+ *
+ * <p>The implementation strives to perform as fast as possible under the constraint of contiguous
+ * bits, by avoiding expensive operations. This comes at the cost of code clarity.
+ *
+ * <p>Technical details: This implementation is a refinement of a non-branching version. The
+ * non-branching get and set methods meant that 2 or 4 atomics in the underlying array were always
+ * accessed, even for the cases where only 1 or 2 were needed. Even with caching, this had a
+ * detrimental effect on performance. Related to this issue, the old implementation used lookup
+ * tables for shifts and masks, which also proved to be a bit slower than calculating the shifts and
+ * masks on the fly. See https://issues.apache.org/jira/browse/LUCENE-4062 for details.
+ */
+class Packed64VHLongLong extends PackedInts.MutableImpl {
+ static final int BLOCK_SIZE = 64; // 32 = int, 64 = long
+ static final int BLOCK_BITS = 6; // The #bits representing BLOCK_SIZE
+ static final int MOD_MASK = BLOCK_SIZE - 1; // x % BLOCK_SIZE
+
+ /** Values are stores contiguously in the blocks array. */
+ private final byte[] blocks;
+ /** A right-aligned mask of width BitsPerValue used by {@link #get(int)}. */
+ private final long maskRight;
+ /** Optimization: Saves one lookup in {@link #get(int)}. */
+ private final int bpvMinusBlockSize;
+
+ /**
+ * Creates an array with the internal structures adjusted for the given limits and initialized to
+ * 0.
+ *
+ * @param valueCount the number of elements.
+ * @param bitsPerValue the number of bits available for any given value.
+ */
+ public Packed64VHLongLong(int valueCount, int bitsPerValue) {
+ super(valueCount, bitsPerValue);
+ final PackedInts.Format format = PackedInts.Format.PACKED;
+ final long byteCount = format.byteCount(PackedInts.VERSION_CURRENT, valueCount, bitsPerValue);
+ if (byteCount > ArrayUtil.MAX_ARRAY_LENGTH) {
+ throw new IllegalArgumentException("Not yet supported");
+ }
+ this.blocks = new byte[(int) byteCount + 8];
+ maskRight = ~0L << (BLOCK_SIZE - bitsPerValue) >>> (BLOCK_SIZE - bitsPerValue);
+ bpvMinusBlockSize = bitsPerValue - BLOCK_SIZE;
+ }
+
+ /**
+ * @param index the position of the value.
+ * @return the value at the given index.
+ */
+ @Override
+ public long get(final int index) {
+ // The abstract index in a bit stream
+ final long majorBitPos = (long) index * bitsPerValue;
+ // The index in the backing byte-array
+ final int elementPos = (int) (majorBitPos >>> BLOCK_BITS) * 8;
+ // read entry as a long
+ long packed;
+
+ // The number of value-bits in the second long
+ final long endBits = (majorBitPos & MOD_MASK) + bpvMinusBlockSize;
+
+ if (endBits <= 0) { // Single block
+ packed = (long) BitUtil.VH_LE_LONG.get(blocks, elementPos);
+ return (packed >>> -endBits) & maskRight;
+ }
+ // Two blocks
+ packed = (long) BitUtil.VH_LE_LONG.get(blocks, elementPos);
+ long next = (long) BitUtil.VH_LE_LONG.get(blocks, elementPos + 8);
+ return ((packed << endBits) | (next >>> (BLOCK_SIZE - endBits)))
+ & maskRight;
+ }
+
+ @Override
+ public void set(final int index, final long value) {
+ // The abstract index in a contiguous bit stream
+ final long majorBitPos = (long) index * bitsPerValue;
+ // The index in the backing byte-array
+ final int elementPos = (int) (majorBitPos >>> BLOCK_BITS) * 8; // / BLOCK_SIZE
+ // The number of value-bits in the last block
+ final long endBits = (majorBitPos & MOD_MASK) + bpvMinusBlockSize;
+
+ long packed;
+ if (endBits <= 0) { // Single block
+ packed = (long) BitUtil.VH_LE_LONG.get(blocks, elementPos);
+ packed = packed & ~(maskRight << -endBits) | (value << -endBits);
+ BitUtil.VH_LE_LONG.set(blocks, elementPos, packed);
+ return;
+ }
+ // Two blocks
+ packed = (long) BitUtil.VH_LE_LONG.get(blocks, elementPos);
+ long newPacked = packed & ~(maskRight >>> endBits) | (value >>> endBits);
+ BitUtil.VH_LE_LONG.set(blocks, elementPos, newPacked);
+
+ long packed2 = (long) BitUtil.VH_LE_LONG.get(blocks, elementPos + 8);
+ long newPacked2 = packed2 & (~0L >>> endBits) | (value << (BLOCK_SIZE - endBits));
Review comment:
*OperatorPrecedence:* Use grouping parenthesis to make the operator precedence explicit [(details)](https://errorprone.info/bugpattern/OperatorPrecedence)
(at-me [in a reply](https://help.sonatype.com/lift/talking-to-lift) with `help` or `ignore`)
##########
File path: lucene/core/src/java/org/apache/lucene/util/packed/Packed64VHLongAndByte.java
##########
@@ -0,0 +1,154 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.util.packed;
+
+import org.apache.lucene.util.ArrayUtil;
+import org.apache.lucene.util.BitUtil;
+import org.apache.lucene.util.RamUsageEstimator;
+
+import java.util.Arrays;
+
+/**
+ * Space optimized random access capable array of values with a fixed number of bits/value. Values
+ * are packed contiguously.
+ *
+ * <p>The implementation strives to perform as fast as possible under the constraint of contiguous
+ * bits, by avoiding expensive operations. This comes at the cost of code clarity.
+ *
+ * <p>Technical details: This implementation is a refinement of a non-branching version. The
+ * non-branching get and set methods meant that 2 or 4 atomics in the underlying array were always
+ * accessed, even for the cases where only 1 or 2 were needed. Even with caching, this had a
+ * detrimental effect on performance. Related to this issue, the old implementation used lookup
+ * tables for shifts and masks, which also proved to be a bit slower than calculating the shifts and
+ * masks on the fly. See https://issues.apache.org/jira/browse/LUCENE-4062 for details.
+ */
+class Packed64VHLongAndByte extends PackedInts.MutableImpl {
+ static final int BLOCK_SIZE = 64; // 32 = int, 64 = long
+ static final int BYTE_BITS = 3;
+ static final int BYTE_BLOCK_SIZE = 8;
+ static final int BYTE_BLOCK_MOD_MASK = BYTE_BLOCK_SIZE - 1; // x % BYTE_BLOCK
+
+ /** Values are stores contiguously in the blocks array. */
+ private final byte[] blocks;
+ /** A right-aligned mask of width BitsPerValue used by {@link #get(int)}. */
+ private final long maskRight;
+ /** Optimization: Saves one lookup in {@link #get(int)}. */
+ private final int bpvMinusBlockSize;
+
+ /**
+ * Creates an array with the internal structures adjusted for the given limits and initialized to
+ * 0.
+ *
+ * @param valueCount the number of elements.
+ * @param bitsPerValue the number of bits available for any given value.
+ */
+ public Packed64VHLongAndByte(int valueCount, int bitsPerValue) {
+ super(valueCount, bitsPerValue);
+ final PackedInts.Format format = PackedInts.Format.PACKED;
+ final long byteCount = format.byteCount(PackedInts.VERSION_CURRENT, valueCount, bitsPerValue);
+ if (bitsPerValue > 56) {
+ throw new IllegalArgumentException("Only bitsPerValue up to 56 allowed");
+ }
+ if (byteCount > ArrayUtil.MAX_ARRAY_LENGTH) {
+ throw new IllegalArgumentException("Too many values/bits to store");
+ }
+ this.blocks = new byte[(int) byteCount + Long.BYTES];
+ maskRight = ~0L << (BLOCK_SIZE - bitsPerValue) >>> (BLOCK_SIZE - bitsPerValue);
+ bpvMinusBlockSize = bitsPerValue - BLOCK_SIZE;
+ }
+
+ /**
+ * @param index the position of the value.
+ * @return the value at the given index.
+ */
+ @Override
+ public long get(final int index) {
+ // The abstract index in a bit stream
+ final long majorBitPos = (long) index * bitsPerValue;
+ // The index in the backing byte-array
+ final int elementPos = (int) (majorBitPos >>> BYTE_BITS); // BYTE_BLOCK
+ // The number of bits already occupied from the previous position
+ final int maskOffset = (int) majorBitPos & BYTE_BLOCK_MOD_MASK;
+// final int endBits = maskOffset + bpvMinusBlockSize;
+ // Read the main long
+ //final long packed = ((long) BitUtil.VH_LE_LONG.get(blocks, elementPos) >>> maskOffset) & maskRight;
+ return ((long) BitUtil.VH_LE_LONG.get(blocks, elementPos) >>> maskOffset) & maskRight;
+
+// if (endBits <= 0) { // Single block
+// return packed;
+// }
+// // compute next mask
+// return packed | (blocks[elementPos + BYTE_BLOCK_SIZE] & ~(~0L << endBits)) << (BLOCK_SIZE - maskOffset);
+ }
+
+ @Override
+ public void set(final int index, final long value) {
+ // The abstract index in a contiguous bit stream
+ final long majorBitPos = (long) index * bitsPerValue;
+ // The index in the backing byte-array
+ final int elementPos = (int) (majorBitPos >>> BYTE_BITS); // / BYTE_BLOCK
+ // The number of bits already occupied from the previous position
+ final int maskOffset = (int) majorBitPos & BYTE_BLOCK_MOD_MASK;
+ final int endBits = maskOffset + bpvMinusBlockSize;
Review comment:
*UnusedVariable:* The local variable 'endBits' is never read. [(details)](https://errorprone.info/bugpattern/UnusedVariable)
(at-me [in a reply](https://help.sonatype.com/lift/talking-to-lift) with `help` or `ignore`)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] costin commented on pull request #453: Investigate whether should Packed64 use a byte[] plus VarHandles instead of a long[]
Posted by GitBox <gi...@apache.org>.
costin commented on pull request #453:
URL: https://github.com/apache/lucene/pull/453#issuecomment-973879755
I've done more benchmarking with a variant of Packed64 that uses the `VarHandle` `int` and `short` variant to read memory when dealing with a smaller bpv (<=32 for `int`, <=16 for `short`) instead of reading a full long.
However performance takes a huge hit, an order of magnitude slower:
```
# VarHandle int (set):
Packed64Benchmark.packed64 16 10240 thrpt 3 5653670.172 ± 2376991.321 ops/s
Packed64Benchmark.packed64VarHandleDynamic 16 10240 thrpt 3 349186.319 ± 57043.802 ops/s
Packed64Benchmark.packed64VarHandleLongAndByte 16 10240 thrpt 3 4856165.100 ± 265007.574 ops/s
Packed64Benchmark.packed64VarHandleLongLong 16 10240 thrpt 3 5151665.941 ± 567170.748 ops/s
# VarHandle short (set):
Packed64Benchmark.packed64 15 10240 thrpt 3 5708794.814 ± 196465.456 ops/s
Packed64Benchmark.packed64VarHandleDynamic 15 10240 thrpt 3 370572.965 ± 17435.808 ops/s
Packed64Benchmark.packed64VarHandleLongAndByte 15 10240 thrpt 3 4980979.402 ± 305740.478 ops/s
Packed64Benchmark.packed64VarHandleLongLong 15 10240 thrpt 3 4913621.842 ± 248031.886 ops/s
```
Changing to a direct array assignment (when dealing with bpv <= 8) improves performance a bit but still far from enough:
```
# Direct array assignment
Benchmark (bpv) (size) Mode Cnt Score Error Units
Packed64Benchmark.packed64 7 10240 thrpt 3 5711092.727 ± 373045.792 ops/s
Packed64Benchmark.packed64VarHandleDynamic 7 10240 thrpt 3 459770.147 ± 10022.225 ops/s
Packed64Benchmark.packed64VarHandleLongAndByte 7 10240 thrpt 3 5147663.602 ± 98164.084 ops/s
Packed64Benchmark.packed64VarHandleLongLong 7 10240 thrpt 3 5161701.054 ± 182228.458 ops/s
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org