You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2021/11/18 15:46:56 UTC
[GitHub] [lucene] costin commented on pull request #453: Investigate whether should Packed64 use a byte[] plus VarHandles instead of a long[]
costin commented on pull request #453:
URL: https://github.com/apache/lucene/pull/453#issuecomment-972988872
I've tested `Packed64LongByte` multiple times against `TestPackedInts` and it seems to pass. Note that I've focused mainly on the `get` and `set` methods and removed the specialized fill and bulk get/set methods due to slightly different encoding.
tl;dr - `Packed64` has best performance, followed by `Packed64VarHandleLongLong` and lastly by `Packed64LongByte` with a significant gap. My guess is the long[] array is more cache friendly while the byte[] isn't - by sliding across the byte[] to read longs, the caches are thrashed which is much more expensive than doing bitwise operations.
The data in case of byte[] support this - performance is great for the 1 and 64 case (where the sliding is minimal since new positions belong to the same byte block), while for the rest it falls to the ground.
I've included the benchmark in the commit; the tests are running on latest JDK 17.0.1 on an Ryzen 5900x.
```
openjdk version "17.0.1" 2021-10-19
OpenJDK Runtime Environment Temurin-17.0.1+12 (build 17.0.1+12)
OpenJDK 64-Bit Server VM Temurin-17.0.1+12 (build 17.0.1+12, mixed mode, sharing)
```
```
# JMH version: 1.33
# VM version: JDK 17.0.1, OpenJDK 64-Bit Server VM, 17.0.1+12
Benchmark (bpv) (size) Mode Cnt Score Error Units
Packed64Benchmark.packed64 1 10240 thrpt 3 80129.482 ± 2603.161 ops/s
Packed64Benchmark.packed64 4 10240 thrpt 3 87680.030 ± 2725.245 ops/s
Packed64Benchmark.packed64 5 10240 thrpt 3 56768.849 ± 633.855 ops/s
Packed64Benchmark.packed64 8 10240 thrpt 3 91869.977 ± 1579.788 ops/s
Packed64Benchmark.packed64 11 10240 thrpt 3 71802.824 ± 2124.577 ops/s
Packed64Benchmark.packed64 16 10240 thrpt 3 94892.804 ± 1876.298 ops/s
Packed64Benchmark.packed64 23 10240 thrpt 3 65401.053 ± 690.302 ops/s
Packed64Benchmark.packed64 25 10240 thrpt 3 69623.043 ± 679.354 ops/s
Packed64Benchmark.packed64 31 10240 thrpt 3 66732.829 ± 1788.781 ops/s
Packed64Benchmark.packed64 32 10240 thrpt 3 99938.841 ± 197.846 ops/s
Packed64Benchmark.packed64 47 10240 thrpt 3 52174.874 ± 2369.976 ops/s
Packed64Benchmark.packed64 59 10240 thrpt 3 55240.166 ± 165.920 ops/s
Packed64Benchmark.packed64 61 10240 thrpt 3 54909.784 ± 5064.028 ops/s
Packed64Benchmark.packed64 64 10240 thrpt 3 99261.182 ± 2746.589 ops/s
Packed64Benchmark.packed64VarHandleLongAndByte 1 10240 thrpt 3 56909.631 ± 1986.897 ops/s
Packed64Benchmark.packed64VarHandleLongAndByte 4 10240 thrpt 3 31584.303 ± 1063.071 ops/s
Packed64Benchmark.packed64VarHandleLongAndByte 5 10240 thrpt 3 23703.179 ± 2127.908 ops/s
Packed64Benchmark.packed64VarHandleLongAndByte 8 10240 thrpt 3 17958.177 ± 825.106 ops/s
Packed64Benchmark.packed64VarHandleLongAndByte 11 10240 thrpt 3 17810.271 ± 379.489 ops/s
Packed64Benchmark.packed64VarHandleLongAndByte 16 10240 thrpt 3 19043.959 ± 2185.749 ops/s
Packed64Benchmark.packed64VarHandleLongAndByte 23 10240 thrpt 3 17784.221 ± 824.539 ops/s
Packed64Benchmark.packed64VarHandleLongAndByte 25 10240 thrpt 3 17728.510 ± 1063.605 ops/s
Packed64Benchmark.packed64VarHandleLongAndByte 31 10240 thrpt 3 17742.034 ± 198.159 ops/s
Packed64Benchmark.packed64VarHandleLongAndByte 32 10240 thrpt 3 21951.514 ± 7648.112 ops/s
Packed64Benchmark.packed64VarHandleLongAndByte 47 10240 thrpt 3 17741.987 ± 866.598 ops/s
Packed64Benchmark.packed64VarHandleLongAndByte 59 10240 thrpt 3 19481.548 ± 446.086 ops/s
Packed64Benchmark.packed64VarHandleLongAndByte 61 10240 thrpt 3 18999.020 ± 949.164 ops/s
Packed64Benchmark.packed64VarHandleLongAndByte 64 10240 thrpt 3 74178.395 ± 1219.876 ops/s
Packed64Benchmark.packed64VarHandleLongLong 1 10240 thrpt 3 73824.221 ± 3210.970 ops/s
Packed64Benchmark.packed64VarHandleLongLong 4 10240 thrpt 3 82908.298 ± 8099.224 ops/s
Packed64Benchmark.packed64VarHandleLongLong 5 10240 thrpt 3 54176.175 ± 2005.872 ops/s
Packed64Benchmark.packed64VarHandleLongLong 8 10240 thrpt 3 83556.558 ± 1658.929 ops/s
Packed64Benchmark.packed64VarHandleLongLong 11 10240 thrpt 3 56512.522 ± 380.595 ops/s
Packed64Benchmark.packed64VarHandleLongLong 16 10240 thrpt 3 83820.820 ± 2000.138 ops/s
Packed64Benchmark.packed64VarHandleLongLong 23 10240 thrpt 3 62570.823 ± 604.178 ops/s
Packed64Benchmark.packed64VarHandleLongLong 25 10240 thrpt 3 63497.723 ± 2172.188 ops/s
Packed64Benchmark.packed64VarHandleLongLong 31 10240 thrpt 3 62699.661 ± 3982.879 ops/s
Packed64Benchmark.packed64VarHandleLongLong 32 10240 thrpt 3 84795.726 ± 7409.939 ops/s
Packed64Benchmark.packed64VarHandleLongLong 47 10240 thrpt 3 55820.366 ± 1247.217 ops/s
Packed64Benchmark.packed64VarHandleLongLong 59 10240 thrpt 3 49562.673 ± 1131.955 ops/s
Packed64Benchmark.packed64VarHandleLongLong 61 10240 thrpt 3 53540.899 ± 1268.365 ops/s
Packed64Benchmark.packed64VarHandleLongLong 64 10240 thrpt 3 85549.020 ± 3293.070 ops/s
```
I'm sure the `Packed64LongByte` (p64lb) implementation could be improved however the gap in performance is significant.
p64lb is somewhat reasonable only when dealing with pbv of 64 and 1 - in all other cases performance simply drops.
Packed64LongLong (p64ll) tracks the performance of `Packed64` (p64) pretty well (which is expected).
To understand where the gap cap comes from, I've done more tests for the best bpv of 64 and worse case 23 and it boils down to this line
https://github.com/apache/lucene/blob/6bd5c14bf3c6113244773d2e1eb39233dc33ad71/lucene/core/src/java/org/apache/lucene/util/packed/Packed64.java#L73
The bigger the shift, the better the performance - not because of the shift itself but rather because the higher the chances data falls in the same data block which for consecutive operations is going to benefit from caching.
To test this out, I've changed the code in `Packed64LongAndByte` from :
```
final int elementPos = (int) (majorBitPos >>> 3); // /8
```
to
```
final int elementPos = (int) (majorBitPos >>> 6); // /64
```
which almost doubled the performance of pb64lb (which is still half of packed64):
```
Benchmark (bpv) (size) Mode Cnt Score Error Units
Packed64Benchmark.packed64 23 10240 thrpt 3 65188.193 ± 3034.465 ops/s
Packed64Benchmark.packed64VarHandleLongAndByte 23 10240 thrpt 3 30850.017 ± 422.149 ops/s
Packed64Benchmark.packed64VarHandleLongLong 23 10240 thrpt 3 56759.947 ± 1055.780 ops/s
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org