You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2021/11/18 14:38:02 UTC

[GitHub] [lucene] costin opened a new pull request #453: Investigate whether should Packed64 use a byte[] plus VarHandles instead of a long[]

costin opened a new pull request #453:
URL: https://github.com/apache/lucene/pull/453


   Add VarHandle variants for `Packed64` to investigate whether using VarHandles 
   improved performance or not.
   
   This PR introduces two new Packed64 variants:
   ~ Packed64LongLong
   Same as Packed64 however instead of using direct array access, it
   relies on `VarHandle`. It's main purpose benchmarking.
   ~ Packed64LongByte
   A different implementation of Packed64 backed by a byte[]. It reads the data as a
    long and in case of overflow reads an extra byte. The algorithm is slightly different
   since the overflowing bits need to be computed on a different block size (8 vs 64)
   which requires extra cycles.
   
   Related LUCENE-10205


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] costin commented on pull request #453: Investigate whether should Packed64 use a byte[] plus VarHandles instead of a long[]

Posted by GitBox <gi...@apache.org>.
costin commented on pull request #453:
URL: https://github.com/apache/lucene/pull/453#issuecomment-973193877


   Thanks to @jpountz I've added two more tests beside JMH.
   `PackedRandomTest` which does a brute force for testing performance of random locations instead of consecutive ones. Surprisingly the VHLongByte implementation comes ahead, followed by Packed64 and finally Packed64VHLongLong - the results on my machine are roughly:
   ```
   1. VHLongAndByte     939100
   2. Packed64         1358500
   3. VHLongLong       1510400
   ```
   
   Adrien suggested another workload, more realistic `PackedMacroTest` which does `SortedDocValues` merging, which creates an `OrdinalMap` backed by `PackedInts` and then recorded the merge times.
   In all cases I manually updated the PackedInt class to returned the desired implementation.  Based on these set of results, Packed64 is on the first place, followed by VHLongAndByte with VHLongLong coming in third.
   
   I'm attaching the merging times found for reference:
   [p64vhll.log](https://github.com/apache/lucene/files/7565523/p64vhll.log)
   [p64.log](https://github.com/apache/lucene/files/7565524/p64.log)
   [p64vhlb.log](https://github.com/apache/lucene/files/7565525/p64vhlb.log)
   .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] costin commented on pull request #453: Investigate whether should Packed64 use a byte[] plus VarHandles instead of a long[]

Posted by GitBox <gi...@apache.org>.
costin commented on pull request #453:
URL: https://github.com/apache/lucene/pull/453#issuecomment-984534767


   Updated the benchmark to use JMH `blackhole` and tested it with/without the blackhole flag enabled on the Java 17.0.1 compiler and got the following results:
   ```
   Benchmark                                                (bpv)  (size)   Mode  Cnt      Score       Error  Units
   Packed64Benchmark.packed64VarHandleLongByte_Consecutive     47   10240  thrpt    3  12381.811 ±  6265.438  ops/s
   Packed64Benchmark.packed64VarHandleLongByte_Sparse          47   10240  thrpt    3  19846.767 ±  1419.612  ops/s
   Packed64Benchmark.packed64VarHandleLongLong_Consecutive     47   10240  thrpt    3  19840.317 ± 11542.155  ops/s
   Packed64Benchmark.packed64VarHandleLongLong_Sparse          47   10240  thrpt    3  21213.913 ±   592.893  ops/s
   Packed64Benchmark.packed64_Consecutive                      47   10240  thrpt    3  22211.687 ±  2118.587  ops/s
   Packed64Benchmark.packed64_Sparse                           47   10240  thrpt    3  22870.002 ±   418.572  ops/s
   
   Compiler blackhole:
   Packed64Benchmark.packed64VarHandleLongByte_Consecutive     47   10240  thrpt    3  15021.054 ±  2538.483  ops/s
   Packed64Benchmark.packed64VarHandleLongByte_Sparse          47   10240  thrpt    3  25409.229 ±  5009.585  ops/s
   Packed64Benchmark.packed64VarHandleLongLong_Consecutive     47   10240  thrpt    3  25273.456 ± 21212.884  ops/s
   Packed64Benchmark.packed64VarHandleLongLong_Sparse          47   10240  thrpt    3  28486.621 ±  1019.290  ops/s
   Packed64Benchmark.packed64_Consecutive                      47   10240  thrpt    3  28826.916 ±  7247.116  ops/s
   Packed64Benchmark.packed64_Sparse                           47   10240  thrpt    3  29529.579 ± 12940.266  ops/s
   ```
   
   Also plugged Packed64VHLB into PackedInts so it gets used for bpv up to 56 (included) and ran the macro test to check performance and got the following results 
   [dynamic.log](https://github.com/apache/lucene/files/7641422/dynamic.log)
   
   Overall it's a mixed bag - with some results better especially at the beginning of the test and then mixed results in terms of performance.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] costin edited a comment on pull request #453: Investigate whether should Packed64 use a byte[] plus VarHandles instead of a long[]

Posted by GitBox <gi...@apache.org>.
costin edited a comment on pull request #453:
URL: https://github.com/apache/lucene/pull/453#issuecomment-973999178


   I have tighten the implementation a bit, removing an extra field adding some constants and following more the style of Packed64 with regards to the conditionals.
   In addition updated the benchmark to differentiate between consecutive get/set and spare (get/set) where different parts of memory are being read.
   This has a big impact on the VHLB performance (almost double) while the other implementations don't exhibit much difference, making them more consistent, for example:
   
   ```
   Packed64Benchmark.packed64VarHandleLongByte_Consecutive     23   10240  thrpt    3  16121.142 ±  836.602  ops/s
   Packed64Benchmark.packed64VarHandleLongByte_Sparse          23   10240  thrpt    3  28436.567 ±  771.609  ops/s
   Packed64Benchmark.packed64VarHandleLongLong_Consecutive     23   10240  thrpt    3  42751.522 ± 2367.785  ops/s
   Packed64Benchmark.packed64VarHandleLongLong_Sparse          23   10240  thrpt    3  40369.377 ± 1737.285  ops/s
   Packed64Benchmark.packed64_Consecutive                      23   10240  thrpt    3  52004.882 ± 2006.942  ops/s
   Packed64Benchmark.packed64_Sparse                           23   10240  thrpt    3  44671.486 ± 1567.467  ops/s
   ```
   
   It might be that the sparse benchmark is not adequate enough (the operations happen from the outside in, which should give slight advantage towards the middle due to data locality).
   
   Below is the full benchmark:
   
   ```
   Benchmark                                                (bpv)  (size)   Mode  Cnt      Score       Error  Units
   Packed64Benchmark.packed64VarHandleLongByte_Consecutive      1   10240  thrpt    3  38301.741 ±   655.443  ops/s
   Packed64Benchmark.packed64VarHandleLongByte_Consecutive      4   10240  thrpt    3  25817.622 ± 16839.449  ops/s
   Packed64Benchmark.packed64VarHandleLongByte_Consecutive      5   10240  thrpt    3  21086.108 ±  1738.288  ops/s
   Packed64Benchmark.packed64VarHandleLongByte_Consecutive      8   10240  thrpt    3  16114.364 ±  1042.073  ops/s
   Packed64Benchmark.packed64VarHandleLongByte_Consecutive     11   10240  thrpt    3  16056.271 ±  2397.599  ops/s
   Packed64Benchmark.packed64VarHandleLongByte_Consecutive     16   10240  thrpt    3  17125.401 ±  1574.413  ops/s
   Packed64Benchmark.packed64VarHandleLongByte_Consecutive     23   10240  thrpt    3  16063.316 ±   453.476  ops/s
   Packed64Benchmark.packed64VarHandleLongByte_Consecutive     25   10240  thrpt    3  16046.605 ±   315.952  ops/s
   Packed64Benchmark.packed64VarHandleLongByte_Consecutive     31   10240  thrpt    3  16017.969 ±   894.789  ops/s
   Packed64Benchmark.packed64VarHandleLongByte_Consecutive     32   10240  thrpt    3  19653.331 ±  1231.635  ops/s
   Packed64Benchmark.packed64VarHandleLongByte_Consecutive     47   10240  thrpt    3  15992.127 ±  1909.132  ops/s
   Packed64Benchmark.packed64VarHandleLongByte_Consecutive     59   10240  thrpt    3  17114.009 ±  6263.882  ops/s
   Packed64Benchmark.packed64VarHandleLongByte_Consecutive     61   10240  thrpt    3  15822.847 ±  8200.733  ops/s
   Packed64Benchmark.packed64VarHandleLongByte_Consecutive     64   10240  thrpt    3  40686.026 ±  2245.477  ops/s
   Packed64Benchmark.packed64VarHandleLongByte_Sparse           1   10240  thrpt    3  49795.721 ±  1016.448  ops/s
   Packed64Benchmark.packed64VarHandleLongByte_Sparse           4   10240  thrpt    3  37455.051 ±  1059.990  ops/s
   Packed64Benchmark.packed64VarHandleLongByte_Sparse           5   10240  thrpt    3  34629.635 ±   730.416  ops/s
   Packed64Benchmark.packed64VarHandleLongByte_Sparse           8   10240  thrpt    3  28438.560 ±   364.251  ops/s
   Packed64Benchmark.packed64VarHandleLongByte_Sparse          11   10240  thrpt    3  28240.196 ±   752.703  ops/s
   Packed64Benchmark.packed64VarHandleLongByte_Sparse          16   10240  thrpt    3  29998.199 ±  1275.620  ops/s
   Packed64Benchmark.packed64VarHandleLongByte_Sparse          23   10240  thrpt    3  28481.596 ±  1537.821  ops/s
   Packed64Benchmark.packed64VarHandleLongByte_Sparse          25   10240  thrpt    3  28585.727 ±   948.670  ops/s
   Packed64Benchmark.packed64VarHandleLongByte_Sparse          31   10240  thrpt    3  28002.335 ±  1701.436  ops/s
   Packed64Benchmark.packed64VarHandleLongByte_Sparse          32   10240  thrpt    3  34116.362 ±   421.667  ops/s
   Packed64Benchmark.packed64VarHandleLongByte_Sparse          47   10240  thrpt    3  28258.341 ±  1065.642  ops/s
   Packed64Benchmark.packed64VarHandleLongByte_Sparse          59   10240  thrpt    3  29776.379 ±   469.763  ops/s
   Packed64Benchmark.packed64VarHandleLongByte_Sparse          61   10240  thrpt    3  28820.101 ±  3651.373  ops/s
   Packed64Benchmark.packed64VarHandleLongByte_Sparse          64   10240  thrpt    3  57477.947 ±  3698.974  ops/s
   Packed64Benchmark.packed64VarHandleLongLong_Consecutive      1   10240  thrpt    3  32689.162 ±  1387.629  ops/s
   Packed64Benchmark.packed64VarHandleLongLong_Consecutive      4   10240  thrpt    3  35393.931 ±  1447.491  ops/s
   Packed64Benchmark.packed64VarHandleLongLong_Consecutive      5   10240  thrpt    3  40258.860 ±  2152.352  ops/s
   Packed64Benchmark.packed64VarHandleLongLong_Consecutive      8   10240  thrpt    3  35385.111 ±   347.894  ops/s
   Packed64Benchmark.packed64VarHandleLongLong_Consecutive     11   10240  thrpt    3  45596.088 ±   990.686  ops/s
   Packed64Benchmark.packed64VarHandleLongLong_Consecutive     16   10240  thrpt    3  35012.112 ±  9437.142  ops/s
   Packed64Benchmark.packed64VarHandleLongLong_Consecutive     23   10240  thrpt    3  47095.570 ±   905.039  ops/s
   Packed64Benchmark.packed64VarHandleLongLong_Consecutive     25   10240  thrpt    3  31985.949 ±   590.707  ops/s
   Packed64Benchmark.packed64VarHandleLongLong_Consecutive     31   10240  thrpt    3  42513.815 ±  1896.300  ops/s
   Packed64Benchmark.packed64VarHandleLongLong_Consecutive     32   10240  thrpt    3  35779.268 ±   956.749  ops/s
   Packed64Benchmark.packed64VarHandleLongLong_Consecutive     47   10240  thrpt    3  30137.376 ±   516.086  ops/s
   Packed64Benchmark.packed64VarHandleLongLong_Consecutive     59   10240  thrpt    3  25869.023 ±  1372.035  ops/s
   Packed64Benchmark.packed64VarHandleLongLong_Consecutive     61   10240  thrpt    3  24951.079 ±   345.293  ops/s
   Packed64Benchmark.packed64VarHandleLongLong_Consecutive     64   10240  thrpt    3  35496.562 ±  2744.344  ops/s
   Packed64Benchmark.packed64VarHandleLongLong_Sparse           1   10240  thrpt    3  53103.799 ±   564.971  ops/s
   Packed64Benchmark.packed64VarHandleLongLong_Sparse           4   10240  thrpt    3  53134.834 ±   763.798  ops/s
   Packed64Benchmark.packed64VarHandleLongLong_Sparse           5   10240  thrpt    3  42840.502 ±   547.945  ops/s
   Packed64Benchmark.packed64VarHandleLongLong_Sparse           8   10240  thrpt    3  53661.284 ±  2760.469  ops/s
   Packed64Benchmark.packed64VarHandleLongLong_Sparse          11   10240  thrpt    3  42976.618 ±  4369.925  ops/s
   Packed64Benchmark.packed64VarHandleLongLong_Sparse          16   10240  thrpt    3  53319.723 ±  8308.326  ops/s
   Packed64Benchmark.packed64VarHandleLongLong_Sparse          23   10240  thrpt    3  40894.483 ±   772.369  ops/s
   Packed64Benchmark.packed64VarHandleLongLong_Sparse          25   10240  thrpt    3  40646.463 ±  1482.981  ops/s
   Packed64Benchmark.packed64VarHandleLongLong_Sparse          31   10240  thrpt    3  40995.711 ±  1436.172  ops/s
   Packed64Benchmark.packed64VarHandleLongLong_Sparse          32   10240  thrpt    3  53884.020 ±  2006.964  ops/s
   Packed64Benchmark.packed64VarHandleLongLong_Sparse          47   10240  thrpt    3  29708.915 ±   657.222  ops/s
   Packed64Benchmark.packed64VarHandleLongLong_Sparse          59   10240  thrpt    3  35063.658 ±  1556.863  ops/s
   Packed64Benchmark.packed64VarHandleLongLong_Sparse          61   10240  thrpt    3  28871.592 ±   366.623  ops/s
   Packed64Benchmark.packed64VarHandleLongLong_Sparse          64   10240  thrpt    3  53333.751 ± 15511.237  ops/s
   Packed64Benchmark.packed64_Consecutive                       1   10240  thrpt    3  36595.316 ±  1213.323  ops/s
   Packed64Benchmark.packed64_Consecutive                       4   10240  thrpt    3  39777.028 ±   953.027  ops/s
   Packed64Benchmark.packed64_Consecutive                       5   10240  thrpt    3  46119.465 ±   211.767  ops/s
   Packed64Benchmark.packed64_Consecutive                       8   10240  thrpt    3  39886.892 ±  1574.036  ops/s
   Packed64Benchmark.packed64_Consecutive                      11   10240  thrpt    3  53222.462 ±  1847.024  ops/s
   Packed64Benchmark.packed64_Consecutive                      16   10240  thrpt    3  39897.330 ±  1012.499  ops/s
   Packed64Benchmark.packed64_Consecutive                      23   10240  thrpt    3  50924.631 ±  2607.771  ops/s
   Packed64Benchmark.packed64_Consecutive                      25   10240  thrpt    3  51179.396 ±  4118.732  ops/s
   Packed64Benchmark.packed64_Consecutive                      31   10240  thrpt    3  49350.911 ±  2652.142  ops/s
   Packed64Benchmark.packed64_Consecutive                      32   10240  thrpt    3  40245.046 ±  1506.183  ops/s
   Packed64Benchmark.packed64_Consecutive                      47   10240  thrpt    3  43794.173 ±  4789.862  ops/s
   Packed64Benchmark.packed64_Consecutive                      59   10240  thrpt    3  41248.048 ±  2533.885  ops/s
   Packed64Benchmark.packed64_Consecutive                      61   10240  thrpt    3  42538.675 ±  3462.201  ops/s
   Packed64Benchmark.packed64_Consecutive                      64   10240  thrpt    3  40091.319 ±   962.752  ops/s
   Packed64Benchmark.packed64_Sparse                            1   10240  thrpt    3  60621.693 ±  6289.038  ops/s
   Packed64Benchmark.packed64_Sparse                            4   10240  thrpt    3  63121.896 ±  5265.890  ops/s
   Packed64Benchmark.packed64_Sparse                            5   10240  thrpt    3  49445.705 ±  1348.639  ops/s
   Packed64Benchmark.packed64_Sparse                            8   10240  thrpt    3  63533.078 ±  5166.244  ops/s
   Packed64Benchmark.packed64_Sparse                           11   10240  thrpt    3  47624.192 ±  5238.701  ops/s
   Packed64Benchmark.packed64_Sparse                           16   10240  thrpt    3  64148.964 ±   820.543  ops/s
   Packed64Benchmark.packed64_Sparse                           23   10240  thrpt    3  44013.707 ±  3850.643  ops/s
   Packed64Benchmark.packed64_Sparse                           25   10240  thrpt    3  44837.906 ±  2231.909  ops/s
   Packed64Benchmark.packed64_Sparse                           31   10240  thrpt    3  44638.184 ±  1887.982  ops/s
   Packed64Benchmark.packed64_Sparse                           32   10240  thrpt    3  63831.488 ±  1325.146  ops/s
   Packed64Benchmark.packed64_Sparse                           47   10240  thrpt    3  41508.192 ±  3339.770  ops/s
   Packed64Benchmark.packed64_Sparse                           59   10240  thrpt    3  40849.655 ±  2445.001  ops/s
   Packed64Benchmark.packed64_Sparse                           61   10240  thrpt    3  36888.602 ±  1342.537  ops/s
   Packed64Benchmark.packed64_Sparse                           64   10240  thrpt    3  64386.395 ±  1694.784  ops/s
   ```
   
   Also reran the macro test with similar results to yesterday.
   [p64vhlb.log](https://github.com/apache/lucene/files/7569608/p64vhlb.log)
   
   As a last item, adding the compile info of the improved methods:
   
   ```
   ============================= C1-compiled nmethod ==============================
   ----------------------------------- Assembly -----------------------------------
   
   Compiled method (c1)     214  821       3       org.apache.lucene.util.packed.Packed64VHLongAndByte::get (94 bytes)
    total in heap  [0x0000028932208490,0x0000028932209218] = 3464
    relocation     [0x00000289322085e8,0x00000289322086b0] = 200
    main code      [0x00000289322086c0,0x0000028932208e80] = 1984
    stub code      [0x0000028932208e80,0x0000028932208ed8] = 88
    oops           [0x0000028932208ed8,0x0000028932208f08] = 48
    metadata       [0x0000028932208f08,0x0000028932208f50] = 72
    scopes data    [0x0000028932208f50,0x00000289322090c8] = 376
    scopes pcs     [0x00000289322090c8,0x00000289322091e8] = 288
    dependencies   [0x00000289322091e8,0x00000289322091f0] = 8
    nul chk table  [0x00000289322091f0,0x0000028932209218] = 40
   
   ============================= C2-compiled nmethod ==============================
   ----------------------------------- Assembly -----------------------------------
   
   Compiled method (c2)     221  829       4       org.apache.lucene.util.packed.Packed64VHLongAndByte::set (127 bytes)
    total in heap  [0x0000028939553110,0x0000028939553668] = 1368
    relocation     [0x0000028939553268,0x0000028939553288] = 32
    main code      [0x00000289395532a0,0x0000028939553440] = 416
    stub code      [0x0000028939553440,0x0000028939553458] = 24
    oops           [0x0000028939553458,0x0000028939553478] = 32
    metadata       [0x0000028939553478,0x00000289395534c0] = 72
    scopes data    [0x00000289395534c0,0x00000289395535d8] = 280
    scopes pcs     [0x00000289395535d8,0x0000028939553648] = 112
    dependencies   [0x0000028939553648,0x0000028939553658] = 16
    nul chk table  [0x0000028939553658,0x0000028939553668] = 16
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] costin commented on pull request #453: Investigate whether should Packed64 use a byte[] plus VarHandles instead of a long[]

Posted by GitBox <gi...@apache.org>.
costin commented on pull request #453:
URL: https://github.com/apache/lucene/pull/453#issuecomment-974257210


   To see how VHLB behaves without the branch conditional (for bpv < 56) I've created removed the endBits field and related codepath from `get` and `set`. With it the benchmark becomes:
   
   ```
   Benchmark                                                (bpv)  (size)   Mode  Cnt      Score      Error  Units
   Packed64Benchmark.packed64VarHandleLongByte_Consecutive     47   10240  thrpt    3  16746.842 ± 2978.916  ops/s
   Packed64Benchmark.packed64VarHandleLongByte_Sparse          47   10240  thrpt    3  31318.413 ±  639.686  ops/s
   Packed64Benchmark.packed64VarHandleLongLong_Consecutive     47   10240  thrpt    3  40157.645 ±  612.249  ops/s
   Packed64Benchmark.packed64VarHandleLongLong_Sparse          47   10240  thrpt    3  37980.091 ±  373.676  ops/s
   Packed64Benchmark.packed64_Consecutive                      47   10240  thrpt    3  43703.192 ±  993.999  ops/s
   Packed64Benchmark.packed64_Sparse                           47   10240  thrpt    3  41365.544 ± 1254.313  ops/s
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] costin closed pull request #453: Investigate whether should Packed64 use a byte[] plus VarHandles instead of a long[]

Posted by GitBox <gi...@apache.org>.
costin closed pull request #453:
URL: https://github.com/apache/lucene/pull/453


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] costin commented on pull request #453: Investigate whether should Packed64 use a byte[] plus VarHandles instead of a long[]

Posted by GitBox <gi...@apache.org>.
costin commented on pull request #453:
URL: https://github.com/apache/lucene/pull/453#issuecomment-973999178


   I have tighten the implementation a bit, removing an extra field adding some constants and following more the style of Packed64 with regards to the conditionals.
   In addition updated the benchmark to differentiate between consecutive get/set and spare (get/set) where different parts of memory are being read.
   This has a big impact on the VHLB performance (almost double) while the other implementations don't exhibit much difference, making them more consistent, for example:
   
   ```
   Packed64Benchmark.packed64VarHandleLongByte_Consecutive     23   10240  thrpt    3  16121.142 ±  836.602  ops/s
   Packed64Benchmark.packed64VarHandleLongByte_Sparse          23   10240  thrpt    3  28436.567 ±  771.609  ops/s
   Packed64Benchmark.packed64VarHandleLongLong_Consecutive     23   10240  thrpt    3  42751.522 ± 2367.785  ops/s
   Packed64Benchmark.packed64VarHandleLongLong_Sparse          23   10240  thrpt    3  40369.377 ± 1737.285  ops/s
   Packed64Benchmark.packed64_Consecutive                      23   10240  thrpt    3  52004.882 ± 2006.942  ops/s
   Packed64Benchmark.packed64_Sparse                           23   10240  thrpt    3  44671.486 ± 1567.467  ops/s
   ```
   
   It might be that the sparse benchmark is not adequate enough (the operations happen from the outside in, which should give slight advantage towards the middle due to data locality).
   
   Below is the full benchmark:
   
   ```
   Benchmark                                                (bpv)  (size)   Mode  Cnt      Score       Error  Units
   Packed64Benchmark.packed64VarHandleLongByte_Consecutive      1   10240  thrpt    3  38301.741 ±   655.443  ops/s
   Packed64Benchmark.packed64VarHandleLongByte_Consecutive      4   10240  thrpt    3  25817.622 ± 16839.449  ops/s
   Packed64Benchmark.packed64VarHandleLongByte_Consecutive      5   10240  thrpt    3  21086.108 ±  1738.288  ops/s
   Packed64Benchmark.packed64VarHandleLongByte_Consecutive      8   10240  thrpt    3  16114.364 ±  1042.073  ops/s
   Packed64Benchmark.packed64VarHandleLongByte_Consecutive     11   10240  thrpt    3  16056.271 ±  2397.599  ops/s
   Packed64Benchmark.packed64VarHandleLongByte_Consecutive     16   10240  thrpt    3  17125.401 ±  1574.413  ops/s
   Packed64Benchmark.packed64VarHandleLongByte_Consecutive     23   10240  thrpt    3  16063.316 ±   453.476  ops/s
   Packed64Benchmark.packed64VarHandleLongByte_Consecutive     25   10240  thrpt    3  16046.605 ±   315.952  ops/s
   Packed64Benchmark.packed64VarHandleLongByte_Consecutive     31   10240  thrpt    3  16017.969 ±   894.789  ops/s
   Packed64Benchmark.packed64VarHandleLongByte_Consecutive     32   10240  thrpt    3  19653.331 ±  1231.635  ops/s
   Packed64Benchmark.packed64VarHandleLongByte_Consecutive     47   10240  thrpt    3  15992.127 ±  1909.132  ops/s
   Packed64Benchmark.packed64VarHandleLongByte_Consecutive     59   10240  thrpt    3  17114.009 ±  6263.882  ops/s
   Packed64Benchmark.packed64VarHandleLongByte_Consecutive     61   10240  thrpt    3  15822.847 ±  8200.733  ops/s
   Packed64Benchmark.packed64VarHandleLongByte_Consecutive     64   10240  thrpt    3  40686.026 ±  2245.477  ops/s
   Packed64Benchmark.packed64VarHandleLongByte_Sparse           1   10240  thrpt    3  49795.721 ±  1016.448  ops/s
   Packed64Benchmark.packed64VarHandleLongByte_Sparse           4   10240  thrpt    3  37455.051 ±  1059.990  ops/s
   Packed64Benchmark.packed64VarHandleLongByte_Sparse           5   10240  thrpt    3  34629.635 ±   730.416  ops/s
   Packed64Benchmark.packed64VarHandleLongByte_Sparse           8   10240  thrpt    3  28438.560 ±   364.251  ops/s
   Packed64Benchmark.packed64VarHandleLongByte_Sparse          11   10240  thrpt    3  28240.196 ±   752.703  ops/s
   Packed64Benchmark.packed64VarHandleLongByte_Sparse          16   10240  thrpt    3  29998.199 ±  1275.620  ops/s
   Packed64Benchmark.packed64VarHandleLongByte_Sparse          23   10240  thrpt    3  28481.596 ±  1537.821  ops/s
   Packed64Benchmark.packed64VarHandleLongByte_Sparse          25   10240  thrpt    3  28585.727 ±   948.670  ops/s
   Packed64Benchmark.packed64VarHandleLongByte_Sparse          31   10240  thrpt    3  28002.335 ±  1701.436  ops/s
   Packed64Benchmark.packed64VarHandleLongByte_Sparse          32   10240  thrpt    3  34116.362 ±   421.667  ops/s
   Packed64Benchmark.packed64VarHandleLongByte_Sparse          47   10240  thrpt    3  28258.341 ±  1065.642  ops/s
   Packed64Benchmark.packed64VarHandleLongByte_Sparse          59   10240  thrpt    3  29776.379 ±   469.763  ops/s
   Packed64Benchmark.packed64VarHandleLongByte_Sparse          61   10240  thrpt    3  28820.101 ±  3651.373  ops/s
   Packed64Benchmark.packed64VarHandleLongByte_Sparse          64   10240  thrpt    3  57477.947 ±  3698.974  ops/s
   Packed64Benchmark.packed64VarHandleLongLong_Consecutive      1   10240  thrpt    3  32689.162 ±  1387.629  ops/s
   Packed64Benchmark.packed64VarHandleLongLong_Consecutive      4   10240  thrpt    3  35393.931 ±  1447.491  ops/s
   Packed64Benchmark.packed64VarHandleLongLong_Consecutive      5   10240  thrpt    3  40258.860 ±  2152.352  ops/s
   Packed64Benchmark.packed64VarHandleLongLong_Consecutive      8   10240  thrpt    3  35385.111 ±   347.894  ops/s
   Packed64Benchmark.packed64VarHandleLongLong_Consecutive     11   10240  thrpt    3  45596.088 ±   990.686  ops/s
   Packed64Benchmark.packed64VarHandleLongLong_Consecutive     16   10240  thrpt    3  35012.112 ±  9437.142  ops/s
   Packed64Benchmark.packed64VarHandleLongLong_Consecutive     23   10240  thrpt    3  47095.570 ±   905.039  ops/s
   Packed64Benchmark.packed64VarHandleLongLong_Consecutive     25   10240  thrpt    3  31985.949 ±   590.707  ops/s
   Packed64Benchmark.packed64VarHandleLongLong_Consecutive     31   10240  thrpt    3  42513.815 ±  1896.300  ops/s
   Packed64Benchmark.packed64VarHandleLongLong_Consecutive     32   10240  thrpt    3  35779.268 ±   956.749  ops/s
   Packed64Benchmark.packed64VarHandleLongLong_Consecutive     47   10240  thrpt    3  30137.376 ±   516.086  ops/s
   Packed64Benchmark.packed64VarHandleLongLong_Consecutive     59   10240  thrpt    3  25869.023 ±  1372.035  ops/s
   Packed64Benchmark.packed64VarHandleLongLong_Consecutive     61   10240  thrpt    3  24951.079 ±   345.293  ops/s
   Packed64Benchmark.packed64VarHandleLongLong_Consecutive     64   10240  thrpt    3  35496.562 ±  2744.344  ops/s
   Packed64Benchmark.packed64VarHandleLongLong_Sparse           1   10240  thrpt    3  53103.799 ±   564.971  ops/s
   Packed64Benchmark.packed64VarHandleLongLong_Sparse           4   10240  thrpt    3  53134.834 ±   763.798  ops/s
   Packed64Benchmark.packed64VarHandleLongLong_Sparse           5   10240  thrpt    3  42840.502 ±   547.945  ops/s
   Packed64Benchmark.packed64VarHandleLongLong_Sparse           8   10240  thrpt    3  53661.284 ±  2760.469  ops/s
   Packed64Benchmark.packed64VarHandleLongLong_Sparse          11   10240  thrpt    3  42976.618 ±  4369.925  ops/s
   Packed64Benchmark.packed64VarHandleLongLong_Sparse          16   10240  thrpt    3  53319.723 ±  8308.326  ops/s
   Packed64Benchmark.packed64VarHandleLongLong_Sparse          23   10240  thrpt    3  40894.483 ±   772.369  ops/s
   Packed64Benchmark.packed64VarHandleLongLong_Sparse          25   10240  thrpt    3  40646.463 ±  1482.981  ops/s
   Packed64Benchmark.packed64VarHandleLongLong_Sparse          31   10240  thrpt    3  40995.711 ±  1436.172  ops/s
   Packed64Benchmark.packed64VarHandleLongLong_Sparse          32   10240  thrpt    3  53884.020 ±  2006.964  ops/s
   Packed64Benchmark.packed64VarHandleLongLong_Sparse          47   10240  thrpt    3  29708.915 ±   657.222  ops/s
   Packed64Benchmark.packed64VarHandleLongLong_Sparse          59   10240  thrpt    3  35063.658 ±  1556.863  ops/s
   Packed64Benchmark.packed64VarHandleLongLong_Sparse          61   10240  thrpt    3  28871.592 ±   366.623  ops/s
   Packed64Benchmark.packed64VarHandleLongLong_Sparse          64   10240  thrpt    3  53333.751 ± 15511.237  ops/s
   Packed64Benchmark.packed64_Consecutive                       1   10240  thrpt    3  36595.316 ±  1213.323  ops/s
   Packed64Benchmark.packed64_Consecutive                       4   10240  thrpt    3  39777.028 ±   953.027  ops/s
   Packed64Benchmark.packed64_Consecutive                       5   10240  thrpt    3  46119.465 ±   211.767  ops/s
   Packed64Benchmark.packed64_Consecutive                       8   10240  thrpt    3  39886.892 ±  1574.036  ops/s
   Packed64Benchmark.packed64_Consecutive                      11   10240  thrpt    3  53222.462 ±  1847.024  ops/s
   Packed64Benchmark.packed64_Consecutive                      16   10240  thrpt    3  39897.330 ±  1012.499  ops/s
   Packed64Benchmark.packed64_Consecutive                      23   10240  thrpt    3  50924.631 ±  2607.771  ops/s
   Packed64Benchmark.packed64_Consecutive                      25   10240  thrpt    3  51179.396 ±  4118.732  ops/s
   Packed64Benchmark.packed64_Consecutive                      31   10240  thrpt    3  49350.911 ±  2652.142  ops/s
   Packed64Benchmark.packed64_Consecutive                      32   10240  thrpt    3  40245.046 ±  1506.183  ops/s
   Packed64Benchmark.packed64_Consecutive                      47   10240  thrpt    3  43794.173 ±  4789.862  ops/s
   Packed64Benchmark.packed64_Consecutive                      59   10240  thrpt    3  41248.048 ±  2533.885  ops/s
   Packed64Benchmark.packed64_Consecutive                      61   10240  thrpt    3  42538.675 ±  3462.201  ops/s
   Packed64Benchmark.packed64_Consecutive                      64   10240  thrpt    3  40091.319 ±   962.752  ops/s
   Packed64Benchmark.packed64_Sparse                            1   10240  thrpt    3  60621.693 ±  6289.038  ops/s
   Packed64Benchmark.packed64_Sparse                            4   10240  thrpt    3  63121.896 ±  5265.890  ops/s
   Packed64Benchmark.packed64_Sparse                            5   10240  thrpt    3  49445.705 ±  1348.639  ops/s
   Packed64Benchmark.packed64_Sparse                            8   10240  thrpt    3  63533.078 ±  5166.244  ops/s
   Packed64Benchmark.packed64_Sparse                           11   10240  thrpt    3  47624.192 ±  5238.701  ops/s
   Packed64Benchmark.packed64_Sparse                           16   10240  thrpt    3  64148.964 ±   820.543  ops/s
   Packed64Benchmark.packed64_Sparse                           23   10240  thrpt    3  44013.707 ±  3850.643  ops/s
   Packed64Benchmark.packed64_Sparse                           25   10240  thrpt    3  44837.906 ±  2231.909  ops/s
   Packed64Benchmark.packed64_Sparse                           31   10240  thrpt    3  44638.184 ±  1887.982  ops/s
   Packed64Benchmark.packed64_Sparse                           32   10240  thrpt    3  63831.488 ±  1325.146  ops/s
   Packed64Benchmark.packed64_Sparse                           47   10240  thrpt    3  41508.192 ±  3339.770  ops/s
   Packed64Benchmark.packed64_Sparse                           59   10240  thrpt    3  40849.655 ±  2445.001  ops/s
   Packed64Benchmark.packed64_Sparse                           61   10240  thrpt    3  36888.602 ±  1342.537  ops/s
   Packed64Benchmark.packed64_Sparse                           64   10240  thrpt    3  64386.395 ±  1694.784  ops/s
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] costin edited a comment on pull request #453: Investigate whether should Packed64 use a byte[] plus VarHandles instead of a long[]

Posted by GitBox <gi...@apache.org>.
costin edited a comment on pull request #453:
URL: https://github.com/apache/lucene/pull/453#issuecomment-973193877


   Thanks to @jpountz I've added two more tests beside JMH.
   `PackedRandomTest` which does a brute force for testing performance of random locations instead of consecutive ones. Surprisingly the VHLongByte implementation comes ahead, followed by Packed64 and finally Packed64VHLongLong - the results on my machine are roughly:
   ```
   1. VHLongAndByte     939100
   2. Packed64         1358500
   3. VHLongLong       1510400
   ```
   
   Adrien suggested another workload, more realistic `PackedMacroTest` which does `SortedDocValues` merging, which creates an `OrdinalMap` backed by `PackedInts` and then recorded the merge times.
   In all cases I manually updated the PackedInt class to returned the desired implementation.  Based on these set of results, Packed64 is on the first place, followed by VHLongAndByte with VHLongLong coming in third.
   
   I'm attaching the merging times found for reference:
   [p64vhll.log](https://github.com/apache/lucene/files/7565523/p64vhll.log)
   [p64.log](https://github.com/apache/lucene/files/7565524/p64.log)
   [p64vhlb.log](https://github.com/apache/lucene/files/7565525/p64vhlb.log)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] sonatype-lift[bot] commented on a change in pull request #453: Investigate whether should Packed64 use a byte[] plus VarHandles instead of a long[]

Posted by GitBox <gi...@apache.org>.
sonatype-lift[bot] commented on a change in pull request #453:
URL: https://github.com/apache/lucene/pull/453#discussion_r761874066



##########
File path: lucene/core/src/java/org/apache/lucene/util/packed/Packed64VHLongAndByte.java
##########
@@ -0,0 +1,154 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.util.packed;
+
+import org.apache.lucene.util.ArrayUtil;
+import org.apache.lucene.util.BitUtil;
+import org.apache.lucene.util.RamUsageEstimator;
+
+import java.util.Arrays;
+
+/**
+ * Space optimized random access capable array of values with a fixed number of bits/value. Values
+ * are packed contiguously.
+ *
+ * <p>The implementation strives to perform as fast as possible under the constraint of contiguous
+ * bits, by avoiding expensive operations. This comes at the cost of code clarity.
+ *
+ * <p>Technical details: This implementation is a refinement of a non-branching version. The
+ * non-branching get and set methods meant that 2 or 4 atomics in the underlying array were always
+ * accessed, even for the cases where only 1 or 2 were needed. Even with caching, this had a
+ * detrimental effect on performance. Related to this issue, the old implementation used lookup
+ * tables for shifts and masks, which also proved to be a bit slower than calculating the shifts and
+ * masks on the fly. See https://issues.apache.org/jira/browse/LUCENE-4062 for details.
+ */
+class Packed64VHLongAndByte extends PackedInts.MutableImpl {
+  static final int BLOCK_SIZE = 64; // 32 = int, 64 = long
+  static final int BYTE_BITS = 3;
+  static final int BYTE_BLOCK_SIZE = 8;
+  static final int BYTE_BLOCK_MOD_MASK = BYTE_BLOCK_SIZE - 1; // x % BYTE_BLOCK
+
+  /** Values are stores contiguously in the blocks array. */
+  private final byte[] blocks;
+  /** A right-aligned mask of width BitsPerValue used by {@link #get(int)}. */
+  private final long maskRight;
+  /** Optimization: Saves one lookup in {@link #get(int)}. */
+  private final int bpvMinusBlockSize;

Review comment:
       *UnusedVariable:*  The field 'bpvMinusBlockSize' is never read. [(details)](https://errorprone.info/bugpattern/UnusedVariable)
   (at-me [in a reply](https://help.sonatype.com/lift/talking-to-lift) with `help` or `ignore`)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] jpountz commented on pull request #453: Investigate whether should Packed64 use a byte[] plus VarHandles instead of a long[]

Posted by GitBox <gi...@apache.org>.
jpountz commented on pull request #453:
URL: https://github.com/apache/lucene/pull/453#issuecomment-985539567


   Thanks a lot @costin for running these benchmarks. I'm a bit disappointed by the results, I was hoping that the removal of the condition would yield better results than that. I would suggest that we close this exploration now. Sorry that this didn't result in a code change in Lucene, but I want to say that it's good that we did these experiments because at least now we know thet VarHandles are not a silver bullet to make Packed64 faster.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] costin commented on pull request #453: Investigate whether should Packed64 use a byte[] plus VarHandles instead of a long[]

Posted by GitBox <gi...@apache.org>.
costin commented on pull request #453:
URL: https://github.com/apache/lucene/pull/453#issuecomment-972988872


   I've tested `Packed64LongByte` multiple times against `TestPackedInts` and it seems to pass. Note that I've focused mainly on the `get` and `set` methods and removed the specialized fill and bulk get/set methods due to slightly different encoding.
   
   tl;dr - `Packed64` has best performance, followed by `Packed64VarHandleLongLong` and lastly by `Packed64LongByte` with a significant gap. My guess is the long[] array is more cache friendly while the byte[] isn't - by sliding across the byte[] to read longs, the caches are thrashed which is much more expensive than doing bitwise operations.
   The data in case of byte[] support this - performance is great for the 1 and 64 case (where the sliding is minimal since new positions belong to the same byte block), while for the rest it falls to the ground.
   
   I've included the benchmark in the commit; the tests are running on latest JDK 17.0.1 on an Ryzen 5900x.
   
   ```
   openjdk version "17.0.1" 2021-10-19
   OpenJDK Runtime Environment Temurin-17.0.1+12 (build 17.0.1+12)
   OpenJDK 64-Bit Server VM Temurin-17.0.1+12 (build 17.0.1+12, mixed mode, sharing)
   ```
   
   ```
   # JMH version: 1.33
   # VM version: JDK 17.0.1, OpenJDK 64-Bit Server VM, 17.0.1+12
   
   Benchmark                                       (bpv)  (size)   Mode  Cnt      Score      Error  Units
   Packed64Benchmark.packed64                          1   10240  thrpt    3  80129.482 ± 2603.161  ops/s
   Packed64Benchmark.packed64                          4   10240  thrpt    3  87680.030 ± 2725.245  ops/s
   Packed64Benchmark.packed64                          5   10240  thrpt    3  56768.849 ±  633.855  ops/s
   Packed64Benchmark.packed64                          8   10240  thrpt    3  91869.977 ± 1579.788  ops/s
   Packed64Benchmark.packed64                         11   10240  thrpt    3  71802.824 ± 2124.577  ops/s
   Packed64Benchmark.packed64                         16   10240  thrpt    3  94892.804 ± 1876.298  ops/s
   Packed64Benchmark.packed64                         23   10240  thrpt    3  65401.053 ±  690.302  ops/s
   Packed64Benchmark.packed64                         25   10240  thrpt    3  69623.043 ±  679.354  ops/s
   Packed64Benchmark.packed64                         31   10240  thrpt    3  66732.829 ± 1788.781  ops/s
   Packed64Benchmark.packed64                         32   10240  thrpt    3  99938.841 ±  197.846  ops/s
   Packed64Benchmark.packed64                         47   10240  thrpt    3  52174.874 ± 2369.976  ops/s
   Packed64Benchmark.packed64                         59   10240  thrpt    3  55240.166 ±  165.920  ops/s
   Packed64Benchmark.packed64                         61   10240  thrpt    3  54909.784 ± 5064.028  ops/s
   Packed64Benchmark.packed64                         64   10240  thrpt    3  99261.182 ± 2746.589  ops/s
   Packed64Benchmark.packed64VarHandleLongAndByte      1   10240  thrpt    3  56909.631 ± 1986.897  ops/s
   Packed64Benchmark.packed64VarHandleLongAndByte      4   10240  thrpt    3  31584.303 ± 1063.071  ops/s
   Packed64Benchmark.packed64VarHandleLongAndByte      5   10240  thrpt    3  23703.179 ± 2127.908  ops/s
   Packed64Benchmark.packed64VarHandleLongAndByte      8   10240  thrpt    3  17958.177 ±  825.106  ops/s
   Packed64Benchmark.packed64VarHandleLongAndByte     11   10240  thrpt    3  17810.271 ±  379.489  ops/s
   Packed64Benchmark.packed64VarHandleLongAndByte     16   10240  thrpt    3  19043.959 ± 2185.749  ops/s
   Packed64Benchmark.packed64VarHandleLongAndByte     23   10240  thrpt    3  17784.221 ±  824.539  ops/s
   Packed64Benchmark.packed64VarHandleLongAndByte     25   10240  thrpt    3  17728.510 ± 1063.605  ops/s
   Packed64Benchmark.packed64VarHandleLongAndByte     31   10240  thrpt    3  17742.034 ±  198.159  ops/s
   Packed64Benchmark.packed64VarHandleLongAndByte     32   10240  thrpt    3  21951.514 ± 7648.112  ops/s
   Packed64Benchmark.packed64VarHandleLongAndByte     47   10240  thrpt    3  17741.987 ±  866.598  ops/s
   Packed64Benchmark.packed64VarHandleLongAndByte     59   10240  thrpt    3  19481.548 ±  446.086  ops/s
   Packed64Benchmark.packed64VarHandleLongAndByte     61   10240  thrpt    3  18999.020 ±  949.164  ops/s
   Packed64Benchmark.packed64VarHandleLongAndByte     64   10240  thrpt    3  74178.395 ± 1219.876  ops/s
   Packed64Benchmark.packed64VarHandleLongLong         1   10240  thrpt    3  73824.221 ± 3210.970  ops/s
   Packed64Benchmark.packed64VarHandleLongLong         4   10240  thrpt    3  82908.298 ± 8099.224  ops/s
   Packed64Benchmark.packed64VarHandleLongLong         5   10240  thrpt    3  54176.175 ± 2005.872  ops/s
   Packed64Benchmark.packed64VarHandleLongLong         8   10240  thrpt    3  83556.558 ± 1658.929  ops/s
   Packed64Benchmark.packed64VarHandleLongLong        11   10240  thrpt    3  56512.522 ±  380.595  ops/s
   Packed64Benchmark.packed64VarHandleLongLong        16   10240  thrpt    3  83820.820 ± 2000.138  ops/s
   Packed64Benchmark.packed64VarHandleLongLong        23   10240  thrpt    3  62570.823 ±  604.178  ops/s
   Packed64Benchmark.packed64VarHandleLongLong        25   10240  thrpt    3  63497.723 ± 2172.188  ops/s
   Packed64Benchmark.packed64VarHandleLongLong        31   10240  thrpt    3  62699.661 ± 3982.879  ops/s
   Packed64Benchmark.packed64VarHandleLongLong        32   10240  thrpt    3  84795.726 ± 7409.939  ops/s
   Packed64Benchmark.packed64VarHandleLongLong        47   10240  thrpt    3  55820.366 ± 1247.217  ops/s
   Packed64Benchmark.packed64VarHandleLongLong        59   10240  thrpt    3  49562.673 ± 1131.955  ops/s
   Packed64Benchmark.packed64VarHandleLongLong        61   10240  thrpt    3  53540.899 ± 1268.365  ops/s
   Packed64Benchmark.packed64VarHandleLongLong        64   10240  thrpt    3  85549.020 ± 3293.070  ops/s
   ```
   
   I'm sure the `Packed64LongByte` (p64lb) implementation could be improved however the gap in performance is significant.
   p64lb is somewhat reasonable only when dealing with pbv of 64 and 1 - in all other cases performance simply drops.
   Packed64LongLong (p64ll) tracks the performance of `Packed64` (p64) pretty well (which is expected).
   
   To understand where the gap cap comes from, I've done more tests for the best bpv of 64 and worse case 23 and it boils down to this line
   https://github.com/apache/lucene/blob/6bd5c14bf3c6113244773d2e1eb39233dc33ad71/lucene/core/src/java/org/apache/lucene/util/packed/Packed64.java#L73
   
   The bigger the shift, the better the performance - not because of the shift itself but rather because the higher the chances data falls in the same data block which for consecutive operations is going to benefit from caching.
   To test this out, I've changed the code in `Packed64LongAndByte` from :
   ```
   final int elementPos = (int) (majorBitPos >>> 3);     // /8
   ```
   to 
   
   ```
   final int elementPos = (int) (majorBitPos >>> 6);     // /64
   ```
   
   which almost doubled the performance of pb64lb (which is still half of packed64):
   
   ```
   Benchmark                                       (bpv)  (size)   Mode  Cnt      Score      Error  Units
   Packed64Benchmark.packed64                         23   10240  thrpt    3  65188.193 ± 3034.465  ops/s
   Packed64Benchmark.packed64VarHandleLongAndByte     23   10240  thrpt    3  30850.017 ±  422.149  ops/s
   Packed64Benchmark.packed64VarHandleLongLong        23   10240  thrpt    3  56759.947 ± 1055.780  ops/s
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] costin commented on pull request #453: Investigate whether should Packed64 use a byte[] plus VarHandles instead of a long[]

Posted by GitBox <gi...@apache.org>.
costin commented on pull request #453:
URL: https://github.com/apache/lucene/pull/453#issuecomment-972999769


   Adding a few more info to this ticket regarding the underlying assembly.
   The p64lb method is the shortest, followed by p64 and p64ll.
   
   ```
   ============================= C1-compiled nmethod ==============================
   ----------------------------------- Assembly -----------------------------------
   
   Compiled method (c1)     194  774       3       org.apache.lucene.util.packed.Packed64::set (145 bytes)
    total in heap  [0x000001de36e94990,0x000001de36e94f20] = 1424
    relocation     [0x000001de36e94ae8,0x000001de36e94b30] = 72
    main code      [0x000001de36e94b40,0x000001de36e94dc0] = 640
    stub code      [0x000001de36e94dc0,0x000001de36e94df8] = 56
    oops           [0x000001de36e94df8,0x000001de36e94e00] = 8
    metadata       [0x000001de36e94e00,0x000001de36e94e08] = 8
    scopes data    [0x000001de36e94e08,0x000001de36e94e50] = 72
    scopes pcs     [0x000001de36e94e50,0x000001de36e94ef0] = 160
    dependencies   [0x000001de36e94ef0,0x000001de36e94ef8] = 8
    nul chk table  [0x000001de36e94ef8,0x000001de36e94f20] = 40
   
   ----------------------------------- Assembly -----------------------------------
   
   Compiled method (c1)     173  799       3       org.apache.lucene.util.packed.Packed64VHLongAndByte::set (144 bytes)
    total in heap  [0x000001c7d511be10,0x000001c7d511d720] = 6416
    relocation     [0x000001c7d511bf68,0x000001c7d511c0c8] = 352
    main code      [0x000001c7d511c0e0,0x000001c7d511d0e0] = 4096
    stub code      [0x000001c7d511d0e0,0x000001c7d511d158] = 120
    oops           [0x000001c7d511d158,0x000001c7d511d190] = 56
    metadata       [0x000001c7d511d190,0x000001c7d511d1e8] = 88
    scopes data    [0x000001c7d511d1e8,0x000001c7d511d500] = 792
    scopes pcs     [0x000001c7d511d500,0x000001c7d511d6e0] = 480
    dependencies   [0x000001c7d511d6e0,0x000001c7d511d6e8] = 8
    nul chk table  [0x000001c7d511d6e8,0x000001c7d511d720] = 56
   
   ----------------------------------- Assembly -----------------------------------
   
   Compiled method (c1)     177  797       3       org.apache.lucene.util.packed.Packed64VHLongLong::set (204 bytes)
    total in heap  [0x0000018a00893b90,0x0000018a00897390] = 14336
    relocation     [0x0000018a00893ce8,0x0000018a00894060] = 888
    main code      [0x0000018a00894060,0x0000018a008965a0] = 9536
    stub code      [0x0000018a008965a0,0x0000018a00896698] = 248
    oops           [0x0000018a00896698,0x0000018a008966d0] = 56
    metadata       [0x0000018a008966d0,0x0000018a00896728] = 88
    scopes data    [0x0000018a00896728,0x0000018a00896e70] = 1864
    scopes pcs     [0x0000018a00896e70,0x0000018a00897320] = 1200
    dependencies   [0x0000018a00897320,0x0000018a00897328] = 8
    nul chk table  [0x0000018a00897328,0x0000018a00897390] = 104
   
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] costin commented on pull request #453: Investigate whether should Packed64 use a byte[] plus VarHandles instead of a long[]

Posted by GitBox <gi...@apache.org>.
costin commented on pull request #453:
URL: https://github.com/apache/lucene/pull/453#issuecomment-985547790


   Thanks for the feedback @jpountz. Closing the issue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] sonatype-lift[bot] commented on a change in pull request #453: Investigate whether should Packed64 use a byte[] plus VarHandles instead of a long[]

Posted by GitBox <gi...@apache.org>.
sonatype-lift[bot] commented on a change in pull request #453:
URL: https://github.com/apache/lucene/pull/453#discussion_r761116584



##########
File path: lucene/core/src/java/org/apache/lucene/util/packed/Packed64VHLongAndByte.java
##########
@@ -0,0 +1,154 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.util.packed;
+
+import org.apache.lucene.util.ArrayUtil;
+import org.apache.lucene.util.BitUtil;
+import org.apache.lucene.util.RamUsageEstimator;
+
+import java.util.Arrays;
+
+/**
+ * Space optimized random access capable array of values with a fixed number of bits/value. Values
+ * are packed contiguously.
+ *
+ * <p>The implementation strives to perform as fast as possible under the constraint of contiguous
+ * bits, by avoiding expensive operations. This comes at the cost of code clarity.
+ *
+ * <p>Technical details: This implementation is a refinement of a non-branching version. The
+ * non-branching get and set methods meant that 2 or 4 atomics in the underlying array were always
+ * accessed, even for the cases where only 1 or 2 were needed. Even with caching, this had a
+ * detrimental effect on performance. Related to this issue, the old implementation used lookup
+ * tables for shifts and masks, which also proved to be a bit slower than calculating the shifts and
+ * masks on the fly. See https://issues.apache.org/jira/browse/LUCENE-4062 for details.
+ */
+class Packed64VHLongAndByte extends PackedInts.MutableImpl {
+  static final int BLOCK_SIZE = 64; // 32 = int, 64 = long
+  static final int BYTE_BITS = 3;
+  static final int BYTE_BLOCK_SIZE = 8;
+  static final int BYTE_BLOCK_MOD_MASK = BYTE_BLOCK_SIZE - 1; // x % BYTE_BLOCK
+
+  /** Values are stores contiguously in the blocks array. */
+  private final byte[] blocks;
+  /** A right-aligned mask of width BitsPerValue used by {@link #get(int)}. */
+  private final long maskRight;
+  /** Optimization: Saves one lookup in {@link #get(int)}. */
+  private final int bpvMinusBlockSize;
+
+  /**
+   * Creates an array with the internal structures adjusted for the given limits and initialized to
+   * 0.
+   *
+   * @param valueCount the number of elements.
+   * @param bitsPerValue the number of bits available for any given value.
+   */
+  public Packed64VHLongAndByte(int valueCount, int bitsPerValue) {
+    super(valueCount, bitsPerValue);
+    final PackedInts.Format format = PackedInts.Format.PACKED;
+    final long byteCount = format.byteCount(PackedInts.VERSION_CURRENT, valueCount, bitsPerValue);
+    if (bitsPerValue > 56) {
+      throw new IllegalArgumentException("Only bitsPerValue up to 56 allowed");
+    }
+    if (byteCount > ArrayUtil.MAX_ARRAY_LENGTH) {
+      throw new IllegalArgumentException("Too many values/bits to store");
+    }
+    this.blocks = new byte[(int) byteCount + Long.BYTES];
+    maskRight = ~0L << (BLOCK_SIZE - bitsPerValue) >>> (BLOCK_SIZE - bitsPerValue);
+    bpvMinusBlockSize = bitsPerValue - BLOCK_SIZE;
+  }
+
+  /**
+   * @param index the position of the value.
+   * @return the value at the given index.
+   */
+  @Override
+  public long get(final int index) {
+    // The abstract index in a bit stream
+    final long majorBitPos = (long) index * bitsPerValue;
+    // The index in the backing byte-array
+    final int elementPos = (int) (majorBitPos >>> BYTE_BITS); // BYTE_BLOCK
+    // The number of bits already occupied from the previous position
+    final int maskOffset = (int) majorBitPos & BYTE_BLOCK_MOD_MASK;
+//    final int endBits = maskOffset + bpvMinusBlockSize;
+    // Read the main long
+    //final long packed = ((long) BitUtil.VH_LE_LONG.get(blocks, elementPos) >>> maskOffset) & maskRight;
+    return ((long) BitUtil.VH_LE_LONG.get(blocks, elementPos) >>> maskOffset) & maskRight;
+
+//    if (endBits <= 0) { // Single block
+//      return packed;
+//    }
+//    // compute next mask
+//    return packed | (blocks[elementPos + BYTE_BLOCK_SIZE] & ~(~0L << endBits)) << (BLOCK_SIZE - maskOffset);
+  }
+
+  @Override
+  public void set(final int index, final long value) {
+    // The abstract index in a contiguous bit stream
+    final long majorBitPos = (long) index * bitsPerValue;
+    // The index in the backing byte-array
+    final int elementPos = (int) (majorBitPos >>> BYTE_BITS); // / BYTE_BLOCK
+    // The number of bits already occupied from the previous position
+    final int maskOffset = (int) majorBitPos & BYTE_BLOCK_MOD_MASK;
+    final int endBits = maskOffset + bpvMinusBlockSize;
+    // Read the main long
+    final long packed = (long) BitUtil.VH_LE_LONG.get(blocks, elementPos);
+    BitUtil.VH_LE_LONG.set(blocks, elementPos, packed & ~(maskRight << maskOffset) | (value << maskOffset));

Review comment:
       *OperatorPrecedence:*  Use grouping parenthesis to make the operator precedence explicit [(details)](https://errorprone.info/bugpattern/OperatorPrecedence)
   (at-me [in a reply](https://help.sonatype.com/lift/talking-to-lift) with `help` or `ignore`)

##########
File path: lucene/core/src/java/org/apache/lucene/util/packed/Packed64VHLongLong.java
##########
@@ -0,0 +1,159 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.util.packed;
+
+import org.apache.lucene.util.ArrayUtil;
+import org.apache.lucene.util.BitUtil;
+import org.apache.lucene.util.RamUsageEstimator;
+
+import java.sql.Array;
+import java.util.Arrays;
+
+/**
+ * Space optimized random access capable array of values with a fixed number of bits/value. Values
+ * are packed contiguously.
+ *
+ * <p>The implementation strives to perform as fast as possible under the constraint of contiguous
+ * bits, by avoiding expensive operations. This comes at the cost of code clarity.
+ *
+ * <p>Technical details: This implementation is a refinement of a non-branching version. The
+ * non-branching get and set methods meant that 2 or 4 atomics in the underlying array were always
+ * accessed, even for the cases where only 1 or 2 were needed. Even with caching, this had a
+ * detrimental effect on performance. Related to this issue, the old implementation used lookup
+ * tables for shifts and masks, which also proved to be a bit slower than calculating the shifts and
+ * masks on the fly. See https://issues.apache.org/jira/browse/LUCENE-4062 for details.
+ */
+class Packed64VHLongLong extends PackedInts.MutableImpl {
+    static final int BLOCK_SIZE = 64; // 32 = int, 64 = long
+    static final int BLOCK_BITS = 6; // The #bits representing BLOCK_SIZE
+    static final int MOD_MASK = BLOCK_SIZE - 1; // x % BLOCK_SIZE
+
+    /** Values are stores contiguously in the blocks array. */
+    private final byte[] blocks;
+    /** A right-aligned mask of width BitsPerValue used by {@link #get(int)}. */
+    private final long maskRight;
+    /** Optimization: Saves one lookup in {@link #get(int)}. */
+    private final int bpvMinusBlockSize;
+
+    /**
+     * Creates an array with the internal structures adjusted for the given limits and initialized to
+     * 0.
+     *
+     * @param valueCount the number of elements.
+     * @param bitsPerValue the number of bits available for any given value.
+     */
+    public Packed64VHLongLong(int valueCount, int bitsPerValue) {
+        super(valueCount, bitsPerValue);
+        final PackedInts.Format format = PackedInts.Format.PACKED;
+        final long byteCount = format.byteCount(PackedInts.VERSION_CURRENT, valueCount, bitsPerValue);
+        if (byteCount > ArrayUtil.MAX_ARRAY_LENGTH) {
+            throw new IllegalArgumentException("Not yet supported");
+        }
+        this.blocks = new byte[(int) byteCount + 8];
+        maskRight = ~0L << (BLOCK_SIZE - bitsPerValue) >>> (BLOCK_SIZE - bitsPerValue);
+        bpvMinusBlockSize = bitsPerValue - BLOCK_SIZE;
+    }
+
+    /**
+     * @param index the position of the value.
+     * @return the value at the given index.
+     */
+    @Override
+    public long get(final int index) {
+        // The abstract index in a bit stream
+        final long majorBitPos = (long) index * bitsPerValue;
+        // The index in the backing byte-array
+        final int elementPos = (int) (majorBitPos >>> BLOCK_BITS) * 8;
+        // read entry as a long
+        long packed;
+
+        // The number of value-bits in the second long
+        final long endBits = (majorBitPos & MOD_MASK) + bpvMinusBlockSize;
+
+        if (endBits <= 0) { // Single block
+            packed = (long) BitUtil.VH_LE_LONG.get(blocks, elementPos);
+            return (packed >>> -endBits) & maskRight;
+        }
+        // Two blocks
+        packed = (long) BitUtil.VH_LE_LONG.get(blocks, elementPos);
+        long next = (long) BitUtil.VH_LE_LONG.get(blocks, elementPos + 8);
+        return ((packed << endBits) | (next >>> (BLOCK_SIZE - endBits)))
+                & maskRight;
+    }
+
+    @Override
+    public void set(final int index, final long value) {
+        // The abstract index in a contiguous bit stream
+        final long majorBitPos = (long) index * bitsPerValue;
+        // The index in the backing byte-array
+        final int elementPos = (int) (majorBitPos >>> BLOCK_BITS) * 8; // / BLOCK_SIZE
+        // The number of value-bits in the last block
+        final long endBits = (majorBitPos & MOD_MASK) + bpvMinusBlockSize;
+
+        long packed;
+        if (endBits <= 0) { // Single block
+            packed = (long) BitUtil.VH_LE_LONG.get(blocks, elementPos);
+            packed = packed & ~(maskRight << -endBits) | (value << -endBits);

Review comment:
       *OperatorPrecedence:*  Use grouping parenthesis to make the operator precedence explicit [(details)](https://errorprone.info/bugpattern/OperatorPrecedence)
   (at-me [in a reply](https://help.sonatype.com/lift/talking-to-lift) with `help` or `ignore`)

##########
File path: lucene/core/src/java/org/apache/lucene/util/packed/Packed64VHLongLong.java
##########
@@ -0,0 +1,159 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.util.packed;
+
+import org.apache.lucene.util.ArrayUtil;
+import org.apache.lucene.util.BitUtil;
+import org.apache.lucene.util.RamUsageEstimator;
+
+import java.sql.Array;
+import java.util.Arrays;
+
+/**
+ * Space optimized random access capable array of values with a fixed number of bits/value. Values
+ * are packed contiguously.
+ *
+ * <p>The implementation strives to perform as fast as possible under the constraint of contiguous
+ * bits, by avoiding expensive operations. This comes at the cost of code clarity.
+ *
+ * <p>Technical details: This implementation is a refinement of a non-branching version. The
+ * non-branching get and set methods meant that 2 or 4 atomics in the underlying array were always
+ * accessed, even for the cases where only 1 or 2 were needed. Even with caching, this had a
+ * detrimental effect on performance. Related to this issue, the old implementation used lookup
+ * tables for shifts and masks, which also proved to be a bit slower than calculating the shifts and
+ * masks on the fly. See https://issues.apache.org/jira/browse/LUCENE-4062 for details.
+ */
+class Packed64VHLongLong extends PackedInts.MutableImpl {
+    static final int BLOCK_SIZE = 64; // 32 = int, 64 = long
+    static final int BLOCK_BITS = 6; // The #bits representing BLOCK_SIZE
+    static final int MOD_MASK = BLOCK_SIZE - 1; // x % BLOCK_SIZE
+
+    /** Values are stores contiguously in the blocks array. */
+    private final byte[] blocks;
+    /** A right-aligned mask of width BitsPerValue used by {@link #get(int)}. */
+    private final long maskRight;
+    /** Optimization: Saves one lookup in {@link #get(int)}. */
+    private final int bpvMinusBlockSize;
+
+    /**
+     * Creates an array with the internal structures adjusted for the given limits and initialized to
+     * 0.
+     *
+     * @param valueCount the number of elements.
+     * @param bitsPerValue the number of bits available for any given value.
+     */
+    public Packed64VHLongLong(int valueCount, int bitsPerValue) {
+        super(valueCount, bitsPerValue);
+        final PackedInts.Format format = PackedInts.Format.PACKED;
+        final long byteCount = format.byteCount(PackedInts.VERSION_CURRENT, valueCount, bitsPerValue);
+        if (byteCount > ArrayUtil.MAX_ARRAY_LENGTH) {
+            throw new IllegalArgumentException("Not yet supported");
+        }
+        this.blocks = new byte[(int) byteCount + 8];
+        maskRight = ~0L << (BLOCK_SIZE - bitsPerValue) >>> (BLOCK_SIZE - bitsPerValue);
+        bpvMinusBlockSize = bitsPerValue - BLOCK_SIZE;
+    }
+
+    /**
+     * @param index the position of the value.
+     * @return the value at the given index.
+     */
+    @Override
+    public long get(final int index) {
+        // The abstract index in a bit stream
+        final long majorBitPos = (long) index * bitsPerValue;
+        // The index in the backing byte-array
+        final int elementPos = (int) (majorBitPos >>> BLOCK_BITS) * 8;
+        // read entry as a long
+        long packed;
+
+        // The number of value-bits in the second long
+        final long endBits = (majorBitPos & MOD_MASK) + bpvMinusBlockSize;
+
+        if (endBits <= 0) { // Single block
+            packed = (long) BitUtil.VH_LE_LONG.get(blocks, elementPos);
+            return (packed >>> -endBits) & maskRight;
+        }
+        // Two blocks
+        packed = (long) BitUtil.VH_LE_LONG.get(blocks, elementPos);
+        long next = (long) BitUtil.VH_LE_LONG.get(blocks, elementPos + 8);
+        return ((packed << endBits) | (next >>> (BLOCK_SIZE - endBits)))
+                & maskRight;
+    }
+
+    @Override
+    public void set(final int index, final long value) {
+        // The abstract index in a contiguous bit stream
+        final long majorBitPos = (long) index * bitsPerValue;
+        // The index in the backing byte-array
+        final int elementPos = (int) (majorBitPos >>> BLOCK_BITS) * 8; // / BLOCK_SIZE
+        // The number of value-bits in the last block
+        final long endBits = (majorBitPos & MOD_MASK) + bpvMinusBlockSize;
+
+        long packed;
+        if (endBits <= 0) { // Single block
+            packed = (long) BitUtil.VH_LE_LONG.get(blocks, elementPos);
+            packed = packed & ~(maskRight << -endBits) | (value << -endBits);
+            BitUtil.VH_LE_LONG.set(blocks, elementPos, packed);
+            return;
+        }
+        // Two blocks
+        packed = (long) BitUtil.VH_LE_LONG.get(blocks, elementPos);
+        long newPacked = packed & ~(maskRight >>> endBits) | (value >>> endBits);

Review comment:
       *OperatorPrecedence:*  Use grouping parenthesis to make the operator precedence explicit [(details)](https://errorprone.info/bugpattern/OperatorPrecedence)
   (at-me [in a reply](https://help.sonatype.com/lift/talking-to-lift) with `help` or `ignore`)

##########
File path: lucene/core/src/java/org/apache/lucene/util/packed/Packed64VHLongLong.java
##########
@@ -0,0 +1,159 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.util.packed;
+
+import org.apache.lucene.util.ArrayUtil;
+import org.apache.lucene.util.BitUtil;
+import org.apache.lucene.util.RamUsageEstimator;
+
+import java.sql.Array;
+import java.util.Arrays;
+
+/**
+ * Space optimized random access capable array of values with a fixed number of bits/value. Values
+ * are packed contiguously.
+ *
+ * <p>The implementation strives to perform as fast as possible under the constraint of contiguous
+ * bits, by avoiding expensive operations. This comes at the cost of code clarity.
+ *
+ * <p>Technical details: This implementation is a refinement of a non-branching version. The
+ * non-branching get and set methods meant that 2 or 4 atomics in the underlying array were always
+ * accessed, even for the cases where only 1 or 2 were needed. Even with caching, this had a
+ * detrimental effect on performance. Related to this issue, the old implementation used lookup
+ * tables for shifts and masks, which also proved to be a bit slower than calculating the shifts and
+ * masks on the fly. See https://issues.apache.org/jira/browse/LUCENE-4062 for details.
+ */
+class Packed64VHLongLong extends PackedInts.MutableImpl {
+    static final int BLOCK_SIZE = 64; // 32 = int, 64 = long
+    static final int BLOCK_BITS = 6; // The #bits representing BLOCK_SIZE
+    static final int MOD_MASK = BLOCK_SIZE - 1; // x % BLOCK_SIZE
+
+    /** Values are stores contiguously in the blocks array. */
+    private final byte[] blocks;
+    /** A right-aligned mask of width BitsPerValue used by {@link #get(int)}. */
+    private final long maskRight;
+    /** Optimization: Saves one lookup in {@link #get(int)}. */
+    private final int bpvMinusBlockSize;
+
+    /**
+     * Creates an array with the internal structures adjusted for the given limits and initialized to
+     * 0.
+     *
+     * @param valueCount the number of elements.
+     * @param bitsPerValue the number of bits available for any given value.
+     */
+    public Packed64VHLongLong(int valueCount, int bitsPerValue) {
+        super(valueCount, bitsPerValue);
+        final PackedInts.Format format = PackedInts.Format.PACKED;
+        final long byteCount = format.byteCount(PackedInts.VERSION_CURRENT, valueCount, bitsPerValue);
+        if (byteCount > ArrayUtil.MAX_ARRAY_LENGTH) {
+            throw new IllegalArgumentException("Not yet supported");
+        }
+        this.blocks = new byte[(int) byteCount + 8];
+        maskRight = ~0L << (BLOCK_SIZE - bitsPerValue) >>> (BLOCK_SIZE - bitsPerValue);
+        bpvMinusBlockSize = bitsPerValue - BLOCK_SIZE;
+    }
+
+    /**
+     * @param index the position of the value.
+     * @return the value at the given index.
+     */
+    @Override
+    public long get(final int index) {
+        // The abstract index in a bit stream
+        final long majorBitPos = (long) index * bitsPerValue;
+        // The index in the backing byte-array
+        final int elementPos = (int) (majorBitPos >>> BLOCK_BITS) * 8;
+        // read entry as a long
+        long packed;
+
+        // The number of value-bits in the second long
+        final long endBits = (majorBitPos & MOD_MASK) + bpvMinusBlockSize;
+
+        if (endBits <= 0) { // Single block
+            packed = (long) BitUtil.VH_LE_LONG.get(blocks, elementPos);
+            return (packed >>> -endBits) & maskRight;
+        }
+        // Two blocks
+        packed = (long) BitUtil.VH_LE_LONG.get(blocks, elementPos);
+        long next = (long) BitUtil.VH_LE_LONG.get(blocks, elementPos + 8);
+        return ((packed << endBits) | (next >>> (BLOCK_SIZE - endBits)))
+                & maskRight;
+    }
+
+    @Override
+    public void set(final int index, final long value) {
+        // The abstract index in a contiguous bit stream
+        final long majorBitPos = (long) index * bitsPerValue;
+        // The index in the backing byte-array
+        final int elementPos = (int) (majorBitPos >>> BLOCK_BITS) * 8; // / BLOCK_SIZE
+        // The number of value-bits in the last block
+        final long endBits = (majorBitPos & MOD_MASK) + bpvMinusBlockSize;
+
+        long packed;
+        if (endBits <= 0) { // Single block
+            packed = (long) BitUtil.VH_LE_LONG.get(blocks, elementPos);
+            packed = packed & ~(maskRight << -endBits) | (value << -endBits);
+            BitUtil.VH_LE_LONG.set(blocks, elementPos, packed);
+            return;
+        }
+        // Two blocks
+        packed = (long) BitUtil.VH_LE_LONG.get(blocks, elementPos);
+        long newPacked = packed & ~(maskRight >>> endBits) | (value >>> endBits);
+        BitUtil.VH_LE_LONG.set(blocks, elementPos, newPacked);
+
+        long packed2 = (long) BitUtil.VH_LE_LONG.get(blocks, elementPos + 8);
+        long newPacked2 = packed2 & (~0L >>> endBits) | (value << (BLOCK_SIZE - endBits));

Review comment:
       *OperatorPrecedence:*  Use grouping parenthesis to make the operator precedence explicit [(details)](https://errorprone.info/bugpattern/OperatorPrecedence)
   (at-me [in a reply](https://help.sonatype.com/lift/talking-to-lift) with `help` or `ignore`)

##########
File path: lucene/core/src/java/org/apache/lucene/util/packed/Packed64VHLongAndByte.java
##########
@@ -0,0 +1,154 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.util.packed;
+
+import org.apache.lucene.util.ArrayUtil;
+import org.apache.lucene.util.BitUtil;
+import org.apache.lucene.util.RamUsageEstimator;
+
+import java.util.Arrays;
+
+/**
+ * Space optimized random access capable array of values with a fixed number of bits/value. Values
+ * are packed contiguously.
+ *
+ * <p>The implementation strives to perform as fast as possible under the constraint of contiguous
+ * bits, by avoiding expensive operations. This comes at the cost of code clarity.
+ *
+ * <p>Technical details: This implementation is a refinement of a non-branching version. The
+ * non-branching get and set methods meant that 2 or 4 atomics in the underlying array were always
+ * accessed, even for the cases where only 1 or 2 were needed. Even with caching, this had a
+ * detrimental effect on performance. Related to this issue, the old implementation used lookup
+ * tables for shifts and masks, which also proved to be a bit slower than calculating the shifts and
+ * masks on the fly. See https://issues.apache.org/jira/browse/LUCENE-4062 for details.
+ */
+class Packed64VHLongAndByte extends PackedInts.MutableImpl {
+  static final int BLOCK_SIZE = 64; // 32 = int, 64 = long
+  static final int BYTE_BITS = 3;
+  static final int BYTE_BLOCK_SIZE = 8;
+  static final int BYTE_BLOCK_MOD_MASK = BYTE_BLOCK_SIZE - 1; // x % BYTE_BLOCK
+
+  /** Values are stores contiguously in the blocks array. */
+  private final byte[] blocks;
+  /** A right-aligned mask of width BitsPerValue used by {@link #get(int)}. */
+  private final long maskRight;
+  /** Optimization: Saves one lookup in {@link #get(int)}. */
+  private final int bpvMinusBlockSize;
+
+  /**
+   * Creates an array with the internal structures adjusted for the given limits and initialized to
+   * 0.
+   *
+   * @param valueCount the number of elements.
+   * @param bitsPerValue the number of bits available for any given value.
+   */
+  public Packed64VHLongAndByte(int valueCount, int bitsPerValue) {
+    super(valueCount, bitsPerValue);
+    final PackedInts.Format format = PackedInts.Format.PACKED;
+    final long byteCount = format.byteCount(PackedInts.VERSION_CURRENT, valueCount, bitsPerValue);
+    if (bitsPerValue > 56) {
+      throw new IllegalArgumentException("Only bitsPerValue up to 56 allowed");
+    }
+    if (byteCount > ArrayUtil.MAX_ARRAY_LENGTH) {
+      throw new IllegalArgumentException("Too many values/bits to store");
+    }
+    this.blocks = new byte[(int) byteCount + Long.BYTES];
+    maskRight = ~0L << (BLOCK_SIZE - bitsPerValue) >>> (BLOCK_SIZE - bitsPerValue);
+    bpvMinusBlockSize = bitsPerValue - BLOCK_SIZE;
+  }
+
+  /**
+   * @param index the position of the value.
+   * @return the value at the given index.
+   */
+  @Override
+  public long get(final int index) {
+    // The abstract index in a bit stream
+    final long majorBitPos = (long) index * bitsPerValue;
+    // The index in the backing byte-array
+    final int elementPos = (int) (majorBitPos >>> BYTE_BITS); // BYTE_BLOCK
+    // The number of bits already occupied from the previous position
+    final int maskOffset = (int) majorBitPos & BYTE_BLOCK_MOD_MASK;
+//    final int endBits = maskOffset + bpvMinusBlockSize;
+    // Read the main long
+    //final long packed = ((long) BitUtil.VH_LE_LONG.get(blocks, elementPos) >>> maskOffset) & maskRight;
+    return ((long) BitUtil.VH_LE_LONG.get(blocks, elementPos) >>> maskOffset) & maskRight;
+
+//    if (endBits <= 0) { // Single block
+//      return packed;
+//    }
+//    // compute next mask
+//    return packed | (blocks[elementPos + BYTE_BLOCK_SIZE] & ~(~0L << endBits)) << (BLOCK_SIZE - maskOffset);
+  }
+
+  @Override
+  public void set(final int index, final long value) {
+    // The abstract index in a contiguous bit stream
+    final long majorBitPos = (long) index * bitsPerValue;
+    // The index in the backing byte-array
+    final int elementPos = (int) (majorBitPos >>> BYTE_BITS); // / BYTE_BLOCK
+    // The number of bits already occupied from the previous position
+    final int maskOffset = (int) majorBitPos & BYTE_BLOCK_MOD_MASK;
+    final int endBits = maskOffset + bpvMinusBlockSize;

Review comment:
       *UnusedVariable:*  The local variable 'endBits' is never read. [(details)](https://errorprone.info/bugpattern/UnusedVariable)
   (at-me [in a reply](https://help.sonatype.com/lift/talking-to-lift) with `help` or `ignore`)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] costin commented on pull request #453: Investigate whether should Packed64 use a byte[] plus VarHandles instead of a long[]

Posted by GitBox <gi...@apache.org>.
costin commented on pull request #453:
URL: https://github.com/apache/lucene/pull/453#issuecomment-973879755


   I've done more benchmarking with a variant of Packed64 that uses the `VarHandle` `int` and `short` variant to read memory when dealing with a smaller bpv (<=32 for `int`, <=16 for `short`) instead of reading a full long.
   However performance takes a huge hit, an order of magnitude slower:
   
   ```
   # VarHandle int (set):
   
   Packed64Benchmark.packed64                         16   10240  thrpt    3  5653670.172 ± 2376991.321  ops/s
   Packed64Benchmark.packed64VarHandleDynamic         16   10240  thrpt    3   349186.319 ±   57043.802  ops/s
   Packed64Benchmark.packed64VarHandleLongAndByte     16   10240  thrpt    3  4856165.100 ±  265007.574  ops/s
   Packed64Benchmark.packed64VarHandleLongLong        16   10240  thrpt    3  5151665.941 ±  567170.748  ops/s
   
   # VarHandle short (set):
   Packed64Benchmark.packed64                         15   10240  thrpt    3  5708794.814 ± 196465.456  ops/s
   Packed64Benchmark.packed64VarHandleDynamic         15   10240  thrpt    3   370572.965 ±  17435.808  ops/s
   Packed64Benchmark.packed64VarHandleLongAndByte     15   10240  thrpt    3  4980979.402 ± 305740.478  ops/s
   Packed64Benchmark.packed64VarHandleLongLong        15   10240  thrpt    3  4913621.842 ± 248031.886  ops/s
   ```
   Changing to a direct array assignment (when dealing with bpv <= 8) improves performance a bit but still far from enough:
   
   ```
   # Direct array assignment
   Benchmark                                       (bpv)  (size)   Mode  Cnt        Score        Error  Units
   Packed64Benchmark.packed64                          7   10240  thrpt    3  5711092.727 ± 373045.792  ops/s
   Packed64Benchmark.packed64VarHandleDynamic          7   10240  thrpt    3   459770.147 ±  10022.225  ops/s
   Packed64Benchmark.packed64VarHandleLongAndByte      7   10240  thrpt    3  5147663.602 ±  98164.084  ops/s
   Packed64Benchmark.packed64VarHandleLongLong         7   10240  thrpt    3  5161701.054 ± 182228.458  ops/s
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org