You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Greg Miller (Jira)" <ji...@apache.org> on 2021/03/25 22:57:00 UTC
[jira] [Commented] (LUCENE-9877) Explore increasing the allowable exceptions in PForUtil

    [ https://issues.apache.org/jira/browse/LUCENE-9877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17309021#comment-17309021 ] 

Greg Miller commented on LUCENE-9877:
-------------------------------------

Here are my results from comparing 7 allowable exceptions in PFOR to the baseline 3:

Similar to my approach in [LUCENE-9850|https://issues.apache.org/jira/browse/LUCENE-9850], I threw together a tool that visualizes the bits-per-value used across all PFOR encoded blocks. I also added in a visualization of the exception count used across all PFOR blocks. Here's what I found by building the Wikipedia EN index with the "wikimedium10m" source in luceneutil (all zero-valued histogram entries were truncated for readability):


{code:java}
BASELINE (3 exceptions)

PFOR BPV Histogram
 0 ***                                                 [4.02 %] (239453 of 5959788)
 1 **                                                  [2.78 %] (165409 of 5959788)
 2 **************************                          [51.98 %] (3097737 of 5959788)
 3 *****************                                   [32.61 %] (1943685 of 5959788)
 4 ****                                                [7.06 %] (420552 of 5959788)
 5 *                                                   [1.49 %] (88700 of 5959788)
 6 *                                                   [0.07 %] (4097 of 5959788)
 7 *                                                   [0.00 %] (153 of 5959788)
 8 *                                                   [0.00 %] (2 of 5959788)
Total bytes used: 234757569
PFOR Exceptions Used Histogram
 0 **************************                          [51.83 %] (3088921 of 5959788)
 1 ************                                        [22.78 %] (1357732 of 5959788)
 2 ********                                            [14.63 %] (872061 of 5959788)
 3 ******                                              [10.76 %] (641074 of 5959788)

CANDIDATE (7 exceptions)

PFOR BPV Histogram
 0 ***                                                 [4.02 %] (239583 of 5959788)
 1 *****                                               [8.75 %] (521487 of 5959788)
 2 ********************************                    [62.40 %] (3718819 of 5959788)
 3 ***********                                         [20.26 %] (1207623 of 5959788)
 4 **                                                  [3.71 %] (221391 of 5959788)
 5 *                                                   [0.82 %] (48939 of 5959788)
 6 *                                                   [0.03 %] (1868 of 5959788)
 7 *                                                   [0.00 %] (76 of 5959788)
 8 *                                                   [0.00 %] (2 of 5959788)
Total bytes used: 216891765
PFOR Exceptions Used Histogram
 0 ****************                                    [30.93 %] (1843145 of 5959788)
 1 **********                                          [18.46 %] (1100315 of 5959788)
 2 *******                                             [13.38 %] (797217 of 5959788)
 3 ******                                              [10.34 %] (615991 of 5959788)
 4 *****                                               [8.36 %] (498042 of 5959788)
 5 ****                                                [7.03 %] (419005 of 5959788)
 6 ****                                                [6.11 %] (364388 of 5959788)
 7 ***                                                 [5.40 %] (321685 of 5959788)
{code}

So we can observe the additional exceptions getting utilized and the bits-per-value generally shifting lower. This seems to amount to a ~7.6% decrease in space used for all of the PFOR blocks in aggregate.

Looking at the overall index size though, I only see a ~0.5% decrease in the index size. :

{code:java}
3816536 wikimedium10m.baseline.facets.taxonomy:Date.taxonomy:Month.taxonomy:DayOfYear.sortedset:Month.sortedset:DayOfYear.Lucene90.Lucene90.nd10M
3795532 wikimedium10m.candidate.facets.taxonomy:Date.taxonomy:Month.taxonomy:DayOfYear.sortedset:Month.sortedset:DayOfYear.Lucene90.Lucene90.nd10M
{code}

I suspect the PFOR data in this index (frequencies, positions, offsets and payloads) just aren't a big enough portion of the overall index to move the needle much in overall size. If my tool is accurate, it looks like only ~0.2 GB of the ~3.8 GB index is PFOR data.

As for performance implications, running a luceneutil benchmark (using "-source wikimedium10m") shows the following:

{code:java}
                    TaskQPS baseline      StdDevQPS pfor 7 exceptions      StdDev                Pct diff p-value
   HighTermDayOfYearSort      139.37     (10.8%)      135.30     (10.2%)   -2.9% ( -21% -   20%) 0.379
       HighTermMonthSort      134.56     (13.4%)      130.65     (11.9%)   -2.9% ( -24% -   25%) 0.470
              TermDTSort      100.66     (12.9%)       97.96     (13.5%)   -2.7% ( -25% -   27%) 0.521
    HighTermTitleBDVSort      130.20      (9.6%)      127.06      (8.4%)   -2.4% ( -18% -   17%) 0.398
         MedSloppyPhrase       45.83      (3.7%)       44.82      (3.8%)   -2.2% (  -9% -    5%) 0.063
           OrHighNotHigh      505.30      (3.7%)      498.15      (4.6%)   -1.4% (  -9% -    7%) 0.288
               OrHighLow      413.80      (5.0%)      408.04      (4.9%)   -1.4% ( -10% -    8%) 0.371
              AndHighLow      575.78      (3.5%)      568.21      (3.6%)   -1.3% (  -8% -    5%) 0.241
               MedPhrase       19.20      (2.5%)       18.98      (2.4%)   -1.1% (  -5% -    3%) 0.138
            OrHighNotLow      504.91      (4.7%)      499.19      (4.2%)   -1.1% (  -9% -    8%) 0.423
         LowSloppyPhrase       44.42      (4.0%)       44.01      (3.1%)   -0.9% (  -7% -    6%) 0.413
            OrNotHighLow      789.01      (3.7%)      781.98      (4.4%)   -0.9% (  -8% -    7%) 0.488
             LowSpanNear      100.30      (2.1%)       99.42      (2.0%)   -0.9% (  -4% -    3%) 0.178
    HighIntervalsOrdered       13.86      (2.0%)       13.75      (1.9%)   -0.8% (  -4% -    3%) 0.174
                 Prefix3      230.15      (2.8%)      228.24      (3.3%)   -0.8% (  -6% -    5%) 0.394
        HighSloppyPhrase        1.77      (3.8%)        1.76      (3.4%)   -0.8% (  -7% -    6%) 0.477
   BrowseMonthTaxoFacets        5.01      (2.2%)        4.97      (3.5%)   -0.8% (  -6% -    4%) 0.396
             AndHighHigh       35.38      (2.8%)       35.12      (3.1%)   -0.7% (  -6% -    5%) 0.428
                Wildcard       63.19      (1.9%)       62.83      (1.9%)   -0.6% (  -4% -    3%) 0.337
            HighSpanNear       11.63      (2.8%)       11.57      (2.2%)   -0.6% (  -5% -    4%) 0.492
            OrHighNotMed      516.53      (4.0%)      513.71      (4.1%)   -0.5% (  -8% -    7%) 0.670
BrowseDayOfYearTaxoFacets        4.20      (2.5%)        4.18      (2.9%)   -0.5% (  -5% -    5%) 0.541
               LowPhrase       51.17      (3.0%)       50.97      (3.0%)   -0.4% (  -6% -    5%) 0.674
                 MedTerm     1080.72      (5.4%)     1076.53      (4.9%)   -0.4% ( -10% -   10%) 0.812
    BrowseDateTaxoFacets        4.19      (2.5%)        4.18      (2.8%)   -0.4% (  -5% -    5%) 0.657
                HighTerm      923.10      (3.8%)      920.08      (4.1%)   -0.3% (  -7% -    7%) 0.795
                 Respell       40.97      (1.8%)       40.84      (2.2%)   -0.3% (  -4% -    3%) 0.620
               OrHighMed       88.93      (2.7%)       88.72      (1.7%)   -0.2% (  -4% -    4%) 0.739
              HighPhrase      314.90      (2.9%)      314.29      (4.1%)   -0.2% (  -6% -    6%) 0.863
BrowseDayOfYearSSDVFacets       11.39      (5.8%)       11.38      (5.8%)   -0.1% ( -11% -   12%) 0.957
   BrowseMonthSSDVFacets       12.55      (3.3%)       12.54      (3.3%)   -0.1% (  -6% -    6%) 0.924
                PKLookup      146.67      (4.3%)      146.58      (3.0%)   -0.1% (  -7% -    7%) 0.960
                  IntNRQ       53.49      (0.7%)       53.50      (0.7%)    0.0% (  -1% -    1%) 0.935
           OrNotHighHigh      480.86      (3.8%)      481.91      (3.9%)    0.2% (  -7% -    8%) 0.857
             MedSpanNear      130.25      (2.5%)      130.55      (2.5%)    0.2% (  -4% -    5%) 0.775
              AndHighMed      159.94      (2.4%)      160.41      (2.6%)    0.3% (  -4% -    5%) 0.704
                 LowTerm     1030.42      (3.7%)     1034.09      (5.3%)    0.4% (  -8% -    9%) 0.804
              OrHighHigh       14.59      (2.2%)       14.65      (1.3%)    0.4% (  -3% -    4%) 0.514
                  Fuzzy2       51.29      (7.7%)       51.74      (9.7%)    0.9% ( -15% -   19%) 0.755
            OrNotHighMed      480.49      (2.9%)      485.66      (3.9%)    1.1% (  -5% -    8%) 0.326
                  Fuzzy1       35.35      (7.6%)       35.84      (7.5%)    1.4% ( -12% -   17%) 0.562
{code}

So while the index size reduction is relatively small, I don't see a real significant performance impact either. Curious if anyone else sees something interesting in these numbers.

Finally, I'll mention that I observed pretty similar behavior when running our internal (Amazon product search) benchmarking tools. Pretty small impact to index size but no significant performance hit either.

> Explore increasing the allowable exceptions in PForUtil
> -------------------------------------------------------
>
>                 Key: LUCENE-9877
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9877
>             Project: Lucene - Core
>          Issue Type: Task
>          Components: core/codecs
>    Affects Versions: main (9.0)
>            Reporter: Greg Miller
>            Priority: Minor
>
> Piggybacking a little off of the investigation I was doing over in LUCENE-9850 I thought it might also be worth-while exploring the impact of increasing the number of allowable exceptions in PForUtil. The aim of this investigation is to see if we could reduce index size by allowing for more exceptions without significant negative impact to performance.
> PForUtil currently allows for up to 3 exceptions, and it only uses 3 bits to encode the number of exceptions (using the remaining 3 bits of the byte used to also encode the number of bits-per-value, which requires 5 bits). Each exception used is encoded with a full byte, using a maximum of 3 bytes per block.
> It seems to me like 14 might be a more ideal number of exceptions if index size is the driving motivation. My thought process is that, in the worst-case, 14 exceptions would be used to save only a single bit-per-value in the corresponding block. With 128 entries per block, this would save 16 bytes. To encode the exception count, we'd need to use an extra byte per block since we need more than the current 3 bits. So with that extra byte in addition to the 14 bytes used to encode the exception values, we would save a single byte in total (just slightly better than breaking even). If we need fewer than the 14 exceptions, or if we're able to save more than 1 bit-per-value, it's all additional savings.
> As a first experiment, I think testing 7 exceptions might be a good middle-ground. Seven exceptions is the most we can encode with the 3 bits we have allocated today. So for 7 exceptions, we don't need the extra byte for encoding the exception count, and in the worst case of needing all 7 to save 1 bit-per-value, we'd save a total of 9 bytes in a block. I suppose the question is what kind of performance hit we might observe due to decoding more exceptions.
> I'll post some results here for discussion or at least for public record of my work for future reference.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org