You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Luca Cavanna (JIRA)" <ji...@apache.org> on 2019/05/09 15:22:00 UTC
[jira] [Comment Edited] (LUCENE-8796) Use exponential search in IntArrayDocIdSet advance method

    [ https://issues.apache.org/jira/browse/LUCENE-8796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16836467#comment-16836467 ] 

Luca Cavanna edited comment on LUCENE-8796 at 5/9/19 3:21 PM:
--------------------------------------------------------------

I have updated the PR after applying Yonik's suggestion and re-run benchmarks a few times. The run with the least noise had these results (note that I disabled the bitset optimization on both sides):
{noformat}
Report after iter 19:
                    TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff
                HighTerm     1575.07      (5.9%)     1541.27      (6.9%)   -2.1% ( -14% -   11%)
                 MedTerm     1363.22      (6.5%)     1337.03      (7.0%)   -1.9% ( -14% -   12%)
                 LowTerm     1441.86      (4.2%)     1420.77      (5.2%)   -1.5% ( -10% -    8%)
       IntNRQConjMedTerm      280.55      (4.0%)      277.64      (4.1%)   -1.0% (  -8% -    7%)
               MedPhrase      153.84      (3.5%)      152.44      (3.3%)   -0.9% (  -7% -    6%)
                 Prefix3      224.92      (4.0%)      223.13      (3.7%)   -0.8% (  -8% -    7%)
        HighSloppyPhrase       19.70      (3.7%)       19.56      (4.5%)   -0.7% (  -8% -    7%)
         MedSloppyPhrase       18.23      (4.3%)       18.11      (4.7%)   -0.7% (  -9% -    8%)
            OrNotHighMed      586.33      (3.4%)      582.47      (4.9%)   -0.7% (  -8% -    7%)
         LowSloppyPhrase       18.56      (3.6%)       18.46      (3.9%)   -0.5% (  -7% -    7%)
              HighPhrase       22.64      (2.7%)       22.54      (3.0%)   -0.4% (  -6% -    5%)
               LowPhrase      144.10      (3.8%)      143.55      (3.3%)   -0.4% (  -7% -    6%)
              AndHighLow      539.26      (3.7%)      537.25      (3.2%)   -0.4% (  -7% -    6%)
                PKLookup      132.96      (3.0%)      132.48      (4.6%)   -0.4% (  -7% -    7%)
               OrHighMed      115.79      (2.7%)      115.49      (3.5%)   -0.3% (  -6% -    6%)
      PrefixConjHighTerm       36.98      (2.8%)       36.93      (3.4%)   -0.1% (  -6% -    6%)
    WildcardConjHighTerm       45.79      (3.0%)       45.73      (3.1%)   -0.1% (  -6% -    6%)
               OrHighLow      448.91      (3.7%)      448.70      (6.3%)   -0.0% (  -9% -   10%)
                Wildcard       78.89      (3.2%)       78.95      (3.6%)    0.1% (  -6% -    7%)
      IntNRQConjHighTerm       78.35      (2.3%)       78.48      (2.4%)    0.2% (  -4% -    4%)
                  IntNRQ      100.56      (2.7%)      100.84      (2.8%)    0.3% (  -5% -    5%)
            OrHighNotLow      732.45      (2.8%)      734.56      (5.3%)    0.3% (  -7% -    8%)
           OrHighNotHigh      544.87      (2.8%)      546.47      (4.6%)    0.3% (  -6% -    7%)
       IntNRQConjLowTerm      249.20      (4.2%)      249.99      (3.8%)    0.3% (  -7% -    8%)
                 Respell       73.05      (3.1%)       73.28      (3.4%)    0.3% (  -6% -    7%)
              OrHighHigh       35.56      (3.0%)       35.68      (4.2%)    0.3% (  -6% -    7%)
            OrNotHighLow      695.41      (4.8%)      697.88      (6.5%)    0.4% ( -10% -   12%)
             MedSpanNear       59.99      (3.8%)       60.30      (4.0%)    0.5% (  -7% -    8%)
              AndHighMed      190.02      (3.1%)      191.04      (3.6%)    0.5% (  -5% -    7%)
             LowSpanNear       12.73      (3.9%)       12.81      (4.2%)    0.6% (  -7% -    8%)
   HighTermDayOfYearSort       88.42      (7.0%)       89.09      (7.1%)    0.8% ( -12% -   15%)
       PrefixConjLowTerm       54.95      (3.7%)       55.43      (3.8%)    0.9% (  -6% -    8%)
            OrHighNotMed      628.44      (3.4%)      634.02      (6.1%)    0.9% (  -8% -   10%)
            HighSpanNear       28.86      (3.2%)       29.11      (3.5%)    0.9% (  -5% -    7%)
     WildcardConjMedTerm       72.48      (3.4%)       73.19      (4.8%)    1.0% (  -7% -    9%)
                  Fuzzy2       49.17      (9.9%)       49.68     (11.7%)    1.0% ( -18% -   25%)
             AndHighHigh       63.44      (3.8%)       64.11      (3.8%)    1.1% (  -6% -    9%)
                  Fuzzy1       79.43      (9.9%)       80.55      (9.7%)    1.4% ( -16% -   23%)
           OrNotHighHigh      574.89      (3.6%)      584.43      (5.5%)    1.7% (  -7% -   11%)
       PrefixConjMedTerm       79.00      (3.2%)       80.50      (3.6%)    1.9% (  -4% -    8%)
     WildcardConjLowTerm       90.67      (2.9%)       92.49      (3.7%)    2.0% (  -4% -    8%)
       HighTermMonthSort       86.13     (11.8%)       88.79     (12.4%)    3.1% ( -18% -   30%)
{noformat}
I also ran benchmarks with the bitset optimization in place on both ends:

{{{noformat}}}
 Report after iter 19:
 TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff
 IntNRQ 63.46 (24.6%) 62.28 (24.2%) -1.9% ( -40% - 62%)
 OrNotHighMed 596.89 (3.5%) 589.18 (4.6%) -1.3% ( -9% - 7%)
 OrNotHighHigh 769.65 (3.1%) 760.89 (2.9%) -1.1% ( -6% - 4%)
 Fuzzy1 76.45 (6.8%) 75.62 (7.4%) -1.1% ( -14% - 14%)
 OrHighNotHigh 626.48 (3.4%) 619.67 (2.6%) -1.1% ( -6% - 5%)
 MedTerm 1345.24 (3.6%) 1332.07 (3.7%) -1.0% ( -7% - 6%)
 PKLookup 136.70 (4.0%) 135.57 (3.9%) -0.8% ( -8% - 7%)
 HighTerm 1103.39 (3.5%) 1095.23 (3.7%) -0.7% ( -7% - 6%)
 OrNotHighLow 780.85 (3.8%) 775.40 (2.7%) -0.7% ( -6% - 5%)
 OrHighNotLow 815.32 (4.7%) 810.72 (5.7%) -0.6% ( -10% - 10%)
 Prefix3 310.00 (4.9%) 308.31 (3.7%) -0.5% ( -8% - 8%)
 LowTerm 1462.30 (3.7%) 1455.27 (4.4%) -0.5% ( -8% - 7%)
 OrHighLow 446.56 (4.6%) 445.11 (3.2%) -0.3% ( -7% - 7%)
 AndHighLow 594.90 (2.9%) 593.39 (3.4%) -0.3% ( -6% - 6%)
 Respell 64.46 (2.5%) 64.36 (2.6%) -0.2% ( -5% - 5%)
 OrHighNotMed 685.98 (4.5%) 685.69 (3.7%) -0.0% ( -7% - 8%)
 OrHighMed 67.90 (5.1%) 67.91 (3.4%) 0.0% ( -8% - 8%)
 Fuzzy2 50.18 (4.5%) 50.21 (5.8%) 0.1% ( -9% - 10%)
 LowSpanNear 59.27 (3.9%) 59.34 (4.0%) 0.1% ( -7% - 8%)
 OrHighHigh 30.89 (5.2%) 30.94 (3.3%) 0.2% ( -7% - 9%)
 LowPhrase 114.67 (3.1%) 114.87 (2.5%) 0.2% ( -5% - 5%)
 HighPhrase 22.34 (2.7%) 22.42 (2.2%) 0.4% ( -4% - 5%)
 AndHighHigh 59.53 (3.8%) 59.89 (4.4%) 0.6% ( -7% - 9%)
 MedPhrase 29.99 (2.9%) 30.19 (2.3%) 0.7% ( -4% - 6%)
 MedSloppyPhrase 71.57 (3.1%) 72.10 (3.0%) 0.7% ( -5% - 7%)
 IntNRQConjHighTerm 113.74 (7.3%) 114.66 (7.1%) 0.8% ( -12% - 16%)
 LowSloppyPhrase 14.18 (3.4%) 14.30 (2.6%) 0.8% ( -4% - 6%)
 PrefixConjLowTerm 89.05 (4.6%) 89.80 (5.1%) 0.8% ( -8% - 11%)
 AndHighMed 166.34 (3.1%) 167.76 (3.8%) 0.9% ( -5% - 7%)
 WildcardConjMedTerm 51.44 (2.6%) 51.88 (3.0%) 0.9% ( -4% - 6%)
 PrefixConjMedTerm 68.16 (4.8%) 68.80 (4.6%) 0.9% ( -8% - 10%)
 PrefixConjHighTerm 42.34 (6.1%) 42.81 (5.0%) 1.1% ( -9% - 13%)
 MedSpanNear 15.57 (5.5%) 15.74 (5.4%) 1.1% ( -9% - 12%)
 WildcardConjLowTerm 51.56 (3.7%) 52.15 (4.2%) 1.1% ( -6% - 9%)
 HighSpanNear 5.66 (5.8%) 5.73 (5.9%) 1.2% ( -9% - 13%)
 IntNRQConjLowTerm 120.28 (8.5%) 121.67 (8.8%) 1.2% ( -14% - 20%)
 WildcardConjHighTerm 55.43 (3.2%) 56.10 (3.4%) 1.2% ( -5% - 8%)
 IntNRQConjMedTerm 97.79 (8.3%) 98.98 (8.6%) 1.2% ( -14% - 19%)
 Wildcard 106.37 (2.9%) 107.75 (3.6%) 1.3% ( -5% - 7%)
 HighSloppyPhrase 18.21 (4.9%) 18.48 (4.4%) 1.5% ( -7% - 11%)
 HighTermMonthSort 146.10 (11.0%) 148.89 (10.5%) 1.9% ( -17% - 26%)
 HighTermDayOfYearSort 68.62 (6.1%) 70.08 (3.9%) 2.1% ( -7% - 12%)
{{{noformat}}}
  
 I will next have a look at what Atri is suggesting.


was (Author: lucacavanna):
I have updated the PR after applying Yonik's suggestion and re-run benchmarks a few times. The run with the least noise had these results (note that I disabled the bitset optimization on both sides):

{{
Report after iter 19:
                    TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff
                HighTerm     1575.07      (5.9%)     1541.27      (6.9%)   -2.1% ( -14% -   11%)
                 MedTerm     1363.22      (6.5%)     1337.03      (7.0%)   -1.9% ( -14% -   12%)
                 LowTerm     1441.86      (4.2%)     1420.77      (5.2%)   -1.5% ( -10% -    8%)
       IntNRQConjMedTerm      280.55      (4.0%)      277.64      (4.1%)   -1.0% (  -8% -    7%)
               MedPhrase      153.84      (3.5%)      152.44      (3.3%)   -0.9% (  -7% -    6%)
                 Prefix3      224.92      (4.0%)      223.13      (3.7%)   -0.8% (  -8% -    7%)
        HighSloppyPhrase       19.70      (3.7%)       19.56      (4.5%)   -0.7% (  -8% -    7%)
         MedSloppyPhrase       18.23      (4.3%)       18.11      (4.7%)   -0.7% (  -9% -    8%)
            OrNotHighMed      586.33      (3.4%)      582.47      (4.9%)   -0.7% (  -8% -    7%)
         LowSloppyPhrase       18.56      (3.6%)       18.46      (3.9%)   -0.5% (  -7% -    7%)
              HighPhrase       22.64      (2.7%)       22.54      (3.0%)   -0.4% (  -6% -    5%)
               LowPhrase      144.10      (3.8%)      143.55      (3.3%)   -0.4% (  -7% -    6%)
              AndHighLow      539.26      (3.7%)      537.25      (3.2%)   -0.4% (  -7% -    6%)
                PKLookup      132.96      (3.0%)      132.48      (4.6%)   -0.4% (  -7% -    7%)
               OrHighMed      115.79      (2.7%)      115.49      (3.5%)   -0.3% (  -6% -    6%)
      PrefixConjHighTerm       36.98      (2.8%)       36.93      (3.4%)   -0.1% (  -6% -    6%)
    WildcardConjHighTerm       45.79      (3.0%)       45.73      (3.1%)   -0.1% (  -6% -    6%)
               OrHighLow      448.91      (3.7%)      448.70      (6.3%)   -0.0% (  -9% -   10%)
                Wildcard       78.89      (3.2%)       78.95      (3.6%)    0.1% (  -6% -    7%)
      IntNRQConjHighTerm       78.35      (2.3%)       78.48      (2.4%)    0.2% (  -4% -    4%)
                  IntNRQ      100.56      (2.7%)      100.84      (2.8%)    0.3% (  -5% -    5%)
            OrHighNotLow      732.45      (2.8%)      734.56      (5.3%)    0.3% (  -7% -    8%)
           OrHighNotHigh      544.87      (2.8%)      546.47      (4.6%)    0.3% (  -6% -    7%)
       IntNRQConjLowTerm      249.20      (4.2%)      249.99      (3.8%)    0.3% (  -7% -    8%)
                 Respell       73.05      (3.1%)       73.28      (3.4%)    0.3% (  -6% -    7%)
              OrHighHigh       35.56      (3.0%)       35.68      (4.2%)    0.3% (  -6% -    7%)
            OrNotHighLow      695.41      (4.8%)      697.88      (6.5%)    0.4% ( -10% -   12%)
             MedSpanNear       59.99      (3.8%)       60.30      (4.0%)    0.5% (  -7% -    8%)
              AndHighMed      190.02      (3.1%)      191.04      (3.6%)    0.5% (  -5% -    7%)
             LowSpanNear       12.73      (3.9%)       12.81      (4.2%)    0.6% (  -7% -    8%)
   HighTermDayOfYearSort       88.42      (7.0%)       89.09      (7.1%)    0.8% ( -12% -   15%)
       PrefixConjLowTerm       54.95      (3.7%)       55.43      (3.8%)    0.9% (  -6% -    8%)
            OrHighNotMed      628.44      (3.4%)      634.02      (6.1%)    0.9% (  -8% -   10%)
            HighSpanNear       28.86      (3.2%)       29.11      (3.5%)    0.9% (  -5% -    7%)
     WildcardConjMedTerm       72.48      (3.4%)       73.19      (4.8%)    1.0% (  -7% -    9%)
                  Fuzzy2       49.17      (9.9%)       49.68     (11.7%)    1.0% ( -18% -   25%)
             AndHighHigh       63.44      (3.8%)       64.11      (3.8%)    1.1% (  -6% -    9%)
                  Fuzzy1       79.43      (9.9%)       80.55      (9.7%)    1.4% ( -16% -   23%)
           OrNotHighHigh      574.89      (3.6%)      584.43      (5.5%)    1.7% (  -7% -   11%)
       PrefixConjMedTerm       79.00      (3.2%)       80.50      (3.6%)    1.9% (  -4% -    8%)
     WildcardConjLowTerm       90.67      (2.9%)       92.49      (3.7%)    2.0% (  -4% -    8%)
       HighTermMonthSort       86.13     (11.8%)       88.79     (12.4%)    3.1% ( -18% -   30%)
}}

I also ran benchmarks with the bitset optimization in place on both ends:

{{
Report after iter 19:
                    TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff
                  IntNRQ       63.46     (24.6%)       62.28     (24.2%)   -1.9% ( -40% -   62%)
            OrNotHighMed      596.89      (3.5%)      589.18      (4.6%)   -1.3% (  -9% -    7%)
           OrNotHighHigh      769.65      (3.1%)      760.89      (2.9%)   -1.1% (  -6% -    4%)
                  Fuzzy1       76.45      (6.8%)       75.62      (7.4%)   -1.1% ( -14% -   14%)
           OrHighNotHigh      626.48      (3.4%)      619.67      (2.6%)   -1.1% (  -6% -    5%)
                 MedTerm     1345.24      (3.6%)     1332.07      (3.7%)   -1.0% (  -7% -    6%)
                PKLookup      136.70      (4.0%)      135.57      (3.9%)   -0.8% (  -8% -    7%)
                HighTerm     1103.39      (3.5%)     1095.23      (3.7%)   -0.7% (  -7% -    6%)
            OrNotHighLow      780.85      (3.8%)      775.40      (2.7%)   -0.7% (  -6% -    5%)
            OrHighNotLow      815.32      (4.7%)      810.72      (5.7%)   -0.6% ( -10% -   10%)
                 Prefix3      310.00      (4.9%)      308.31      (3.7%)   -0.5% (  -8% -    8%)
                 LowTerm     1462.30      (3.7%)     1455.27      (4.4%)   -0.5% (  -8% -    7%)
               OrHighLow      446.56      (4.6%)      445.11      (3.2%)   -0.3% (  -7% -    7%)
              AndHighLow      594.90      (2.9%)      593.39      (3.4%)   -0.3% (  -6% -    6%)
                 Respell       64.46      (2.5%)       64.36      (2.6%)   -0.2% (  -5% -    5%)
            OrHighNotMed      685.98      (4.5%)      685.69      (3.7%)   -0.0% (  -7% -    8%)
               OrHighMed       67.90      (5.1%)       67.91      (3.4%)    0.0% (  -8% -    8%)
                  Fuzzy2       50.18      (4.5%)       50.21      (5.8%)    0.1% (  -9% -   10%)
             LowSpanNear       59.27      (3.9%)       59.34      (4.0%)    0.1% (  -7% -    8%)
              OrHighHigh       30.89      (5.2%)       30.94      (3.3%)    0.2% (  -7% -    9%)
               LowPhrase      114.67      (3.1%)      114.87      (2.5%)    0.2% (  -5% -    5%)
              HighPhrase       22.34      (2.7%)       22.42      (2.2%)    0.4% (  -4% -    5%)
             AndHighHigh       59.53      (3.8%)       59.89      (4.4%)    0.6% (  -7% -    9%)
               MedPhrase       29.99      (2.9%)       30.19      (2.3%)    0.7% (  -4% -    6%)
         MedSloppyPhrase       71.57      (3.1%)       72.10      (3.0%)    0.7% (  -5% -    7%)
      IntNRQConjHighTerm      113.74      (7.3%)      114.66      (7.1%)    0.8% ( -12% -   16%)
         LowSloppyPhrase       14.18      (3.4%)       14.30      (2.6%)    0.8% (  -4% -    6%)
       PrefixConjLowTerm       89.05      (4.6%)       89.80      (5.1%)    0.8% (  -8% -   11%)
              AndHighMed      166.34      (3.1%)      167.76      (3.8%)    0.9% (  -5% -    7%)
     WildcardConjMedTerm       51.44      (2.6%)       51.88      (3.0%)    0.9% (  -4% -    6%)
       PrefixConjMedTerm       68.16      (4.8%)       68.80      (4.6%)    0.9% (  -8% -   10%)
      PrefixConjHighTerm       42.34      (6.1%)       42.81      (5.0%)    1.1% (  -9% -   13%)
             MedSpanNear       15.57      (5.5%)       15.74      (5.4%)    1.1% (  -9% -   12%)
     WildcardConjLowTerm       51.56      (3.7%)       52.15      (4.2%)    1.1% (  -6% -    9%)
            HighSpanNear        5.66      (5.8%)        5.73      (5.9%)    1.2% (  -9% -   13%)
       IntNRQConjLowTerm      120.28      (8.5%)      121.67      (8.8%)    1.2% ( -14% -   20%)
    WildcardConjHighTerm       55.43      (3.2%)       56.10      (3.4%)    1.2% (  -5% -    8%)
       IntNRQConjMedTerm       97.79      (8.3%)       98.98      (8.6%)    1.2% ( -14% -   19%)
                Wildcard      106.37      (2.9%)      107.75      (3.6%)    1.3% (  -5% -    7%)
        HighSloppyPhrase       18.21      (4.9%)       18.48      (4.4%)    1.5% (  -7% -   11%)
       HighTermMonthSort      146.10     (11.0%)      148.89     (10.5%)    1.9% ( -17% -   26%)
   HighTermDayOfYearSort       68.62      (6.1%)       70.08      (3.9%)    2.1% (  -7% -   12%)
}}
 
I will next have a look at what Atri is suggesting.

> Use exponential search in IntArrayDocIdSet advance method
> ---------------------------------------------------------
>
>                 Key: LUCENE-8796
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8796
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Luca Cavanna
>            Priority: Minor
>
> Chatting with [~jpountz] , he suggested to improve IntArrayDocIdSet by making its advance method use exponential search instead of binary search. This should help performance of queries including conjunctions: given that ConjunctionDISI uses leap frog, it advances through doc ids in small steps, hence exponential search should be faster when advancing on average compared to binary search.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org