You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Adrien Grand (Jira)" <ji...@apache.org> on 2021/07/27 16:49:00 UTC
[jira] [Commented] (LUCENE-10033) Encode doc values in smaller blocks of values, like postings

    [ https://issues.apache.org/jira/browse/LUCENE-10033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17388178#comment-17388178 ] 

Adrien Grand commented on LUCENE-10033:
---------------------------------------

I opened a PR with this idea. Queries that consume most values like the Browse* faceting tasks become faster, but queries that only consume a small subset of values like some sorting tasks (not all, on of them is faster) become slower.

{noformat}
                    TaskQPS baseline      StdDev   QPS patch      StdDev                Pct diff p-value
       HighTermMonthSort      101.33      (9.7%)       51.93      (2.8%)  -48.7% ( -55% -  -40%) 0.000
              TermDTSort      587.24      (6.1%)      404.20      (2.9%)  -31.2% ( -37% -  -23%) 0.000
                  IntNRQ       85.55     (14.7%)       73.16      (1.6%)  -14.5% ( -26% -    2%) 0.000
            OrHighNotMed     1301.37      (3.7%)     1218.64      (2.3%)   -6.4% ( -11% -    0%) 0.000
           OrNotHighHigh     1121.91      (4.1%)     1089.27      (2.7%)   -2.9% (  -9% -    4%) 0.008
                 MedTerm     2156.71      (3.3%)     2103.32      (3.6%)   -2.5% (  -9% -    4%) 0.022
                  Fuzzy2       67.41      (4.6%)       65.74      (4.9%)   -2.5% ( -11% -    7%) 0.098
            OrNotHighLow     1099.66      (3.7%)     1078.60      (3.0%)   -1.9% (  -8% -    4%) 0.073
     MedIntervalsOrdered       79.39      (3.0%)       77.94      (3.7%)   -1.8% (  -8% -    5%) 0.088
               MedPhrase      403.62      (2.8%)      397.19      (2.3%)   -1.6% (  -6% -    3%) 0.050
               OrHighMed      130.57      (3.0%)      128.64      (2.6%)   -1.5% (  -6% -    4%) 0.099
     LowIntervalsOrdered       20.82      (2.5%)       20.55      (3.4%)   -1.3% (  -6% -    4%) 0.167
    HighIntervalsOrdered        2.95      (5.1%)        2.91      (5.8%)   -1.1% ( -11% -   10%) 0.530
               OrHighLow      579.45      (2.9%)      574.45      (2.4%)   -0.9% (  -5% -    4%) 0.306
             LowSpanNear       33.20      (2.9%)       33.06      (3.5%)   -0.4% (  -6% -    6%) 0.668
            HighSpanNear        9.79      (3.5%)        9.79      (3.7%)   -0.0% (  -7% -    7%) 0.996
                 Respell      221.47      (2.1%)      221.62      (2.8%)    0.1% (  -4% -    4%) 0.931
        HighSloppyPhrase       36.64      (3.4%)       36.69      (4.0%)    0.1% (  -7% -    7%) 0.915
                Wildcard      283.85      (6.5%)      285.06      (7.2%)    0.4% ( -12% -   15%) 0.845
         LowSloppyPhrase      175.77      (4.3%)      176.56      (4.4%)    0.5% (  -7% -    9%) 0.740
             AndHighHigh       64.34      (2.5%)       64.84      (3.4%)    0.8% (  -5% -    6%) 0.410
                HighTerm     2146.56      (3.3%)     2164.26      (4.5%)    0.8% (  -6% -    8%) 0.505
    HighTermTitleBDVSort       27.18      (4.6%)       27.41      (2.1%)    0.8% (  -5% -    7%) 0.461
            OrHighNotLow     1261.38      (2.3%)     1274.89      (3.0%)    1.1% (  -4% -    6%) 0.210
             MedSpanNear       26.96      (4.1%)       27.28      (3.5%)    1.2% (  -6% -    9%) 0.336
         MedSloppyPhrase      102.18      (4.7%)      103.51      (5.1%)    1.3% (  -8% -   11%) 0.399
    BrowseDateTaxoFacets        3.15      (4.0%)        3.19      (4.0%)    1.4% (  -6% -    9%) 0.281
BrowseDayOfYearTaxoFacets        3.15      (4.0%)        3.20      (4.0%)    1.5% (  -6% -    9%) 0.250
              AndHighLow     1295.59      (3.3%)     1318.11      (3.4%)    1.7% (  -4% -    8%) 0.105
                 Prefix3       63.21     (15.4%)       64.49     (17.1%)    2.0% ( -26% -   40%) 0.694
              OrHighHigh       35.41      (3.1%)       36.24      (3.1%)    2.4% (  -3% -    8%) 0.015
                  Fuzzy1      253.74      (6.1%)      260.89      (7.1%)    2.8% (  -9% -   16%) 0.175
   BrowseMonthTaxoFacets        3.42      (7.7%)        3.52      (4.1%)    2.9% (  -8% -   15%) 0.135
              AndHighMed      164.48      (2.6%)      169.43      (3.3%)    3.0% (  -2% -    9%) 0.001
                 LowTerm     2645.26      (4.9%)     2752.43      (5.6%)    4.1% (  -6% -   15%) 0.015
           OrHighNotHigh     1286.12      (3.7%)     1349.66      (4.6%)    4.9% (  -3% -   13%) 0.000
              HighPhrase      105.61      (3.7%)      111.65      (4.8%)    5.7% (  -2% -   14%) 0.000
               LowPhrase       35.85      (2.6%)       38.76      (3.3%)    8.1% (   2% -   14%) 0.000
            OrNotHighMed     1241.35      (3.1%)     1368.49      (3.6%)   10.2% (   3% -   17%) 0.000
   HighTermDayOfYearSort      573.92      (9.5%)      687.19      (7.9%)   19.7% (   2% -   40%) 0.000
   BrowseMonthSSDVFacets       11.52      (5.1%)       17.81     (23.5%)   54.6% (  24% -   87%) 0.000
BrowseDayOfYearSSDVFacets       11.24      (3.9%)       18.15     (23.1%)   61.4% (  33% -   91%) 0.000
{noformat}

> Encode doc values in smaller blocks of values, like postings
> ------------------------------------------------------------
>
>                 Key: LUCENE-10033
>                 URL: https://issues.apache.org/jira/browse/LUCENE-10033
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Adrien Grand
>            Priority: Minor
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> This is a follow-up to the discussion on this thread: https://lists.apache.org/thread.html/r7b757074d5f02874ce3a295b0007dff486bc10d08fb0b5e5a4ba72c5%40%3Cdev.lucene.apache.org%3E.
> Our current approach for doc values uses large blocks of 16k values where values can be decompressed independently, using DirectWriter/DirectReader. This is a bit inefficient in some cases, e.g. a single outlier can grow the number of bits per value for the entire block, we can't easily use run-length compression, etc. Plus, it encourages using a different sub-class for every compression technique, which puts pressure on the JVM.
> We'd like to move to an approach that would be more similar to postings with smaller blocks (e.g. 128 values) whose values get all decompressed at once (using SIMD instructions), with skip data within blocks in order to efficiently skip to arbitrary doc IDs (or maybe still use jump tables as today's doc values, and as discussed here for postings: https://lists.apache.org/thread.html/r7c3cb7ab143fd4ecbc05c04064d10ef9fb50c5b4d6479b0f35732677%40%3Cdev.lucene.apache.org%3E).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org