You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Adrien Grand (JIRA)" <ji...@apache.org> on 2016/10/19 09:31:58 UTC
[jira] [Updated] (LUCENE-7462) Faster search APIs for doc values

     [ https://issues.apache.org/jira/browse/LUCENE-7462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Adrien Grand updated LUCENE-7462:
---------------------------------
    Attachment: LUCENE-7462-advanceExact.patch

I have been playing with the idea of having an advanceExact method (which I guess is the alternative to adding a 2nd search API for doc values). It removes stress on consumers since this method can be called blindly since it does not advance beyond the target document. It also removes some stress on the codec since it doesn't have to find the next document that has a value anymore.

I ran the wikimedium10m benchmark, to which I added the sorting tasks from the nigthly benchmark to check the impact. There seems to be a consistent speedup for queries for which norms is the bottleneck (term queries and simple conjunctions/disjunctions) and sorted queries (TermTitleSort and TermDTSort).

{noformat}
                    TaskQPS baseline      StdDev   QPS patch      StdDev                Pct diff
                  Fuzzy2       55.31     (20.1%)       54.45     (18.5%)   -1.6% ( -33% -   46%)
            OrNotHighLow      875.16      (3.3%)      870.60      (2.9%)   -0.5% (  -6% -    5%)
         MedSloppyPhrase      210.38      (3.9%)      209.40      (3.8%)   -0.5% (  -7% -    7%)
         LowSloppyPhrase      126.86      (2.5%)      126.74      (2.1%)   -0.1% (  -4% -    4%)
              AndHighMed      151.22      (1.7%)      151.30      (2.3%)    0.0% (  -3% -    4%)
             LowSpanNear       20.08      (2.6%)       20.10      (2.9%)    0.1% (  -5% -    5%)
                 Respell       77.27      (3.8%)       77.36      (3.5%)    0.1% (  -6% -    7%)
               LowPhrase       42.32      (2.1%)       42.40      (1.9%)    0.2% (  -3% -    4%)
              HighPhrase       20.01      (4.1%)       20.06      (3.7%)    0.3% (  -7% -    8%)
                Wildcard       46.20      (3.5%)       46.32      (3.9%)    0.3% (  -6% -    7%)
        HighSloppyPhrase       15.99      (5.1%)       16.04      (4.9%)    0.3% (  -9% -   10%)
                 Prefix3       43.21      (2.9%)       43.39      (3.1%)    0.4% (  -5% -    6%)
               MedPhrase      151.07      (3.4%)      151.69      (3.7%)    0.4% (  -6% -    7%)
            OrNotHighMed      151.21      (2.3%)      151.98      (2.6%)    0.5% (  -4% -    5%)
             AndHighHigh       58.73      (1.4%)       59.05      (1.4%)    0.5% (  -2% -    3%)
             MedSpanNear       22.36      (1.6%)       22.48      (1.6%)    0.6% (  -2% -    3%)
                  IntNRQ       13.75     (12.5%)       13.83     (13.1%)    0.6% ( -22% -   29%)
            OrHighNotMed       62.26      (2.7%)       62.70      (3.2%)    0.7% (  -5% -    6%)
           OrNotHighHigh       58.38      (2.6%)       58.82      (2.4%)    0.7% (  -4% -    5%)
            HighSpanNear       39.78      (2.2%)       40.09      (3.0%)    0.8% (  -4% -    6%)
           OrHighNotHigh       44.88      (2.8%)       45.29      (2.7%)    0.9% (  -4% -    6%)
              AndHighLow      694.25      (4.8%)      703.66      (3.8%)    1.4% (  -6% -   10%)
               OrHighLow       91.20      (3.4%)       92.54      (3.7%)    1.5% (  -5% -    8%)
            OrHighNotLow      105.90      (3.0%)      107.79      (4.4%)    1.8% (  -5% -    9%)
                  Fuzzy1       79.92     (12.3%)       81.61     (12.1%)    2.1% ( -19% -   30%)
              OrHighHigh       29.18      (7.2%)       29.83      (7.3%)    2.2% ( -11% -   18%)
               OrHighMed       19.44      (7.2%)       19.89      (7.3%)    2.3% ( -11% -   18%)
           TermTitleSort       81.70      (5.6%)       83.67      (5.8%)    2.4% (  -8% -   14%)
                 LowTerm      682.24      (4.5%)      704.58      (4.1%)    3.3% (  -5% -   12%)
              TermDTSort      103.25      (5.7%)      106.77      (4.0%)    3.4% (  -5% -   13%)
                 MedTerm      249.00      (2.5%)      260.56      (3.2%)    4.6% (  -1% -   10%)
                HighTerm      103.70      (3.2%)      109.27      (3.6%)    5.4% (  -1% -   12%)
{noformat}

Note that the patch has barely any tests, so it's really just for playing. :) We'd also still need to define the semantics of this method.

> Faster search APIs for doc values
> ---------------------------------
>
>                 Key: LUCENE-7462
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7462
>             Project: Lucene - Core
>          Issue Type: Improvement
>    Affects Versions: master (7.0)
>            Reporter: Adrien Grand
>            Priority: Minor
>         Attachments: LUCENE-7462-advanceExact.patch
>
>
> While the iterator API helps deal with sparse doc values more efficiently, it also makes search-time operations more costly. For instance, the old random-access API allowed to compute facets on a given segment without any conditionals, by just incrementing the counter at index {{ordinal+1}} while the new API requires to advance the iterator if necessary and then check whether it is exactly on the right document or not.
> Since it is very common for fields to exist across most documents, I suspect codecs will keep an internal structure that is similar to the current codec in the dense case, by having a dense representation of the data and just making the iterator skip over the minority of documents that do not have a value.
> I suggest that we add APIs that make things cheaper at search time. For instance in the case of SORTED doc values, it could look like {{LegacySortedDocValues}} with the additional restriction that documents can only be consumed in order. Codecs that can implement this API efficiently would hide it behind a {{SortedDocValues}} adapter, and then at search time facets and comparators (which liked the {{LegacySortedDocValues}} API better) would either unwrap or hide the SortedDocValues they got behind a more random-access API (which would only happen in the truly sparse case if the codec optimizes the dense case).
> One challenge is that we already use the same idea for hiding single-valued impls behind multi-valued impls, so we would need to enforce the order in which the wrapping needs to happen. At first sight, it seems that it would be best to do the single-value-behind-multi-value-API wrapping above the random-access-behind-iterator-API wrapping. The complexity of wrapping/unwrapping in the right order could be contained in the {{DocValues}} helper class.
> I think this change would also simplify search-time consumption of doc values, which currently needs to spend several lines of code positioning the iterator everytime it needs to do something interesting with doc values.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org