You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Michael Sokolov <ms...@gmail.com> on 2021/09/24 10:58:24 UTC

Re: [jira] [Commented] (LUCENE-10062) Explore using SORTED_NUMERIC doc values to encode taxonomy ordinals for faceting

Hard to read on the phone, but is that a 482% speed up I saw??!

On Thu, Sep 23, 2021, 1:28 PM Greg Miller (Jira) <ji...@apache.org> wrote:

>
>     [
> https://issues.apache.org/jira/browse/LUCENE-10062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17419349#comment-17419349
> ]
>
> Greg Miller commented on LUCENE-10062:
> --------------------------------------
>
> I re-ran {{luceneutil}} benchmarks {{wikimedium10m}} since [~mikemccand]
> added new faceting tasks (thanks Mike!). Looks like there's a nice
> improvement on these new faceting tasks as well with this change (and no
> regressions anywhere else that I see).
>
> I was waiting to iterate on my PR until I was able to run these new
> benchmarking tasks, but it seems like there's enough benefit to this change
> to pick it back up.
>
>
> {noformat}
>                             TaskQPS baseline      StdDevQPS candidate
> StdDev                Pct diff p-value
>            HighTermDayOfYearSort       70.02     (13.7%)       68.45
> (9.7%)   -2.2% ( -22% -   24%) 0.551
>                          MedTerm     1300.90      (5.5%)     1275.97
> (6.7%)   -1.9% ( -13% -   10%) 0.324
>                         HighTerm     1953.46      (5.8%)     1925.79
> (7.9%)   -1.4% ( -14% -   13%) 0.518
>             HighTermTitleBDVSort      122.35     (15.6%)      120.86
>  (14.9%)   -1.2% ( -27% -   34%) 0.801
>                       TermDTSort      133.47      (8.7%)      131.86
> (7.4%)   -1.2% ( -15% -   16%) 0.637
>                          LowTerm     1636.13      (5.5%)     1622.34
> (7.4%)   -0.8% ( -12% -   12%) 0.682
>                          Prefix3       25.69      (6.0%)       25.48
> (6.3%)   -0.8% ( -12% -   12%) 0.676
>                      LowSpanNear      118.02      (2.1%)      117.31
> (1.8%)   -0.6% (  -4% -    3%) 0.326
>                HighTermMonthSort      140.17      (9.8%)      139.47
> (9.9%)   -0.5% ( -18% -   21%) 0.872
>                      AndHighHigh       49.17      (3.1%)       48.92
> (2.7%)   -0.5% (  -6% -    5%) 0.584
>                     HighSpanNear       25.54      (2.7%)       25.41
> (2.2%)   -0.5% (  -5% -    4%) 0.529
>                       AndHighLow      556.68      (5.8%)      554.80
> (5.4%)   -0.3% ( -10% -   11%) 0.848
>        BrowseDayOfYearSSDVFacets       16.53      (2.5%)       16.47
> (2.4%)   -0.3% (  -5% -    4%) 0.674
>                           IntNRQ       87.76      (2.0%)       87.49
> (2.1%)   -0.3% (  -4% -    3%) 0.634
>                      MedSpanNear       31.11      (2.2%)       31.04
> (1.6%)   -0.2% (  -3% -    3%) 0.714
>                     OrNotHighLow      765.10      (4.5%)      763.60
> (5.4%)   -0.2% (  -9% -   10%) 0.901
>                        MedPhrase      160.05      (3.1%)      159.83
> (2.9%)   -0.1% (  -5% -    6%) 0.885
>                 HighSloppyPhrase       27.67      (3.1%)       27.64
> (3.0%)   -0.1% (  -6% -    6%) 0.915
>                        LowPhrase       61.12      (3.2%)       61.05
> (3.2%)   -0.1% (  -6% -    6%) 0.921
>                        OrHighMed       71.85      (2.9%)       71.82
> (2.1%)   -0.0% (  -4% -    5%) 0.963
>                       HighPhrase       29.40      (2.3%)       29.39
> (2.8%)   -0.0% (  -5% -    5%) 0.971
>                           Fuzzy2       32.58      (4.3%)       32.57
> (6.1%)   -0.0% (  -9% -   10%) 0.992
>              LowIntervalsOrdered      150.30      (1.9%)      150.28
> (1.9%)   -0.0% (  -3% -    3%) 0.986
>                       AndHighMed      151.32      (3.9%)      151.31
> (4.1%)   -0.0% (  -7% -    8%) 0.993
>                       OrHighHigh       23.90      (2.3%)       23.91
> (1.9%)    0.0% (  -4% -    4%) 0.970
>                     OrHighNotLow      579.17      (5.1%)      579.35
> (6.4%)    0.0% ( -10% -   12%) 0.986
>              MedIntervalsOrdered       86.93      (1.7%)       86.98
> (1.9%)    0.1% (  -3% -    3%) 0.913
>                    OrHighNotHigh      536.17      (5.6%)      536.57
> (6.6%)    0.1% ( -11% -   12%) 0.969
>                    OrNotHighHigh      787.07      (6.5%)      787.96
> (8.1%)    0.1% ( -13% -   15%) 0.961
>                     OrNotHighMed      687.97      (4.7%)      688.77
> (6.9%)    0.1% ( -10% -   12%) 0.950
>                  MedSloppyPhrase       68.62      (2.8%)       68.74
> (2.7%)    0.2% (  -5% -    5%) 0.838
>                  LowSloppyPhrase      130.37      (2.6%)      130.62
> (2.2%)    0.2% (  -4% -    5%) 0.797
>                        OrHighLow      440.44      (4.1%)      441.33
> (4.1%)    0.2% (  -7% -    8%) 0.877
>                         Wildcard      122.01      (5.2%)      122.35
> (5.3%)    0.3% (  -9% -   11%) 0.867
>             HighIntervalsOrdered       14.24      (2.2%)       14.34
> (2.1%)    0.6% (  -3% -    5%) 0.350
>                          Respell       52.04      (2.2%)       52.48
> (2.0%)    0.8% (  -3% -    5%) 0.209
>                     OrHighNotMed      674.76      (4.8%)      680.97
> (8.0%)    0.9% ( -11% -   14%) 0.659
>                         PKLookup      153.45      (4.3%)      155.13
> (3.8%)    1.1% (  -6% -    9%) 0.394
>                           Fuzzy1       56.57      (9.1%)       57.76
> (6.7%)    2.1% ( -12% -   19%) 0.406
>            BrowseMonthSSDVFacets       19.59     (10.4%)       20.03
> (6.7%)    2.3% ( -13% -   21%) 0.413
>         AndHighHighDayTaxoFacets       19.22      (1.6%)       22.13
> (2.2%)   15.1% (  11% -   19%) 0.000
>          AndHighMedDayTaxoFacets       25.62      (1.5%)       29.93
> (2.2%)   16.8% (  12% -   20%) 0.000
>             MedTermDayTaxoFacets       12.96      (2.2%)       18.99
> (3.4%)   46.5% (  39% -   53%) 0.000
>           OrHighMedDayTaxoFacets        3.97      (2.0%)        5.81
> (4.3%)   46.5% (  39% -   53%) 0.000
>            BrowseMonthTaxoFacets        2.59     (10.9%)       11.16
>  (35.8%)  330.4% ( 255% -  423%) 0.000
>             BrowseDateTaxoFacets        2.44      (9.7%)       13.12
>  (51.8%)  438.1% ( 343% -  553%) 0.000
>        BrowseDayOfYearTaxoFacets        2.44      (9.7%)       13.13
>  (51.7%)  438.2% ( 343% -  552%) 0.000
> {noformat}
>
>
> > Explore using SORTED_NUMERIC doc values to encode taxonomy ordinals for
> faceting
> >
> --------------------------------------------------------------------------------
> >
> >                 Key: LUCENE-10062
> >                 URL: https://issues.apache.org/jira/browse/LUCENE-10062
> >             Project: Lucene - Core
> >          Issue Type: Improvement
> >          Components: modules/facet
> >            Reporter: Greg Miller
> >            Assignee: Greg Miller
> >            Priority: Minor
> >          Time Spent: 1h 40m
> >  Remaining Estimate: 0h
> >
> > We currently encode taxonomy ordinals using varint style packing in a
> binary doc values field. I suspect there have been a number of improvements
> to SortedNumericDocValues since taxonomy faceting was first introduced, and
> I plan to explore replacing the custom binary format we have today with a
> SORTED_NUMERIC type dv field instead.
> > I'll report benchmark results and index size impact here.
>
>
>
> --
> This message was sent by Atlassian Jira
> (v8.3.4#803005)
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
> For additional commands, e-mail: issues-help@lucene.apache.org
>
>