You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Haoyu Zhai (Jira)" <ji...@apache.org> on 2021/11/03 22:14:00 UTC

[jira] [Commented] (LUCENE-10122) Explore using NumericDocValue to store taxonomy parent array

    [ https://issues.apache.org/jira/browse/LUCENE-10122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17438317#comment-17438317 ] 

Haoyu Zhai commented on LUCENE-10122:
-------------------------------------

The luceneutil benchmark shows a mostly neutral result
{code:java}
                    Task    QPS base      StdDev    QPS cand      StdDev                Pct diff p-value
                  Fuzzy2       58.39      (5.6%)       57.70      (6.1%)   -1.2% ( -12% -   11%) 0.518
    BrowseDateTaxoFacets        2.40      (6.6%)        2.38      (5.8%)   -0.7% ( -12% -   12%) 0.709
BrowseDayOfYearTaxoFacets        2.40      (6.5%)        2.38      (5.8%)   -0.7% ( -12% -   12%) 0.721
   BrowseMonthTaxoFacets        2.49      (6.8%)        2.47      (6.1%)   -0.7% ( -12% -   13%) 0.738
   BrowseMonthSSDVFacets       16.44     (36.1%)       16.38     (35.1%)   -0.4% ( -52% -  110%) 0.974
     LowIntervalsOrdered       30.70      (2.8%)       30.61      (3.0%)   -0.3% (  -5% -    5%) 0.763
               LowPhrase      516.96      (1.7%)      515.67      (1.6%)   -0.3% (  -3% -    3%) 0.626
           OrNotHighHigh      580.07      (2.1%)      578.61      (2.8%)   -0.3% (  -5% -    4%) 0.747
BrowseDayOfYearSSDVFacets       15.22     (24.2%)       15.19     (24.2%)   -0.2% ( -39% -   63%) 0.976
   HighTermDayOfYearSort      766.98      (1.7%)      765.20      (1.7%)   -0.2% (  -3% -    3%) 0.665
    HighIntervalsOrdered        2.46      (2.0%)        2.45      (2.3%)   -0.2% (  -4% -    4%) 0.795
     MedIntervalsOrdered       27.55      (2.8%)       27.51      (2.8%)   -0.1% (  -5% -    5%) 0.878
                  IntNRQ       28.96      (0.3%)       28.92      (0.6%)   -0.1% (   0% -    0%) 0.358
              OrHighHigh       36.05      (2.2%)       36.02      (1.7%)   -0.1% (  -3% -    3%) 0.870
               MedPhrase      119.18      (1.7%)      119.08      (2.0%)   -0.1% (  -3% -    3%) 0.884
             MedSpanNear       99.96      (1.1%)       99.88      (1.2%)   -0.1% (  -2% -    2%) 0.818
                 MedTerm     1211.34      (2.4%)     1210.46      (2.2%)   -0.1% (  -4% -    4%) 0.919
                 Respell       42.08      (1.9%)       42.06      (2.3%)   -0.1% (  -4% -    4%) 0.931
            OrNotHighLow      608.56      (2.1%)      608.41      (2.4%)   -0.0% (  -4% -    4%) 0.971
            HighSpanNear       38.01      (2.2%)       38.01      (2.9%)   -0.0% (  -5% -    5%) 0.994
             LowSpanNear       94.41      (1.5%)       94.42      (2.1%)    0.0% (  -3% -    3%) 0.975
               OrHighLow      228.92      (2.4%)      228.98      (1.6%)    0.0% (  -3% -    4%) 0.971
               OrHighMed       76.23      (2.3%)       76.26      (2.2%)    0.0% (  -4% -    4%) 0.951
    HighTermTitleBDVSort       19.07      (2.6%)       19.08      (2.5%)    0.0% (  -4% -    5%) 0.952
              TermDTSort      312.90      (2.0%)      313.18      (2.5%)    0.1% (  -4% -    4%) 0.901
                PKLookup      153.21      (2.6%)      153.35      (2.5%)    0.1% (  -4% -    5%) 0.910
            OrHighNotMed      798.03      (2.0%)      798.83      (2.3%)    0.1% (  -4% -    4%) 0.883
       HighTermMonthSort      103.99      (9.9%)      104.10      (9.7%)    0.1% ( -17% -   21%) 0.971
                Wildcard      107.61      (2.1%)      107.74      (2.4%)    0.1% (  -4% -    4%) 0.859
                 Prefix3       82.74     (12.0%)       82.84     (12.1%)    0.1% ( -21% -   27%) 0.973
              HighPhrase       67.96      (2.0%)       68.07      (2.0%)    0.2% (  -3% -    4%) 0.792
                HighTerm     1058.76      (1.8%)     1060.59      (2.7%)    0.2% (  -4% -    4%) 0.812
           OrHighNotHigh      528.01      (1.8%)      529.17      (2.5%)    0.2% (  -4% -    4%) 0.751
                  Fuzzy1       42.70      (3.0%)       42.80      (3.3%)    0.2% (  -5% -    6%) 0.814
            OrNotHighMed      613.17      (2.6%)      614.97      (2.6%)    0.3% (  -4% -    5%) 0.722
         MedSloppyPhrase       15.29      (1.8%)       15.34      (2.2%)    0.3% (  -3% -    4%) 0.601
            OrHighNotLow      590.46      (2.5%)      592.57      (2.9%)    0.4% (  -4% -    5%) 0.677
              AndHighLow      518.23      (2.5%)      520.65      (2.9%)    0.5% (  -4% -    6%) 0.585
                 LowTerm     1137.40      (2.9%)     1143.47      (2.8%)    0.5% (  -5% -    6%) 0.556
        HighSloppyPhrase       10.76      (3.2%)       10.82      (3.6%)    0.6% (  -6% -    7%) 0.602
         LowSloppyPhrase      152.21      (2.1%)      153.24      (2.4%)    0.7% (  -3% -    5%) 0.350
              AndHighMed      170.44      (2.5%)      171.76      (3.6%)    0.8% (  -5% -    7%) 0.426
             AndHighHigh       64.45      (3.2%)       65.07      (4.4%)    1.0% (  -6% -    8%) 0.424
{code}
 And size of taxonomy index does not change. 

I've also ran the internal benchmark we use in Amazon, it shows a 10% larger taxonomy index.

Given that decoding of the array is not used so frequently (since we load the parent array into memory and ideally would never load the same category again), speed here is less important than size I think. We probably should not merge the change?

 

> Explore using NumericDocValue to store taxonomy parent array
> ------------------------------------------------------------
>
>                 Key: LUCENE-10122
>                 URL: https://issues.apache.org/jira/browse/LUCENE-10122
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/facet
>    Affects Versions: main (10.0)
>            Reporter: Haoyu Zhai
>            Priority: Minor
>          Time Spent: 40m
>  Remaining Estimate: 0h
>
> We currently use term position of a hardcoded term in a hardcoded field to represent the parent ordinal of each taxonomy label. That is an old way and perhaps could be dated back to the time where doc values didn't exist.
> We probably would want to use NumericDocValues instead given we have spent quite a lot of effort optimizing them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org