You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Haoyu Zhai (Jira)" <ji...@apache.org> on 2021/11/03 22:14:00 UTC
[jira] [Commented] (LUCENE-10122) Explore using NumericDocValue to
store taxonomy parent array
[ https://issues.apache.org/jira/browse/LUCENE-10122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17438317#comment-17438317 ]
Haoyu Zhai commented on LUCENE-10122:
-------------------------------------
The luceneutil benchmark shows a mostly neutral result
{code:java}
Task QPS base StdDev QPS cand StdDev Pct diff p-value
Fuzzy2 58.39 (5.6%) 57.70 (6.1%) -1.2% ( -12% - 11%) 0.518
BrowseDateTaxoFacets 2.40 (6.6%) 2.38 (5.8%) -0.7% ( -12% - 12%) 0.709
BrowseDayOfYearTaxoFacets 2.40 (6.5%) 2.38 (5.8%) -0.7% ( -12% - 12%) 0.721
BrowseMonthTaxoFacets 2.49 (6.8%) 2.47 (6.1%) -0.7% ( -12% - 13%) 0.738
BrowseMonthSSDVFacets 16.44 (36.1%) 16.38 (35.1%) -0.4% ( -52% - 110%) 0.974
LowIntervalsOrdered 30.70 (2.8%) 30.61 (3.0%) -0.3% ( -5% - 5%) 0.763
LowPhrase 516.96 (1.7%) 515.67 (1.6%) -0.3% ( -3% - 3%) 0.626
OrNotHighHigh 580.07 (2.1%) 578.61 (2.8%) -0.3% ( -5% - 4%) 0.747
BrowseDayOfYearSSDVFacets 15.22 (24.2%) 15.19 (24.2%) -0.2% ( -39% - 63%) 0.976
HighTermDayOfYearSort 766.98 (1.7%) 765.20 (1.7%) -0.2% ( -3% - 3%) 0.665
HighIntervalsOrdered 2.46 (2.0%) 2.45 (2.3%) -0.2% ( -4% - 4%) 0.795
MedIntervalsOrdered 27.55 (2.8%) 27.51 (2.8%) -0.1% ( -5% - 5%) 0.878
IntNRQ 28.96 (0.3%) 28.92 (0.6%) -0.1% ( 0% - 0%) 0.358
OrHighHigh 36.05 (2.2%) 36.02 (1.7%) -0.1% ( -3% - 3%) 0.870
MedPhrase 119.18 (1.7%) 119.08 (2.0%) -0.1% ( -3% - 3%) 0.884
MedSpanNear 99.96 (1.1%) 99.88 (1.2%) -0.1% ( -2% - 2%) 0.818
MedTerm 1211.34 (2.4%) 1210.46 (2.2%) -0.1% ( -4% - 4%) 0.919
Respell 42.08 (1.9%) 42.06 (2.3%) -0.1% ( -4% - 4%) 0.931
OrNotHighLow 608.56 (2.1%) 608.41 (2.4%) -0.0% ( -4% - 4%) 0.971
HighSpanNear 38.01 (2.2%) 38.01 (2.9%) -0.0% ( -5% - 5%) 0.994
LowSpanNear 94.41 (1.5%) 94.42 (2.1%) 0.0% ( -3% - 3%) 0.975
OrHighLow 228.92 (2.4%) 228.98 (1.6%) 0.0% ( -3% - 4%) 0.971
OrHighMed 76.23 (2.3%) 76.26 (2.2%) 0.0% ( -4% - 4%) 0.951
HighTermTitleBDVSort 19.07 (2.6%) 19.08 (2.5%) 0.0% ( -4% - 5%) 0.952
TermDTSort 312.90 (2.0%) 313.18 (2.5%) 0.1% ( -4% - 4%) 0.901
PKLookup 153.21 (2.6%) 153.35 (2.5%) 0.1% ( -4% - 5%) 0.910
OrHighNotMed 798.03 (2.0%) 798.83 (2.3%) 0.1% ( -4% - 4%) 0.883
HighTermMonthSort 103.99 (9.9%) 104.10 (9.7%) 0.1% ( -17% - 21%) 0.971
Wildcard 107.61 (2.1%) 107.74 (2.4%) 0.1% ( -4% - 4%) 0.859
Prefix3 82.74 (12.0%) 82.84 (12.1%) 0.1% ( -21% - 27%) 0.973
HighPhrase 67.96 (2.0%) 68.07 (2.0%) 0.2% ( -3% - 4%) 0.792
HighTerm 1058.76 (1.8%) 1060.59 (2.7%) 0.2% ( -4% - 4%) 0.812
OrHighNotHigh 528.01 (1.8%) 529.17 (2.5%) 0.2% ( -4% - 4%) 0.751
Fuzzy1 42.70 (3.0%) 42.80 (3.3%) 0.2% ( -5% - 6%) 0.814
OrNotHighMed 613.17 (2.6%) 614.97 (2.6%) 0.3% ( -4% - 5%) 0.722
MedSloppyPhrase 15.29 (1.8%) 15.34 (2.2%) 0.3% ( -3% - 4%) 0.601
OrHighNotLow 590.46 (2.5%) 592.57 (2.9%) 0.4% ( -4% - 5%) 0.677
AndHighLow 518.23 (2.5%) 520.65 (2.9%) 0.5% ( -4% - 6%) 0.585
LowTerm 1137.40 (2.9%) 1143.47 (2.8%) 0.5% ( -5% - 6%) 0.556
HighSloppyPhrase 10.76 (3.2%) 10.82 (3.6%) 0.6% ( -6% - 7%) 0.602
LowSloppyPhrase 152.21 (2.1%) 153.24 (2.4%) 0.7% ( -3% - 5%) 0.350
AndHighMed 170.44 (2.5%) 171.76 (3.6%) 0.8% ( -5% - 7%) 0.426
AndHighHigh 64.45 (3.2%) 65.07 (4.4%) 1.0% ( -6% - 8%) 0.424
{code}
And size of taxonomy index does not change.
I've also ran the internal benchmark we use in Amazon, it shows a 10% larger taxonomy index.
Given that decoding of the array is not used so frequently (since we load the parent array into memory and ideally would never load the same category again), speed here is less important than size I think. We probably should not merge the change?
> Explore using NumericDocValue to store taxonomy parent array
> ------------------------------------------------------------
>
> Key: LUCENE-10122
> URL: https://issues.apache.org/jira/browse/LUCENE-10122
> Project: Lucene - Core
> Issue Type: Improvement
> Components: modules/facet
> Affects Versions: main (10.0)
> Reporter: Haoyu Zhai
> Priority: Minor
> Time Spent: 40m
> Remaining Estimate: 0h
>
> We currently use term position of a hardcoded term in a hardcoded field to represent the parent ordinal of each taxonomy label. That is an old way and perhaps could be dated back to the time where doc values didn't exist.
> We probably would want to use NumericDocValues instead given we have spent quite a lot of effort optimizing them.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org