You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Adrien Grand (Jira)" <ji...@apache.org> on 2021/07/27 16:49:00 UTC
[jira] [Commented] (LUCENE-10033) Encode doc values in smaller
blocks of values, like postings
[ https://issues.apache.org/jira/browse/LUCENE-10033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17388178#comment-17388178 ]
Adrien Grand commented on LUCENE-10033:
---------------------------------------
I opened a PR with this idea. Queries that consume most values like the Browse* faceting tasks become faster, but queries that only consume a small subset of values like some sorting tasks (not all, on of them is faster) become slower.
{noformat}
TaskQPS baseline StdDev QPS patch StdDev Pct diff p-value
HighTermMonthSort 101.33 (9.7%) 51.93 (2.8%) -48.7% ( -55% - -40%) 0.000
TermDTSort 587.24 (6.1%) 404.20 (2.9%) -31.2% ( -37% - -23%) 0.000
IntNRQ 85.55 (14.7%) 73.16 (1.6%) -14.5% ( -26% - 2%) 0.000
OrHighNotMed 1301.37 (3.7%) 1218.64 (2.3%) -6.4% ( -11% - 0%) 0.000
OrNotHighHigh 1121.91 (4.1%) 1089.27 (2.7%) -2.9% ( -9% - 4%) 0.008
MedTerm 2156.71 (3.3%) 2103.32 (3.6%) -2.5% ( -9% - 4%) 0.022
Fuzzy2 67.41 (4.6%) 65.74 (4.9%) -2.5% ( -11% - 7%) 0.098
OrNotHighLow 1099.66 (3.7%) 1078.60 (3.0%) -1.9% ( -8% - 4%) 0.073
MedIntervalsOrdered 79.39 (3.0%) 77.94 (3.7%) -1.8% ( -8% - 5%) 0.088
MedPhrase 403.62 (2.8%) 397.19 (2.3%) -1.6% ( -6% - 3%) 0.050
OrHighMed 130.57 (3.0%) 128.64 (2.6%) -1.5% ( -6% - 4%) 0.099
LowIntervalsOrdered 20.82 (2.5%) 20.55 (3.4%) -1.3% ( -6% - 4%) 0.167
HighIntervalsOrdered 2.95 (5.1%) 2.91 (5.8%) -1.1% ( -11% - 10%) 0.530
OrHighLow 579.45 (2.9%) 574.45 (2.4%) -0.9% ( -5% - 4%) 0.306
LowSpanNear 33.20 (2.9%) 33.06 (3.5%) -0.4% ( -6% - 6%) 0.668
HighSpanNear 9.79 (3.5%) 9.79 (3.7%) -0.0% ( -7% - 7%) 0.996
Respell 221.47 (2.1%) 221.62 (2.8%) 0.1% ( -4% - 4%) 0.931
HighSloppyPhrase 36.64 (3.4%) 36.69 (4.0%) 0.1% ( -7% - 7%) 0.915
Wildcard 283.85 (6.5%) 285.06 (7.2%) 0.4% ( -12% - 15%) 0.845
LowSloppyPhrase 175.77 (4.3%) 176.56 (4.4%) 0.5% ( -7% - 9%) 0.740
AndHighHigh 64.34 (2.5%) 64.84 (3.4%) 0.8% ( -5% - 6%) 0.410
HighTerm 2146.56 (3.3%) 2164.26 (4.5%) 0.8% ( -6% - 8%) 0.505
HighTermTitleBDVSort 27.18 (4.6%) 27.41 (2.1%) 0.8% ( -5% - 7%) 0.461
OrHighNotLow 1261.38 (2.3%) 1274.89 (3.0%) 1.1% ( -4% - 6%) 0.210
MedSpanNear 26.96 (4.1%) 27.28 (3.5%) 1.2% ( -6% - 9%) 0.336
MedSloppyPhrase 102.18 (4.7%) 103.51 (5.1%) 1.3% ( -8% - 11%) 0.399
BrowseDateTaxoFacets 3.15 (4.0%) 3.19 (4.0%) 1.4% ( -6% - 9%) 0.281
BrowseDayOfYearTaxoFacets 3.15 (4.0%) 3.20 (4.0%) 1.5% ( -6% - 9%) 0.250
AndHighLow 1295.59 (3.3%) 1318.11 (3.4%) 1.7% ( -4% - 8%) 0.105
Prefix3 63.21 (15.4%) 64.49 (17.1%) 2.0% ( -26% - 40%) 0.694
OrHighHigh 35.41 (3.1%) 36.24 (3.1%) 2.4% ( -3% - 8%) 0.015
Fuzzy1 253.74 (6.1%) 260.89 (7.1%) 2.8% ( -9% - 16%) 0.175
BrowseMonthTaxoFacets 3.42 (7.7%) 3.52 (4.1%) 2.9% ( -8% - 15%) 0.135
AndHighMed 164.48 (2.6%) 169.43 (3.3%) 3.0% ( -2% - 9%) 0.001
LowTerm 2645.26 (4.9%) 2752.43 (5.6%) 4.1% ( -6% - 15%) 0.015
OrHighNotHigh 1286.12 (3.7%) 1349.66 (4.6%) 4.9% ( -3% - 13%) 0.000
HighPhrase 105.61 (3.7%) 111.65 (4.8%) 5.7% ( -2% - 14%) 0.000
LowPhrase 35.85 (2.6%) 38.76 (3.3%) 8.1% ( 2% - 14%) 0.000
OrNotHighMed 1241.35 (3.1%) 1368.49 (3.6%) 10.2% ( 3% - 17%) 0.000
HighTermDayOfYearSort 573.92 (9.5%) 687.19 (7.9%) 19.7% ( 2% - 40%) 0.000
BrowseMonthSSDVFacets 11.52 (5.1%) 17.81 (23.5%) 54.6% ( 24% - 87%) 0.000
BrowseDayOfYearSSDVFacets 11.24 (3.9%) 18.15 (23.1%) 61.4% ( 33% - 91%) 0.000
{noformat}
> Encode doc values in smaller blocks of values, like postings
> ------------------------------------------------------------
>
> Key: LUCENE-10033
> URL: https://issues.apache.org/jira/browse/LUCENE-10033
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Adrien Grand
> Priority: Minor
> Time Spent: 10m
> Remaining Estimate: 0h
>
> This is a follow-up to the discussion on this thread: https://lists.apache.org/thread.html/r7b757074d5f02874ce3a295b0007dff486bc10d08fb0b5e5a4ba72c5%40%3Cdev.lucene.apache.org%3E.
> Our current approach for doc values uses large blocks of 16k values where values can be decompressed independently, using DirectWriter/DirectReader. This is a bit inefficient in some cases, e.g. a single outlier can grow the number of bits per value for the entire block, we can't easily use run-length compression, etc. Plus, it encourages using a different sub-class for every compression technique, which puts pressure on the JVM.
> We'd like to move to an approach that would be more similar to postings with smaller blocks (e.g. 128 values) whose values get all decompressed at once (using SIMD instructions), with skip data within blocks in order to efficiently skip to arbitrary doc IDs (or maybe still use jump tables as today's doc values, and as discussed here for postings: https://lists.apache.org/thread.html/r7c3cb7ab143fd4ecbc05c04064d10ef9fb50c5b4d6479b0f35732677%40%3Cdev.lucene.apache.org%3E).
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org