You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Greg Miller (Jira)" <ji...@apache.org> on 2021/03/25 22:57:00 UTC
[jira] [Commented] (LUCENE-9877) Explore increasing the allowable
exceptions in PForUtil
[ https://issues.apache.org/jira/browse/LUCENE-9877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17309021#comment-17309021 ]
Greg Miller commented on LUCENE-9877:
-------------------------------------
Here are my results from comparing 7 allowable exceptions in PFOR to the baseline 3:
Similar to my approach in [LUCENE-9850|https://issues.apache.org/jira/browse/LUCENE-9850], I threw together a tool that visualizes the bits-per-value used across all PFOR encoded blocks. I also added in a visualization of the exception count used across all PFOR blocks. Here's what I found by building the Wikipedia EN index with the "wikimedium10m" source in luceneutil (all zero-valued histogram entries were truncated for readability):
{code:java}
BASELINE (3 exceptions)
PFOR BPV Histogram
0 *** [4.02 %] (239453 of 5959788)
1 ** [2.78 %] (165409 of 5959788)
2 ************************** [51.98 %] (3097737 of 5959788)
3 ***************** [32.61 %] (1943685 of 5959788)
4 **** [7.06 %] (420552 of 5959788)
5 * [1.49 %] (88700 of 5959788)
6 * [0.07 %] (4097 of 5959788)
7 * [0.00 %] (153 of 5959788)
8 * [0.00 %] (2 of 5959788)
Total bytes used: 234757569
PFOR Exceptions Used Histogram
0 ************************** [51.83 %] (3088921 of 5959788)
1 ************ [22.78 %] (1357732 of 5959788)
2 ******** [14.63 %] (872061 of 5959788)
3 ****** [10.76 %] (641074 of 5959788)
CANDIDATE (7 exceptions)
PFOR BPV Histogram
0 *** [4.02 %] (239583 of 5959788)
1 ***** [8.75 %] (521487 of 5959788)
2 ******************************** [62.40 %] (3718819 of 5959788)
3 *********** [20.26 %] (1207623 of 5959788)
4 ** [3.71 %] (221391 of 5959788)
5 * [0.82 %] (48939 of 5959788)
6 * [0.03 %] (1868 of 5959788)
7 * [0.00 %] (76 of 5959788)
8 * [0.00 %] (2 of 5959788)
Total bytes used: 216891765
PFOR Exceptions Used Histogram
0 **************** [30.93 %] (1843145 of 5959788)
1 ********** [18.46 %] (1100315 of 5959788)
2 ******* [13.38 %] (797217 of 5959788)
3 ****** [10.34 %] (615991 of 5959788)
4 ***** [8.36 %] (498042 of 5959788)
5 **** [7.03 %] (419005 of 5959788)
6 **** [6.11 %] (364388 of 5959788)
7 *** [5.40 %] (321685 of 5959788)
{code}
So we can observe the additional exceptions getting utilized and the bits-per-value generally shifting lower. This seems to amount to a ~7.6% decrease in space used for all of the PFOR blocks in aggregate.
Looking at the overall index size though, I only see a ~0.5% decrease in the index size. :
{code:java}
3816536 wikimedium10m.baseline.facets.taxonomy:Date.taxonomy:Month.taxonomy:DayOfYear.sortedset:Month.sortedset:DayOfYear.Lucene90.Lucene90.nd10M
3795532 wikimedium10m.candidate.facets.taxonomy:Date.taxonomy:Month.taxonomy:DayOfYear.sortedset:Month.sortedset:DayOfYear.Lucene90.Lucene90.nd10M
{code}
I suspect the PFOR data in this index (frequencies, positions, offsets and payloads) just aren't a big enough portion of the overall index to move the needle much in overall size. If my tool is accurate, it looks like only ~0.2 GB of the ~3.8 GB index is PFOR data.
As for performance implications, running a luceneutil benchmark (using "-source wikimedium10m") shows the following:
{code:java}
TaskQPS baseline StdDevQPS pfor 7 exceptions StdDev Pct diff p-value
HighTermDayOfYearSort 139.37 (10.8%) 135.30 (10.2%) -2.9% ( -21% - 20%) 0.379
HighTermMonthSort 134.56 (13.4%) 130.65 (11.9%) -2.9% ( -24% - 25%) 0.470
TermDTSort 100.66 (12.9%) 97.96 (13.5%) -2.7% ( -25% - 27%) 0.521
HighTermTitleBDVSort 130.20 (9.6%) 127.06 (8.4%) -2.4% ( -18% - 17%) 0.398
MedSloppyPhrase 45.83 (3.7%) 44.82 (3.8%) -2.2% ( -9% - 5%) 0.063
OrHighNotHigh 505.30 (3.7%) 498.15 (4.6%) -1.4% ( -9% - 7%) 0.288
OrHighLow 413.80 (5.0%) 408.04 (4.9%) -1.4% ( -10% - 8%) 0.371
AndHighLow 575.78 (3.5%) 568.21 (3.6%) -1.3% ( -8% - 5%) 0.241
MedPhrase 19.20 (2.5%) 18.98 (2.4%) -1.1% ( -5% - 3%) 0.138
OrHighNotLow 504.91 (4.7%) 499.19 (4.2%) -1.1% ( -9% - 8%) 0.423
LowSloppyPhrase 44.42 (4.0%) 44.01 (3.1%) -0.9% ( -7% - 6%) 0.413
OrNotHighLow 789.01 (3.7%) 781.98 (4.4%) -0.9% ( -8% - 7%) 0.488
LowSpanNear 100.30 (2.1%) 99.42 (2.0%) -0.9% ( -4% - 3%) 0.178
HighIntervalsOrdered 13.86 (2.0%) 13.75 (1.9%) -0.8% ( -4% - 3%) 0.174
Prefix3 230.15 (2.8%) 228.24 (3.3%) -0.8% ( -6% - 5%) 0.394
HighSloppyPhrase 1.77 (3.8%) 1.76 (3.4%) -0.8% ( -7% - 6%) 0.477
BrowseMonthTaxoFacets 5.01 (2.2%) 4.97 (3.5%) -0.8% ( -6% - 4%) 0.396
AndHighHigh 35.38 (2.8%) 35.12 (3.1%) -0.7% ( -6% - 5%) 0.428
Wildcard 63.19 (1.9%) 62.83 (1.9%) -0.6% ( -4% - 3%) 0.337
HighSpanNear 11.63 (2.8%) 11.57 (2.2%) -0.6% ( -5% - 4%) 0.492
OrHighNotMed 516.53 (4.0%) 513.71 (4.1%) -0.5% ( -8% - 7%) 0.670
BrowseDayOfYearTaxoFacets 4.20 (2.5%) 4.18 (2.9%) -0.5% ( -5% - 5%) 0.541
LowPhrase 51.17 (3.0%) 50.97 (3.0%) -0.4% ( -6% - 5%) 0.674
MedTerm 1080.72 (5.4%) 1076.53 (4.9%) -0.4% ( -10% - 10%) 0.812
BrowseDateTaxoFacets 4.19 (2.5%) 4.18 (2.8%) -0.4% ( -5% - 5%) 0.657
HighTerm 923.10 (3.8%) 920.08 (4.1%) -0.3% ( -7% - 7%) 0.795
Respell 40.97 (1.8%) 40.84 (2.2%) -0.3% ( -4% - 3%) 0.620
OrHighMed 88.93 (2.7%) 88.72 (1.7%) -0.2% ( -4% - 4%) 0.739
HighPhrase 314.90 (2.9%) 314.29 (4.1%) -0.2% ( -6% - 6%) 0.863
BrowseDayOfYearSSDVFacets 11.39 (5.8%) 11.38 (5.8%) -0.1% ( -11% - 12%) 0.957
BrowseMonthSSDVFacets 12.55 (3.3%) 12.54 (3.3%) -0.1% ( -6% - 6%) 0.924
PKLookup 146.67 (4.3%) 146.58 (3.0%) -0.1% ( -7% - 7%) 0.960
IntNRQ 53.49 (0.7%) 53.50 (0.7%) 0.0% ( -1% - 1%) 0.935
OrNotHighHigh 480.86 (3.8%) 481.91 (3.9%) 0.2% ( -7% - 8%) 0.857
MedSpanNear 130.25 (2.5%) 130.55 (2.5%) 0.2% ( -4% - 5%) 0.775
AndHighMed 159.94 (2.4%) 160.41 (2.6%) 0.3% ( -4% - 5%) 0.704
LowTerm 1030.42 (3.7%) 1034.09 (5.3%) 0.4% ( -8% - 9%) 0.804
OrHighHigh 14.59 (2.2%) 14.65 (1.3%) 0.4% ( -3% - 4%) 0.514
Fuzzy2 51.29 (7.7%) 51.74 (9.7%) 0.9% ( -15% - 19%) 0.755
OrNotHighMed 480.49 (2.9%) 485.66 (3.9%) 1.1% ( -5% - 8%) 0.326
Fuzzy1 35.35 (7.6%) 35.84 (7.5%) 1.4% ( -12% - 17%) 0.562
{code}
So while the index size reduction is relatively small, I don't see a real significant performance impact either. Curious if anyone else sees something interesting in these numbers.
Finally, I'll mention that I observed pretty similar behavior when running our internal (Amazon product search) benchmarking tools. Pretty small impact to index size but no significant performance hit either.
> Explore increasing the allowable exceptions in PForUtil
> -------------------------------------------------------
>
> Key: LUCENE-9877
> URL: https://issues.apache.org/jira/browse/LUCENE-9877
> Project: Lucene - Core
> Issue Type: Task
> Components: core/codecs
> Affects Versions: main (9.0)
> Reporter: Greg Miller
> Priority: Minor
>
> Piggybacking a little off of the investigation I was doing over in LUCENE-9850 I thought it might also be worth-while exploring the impact of increasing the number of allowable exceptions in PForUtil. The aim of this investigation is to see if we could reduce index size by allowing for more exceptions without significant negative impact to performance.
> PForUtil currently allows for up to 3 exceptions, and it only uses 3 bits to encode the number of exceptions (using the remaining 3 bits of the byte used to also encode the number of bits-per-value, which requires 5 bits). Each exception used is encoded with a full byte, using a maximum of 3 bytes per block.
> It seems to me like 14 might be a more ideal number of exceptions if index size is the driving motivation. My thought process is that, in the worst-case, 14 exceptions would be used to save only a single bit-per-value in the corresponding block. With 128 entries per block, this would save 16 bytes. To encode the exception count, we'd need to use an extra byte per block since we need more than the current 3 bits. So with that extra byte in addition to the 14 bytes used to encode the exception values, we would save a single byte in total (just slightly better than breaking even). If we need fewer than the 14 exceptions, or if we're able to save more than 1 bit-per-value, it's all additional savings.
> As a first experiment, I think testing 7 exceptions might be a good middle-ground. Seven exceptions is the most we can encode with the 3 bits we have allocated today. So for 7 exceptions, we don't need the extra byte for encoding the exception count, and in the worst case of needing all 7 to save 1 bit-per-value, we'd save a total of 9 bytes in a block. I suppose the question is what kind of performance hit we might observe due to decoding more exceptions.
> I'll post some results here for discussion or at least for public record of my work for future reference.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org