You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2022/06/10 16:01:26 UTC
[GitHub] [lucene] gsmiller opened a new pull request, #954: LUCENE-10603: Change iteration methodology for SSDV ordinals in the f…
gsmiller opened a new pull request, #954:
URL: https://github.com/apache/lucene/pull/954
This PR is to migrate the facets module to using the newly-added `SortedSetDocValues#docValueCount()` for iteration, as described in LUCENE-10603. It doesn't attempt to move all `SSDV` iteration, just the facets module.
Benchmark results show a potentially small win, and no regressions, so I think we should move forward with this.
```
TaskQPS baseline StdDevQPS candidate StdDev Pct diff p-value
TermDTSort 99.36 (13.9%) 93.27 (12.9%) -6.1% ( -28% - 24%) 0.149
HighTermDayOfYearSort 99.67 (13.9%) 96.80 (12.2%) -2.9% ( -25% - 27%) 0.487
BrowseDayOfYearSSDVFacets 14.07 (18.4%) 13.82 (13.8%) -1.8% ( -28% - 37%) 0.729
HighTermTitleBDVSort 126.78 (21.9%) 124.79 (25.4%) -1.6% ( -40% - 58%) 0.835
IntNRQ 73.42 (4.7%) 72.57 (5.1%) -1.2% ( -10% - 9%) 0.454
OrHighMed 99.39 (3.5%) 98.39 (3.2%) -1.0% ( -7% - 5%) 0.345
OrHighNotMed 1295.86 (3.2%) 1285.26 (5.0%) -0.8% ( -8% - 7%) 0.535
OrHighHigh 46.15 (3.1%) 45.78 (3.0%) -0.8% ( -6% - 5%) 0.400
BrowseMonthSSDVFacets 16.26 (15.1%) 16.13 (9.7%) -0.8% ( -22% - 28%) 0.848
OrHighLow 970.97 (2.8%) 964.35 (1.9%) -0.7% ( -5% - 4%) 0.375
OrNotHighMed 945.22 (3.4%) 939.06 (4.1%) -0.7% ( -7% - 7%) 0.582
MedTerm 2116.21 (5.1%) 2103.04 (4.5%) -0.6% ( -9% - 9%) 0.684
PKLookup 169.76 (3.4%) 168.71 (3.8%) -0.6% ( -7% - 6%) 0.588
AndHighHigh 44.04 (3.0%) 43.78 (5.5%) -0.6% ( -8% - 8%) 0.677
MedIntervalsOrdered 10.35 (5.8%) 10.31 (5.2%) -0.4% ( -10% - 11%) 0.820
OrNotHighHigh 1077.87 (4.1%) 1074.45 (5.0%) -0.3% ( -9% - 9%) 0.827
HighTerm 2923.10 (4.7%) 2914.62 (4.3%) -0.3% ( -8% - 9%) 0.838
LowTerm 1969.85 (4.9%) 1964.63 (5.6%) -0.3% ( -10% - 10%) 0.873
MedSpanNear 59.53 (2.6%) 59.38 (3.1%) -0.2% ( -5% - 5%) 0.784
HighIntervalsOrdered 12.23 (8.2%) 12.20 (7.6%) -0.2% ( -14% - 16%) 0.920
HighSpanNear 5.30 (2.7%) 5.29 (3.2%) -0.1% ( -5% - 5%) 0.902
OrNotHighLow 1213.60 (2.8%) 1212.64 (2.8%) -0.1% ( -5% - 5%) 0.928
LowSloppyPhrase 24.51 (3.3%) 24.49 (3.3%) -0.1% ( -6% - 6%) 0.953
OrHighMedDayTaxoFacets 12.99 (4.9%) 12.98 (6.2%) -0.1% ( -10% - 11%) 0.974
MedTermDayTaxoFacets 23.69 (4.8%) 23.68 (4.1%) -0.1% ( -8% - 9%) 0.971
LowIntervalsOrdered 107.55 (5.3%) 107.51 (3.8%) -0.0% ( -8% - 9%) 0.980
OrHighNotLow 1064.18 (4.6%) 1064.82 (5.6%) 0.1% ( -9% - 10%) 0.970
LowSpanNear 190.49 (3.1%) 190.62 (3.9%) 0.1% ( -6% - 7%) 0.951
AndHighMedDayTaxoFacets 39.56 (2.1%) 39.60 (1.6%) 0.1% ( -3% - 3%) 0.868
MedPhrase 379.28 (2.1%) 379.69 (2.5%) 0.1% ( -4% - 4%) 0.883
HighPhrase 223.12 (2.5%) 223.61 (2.8%) 0.2% ( -4% - 5%) 0.795
HighTermMonthSort 121.98 (16.5%) 122.28 (13.8%) 0.2% ( -25% - 36%) 0.959
LowPhrase 66.70 (2.8%) 66.89 (3.9%) 0.3% ( -6% - 7%) 0.792
Fuzzy1 93.42 (1.8%) 93.69 (1.2%) 0.3% ( -2% - 3%) 0.556
Respell 53.91 (1.8%) 54.07 (1.4%) 0.3% ( -2% - 3%) 0.552
MedSloppyPhrase 16.33 (3.2%) 16.38 (3.3%) 0.3% ( -6% - 7%) 0.763
AndHighMed 90.72 (3.2%) 91.05 (4.0%) 0.4% ( -6% - 7%) 0.753
HighSloppyPhrase 32.08 (4.7%) 32.21 (4.4%) 0.4% ( -8% - 9%) 0.774
OrHighNotHigh 895.72 (4.9%) 899.79 (5.1%) 0.5% ( -9% - 10%) 0.773
AndHighLow 588.99 (2.5%) 591.85 (2.0%) 0.5% ( -3% - 5%) 0.497
Fuzzy2 19.51 (1.8%) 19.61 (1.1%) 0.5% ( -2% - 3%) 0.285
AndHighHighDayTaxoFacets 13.45 (2.7%) 13.53 (2.3%) 0.5% ( -4% - 5%) 0.508
BrowseRandomLabelSSDVFacets 10.00 (5.2%) 10.08 (4.9%) 0.8% ( -8% - 11%) 0.608
BrowseRandomLabelTaxoFacets 17.77 (16.3%) 18.16 (13.7%) 2.2% ( -23% - 38%) 0.638
BrowseMonthTaxoFacets 27.62 (23.3%) 28.42 (18.9%) 2.9% ( -31% - 58%) 0.665
BrowseDateTaxoFacets 20.98 (18.5%) 21.63 (17.7%) 3.1% ( -27% - 48%) 0.590
BrowseDayOfYearTaxoFacets 20.96 (18.7%) 21.63 (18.1%) 3.2% ( -28% - 49%) 0.584
Wildcard 56.56 (8.9%) 58.60 (6.0%) 3.6% ( -10% - 20%) 0.133
Prefix3 521.60 (12.1%) 548.29 (11.9%) 5.1% ( -16% - 33%) 0.178
BrowseDateSSDVFacets 2.63 (11.6%) 2.82 (12.1%) 7.2% ( -14% - 35%) 0.054
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] gsmiller commented on pull request #954: LUCENE-10603: Change iteration methodology for SSDV ordinals in the f…
Posted by GitBox <gi...@apache.org>.
gsmiller commented on PR #954:
URL: https://github.com/apache/lucene/pull/954#issuecomment-1154550479
@jpountz:
> plus there may be lots of users doing old-style iteration so we'll need to deprecate and maybe only clean this up in 10.0 or 11.0
Right, makes sense.
> But I'm curious of the sort of speedup that this would yield
Me too. I hacked up a version of this change on another branch ([here](https://github.com/gsmiller/lucene/commit/3768665ae6c173014e8288c46c27afa517c90ede)) that let calling code explicitly ask for a "fast" version of SSDV that doesn't do the ordinal check, and then relied on this new code path for loading SSDV within the faceting module. I didn't observe a significant change with our benchmark tooling, but I wonder how much we actually exercise these multi-value cases within our faceting benchmarks. I think it will be more interesting to test once we've migrated more use-cases to this new SSDV iteration style.
You may have had a completely different thought in mind for testing this, so please let me know if I'm missing the mark here. Thanks!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] gsmiller merged pull request #954: LUCENE-10603: Change iteration methodology for SSDV ordinals in the f…
Posted by GitBox <gi...@apache.org>.
gsmiller merged PR #954:
URL: https://github.com/apache/lucene/pull/954
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] jpountz commented on pull request #954: LUCENE-10603: Change iteration methodology for SSDV ordinals in the f…
Posted by GitBox <gi...@apache.org>.
jpountz commented on PR #954:
URL: https://github.com/apache/lucene/pull/954#issuecomment-1154822015
This is exactly the testing that I had in mind, thanks for running these benchmarks!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] jpountz commented on pull request #954: LUCENE-10603: Change iteration methodology for SSDV ordinals in the f…
Posted by GitBox <gi...@apache.org>.
jpountz commented on PR #954:
URL: https://github.com/apache/lucene/pull/954#issuecomment-1154152474
We'll have to wait to clean up indeed, plus there may be lots of users doing old-style iteration so we'll need to deprecate and maybe only clean this up in 10.0 or 11.0. But I'm curious of the sort of speedup that this would yield. :)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] gsmiller commented on pull request #954: LUCENE-10603: Change iteration methodology for SSDV ordinals in the f…
Posted by GitBox <gi...@apache.org>.
gsmiller commented on PR #954:
URL: https://github.com/apache/lucene/pull/954#issuecomment-1156721533
Sure, no problem @jpountz. Thanks for the suggestion. Any concerns with merging this PR as it is?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] gsmiller commented on pull request #954: LUCENE-10603: Change iteration methodology for SSDV ordinals in the f…
Posted by GitBox <gi...@apache.org>.
gsmiller commented on PR #954:
URL: https://github.com/apache/lucene/pull/954#issuecomment-1159099868
@jpountz no problem. I wasn't in any rush with this, and since you'd had a look, I just wanted to make sure you didn't have additional feedback. Thanks!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] jpountz commented on pull request #954: LUCENE-10603: Change iteration methodology for SSDV ordinals in the f…
Posted by GitBox <gi...@apache.org>.
jpountz commented on PR #954:
URL: https://github.com/apache/lucene/pull/954#issuecomment-1153675730
@gsmiller I wonder if you could also test if there is a speed up if we remove checks that the codec has to do in order to make sure to return `NO_MORE_ORDS` when values for a doc are exhausted. E.g. `Lucene90DocValuesProducer#getSortedSet` looks like this today
```java
@Override
public SortedSetDocValues getSortedSet(FieldInfo field) throws IOException {
SortedSetEntry entry = sortedSets.get(field.name);
if (entry.singleValueEntry != null) {
return DocValues.singleton(getSorted(entry.singleValueEntry));
}
final SortedNumericDocValues ords = getSortedNumeric(entry.ordsEntry);
return new BaseSortedSetDocValues(entry, data) {
int i = 0;
int count = 0;
boolean set = false;
@Override
public long nextOrd() throws IOException {
if (set == false) {
set = true;
i = 0;
count = ords.docValueCount();
}
if (i++ == count) {
return NO_MORE_ORDS;
}
return ords.nextValue();
}
@Override
public long docValueCount() {
return ords.docValueCount();
}
@Override
public boolean advanceExact(int target) throws IOException {
set = false;
return ords.advanceExact(target);
}
@Override
public int docID() {
return ords.docID();
}
@Override
public int nextDoc() throws IOException {
set = false;
return ords.nextDoc();
}
@Override
public int advance(int target) throws IOException {
set = false;
return ords.advance(target);
}
@Override
public long cost() {
return ords.cost();
}
};
}
```
but if we moved everything to this new iteration model, we wouldn't have to check if the caller is visiting more values than expected, it would just lead to an undefined behavior and we could remove `i`, `count`, `set` and `nextOrd()` could delegate directly to `ords.nextValue()`.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] gsmiller commented on pull request #954: LUCENE-10603: Change iteration methodology for SSDV ordinals in the f…
Posted by GitBox <gi...@apache.org>.
gsmiller commented on PR #954:
URL: https://github.com/apache/lucene/pull/954#issuecomment-1154150491
@jpountz +1 to testing this. Good call! Since I only tackled a subset of code accessing `NO_MORE_DOCS`, I think we'll have to wait to clean this up and test though right?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org