You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2022/06/10 16:01:26 UTC

[GitHub] [lucene] gsmiller opened a new pull request, #954: LUCENE-10603: Change iteration methodology for SSDV ordinals in the f…

gsmiller opened a new pull request, #954:
URL: https://github.com/apache/lucene/pull/954

   This PR is to migrate the facets module to using the newly-added `SortedSetDocValues#docValueCount()` for iteration, as described in LUCENE-10603. It doesn't attempt to move all `SSDV` iteration, just the facets module.
   
   Benchmark results show a potentially small win, and no regressions, so I think we should move forward with this.
   
   ```
                               TaskQPS baseline      StdDevQPS candidate      StdDev                Pct diff p-value
                         TermDTSort       99.36     (13.9%)       93.27     (12.9%)   -6.1% ( -28% -   24%) 0.149
              HighTermDayOfYearSort       99.67     (13.9%)       96.80     (12.2%)   -2.9% ( -25% -   27%) 0.487
          BrowseDayOfYearSSDVFacets       14.07     (18.4%)       13.82     (13.8%)   -1.8% ( -28% -   37%) 0.729
               HighTermTitleBDVSort      126.78     (21.9%)      124.79     (25.4%)   -1.6% ( -40% -   58%) 0.835
                             IntNRQ       73.42      (4.7%)       72.57      (5.1%)   -1.2% ( -10% -    9%) 0.454
                          OrHighMed       99.39      (3.5%)       98.39      (3.2%)   -1.0% (  -7% -    5%) 0.345
                       OrHighNotMed     1295.86      (3.2%)     1285.26      (5.0%)   -0.8% (  -8% -    7%) 0.535
                         OrHighHigh       46.15      (3.1%)       45.78      (3.0%)   -0.8% (  -6% -    5%) 0.400
              BrowseMonthSSDVFacets       16.26     (15.1%)       16.13      (9.7%)   -0.8% ( -22% -   28%) 0.848
                          OrHighLow      970.97      (2.8%)      964.35      (1.9%)   -0.7% (  -5% -    4%) 0.375
                       OrNotHighMed      945.22      (3.4%)      939.06      (4.1%)   -0.7% (  -7% -    7%) 0.582
                            MedTerm     2116.21      (5.1%)     2103.04      (4.5%)   -0.6% (  -9% -    9%) 0.684
                           PKLookup      169.76      (3.4%)      168.71      (3.8%)   -0.6% (  -7% -    6%) 0.588
                        AndHighHigh       44.04      (3.0%)       43.78      (5.5%)   -0.6% (  -8% -    8%) 0.677
                MedIntervalsOrdered       10.35      (5.8%)       10.31      (5.2%)   -0.4% ( -10% -   11%) 0.820
                      OrNotHighHigh     1077.87      (4.1%)     1074.45      (5.0%)   -0.3% (  -9% -    9%) 0.827
                           HighTerm     2923.10      (4.7%)     2914.62      (4.3%)   -0.3% (  -8% -    9%) 0.838
                            LowTerm     1969.85      (4.9%)     1964.63      (5.6%)   -0.3% ( -10% -   10%) 0.873
                        MedSpanNear       59.53      (2.6%)       59.38      (3.1%)   -0.2% (  -5% -    5%) 0.784
               HighIntervalsOrdered       12.23      (8.2%)       12.20      (7.6%)   -0.2% ( -14% -   16%) 0.920
                       HighSpanNear        5.30      (2.7%)        5.29      (3.2%)   -0.1% (  -5% -    5%) 0.902
                       OrNotHighLow     1213.60      (2.8%)     1212.64      (2.8%)   -0.1% (  -5% -    5%) 0.928
                    LowSloppyPhrase       24.51      (3.3%)       24.49      (3.3%)   -0.1% (  -6% -    6%) 0.953
             OrHighMedDayTaxoFacets       12.99      (4.9%)       12.98      (6.2%)   -0.1% ( -10% -   11%) 0.974
               MedTermDayTaxoFacets       23.69      (4.8%)       23.68      (4.1%)   -0.1% (  -8% -    9%) 0.971
                LowIntervalsOrdered      107.55      (5.3%)      107.51      (3.8%)   -0.0% (  -8% -    9%) 0.980
                       OrHighNotLow     1064.18      (4.6%)     1064.82      (5.6%)    0.1% (  -9% -   10%) 0.970
                        LowSpanNear      190.49      (3.1%)      190.62      (3.9%)    0.1% (  -6% -    7%) 0.951
            AndHighMedDayTaxoFacets       39.56      (2.1%)       39.60      (1.6%)    0.1% (  -3% -    3%) 0.868
                          MedPhrase      379.28      (2.1%)      379.69      (2.5%)    0.1% (  -4% -    4%) 0.883
                         HighPhrase      223.12      (2.5%)      223.61      (2.8%)    0.2% (  -4% -    5%) 0.795
                  HighTermMonthSort      121.98     (16.5%)      122.28     (13.8%)    0.2% ( -25% -   36%) 0.959
                          LowPhrase       66.70      (2.8%)       66.89      (3.9%)    0.3% (  -6% -    7%) 0.792
                             Fuzzy1       93.42      (1.8%)       93.69      (1.2%)    0.3% (  -2% -    3%) 0.556
                            Respell       53.91      (1.8%)       54.07      (1.4%)    0.3% (  -2% -    3%) 0.552
                    MedSloppyPhrase       16.33      (3.2%)       16.38      (3.3%)    0.3% (  -6% -    7%) 0.763
                         AndHighMed       90.72      (3.2%)       91.05      (4.0%)    0.4% (  -6% -    7%) 0.753
                   HighSloppyPhrase       32.08      (4.7%)       32.21      (4.4%)    0.4% (  -8% -    9%) 0.774
                      OrHighNotHigh      895.72      (4.9%)      899.79      (5.1%)    0.5% (  -9% -   10%) 0.773
                         AndHighLow      588.99      (2.5%)      591.85      (2.0%)    0.5% (  -3% -    5%) 0.497
                             Fuzzy2       19.51      (1.8%)       19.61      (1.1%)    0.5% (  -2% -    3%) 0.285
           AndHighHighDayTaxoFacets       13.45      (2.7%)       13.53      (2.3%)    0.5% (  -4% -    5%) 0.508
        BrowseRandomLabelSSDVFacets       10.00      (5.2%)       10.08      (4.9%)    0.8% (  -8% -   11%) 0.608
        BrowseRandomLabelTaxoFacets       17.77     (16.3%)       18.16     (13.7%)    2.2% ( -23% -   38%) 0.638
              BrowseMonthTaxoFacets       27.62     (23.3%)       28.42     (18.9%)    2.9% ( -31% -   58%) 0.665
               BrowseDateTaxoFacets       20.98     (18.5%)       21.63     (17.7%)    3.1% ( -27% -   48%) 0.590
          BrowseDayOfYearTaxoFacets       20.96     (18.7%)       21.63     (18.1%)    3.2% ( -28% -   49%) 0.584
                           Wildcard       56.56      (8.9%)       58.60      (6.0%)    3.6% ( -10% -   20%) 0.133
                            Prefix3      521.60     (12.1%)      548.29     (11.9%)    5.1% ( -16% -   33%) 0.178
               BrowseDateSSDVFacets        2.63     (11.6%)        2.82     (12.1%)    7.2% ( -14% -   35%) 0.054
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] gsmiller commented on pull request #954: LUCENE-10603: Change iteration methodology for SSDV ordinals in the f…

Posted by GitBox <gi...@apache.org>.
gsmiller commented on PR #954:
URL: https://github.com/apache/lucene/pull/954#issuecomment-1154550479

   @jpountz:
   > plus there may be lots of users doing old-style iteration so we'll need to deprecate and maybe only clean this up in 10.0 or 11.0
   
   Right, makes sense.
   
   > But I'm curious of the sort of speedup that this would yield
   
   Me too. I hacked up a version of this change on another branch ([here](https://github.com/gsmiller/lucene/commit/3768665ae6c173014e8288c46c27afa517c90ede)) that let calling code explicitly ask for a "fast" version of SSDV that doesn't do the ordinal check, and then relied on this new code path for loading SSDV within the faceting module. I didn't observe a significant change with our benchmark tooling, but I wonder how much we actually exercise these multi-value cases within our faceting benchmarks. I think it will be more interesting to test once we've migrated more use-cases to this new SSDV iteration style.
   
   You may have had a completely different thought in mind for testing this, so please let me know if I'm missing the mark here. Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] gsmiller merged pull request #954: LUCENE-10603: Change iteration methodology for SSDV ordinals in the f…

Posted by GitBox <gi...@apache.org>.
gsmiller merged PR #954:
URL: https://github.com/apache/lucene/pull/954


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] jpountz commented on pull request #954: LUCENE-10603: Change iteration methodology for SSDV ordinals in the f…

Posted by GitBox <gi...@apache.org>.
jpountz commented on PR #954:
URL: https://github.com/apache/lucene/pull/954#issuecomment-1154822015

   This is exactly the testing that I had in mind, thanks for running these benchmarks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] jpountz commented on pull request #954: LUCENE-10603: Change iteration methodology for SSDV ordinals in the f…

Posted by GitBox <gi...@apache.org>.
jpountz commented on PR #954:
URL: https://github.com/apache/lucene/pull/954#issuecomment-1154152474

   We'll have to wait to clean up indeed, plus there may be lots of users doing old-style iteration so we'll need to deprecate and maybe only clean this up in 10.0 or 11.0. But I'm curious of the sort of speedup that this would yield. :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] gsmiller commented on pull request #954: LUCENE-10603: Change iteration methodology for SSDV ordinals in the f…

Posted by GitBox <gi...@apache.org>.
gsmiller commented on PR #954:
URL: https://github.com/apache/lucene/pull/954#issuecomment-1156721533

   Sure, no problem @jpountz. Thanks for the suggestion. Any concerns with merging this PR as it is?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] gsmiller commented on pull request #954: LUCENE-10603: Change iteration methodology for SSDV ordinals in the f…

Posted by GitBox <gi...@apache.org>.
gsmiller commented on PR #954:
URL: https://github.com/apache/lucene/pull/954#issuecomment-1159099868

   @jpountz no problem. I wasn't in any rush with this, and since you'd had a look, I just wanted to make sure you didn't have additional feedback. Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] jpountz commented on pull request #954: LUCENE-10603: Change iteration methodology for SSDV ordinals in the f…

Posted by GitBox <gi...@apache.org>.
jpountz commented on PR #954:
URL: https://github.com/apache/lucene/pull/954#issuecomment-1153675730

   @gsmiller I wonder if you could also test if there is a speed up if we remove checks that the codec has to do in order to make sure to return `NO_MORE_ORDS` when values for a doc are exhausted. E.g. `Lucene90DocValuesProducer#getSortedSet` looks like this today
   
   ```java
     @Override
     public SortedSetDocValues getSortedSet(FieldInfo field) throws IOException {
       SortedSetEntry entry = sortedSets.get(field.name);
       if (entry.singleValueEntry != null) {
         return DocValues.singleton(getSorted(entry.singleValueEntry));
       }
   
       final SortedNumericDocValues ords = getSortedNumeric(entry.ordsEntry);
       return new BaseSortedSetDocValues(entry, data) {
   
         int i = 0;
         int count = 0;
         boolean set = false;
   
         @Override
         public long nextOrd() throws IOException {
           if (set == false) {
             set = true;
             i = 0;
             count = ords.docValueCount();
           }
           if (i++ == count) {
             return NO_MORE_ORDS;
           }
           return ords.nextValue();
         }
   
         @Override
         public long docValueCount() {
           return ords.docValueCount();
         }
   
         @Override
         public boolean advanceExact(int target) throws IOException {
           set = false;
           return ords.advanceExact(target);
         }
   
         @Override
         public int docID() {
           return ords.docID();
         }
   
         @Override
         public int nextDoc() throws IOException {
           set = false;
           return ords.nextDoc();
         }
   
         @Override
         public int advance(int target) throws IOException {
           set = false;
           return ords.advance(target);
         }
   
         @Override
         public long cost() {
           return ords.cost();
         }
       };
     }
   ```
   
   but if we moved everything to this new iteration model, we wouldn't have to check if the caller is visiting more values than expected, it would just lead to an undefined behavior and we could remove `i`, `count`, `set` and `nextOrd()` could delegate directly to `ords.nextValue()`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] gsmiller commented on pull request #954: LUCENE-10603: Change iteration methodology for SSDV ordinals in the f…

Posted by GitBox <gi...@apache.org>.
gsmiller commented on PR #954:
URL: https://github.com/apache/lucene/pull/954#issuecomment-1154150491

   @jpountz +1 to testing this. Good call! Since I only tackled a subset of code accessing `NO_MORE_DOCS`, I think we'll have to wait to clean this up and test though right?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org