You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2023/01/01 11:38:50 UTC

[GitHub] [lucene] jpountz opened a new pull request, #12055: Better skipping for multi-term queries with a FILTER rewrite.

jpountz opened a new pull request, #12055:
URL: https://github.com/apache/lucene/pull/12055

   Currently multi-term queries with a filter rewrite internally rewrite to a disjunction if 16 terms or less match the query. Otherwise postings lists of matching terms are collected into a `DocIdSetBuilder`. This change replaces the latter with a mixed approach where a disjunction is created between the 16 terms that have the highest document frequency and an iterator produced from the `DocIdSetBuilder` that collects all other terms. On fields that have a zipfian distribution, it's quite likely that no high-frequency terms make it to the `DocIdSetBuilder`. This provides two main benefits:
    - Queries are less likely to allocate a FixedBitSet of size `maxDoc`.
    - Queries are better at skipping or early terminating. On the other hand, queries that need to consume most or all matching documents may get a slowdown.
   
   The slowdown is unfortunate, but my gut feeling is that this change still has more pros than cons.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] rmuir commented on pull request #12055: Better skipping for multi-term queries with a FILTER rewrite.

Posted by GitBox <gi...@apache.org>.
rmuir commented on PR #12055:
URL: https://github.com/apache/lucene/pull/12055#issuecomment-1369222376

   there's no reason to duplicate a bunch of code just because of minor changes to a rewrite method. we can have more than one or two of these rewritemethods, and use different ones for different queries: this fact seems to have been forgotten here. and maybe we should have e.g. simple FILTER rewrite that just does that, this one with lots of magic could have a different name.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] gsmiller commented on pull request #12055: Better skipping for multi-term queries with a FILTER rewrite.

Posted by "gsmiller (via GitHub)" <gi...@apache.org>.
gsmiller commented on PR #12055:
URL: https://github.com/apache/lucene/pull/12055#issuecomment-1435332248

   OK, I think I've addressed the previous feedback and also brought in the same changes to `TermInSetQuery`. This should be ready for feedback @jpountz (whenever you have a free moment).
   
   On some internal benchmarks (Amazon product search), we see throughput increases ranging from ~7 - 63% (depending on a number of factors). We have some situations where pre-processing of long postings (in the existing TiS implementation) takes up a large share of CPU time (these tend to be cases where the actual matches are sparse but the TiS terms match a large number of docs). Being able to "hold back" these long postings into a `DisjunctionDisiAppox` while still pre-processing the shorter postings is a big win in these cases. Here are some flame charts showing the impact (heavily redacted of course):
   
   "Normal" TiS:
   <img width="1469" alt="Screen Shot 2023-02-15 at 7 51 01 AM" src="https://user-images.githubusercontent.com/16479560/219803198-c28ce8f5-1807-4c86-b3ec-7d6e33096533.png">
   
   TiS with this PR:
   <img width="1415" alt="Screen Shot 2023-02-15 at 10 38 38 AM" src="https://user-images.githubusercontent.com/16479560/219803253-b785d652-cf37-42c6-bf7c-be6d18e69caf.png">
   
   I also re-ran `luceneutil` benchmarks (`wikimedium10m`) and see consistent results with the initial PR:
   ```
                               TaskQPS baseline      StdDevQPS candidate      StdDev                Pct diff p-value
              BrowseMonthTaxoFacets       32.05     (14.3%)       30.06     (23.2%)   -6.2% ( -38% -   36%) 0.309
              BrowseMonthSSDVFacets       14.85     (17.6%)       13.95      (2.3%)   -6.0% ( -22% -   16%) 0.127
        BrowseRandomLabelTaxoFacets       23.18     (12.4%)       21.93     (19.8%)   -5.4% ( -33% -   30%) 0.303
               BrowseDateTaxoFacets       31.04     (14.2%)       29.42     (22.7%)   -5.2% ( -36% -   37%) 0.383
          BrowseDayOfYearTaxoFacets       31.29     (14.4%)       29.66     (22.9%)   -5.2% ( -37% -   37%) 0.389
          BrowseDayOfYearSSDVFacets       14.17     (17.7%)       13.85     (14.0%)   -2.2% ( -28% -   35%) 0.659
                             IntNRQ      156.59      (5.3%)      154.50      (7.1%)   -1.3% ( -12% -   11%) 0.498
                  HighTermTitleSort      100.27      (2.6%)       99.43      (2.8%)   -0.8% (  -6% -    4%) 0.329
                  HighTermMonthSort     2633.69      (3.3%)     2614.59      (3.1%)   -0.7% (  -6% -    5%) 0.477
                         AndHighLow     1085.57      (2.9%)     1080.12      (2.5%)   -0.5% (  -5% -    5%) 0.559
                      OrNotHighHigh      795.28      (3.1%)      791.61      (3.4%)   -0.5% (  -6% -    6%) 0.656
                         HighPhrase      145.14      (2.7%)      144.70      (2.8%)   -0.3% (  -5% -    5%) 0.725
               BrowseDateSSDVFacets        3.81      (7.7%)        3.80      (7.9%)   -0.3% ( -14% -   16%) 0.905
                             Fuzzy2       57.08      (1.3%)       56.95      (1.2%)   -0.2% (  -2% -    2%) 0.555
                          LowPhrase      379.61      (2.4%)      378.74      (2.8%)   -0.2% (  -5% -    5%) 0.783
                             Fuzzy1       76.62      (1.5%)       76.45      (1.2%)   -0.2% (  -2% -    2%) 0.621
                          MedPhrase       19.20      (2.3%)       19.19      (2.2%)   -0.1% (  -4% -    4%) 0.898
             OrHighMedDayTaxoFacets       15.40      (3.4%)       15.40      (3.3%)   -0.0% (  -6% -    6%) 0.978
                       OrNotHighLow     1186.45      (3.0%)     1186.35      (2.2%)   -0.0% (  -5% -    5%) 0.992
                        MedSpanNear        8.28      (2.7%)        8.28      (2.6%)    0.0% (  -5% -    5%) 0.996
                         TermDTSort      104.78      (1.2%)      104.79      (1.3%)    0.0% (  -2% -    2%) 0.984
                         AndHighMed      204.19      (3.2%)      204.21      (3.7%)    0.0% (  -6% -    7%) 0.992
               MedTermDayTaxoFacets       51.30      (2.9%)       51.32      (3.0%)    0.0% (  -5% -    6%) 0.970
               HighTermTitleBDVSort       21.28      (4.7%)       21.28      (4.2%)    0.0% (  -8% -    9%) 0.975
                        LowSpanNear      179.68      (1.3%)      179.77      (1.6%)    0.0% (  -2% -    3%) 0.919
            AndHighMedDayTaxoFacets       25.84      (1.6%)       25.85      (1.7%)    0.0% (  -3% -    3%) 0.923
                          OrHighMed      106.73      (3.0%)      106.83      (3.4%)    0.1% (  -6% -    6%) 0.932
                       OrNotHighMed      329.39      (3.1%)      329.71      (3.4%)    0.1% (  -6% -    6%) 0.925
                            Respell       49.02      (0.9%)       49.09      (0.6%)    0.1% (  -1% -    1%) 0.546
                MedIntervalsOrdered        4.40      (5.8%)        4.40      (5.5%)    0.1% ( -10% -   12%) 0.933
           AndHighHighDayTaxoFacets       13.00      (1.7%)       13.02      (1.5%)    0.2% (  -2% -    3%) 0.760
                           PKLookup      181.72      (2.5%)      182.00      (2.2%)    0.2% (  -4% -    4%) 0.833
                       HighSpanNear       23.16      (1.5%)       23.20      (1.9%)    0.2% (  -3% -    3%) 0.763
                       OrHighNotLow      420.49      (3.3%)      421.43      (5.0%)    0.2% (  -7% -    8%) 0.867
                LowIntervalsOrdered       32.01      (3.6%)       32.08      (3.6%)    0.2% (  -6% -    7%) 0.838
                    MedSloppyPhrase      136.79      (2.9%)      137.27      (2.7%)    0.3% (  -5% -    6%) 0.696
              HighTermDayOfYearSort      233.88      (2.3%)      234.72      (2.9%)    0.4% (  -4% -    5%) 0.663
                    LowSloppyPhrase       24.70      (2.6%)       24.80      (2.5%)    0.4% (  -4% -    5%) 0.604
                       OrHighNotMed      477.53      (3.0%)      479.58      (5.0%)    0.4% (  -7% -    8%) 0.742
                      OrHighNotHigh      274.34      (3.4%)      275.55      (4.8%)    0.4% (  -7% -    8%) 0.737
               HighIntervalsOrdered        1.86      (3.4%)        1.87      (3.3%)    0.5% (  -6% -    7%) 0.652
        BrowseRandomLabelSSDVFacets       10.16      (8.9%)       10.22      (9.0%)    0.5% ( -15% -   20%) 0.850
                          OrHighLow      426.87      (2.9%)      429.35      (4.0%)    0.6% (  -6% -    7%) 0.599
                            MedTerm      520.48      (3.7%)      524.12      (5.6%)    0.7% (  -8% -   10%) 0.643
                           HighTerm      522.07      (3.3%)      526.23      (5.2%)    0.8% (  -7% -    9%) 0.562
                   HighSloppyPhrase        9.98      (4.5%)       10.06      (4.2%)    0.8% (  -7% -    9%) 0.562
                         OrHighHigh       29.97      (4.5%)       30.24      (5.5%)    0.9% (  -8% -   11%) 0.570
                            LowTerm      732.69      (3.2%)      740.93      (4.5%)    1.1% (  -6% -    9%) 0.362
                        AndHighHigh       29.62      (5.1%)       29.99      (5.9%)    1.2% (  -9% -   12%) 0.481
                            Prefix3      138.32      (1.3%)      178.19      (1.9%)   28.8% (  25% -   32%) 0.000
                           Wildcard      393.93      (1.5%)      759.76      (3.8%)   92.9% (  86% -   99%) 0.000
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] gsmiller commented on pull request #12055: Better skipping for multi-term queries with a FILTER rewrite.

Posted by "gsmiller (via GitHub)" <gi...@apache.org>.
gsmiller commented on PR #12055:
URL: https://github.com/apache/lucene/pull/12055#issuecomment-1435477300

   Found the issue where the "constant score auto rewrite" implementation was removed: [LUCENE-5938](https://issues.apache.org/jira/browse/LUCENE-5938). If I'm understanding the history, it seems like the auto rewrite logic was trying to balance two things:
   1. Number of terms. If there are "few" terms (< 350) it would favor using a boolean query. When it passed the term threshold, it would use a filter rewrite.
   2. Sparsity of docs (total docFreq over visited terms relative to total docs in segment). If the "density" of the docs passed a certain point (0.1%), it would favor a filter approach instead of a boolean query. This point seems to have been in place to account for the fact that rewriting required using a fixed bitset, which wasn't efficient when very sparse.
   
   It looks like this rewrite method was removed in LUCENE-5938 since it introduced the idea of a sparse bitset, which removed the issue with #2 above.
   
   In my opinion, it seems like #1 is still a very valid trade-off (many terms are inefficient to manage in a boolean query due to the associated PQ). This, of course, is what the current rewrite method takes into consideration (with a threshold of 16 terms). What I still _don't_ like about the existing implementation is how it completely changes behavior to a full bitset rewrite after passing 16 terms. I _do_ think it's a nice win overall to rewrite in the way proposed by this PR. As far as I can tell, the former implementation never "incrementally" pre-processed postings into a filter bitset. It was an "all or nothing" approach. I think the key benefit of this PR is to allow for "incremental" processing.
   
   But, I also recognize it might not be applicable in all cases, for all users, and/or for all file systems. I think it's really good feedback to introduce this idea as a new rewrite option instead of modifying the existing one in-place. I'll look into that as a next step.
   
   @rmuir / @jpountz / @mikemccand - since you all were involved in LUCENE-5938 and the earlier implementation of the "auto rewrite," please let me know if I'm missing anything. I the best "digital archeology" I could, but it's very possible I'm missing something. Thanks again for the feedback!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] jpountz commented on pull request #12055: Better skipping for multi-term queries with a FILTER rewrite.

Posted by GitBox <gi...@apache.org>.
jpountz commented on PR #12055:
URL: https://github.com/apache/lucene/pull/12055#issuecomment-1368422269

   Here is what luceneutil gives on wikimedium10m:
   
   ```
                               TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
               BrowseDateTaxoFacets       43.81      (3.9%)       42.63     (12.5%)   -2.7% ( -18% -   14%) 0.359
                       OrHighNotLow      397.64      (9.3%)      387.31      (7.0%)   -2.6% ( -17% -   15%) 0.320
          BrowseDayOfYearTaxoFacets       44.21      (4.4%)       43.10     (12.8%)   -2.5% ( -18% -   15%) 0.406
                       OrHighNotMed      439.76      (8.8%)      431.26      (6.6%)   -1.9% ( -15% -   14%) 0.432
                      OrHighNotHigh      349.59      (8.0%)      342.97      (5.8%)   -1.9% ( -14% -   12%) 0.391
              BrowseMonthTaxoFacets       29.26      (8.8%)       28.75     (12.3%)   -1.7% ( -21% -   21%) 0.609
                      OrNotHighHigh      359.69      (6.8%)      353.47      (5.4%)   -1.7% ( -13% -   11%) 0.374
                            MedTerm      741.89      (6.7%)      729.74      (6.8%)   -1.6% ( -14% -   12%) 0.442
                        AndHighHigh      104.97      (5.6%)      103.30      (5.7%)   -1.6% ( -12% -   10%) 0.373
                           HighTerm      509.66      (7.1%)      501.79      (7.3%)   -1.5% ( -14% -   13%) 0.498
                         OrHighHigh       46.45      (4.3%)       45.79      (3.3%)   -1.4% (  -8% -    6%) 0.240
                            LowTerm      972.50      (7.7%)      959.89      (6.7%)   -1.3% ( -14% -   14%) 0.570
                  HighTermTitleSort      174.75      (6.6%)      172.71      (5.5%)   -1.2% ( -12% -   11%) 0.544
                         AndHighLow     1288.70      (2.7%)     1274.94      (3.1%)   -1.1% (  -6% -    4%) 0.247
                       OrNotHighMed      456.13      (4.5%)      452.07      (3.9%)   -0.9% (  -8% -    7%) 0.504
                  HighTermMonthSort     3799.69      (6.2%)     3765.99      (4.8%)   -0.9% ( -11% -   10%) 0.613
              BrowseMonthSSDVFacets       21.87      (9.8%)       21.67     (10.7%)   -0.9% ( -19% -   21%) 0.786
                         HighPhrase       93.80      (7.4%)       92.97      (6.1%)   -0.9% ( -13% -   13%) 0.680
                    LowSloppyPhrase       59.38      (3.5%)       58.90      (4.2%)   -0.8% (  -8% -    7%) 0.513
                             Fuzzy2       49.64      (1.6%)       49.26      (2.7%)   -0.8% (  -4% -    3%) 0.268
                             Fuzzy1      108.72      (1.5%)      107.94      (1.6%)   -0.7% (  -3% -    2%) 0.148
                        LowSpanNear      157.46      (4.0%)      156.35      (4.0%)   -0.7% (  -8% -    7%) 0.577
        BrowseRandomLabelSSDVFacets       14.99      (5.9%)       14.88      (5.9%)   -0.7% ( -11% -   11%) 0.712
           AndHighHighDayTaxoFacets        6.15      (6.0%)        6.11      (5.6%)   -0.6% ( -11% -   11%) 0.743
                         AndHighMed      206.86      (5.1%)      205.71      (5.7%)   -0.6% ( -10% -   10%) 0.745
                          OrHighMed      178.33      (3.7%)      177.55      (3.7%)   -0.4% (  -7% -    7%) 0.709
                        MedSpanNear       55.68      (3.1%)       55.48      (3.3%)   -0.4% (  -6% -    6%) 0.713
                       HighSpanNear       14.27      (3.5%)       14.23      (3.1%)   -0.3% (  -6% -    6%) 0.780
                            Respell      106.00      (1.8%)      105.77      (1.6%)   -0.2% (  -3% -    3%) 0.695
               HighTermTitleBDVSort       18.32      (3.7%)       18.30      (5.7%)   -0.1% (  -9% -    9%) 0.927
                           PKLookup      235.19      (3.3%)      234.99      (3.8%)   -0.1% (  -6% -    7%) 0.939
                    MedSloppyPhrase       19.51      (3.8%)       19.49      (4.0%)   -0.1% (  -7% -    8%) 0.957
                MedIntervalsOrdered       72.80      (5.6%)       72.76      (5.1%)   -0.1% ( -10% -   11%) 0.971
                          OrHighLow      711.52      (2.3%)      712.77      (2.6%)    0.2% (  -4% -    5%) 0.823
                          LowPhrase       37.03      (5.3%)       37.10      (4.7%)    0.2% (  -9% -   10%) 0.908
                          MedPhrase      147.57      (5.0%)      147.86      (4.2%)    0.2% (  -8% -    9%) 0.893
        BrowseRandomLabelTaxoFacets       35.14     (10.5%)       35.21     (12.7%)    0.2% ( -20% -   26%) 0.955
              HighTermDayOfYearSort      394.78      (5.3%)      395.67      (2.7%)    0.2% (  -7% -    8%) 0.865
                         TermDTSort      129.27      (3.6%)      129.82      (3.7%)    0.4% (  -6% -    7%) 0.711
            AndHighMedDayTaxoFacets      157.89      (2.1%)      158.59      (1.8%)    0.4% (  -3% -    4%) 0.475
                       OrNotHighLow     1014.83      (4.5%)     1020.03      (4.1%)    0.5% (  -7% -    9%) 0.709
             OrHighMedDayTaxoFacets       14.78      (4.6%)       14.87      (4.5%)    0.6% (  -8% -   10%) 0.677
                LowIntervalsOrdered      107.62      (4.9%)      108.44      (3.5%)    0.8% (  -7% -    9%) 0.571
                   HighSloppyPhrase        5.46      (5.3%)        5.51      (3.3%)    0.8% (  -7% -    9%) 0.552
               HighIntervalsOrdered       39.16      (5.5%)       39.52      (4.3%)    0.9% (  -8% -   11%) 0.558
               MedTermDayTaxoFacets       82.95      (3.8%)       83.72      (3.5%)    0.9% (  -6% -    8%) 0.427
               BrowseDateSSDVFacets        5.51      (4.8%)        5.56      (9.9%)    1.0% ( -13% -   16%) 0.692
                             IntNRQ      141.10      (9.5%)      142.93      (7.7%)    1.3% ( -14% -   20%) 0.635
          BrowseDayOfYearSSDVFacets       21.40      (9.8%)       21.89     (12.6%)    2.3% ( -18% -   27%) 0.516
                           Wildcard       49.91      (3.2%)       64.56      (5.8%)   29.4% (  19% -   39%) 0.000
                            Prefix3      149.66      (2.3%)      316.36      (7.6%)  111.4% (  99% -  124%) 0.000
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] jpountz commented on pull request #12055: Better skipping for multi-term queries with a FILTER rewrite.

Posted by "jpountz (via GitHub)" <gi...@apache.org>.
jpountz commented on PR #12055:
URL: https://github.com/apache/lucene/pull/12055#issuecomment-1451952360

   Woohoo!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] jpountz commented on a diff in pull request #12055: Better skipping for multi-term queries with a FILTER rewrite.

Posted by "jpountz (via GitHub)" <gi...@apache.org>.
jpountz commented on code in PR #12055:
URL: https://github.com/apache/lucene/pull/12055#discussion_r1118982580


##########
lucene/core/src/java/org/apache/lucene/search/DisiWrapper.java:
##########
@@ -57,4 +56,13 @@ public DisiWrapper(Scorer scorer) {
       matchCost = 0f;
     }
   }
+
+  public DisiWrapper(DocIdSetIterator iterator) {
+    this.scorer = null;
+    this.iterator = iterator;
+    this.cost = iterator.cost();
+    this.twoPhaseView = null;
+    this.approximation = iterator;
+    this.matchCost = 0;
+  }

Review Comment:
   I wonder if we can keep the API a bit cleaner and change callers to do `new DisiWrapper(new ConstantScoreScorer(itiretar))` instead.



##########
lucene/CHANGES.txt:
##########
@@ -277,6 +277,9 @@ Improvements
 * GITHUB#12070: Compound file creation is no longer subject to merge throttling.
   (Adrien Grand)
 
+* GITHUB#12055: Better skipping support for multi-term queries that have a
+  FILTER rewrite. (Adrien Grand, Greg Miller)

Review Comment:
   Maybe update this change log to mention that a new rewrite method was introduced and what queries specifically switched to this new rewrite method as a default?



##########
lucene/queryparser/src/java/org/apache/lucene/queryparser/flexible/standard/CommonQueryParserConfiguration.java:
##########
@@ -55,12 +58,14 @@ public interface CommonQueryParserConfiguration {
   public boolean getEnablePositionIncrements();
 
   /**
-   * By default, it uses {@link MultiTermQuery#CONSTANT_SCORE_REWRITE} when creating a prefix,
-   * wildcard and range queries. This implementation is generally preferable because it a) Runs
-   * faster b) Does not have the scarcity of terms unduly influence score c) avoids any {@link
-   * TooManyListenersException} exception. However, if your application really needs to use the
-   * old-fashioned boolean queries expansion rewriting and the above points are not relevant then
-   * use this change the rewrite method.
+   * By default QueryParser uses {@link
+   * org.apache.lucene.search.MultiTermQuery#CONSTANT_SCORE_BLENDED_REWRITE} when creating a {@link
+   * PrefixQuery}, {@link WildcardQuery} or {@link TermRangeQuery}. This implementation is generally
+   * preferable because it a) Runs faster b) Does not have the scarcity of terms unduly influence
+   * score c) avoids any {@link org.apache.lucene.search.IndexSearcher.TooManyClauses} exception.
+   * However, if your application really needs to use the old-fashioned {@link BooleanQuery}

Review Comment:
   maybe mention the non-blended constant-score rewrite as another alternative?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] jpountz commented on pull request #12055: Better skipping for multi-term queries with a FILTER rewrite.

Posted by "jpountz (via GitHub)" <gi...@apache.org>.
jpountz commented on PR #12055:
URL: https://github.com/apache/lucene/pull/12055#issuecomment-1434808901

   My attention has moved to a few other things, feel free to do whatever you want with this PR, I'll be happy to review.
   
   +1 on the nice property of gradually moving from a lazy disjunction to eager evaluation, this is what I was after with this change!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] rmuir commented on pull request #12055: Better skipping for multi-term queries with a FILTER rewrite.

Posted by "rmuir (via GitHub)" <gi...@apache.org>.
rmuir commented on PR #12055:
URL: https://github.com/apache/lucene/pull/12055#issuecomment-1435401885

   I dont see any of my feedback addressed.  I'll repeat what i said before:
   * We shouldn't be forming booleanqueries from a `FILTER` rewrite, this is wrong to do and it causes some slowdowns in some cases. We need more rewrites instead of one 'wonder-do-it-all'.
   * We are still reusing postingsenum (now only some of the time, grr) and tossing them into priority queues. this is unsafe.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] rmuir commented on pull request #12055: Better skipping for multi-term queries with a FILTER rewrite.

Posted by "rmuir (via GitHub)" <gi...@apache.org>.
rmuir commented on PR #12055:
URL: https://github.com/apache/lucene/pull/12055#issuecomment-1435403774

   I think it would help, a lot, to look thru history and see how "constant score auto rewrite" was implemented years ago, and then its removal, before adding it back again.
   
   but I'm firmly against doing this crap inside filter rewrite. If we really have good reason to "add  back" "auto rewrite", it would be good to add it as a separate thing like before. Also probably good to look at how it was done before.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] rmuir commented on a diff in pull request #12055: Better skipping for multi-term queries with a FILTER rewrite.

Posted by GitBox <gi...@apache.org>.
rmuir commented on code in PR #12055:
URL: https://github.com/apache/lucene/pull/12055#discussion_r1059807649


##########
lucene/core/src/java/org/apache/lucene/search/MultiTermQueryConstantScoreWrapper.java:
##########
@@ -183,23 +182,31 @@ private WeightOrDocIdSet rewrite(LeafReaderContext context) throws IOException {
           }
           Query q = new ConstantScoreQuery(bq.build());
           final Weight weight = searcher.rewrite(q).createWeight(searcher, scoreMode, score());
-          return new WeightOrDocIdSet(weight);
+          return new WeightOrDocIdSetIterator(weight);
         }
 
         // Too many terms: go back to the terms we already collected and start building the bit set
-        DocIdSetBuilder builder = new DocIdSetBuilder(context.reader().maxDoc(), terms);
+        PriorityQueue<PostingsEnum> highFrequencyTerms =
+            new PriorityQueue<PostingsEnum>(collectedTerms.size()) {
+              @Override
+              protected boolean lessThan(PostingsEnum a, PostingsEnum b) {
+                return a.cost() < b.cost();
+              }
+            };
+        DocIdSetBuilder otherTerms = new DocIdSetBuilder(context.reader().maxDoc(), terms);
         if (collectedTerms.isEmpty() == false) {
           TermsEnum termsEnum2 = terms.iterator();
           for (TermAndState t : collectedTerms) {
             termsEnum2.seekExact(t.term, t.state);
-            docs = termsEnum2.postings(docs, PostingsEnum.NONE);
-            builder.add(docs);
+            PostingsEnum postings = termsEnum2.postings(null, PostingsEnum.NONE);
+            highFrequencyTerms.add(postings);
           }
         }
 
-        // Then keep filling the bit set with remaining terms
+        // Then collect remaining terms
+        PostingsEnum postings = null;
         do {
-          docs = termsEnum.postings(docs, PostingsEnum.NONE);
+          postings = termsEnum.postings(postings, PostingsEnum.NONE);

Review Comment:
   i don't understand how this is safe at all, we are reusing PostingsEnum instances yet also stuffing them into a priority queue.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] gsmiller commented on a diff in pull request #12055: Better skipping for multi-term queries with a FILTER rewrite.

Posted by GitBox <gi...@apache.org>.
gsmiller commented on code in PR #12055:
URL: https://github.com/apache/lucene/pull/12055#discussion_r1060869958


##########
lucene/core/src/java/org/apache/lucene/search/MultiTermQueryConstantScoreWrapper.java:
##########
@@ -183,23 +182,31 @@ private WeightOrDocIdSet rewrite(LeafReaderContext context) throws IOException {
           }
           Query q = new ConstantScoreQuery(bq.build());
           final Weight weight = searcher.rewrite(q).createWeight(searcher, scoreMode, score());
-          return new WeightOrDocIdSet(weight);
+          return new WeightOrDocIdSetIterator(weight);
         }
 
         // Too many terms: go back to the terms we already collected and start building the bit set
-        DocIdSetBuilder builder = new DocIdSetBuilder(context.reader().maxDoc(), terms);
+        PriorityQueue<PostingsEnum> highFrequencyTerms =
+            new PriorityQueue<PostingsEnum>(collectedTerms.size()) {
+              @Override
+              protected boolean lessThan(PostingsEnum a, PostingsEnum b) {
+                return a.cost() < b.cost();
+              }
+            };
+        DocIdSetBuilder otherTerms = new DocIdSetBuilder(context.reader().maxDoc(), terms);

Review Comment:
   minor: Could we define `otherTerms` closer to where it first gets used? (e.g., L:207)



##########
lucene/core/src/java/org/apache/lucene/search/MultiTermQueryConstantScoreWrapper.java:
##########
@@ -211,32 +218,39 @@ private WeightOrDocIdSet rewrite(LeafReaderContext context) throws IOException {
                 new ConstantScoreQuery(
                     new TermQuery(new Term(query.field, termsEnum.term()), termStates));
             Weight weight = searcher.rewrite(q).createWeight(searcher, scoreMode, score());
-            return new WeightOrDocIdSet(weight);
+            return new WeightOrDocIdSetIterator(weight);
           }
-          builder.add(docs);
+          PostingsEnum dropped = highFrequencyTerms.insertWithOverflow(postings);
+          otherTerms.add(dropped);
+          postings = dropped;
         } while (termsEnum.next() != null);
 
-        return new WeightOrDocIdSet(builder.build());
+        List<DocIdSetIterator> disis = new ArrayList<>(highFrequencyTerms.size() + 1);
+        for (PostingsEnum pe : highFrequencyTerms) {
+          disis.add(pe);
+        }
+        disis.add(otherTerms.build().iterator());
+        DisiPriorityQueue subs = new DisiPriorityQueue(disis.size());
+        for (DocIdSetIterator disi : disis) {
+          subs.add(new DisiWrapper(disi));
+        }

Review Comment:
   Maybe I'm overlooking something silly, but can't we just do one pass like this?
   
   ```suggestion
   DisiPriorityQueue subs = new DisiPriorityQueue(highFrequencyTerms.size() + 1);
           for (DocIdSetIterator disi : highFrequencyTerms) {
             subs.add(new DisiWrapper(disi));
           }
           subs.add(new DisiWrapper(otherTerms.build().iterator()));
   ```



##########
lucene/core/src/java/org/apache/lucene/search/MultiTermQueryConstantScoreWrapper.java:
##########
@@ -183,23 +182,31 @@ private WeightOrDocIdSet rewrite(LeafReaderContext context) throws IOException {
           }
           Query q = new ConstantScoreQuery(bq.build());
           final Weight weight = searcher.rewrite(q).createWeight(searcher, scoreMode, score());
-          return new WeightOrDocIdSet(weight);
+          return new WeightOrDocIdSetIterator(weight);
         }
 
         // Too many terms: go back to the terms we already collected and start building the bit set

Review Comment:
   Can we update the comments to more accurately reflect the new logic? We don't really start building the bit set here.



##########
lucene/core/src/java/org/apache/lucene/search/MultiTermQueryConstantScoreWrapper.java:
##########
@@ -211,32 +218,39 @@ private WeightOrDocIdSet rewrite(LeafReaderContext context) throws IOException {
                 new ConstantScoreQuery(
                     new TermQuery(new Term(query.field, termsEnum.term()), termStates));
             Weight weight = searcher.rewrite(q).createWeight(searcher, scoreMode, score());
-            return new WeightOrDocIdSet(weight);
+            return new WeightOrDocIdSetIterator(weight);
           }
-          builder.add(docs);
+          PostingsEnum dropped = highFrequencyTerms.insertWithOverflow(postings);
+          otherTerms.add(dropped);
+          postings = dropped;
         } while (termsEnum.next() != null);
 
-        return new WeightOrDocIdSet(builder.build());
+        List<DocIdSetIterator> disis = new ArrayList<>(highFrequencyTerms.size() + 1);
+        for (PostingsEnum pe : highFrequencyTerms) {
+          disis.add(pe);
+        }
+        disis.add(otherTerms.build().iterator());
+        DisiPriorityQueue subs = new DisiPriorityQueue(disis.size());
+        for (DocIdSetIterator disi : disis) {
+          subs.add(new DisiWrapper(disi));
+        }

Review Comment:
   Also, it would be nice if we could get direct access to the underlying array backing `highFrequencyTerms`, then we could leverage `DisiPriorityQueue#addAll` to heapify everything at once.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] gsmiller merged pull request #12055: Better skipping for multi-term queries with a FILTER rewrite.

Posted by "gsmiller (via GitHub)" <gi...@apache.org>.
gsmiller merged PR #12055:
URL: https://github.com/apache/lucene/pull/12055


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] rmuir commented on pull request #12055: Better skipping for multi-term queries with a FILTER rewrite.

Posted by GitBox <gi...@apache.org>.
rmuir commented on PR #12055:
URL: https://github.com/apache/lucene/pull/12055#issuecomment-1370211309

   > Thanks @rmuir, you're right. The way rewrite methods are decoupled from the core protocol term intersection logic in MultiTermQuery did slip my mind. The curse of having a poor memory for things I'm not actively working on. But yes, agreed and I take back that part of my comment.
   
   I didn't really mean you specifically. I meant in the file itself. The `FILTER` rewrite is in danger of becoming a "wonder-do-it-all-rewrite-method" trying to solve all possible situations.
   
   But it doesn't need to be the default for all queries. I think the skipping logic proposed here makes total sense for stuff like WildcardQuery, PrefixQuery, RegexQuery, etc.
   
   But for TermInSetQuery? I think the rewritemethod there need not optimize for "dense" terms at all, but could focus on "PK lookups" (e.g. lots of terms with one posting) since we all know, people use it as a `JOIN` :( It could just be a default for the query, someone could always setRewriteMethod to a different one.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] gsmiller commented on pull request #12055: Better skipping for multi-term queries with a FILTER rewrite.

Posted by "gsmiller (via GitHub)" <gi...@apache.org>.
gsmiller commented on PR #12055:
URL: https://github.com/apache/lucene/pull/12055#issuecomment-1435469343

   @rmuir thanks for the feedback. Let me see if I can respond to all of it here:
   
   > postings reuse problems
   
   Can you help me with where you see this as a problem? I went through the code and didn't see any obvious mistakes. I'm sure I'm missing it? The only tricky one is in the do/while loop in MTQCSW starting on line 214. But I think this is safe since we reassign `postings` to `dropped` at the end of the loop, so `postings` should be safe to reuse at that point. Again, not trying to challenge this feedback, just trying to understand what I'm overlooking.
   
   > objection to modifying MTQCSW rewrite in general / learn from "constant score auto rewrite" history
   
   Thanks for raising this. I wasn't aware of the history here so I'll have to go do some digging. I'll take that on as a next step.
   
   > have TermInSetQuery extend MultiTermQuery
   
   I'd looked at this in the past and had found performance regressions with this approach, which I believe were caused by the different term intersection approach (seekCeil vs. seekExact). I don't recall the amount of regression though. I'll see if I can dig that up. Hopefully I documented it somewhere. But if not, I'll try to re-benchmark. Maybe the regressions don't justify the separate implementation.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] gsmiller commented on pull request #12055: Better skipping for multi-term queries with a FILTER rewrite.

Posted by "gsmiller (via GitHub)" <gi...@apache.org>.
gsmiller commented on PR #12055:
URL: https://github.com/apache/lucene/pull/12055#issuecomment-1435802423

   As one more update, I just opened a new PR to have `TermInSetQuery` extend `MultiTermQuery` instead of having its own custom implementation (#12156). I was able to do some benchmarking and it looks reasonable to me. There are probably a couple of rough-edges to sort out on the PR, but that would enable us to remove the `TermInSetQuery` changes present in this PR.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] gsmiller commented on pull request #12055: Better skipping for multi-term queries with a FILTER rewrite.

Posted by "gsmiller (via GitHub)" <gi...@apache.org>.
gsmiller commented on PR #12055:
URL: https://github.com/apache/lucene/pull/12055#issuecomment-1442363428

   Updated this PR in the following ways:
   1. I've removed the changes to `TermInSetQuery` with the plan being to have `TermInSetQuery` extend from `MultiTermQuery` to get these benefits.
   2. I've added the proposed logic as a _new_ `RewriteMethod` and left the existing ones unchanged.
   3. I've made the new `RewriteMethod` ("blended") the default for non-scoring cases. Users could override this behavior with the "old" constant scoring rewrite method if necessary.
   4. I've added/updated tests.
   
   I believe this addresses all the previous feedback and is ready for another look.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] rmuir commented on a diff in pull request #12055: Better skipping for multi-term queries with a FILTER rewrite.

Posted by GitBox <gi...@apache.org>.
rmuir commented on code in PR #12055:
URL: https://github.com/apache/lucene/pull/12055#discussion_r1059804843


##########
lucene/core/src/java/org/apache/lucene/search/MultiTermQueryConstantScoreWrapper.java:
##########
@@ -183,23 +182,31 @@ private WeightOrDocIdSet rewrite(LeafReaderContext context) throws IOException {
           }
           Query q = new ConstantScoreQuery(bq.build());
           final Weight weight = searcher.rewrite(q).createWeight(searcher, scoreMode, score());
-          return new WeightOrDocIdSet(weight);
+          return new WeightOrDocIdSetIterator(weight);
         }
 
         // Too many terms: go back to the terms we already collected and start building the bit set
-        DocIdSetBuilder builder = new DocIdSetBuilder(context.reader().maxDoc(), terms);
+        PriorityQueue<PostingsEnum> highFrequencyTerms =
+            new PriorityQueue<PostingsEnum>(collectedTerms.size()) {
+              @Override
+              protected boolean lessThan(PostingsEnum a, PostingsEnum b) {
+                return a.cost() < b.cost();

Review Comment:
   `pq.insertWithOverflow` uses `!lessThan()` in its code. So I'm worried about this PQ behaving stupidly on ties with the same `docFreq`.
   
   Is there a simple tiebreaker we can use (even synthetic such as `int termId`) so that such ties don't enter the PQ? I'm just concerned about "collect remaining terms" piece for cases where there are jazillions of terms. should also allow the IO to be a bit more sequential in such cases, rather than constantly replacing top of PQ with more ties?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] gsmiller commented on pull request #12055: Better skipping for multi-term queries with a FILTER rewrite.

Posted by GitBox <gi...@apache.org>.
gsmiller commented on PR #12055:
URL: https://github.com/apache/lucene/pull/12055#issuecomment-1370090964

   > we can have more than one or two of these rewritemethods, and use different ones for different queries: this fact seems to have been forgotten here
   
   Thanks @rmuir, you're right. The way rewrite methods are decoupled from the core protocol term intersection logic in `MultiTermQuery` did slip my mind. The curse of having a poor memory for things I'm not actively working on. But yes, agreed and I take back that part of my comment.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] rmuir commented on pull request #12055: Better skipping for multi-term queries with a FILTER rewrite.

Posted by "rmuir (via GitHub)" <gi...@apache.org>.
rmuir commented on PR #12055:
URL: https://github.com/apache/lucene/pull/12055#issuecomment-1435411850

   also dropping the postings reuse is going to cause big performance degradation for many situations. For example with NIOFSDirectory, new postings reader means indexinput.clone() calls, buffer refills, etc etc. It translates into real I/O
   
   But the reuse isn't dropped in all cases, and we're still putting "reused enums" into things like PQs. Sorry, none of this looks even half-way baked at all.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] gsmiller commented on pull request #12055: Better skipping for multi-term queries with a FILTER rewrite.

Posted by "gsmiller (via GitHub)" <gi...@apache.org>.
gsmiller commented on PR #12055:
URL: https://github.com/apache/lucene/pull/12055#issuecomment-1441957885

   @jpountz:
   
   > Do your queries early terminate somehow?
   
   Yes, our queries do terminate early. +1 to this explaining most of the large difference between the approaches in our case. 
   
   > OK I think I better understand the concern around the slowness with NIOFSDirectory now [...]
   
   Interesting. Thanks for describing the issue a bit more. I'll try to spend a little time digging into this as well to make sure I understand the concern a bit better as well.
   
   I'll post another revision here soon that introduces this rewrite logic as a new rewrite method (as discussed) and will also remove the changes to `TermInSetQuery` for now in favor of having `TermInSetQuery` extend `MultiTermQuery` (#12156).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] gsmiller commented on a diff in pull request #12055: Better skipping for multi-term queries with a FILTER rewrite.

Posted by "gsmiller (via GitHub)" <gi...@apache.org>.
gsmiller commented on code in PR #12055:
URL: https://github.com/apache/lucene/pull/12055#discussion_r1119345563


##########
lucene/core/src/java/org/apache/lucene/search/DisiWrapper.java:
##########
@@ -57,4 +56,13 @@ public DisiWrapper(Scorer scorer) {
       matchCost = 0f;
     }
   }
+
+  public DisiWrapper(DocIdSetIterator iterator) {
+    this.scorer = null;
+    this.iterator = iterator;
+    this.cost = iterator.cost();
+    this.twoPhaseView = null;
+    this.approximation = iterator;
+    this.matchCost = 0;
+  }

Review Comment:
   Yeah, that's reasonable I think. We'd be doing a little unnecessary object creation with the "dummy" scorer wrapper, but that's alright. I'll work this suggestion in. Honestly, I'm not super happy with either option, but I think this is a good suggestion for now.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] rmuir commented on a diff in pull request #12055: Better skipping for multi-term queries with a FILTER rewrite.

Posted by "rmuir (via GitHub)" <gi...@apache.org>.
rmuir commented on code in PR #12055:
URL: https://github.com/apache/lucene/pull/12055#discussion_r1110449800


##########
lucene/core/src/java/org/apache/lucene/search/MultiTermQueryConstantScoreWrapper.java:
##########
@@ -183,23 +182,31 @@ private WeightOrDocIdSet rewrite(LeafReaderContext context) throws IOException {
           }
           Query q = new ConstantScoreQuery(bq.build());
           final Weight weight = searcher.rewrite(q).createWeight(searcher, scoreMode, score());
-          return new WeightOrDocIdSet(weight);
+          return new WeightOrDocIdSetIterator(weight);
         }
 
         // Too many terms: go back to the terms we already collected and start building the bit set
-        DocIdSetBuilder builder = new DocIdSetBuilder(context.reader().maxDoc(), terms);
+        PriorityQueue<PostingsEnum> highFrequencyTerms =
+            new PriorityQueue<PostingsEnum>(collectedTerms.size()) {
+              @Override
+              protected boolean lessThan(PostingsEnum a, PostingsEnum b) {
+                return a.cost() < b.cost();

Review Comment:
   dawid fixed the issue: `!lessThan()` is no longer used in the logic, and the "churn" should be avoided I think. https://github.com/apache/lucene/commit/ba9fee502b0c18fda312da55c6304ac8a463f509
   
   I was looking at outdated version of this file.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] jpountz commented on pull request #12055: Better skipping for multi-term queries with a FILTER rewrite.

Posted by "jpountz (via GitHub)" <gi...@apache.org>.
jpountz commented on PR #12055:
URL: https://github.com/apache/lucene/pull/12055#issuecomment-1438404538

   OK I think I better understand the concern around the slowness with NIOFSDirectory now. With a single PostingsEnum getting reused, a single BufferedIndexInput refill would buffer postings lists for multiple terms at once, so next calls to `Terms#postings` and `PostingsEnum#nextDoc` would read from the buffer.
   
   We're losing this property by reusing across up to 16+1 PostingsEnums, though the fact that the top 16 postings enums that have the highest doc freq should get relatively stable after some time, ie. most terms would have a lower doc freq, plus the threshold on docFreq=512 should help too, as we're always reusing the last postings enum in that case, and postings lists that have more matches would likely need to refill their buffer multiple times to read all matching docs anyway.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] gsmiller commented on pull request #12055: Better skipping for multi-term queries with a FILTER rewrite.

Posted by "gsmiller (via GitHub)" <gi...@apache.org>.
gsmiller commented on PR #12055:
URL: https://github.com/apache/lucene/pull/12055#issuecomment-1434825645

   Thanks @jpountz. I pushed my `TermInSetQuery` changes, but still need to address a couple of Robert's comments on the original implementation. I'll update here when I think it's ready for a look. Thanks for the offer to review!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] rmuir commented on a diff in pull request #12055: Better skipping for multi-term queries with a FILTER rewrite.

Posted by GitBox <gi...@apache.org>.
rmuir commented on code in PR #12055:
URL: https://github.com/apache/lucene/pull/12055#discussion_r1059807197


##########
lucene/core/src/java/org/apache/lucene/search/MultiTermQueryConstantScoreWrapper.java:
##########
@@ -183,23 +182,31 @@ private WeightOrDocIdSet rewrite(LeafReaderContext context) throws IOException {
           }
           Query q = new ConstantScoreQuery(bq.build());
           final Weight weight = searcher.rewrite(q).createWeight(searcher, scoreMode, score());
-          return new WeightOrDocIdSet(weight);
+          return new WeightOrDocIdSetIterator(weight);
         }
 
         // Too many terms: go back to the terms we already collected and start building the bit set
-        DocIdSetBuilder builder = new DocIdSetBuilder(context.reader().maxDoc(), terms);
+        PriorityQueue<PostingsEnum> highFrequencyTerms =
+            new PriorityQueue<PostingsEnum>(collectedTerms.size()) {
+              @Override
+              protected boolean lessThan(PostingsEnum a, PostingsEnum b) {
+                return a.cost() < b.cost();
+              }
+            };
+        DocIdSetBuilder otherTerms = new DocIdSetBuilder(context.reader().maxDoc(), terms);
         if (collectedTerms.isEmpty() == false) {
           TermsEnum termsEnum2 = terms.iterator();
           for (TermAndState t : collectedTerms) {
             termsEnum2.seekExact(t.term, t.state);
-            docs = termsEnum2.postings(docs, PostingsEnum.NONE);
-            builder.add(docs);
+            PostingsEnum postings = termsEnum2.postings(null, PostingsEnum.NONE);
+            highFrequencyTerms.add(postings);

Review Comment:
   Rather than just blindly add terms to the PQ, should we just have a constant mininum `cost` threshold (e.g. 256, 1024, whatever) to even consider it? otherwise go directly to `otherTerms`. The skipping stuff isn't going to be useful for the long-tail of low-cost terms (the majority, if we are thinking zipf). Ideally we wouldnt waste our time unless it has skipdata? And we want to be careful about the performance of these queries when there are jazillions of jazillions of matching low-frequency terms.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] jpountz commented on pull request #12055: Better skipping for multi-term queries with a FILTER rewrite.

Posted by GitBox <gi...@apache.org>.
jpountz commented on PR #12055:
URL: https://github.com/apache/lucene/pull/12055#issuecomment-1368423502

   For the record, the reason why we're seeing a speedup here is because prefix and wildcard queries produce constant scores, so the query can early terminate once 1,000 hits have been collected. Before the change, we would always create a bitset of all matches, and that would force evaluating the query against the entire doc ID space up-front. Evaluation is more lazy now, with only low-frequency postings being evaluated up-front and high-frequency postings being evaulated lazily.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] gsmiller commented on pull request #12055: Better skipping for multi-term queries with a FILTER rewrite.

Posted by GitBox <gi...@apache.org>.
gsmiller commented on PR #12055:
URL: https://github.com/apache/lucene/pull/12055#issuecomment-1369212139

   +1 to this approach in general.
   
   I do wonder if the distribution assumptions generally hold if we start looking at "term in set" queries though. That's sort of irrelevant right now since that implementation is still separate (`TermInSetQuery`), but this may add another reason to keep that implementation separate going forward. I think the difference with "term in set" is that it may not follow natural language distributions in general, while the current MultiTermQuery implementations most likely do.
   
   I also wonder if we could be more aggressive with the number of clauses we build into a `BooleanQuery` if we leverage the short-circuiting idea in #11928. Might be a nice fit for this "filtering" case.
   
   Just a couple thoughts but certainly nothing blocking or anything that needs to be included as part of this PR. Just wanted to toss them out there.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] gsmiller commented on pull request #12055: Better skipping for multi-term queries with a FILTER rewrite.

Posted by "gsmiller (via GitHub)" <gi...@apache.org>.
gsmiller commented on PR #12055:
URL: https://github.com/apache/lucene/pull/12055#issuecomment-1447250930

   > Apparently I can't approve since I opened the PR initially, but the change looks good to me!
   
   Ha! Thanks for having another look. This expanded a little in scope by keeping the "old" behavior along with the new, so thanks for having patience with it. I made a couple minor changes based on your feedback and am going to go ahead and merge given your "approval." If (or anyone else) have additional comments, I'm happy to continue addressing them. Thanks again.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] gsmiller commented on a diff in pull request #12055: Better skipping for multi-term queries with a FILTER rewrite.

Posted by "gsmiller (via GitHub)" <gi...@apache.org>.
gsmiller commented on code in PR #12055:
URL: https://github.com/apache/lucene/pull/12055#discussion_r1109995482


##########
lucene/core/src/java/org/apache/lucene/search/MultiTermQueryConstantScoreWrapper.java:
##########
@@ -183,23 +182,31 @@ private WeightOrDocIdSet rewrite(LeafReaderContext context) throws IOException {
           }
           Query q = new ConstantScoreQuery(bq.build());
           final Weight weight = searcher.rewrite(q).createWeight(searcher, scoreMode, score());
-          return new WeightOrDocIdSet(weight);
+          return new WeightOrDocIdSetIterator(weight);
         }
 
         // Too many terms: go back to the terms we already collected and start building the bit set
-        DocIdSetBuilder builder = new DocIdSetBuilder(context.reader().maxDoc(), terms);
+        PriorityQueue<PostingsEnum> highFrequencyTerms =
+            new PriorityQueue<PostingsEnum>(collectedTerms.size()) {
+              @Override
+              protected boolean lessThan(PostingsEnum a, PostingsEnum b) {
+                return a.cost() < b.cost();

Review Comment:
   Looking at the internals of `insertWithOverflow`, I don't think we'll churn the smallest term here. If a later-visited term has the same cost as the top of the PQ, it will be rejected / immediately returned by `insertWithOverflow` and subsequently built into the bitset, which I think is what we want (to avoid unnecessary seeking later). Does that sound right? It's possible I'm missing something or misunderstanding the concern.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] gsmiller commented on a diff in pull request #12055: Better skipping for multi-term queries with a FILTER rewrite.

Posted by "gsmiller (via GitHub)" <gi...@apache.org>.
gsmiller commented on code in PR #12055:
URL: https://github.com/apache/lucene/pull/12055#discussion_r1119197629


##########
lucene/queryparser/src/java/org/apache/lucene/queryparser/flexible/standard/CommonQueryParserConfiguration.java:
##########
@@ -55,12 +58,14 @@ public interface CommonQueryParserConfiguration {
   public boolean getEnablePositionIncrements();
 
   /**
-   * By default, it uses {@link MultiTermQuery#CONSTANT_SCORE_REWRITE} when creating a prefix,
-   * wildcard and range queries. This implementation is generally preferable because it a) Runs
-   * faster b) Does not have the scarcity of terms unduly influence score c) avoids any {@link
-   * TooManyListenersException} exception. However, if your application really needs to use the
-   * old-fashioned boolean queries expansion rewriting and the above points are not relevant then
-   * use this change the rewrite method.
+   * By default QueryParser uses {@link
+   * org.apache.lucene.search.MultiTermQuery#CONSTANT_SCORE_BLENDED_REWRITE} when creating a {@link
+   * PrefixQuery}, {@link WildcardQuery} or {@link TermRangeQuery}. This implementation is generally
+   * preferable because it a) Runs faster b) Does not have the scarcity of terms unduly influence
+   * score c) avoids any {@link org.apache.lucene.search.IndexSearcher.TooManyClauses} exception.
+   * However, if your application really needs to use the old-fashioned {@link BooleanQuery}

Review Comment:
   Good idea, thanks!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] gsmiller commented on a diff in pull request #12055: Better skipping for multi-term queries with a FILTER rewrite.

Posted by "gsmiller (via GitHub)" <gi...@apache.org>.
gsmiller commented on code in PR #12055:
URL: https://github.com/apache/lucene/pull/12055#discussion_r1119215684


##########
lucene/CHANGES.txt:
##########
@@ -277,6 +277,9 @@ Improvements
 * GITHUB#12070: Compound file creation is no longer subject to merge throttling.
   (Adrien Grand)
 
+* GITHUB#12055: Better skipping support for multi-term queries that have a
+  FILTER rewrite. (Adrien Grand, Greg Miller)

Review Comment:
   Thanks, I missed this. I also realized the entry is in the 9.5 block. Updating the wording and moving under 9.6.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] gsmiller commented on pull request #12055: Better skipping for multi-term queries with a FILTER rewrite.

Posted by "gsmiller (via GitHub)" <gi...@apache.org>.
gsmiller commented on PR #12055:
URL: https://github.com/apache/lucene/pull/12055#issuecomment-1451951125

   Nightly bench runs have refreshed and show some nice improvements. I'll try to add some annotations soon.
   
   * https://home.apache.org/~mikemccand/lucenebench/Prefix3.html
   * https://home.apache.org/~mikemccand/lucenebench/Wildcard.html


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] gsmiller commented on pull request #12055: Better skipping for multi-term queries with a FILTER rewrite.

Posted by "gsmiller (via GitHub)" <gi...@apache.org>.
gsmiller commented on PR #12055:
URL: https://github.com/apache/lucene/pull/12055#issuecomment-1434796647

   @jpountz I've found that applying this same idea to `TermInSetQuery` is really helpful for performance in our use-cases at Amazon product search. It's nice because the behavior of `TermInSetQuery` gradually changes from a standard boolean query to an up-front rewrite approach. We've had some trouble in the past with the hard flip-over in the behavior (e.g., one more term in a disjunction leads to completely different behavior).
   
   Do you mind if I extend this PR to include a similar change to `TermInSetQuery`? Not sure where you are with this work, or if you were planning to pick it back up?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] rmuir commented on pull request #12055: Better skipping for multi-term queries with a FILTER rewrite.

Posted by "rmuir (via GitHub)" <gi...@apache.org>.
rmuir commented on PR #12055:
URL: https://github.com/apache/lucene/pull/12055#issuecomment-1435417952

   I suggest an easy path to success here:
   * Keep the simple filter rewrite without crazy boolean auto-optimizations, that does what the javadocs in MultiTermQuery says it does and nothing more. It will be useful as a bailout, e.g. for NIOFS users, in case the crazy heuristics dont work well. It will also make testing easy: there have been iterations here that are clearly incorrect and yet no tests fail. 
   * add a new auto rewrite, but first look at why it failed and was removed before and try to learn from those lessons. It is ok, for it to be the new "default" rewrite method, once it it is baked.
   * fix TermInSetQuery to extend MultiTermQuery so we don't duplicate the horror-show twice.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] jpountz commented on pull request #12055: Better skipping for multi-term queries with a FILTER rewrite.

Posted by "jpountz (via GitHub)" <gi...@apache.org>.
jpountz commented on PR #12055:
URL: https://github.com/apache/lucene/pull/12055#issuecomment-1438199152

   Thanks Greg for sharing more info about how it helped on Amazon Product search. Do your queries early terminate somehow (in which case I'd expect this change to help the most since it can skip evaluating the tail of long postings)?
   
   I like the idea of having multiple rewrite methods and possibly an `auto` method that tries to guess a sensible rewrite method given index statistics. It helps keep things simple without having a single rewrite method that needs to be heroic.
   
   Reuse of postings enums looks ok to me, we could improve naming and add more comments to make it more obviously ok, but we only create up to 16 postings enums from scratch, reuse otherwise, and make sure to never reuse a postings enum that is in the priority queue. The threshold of 16 looks conservative to me so I wouldn't worry about NIOFSDirectory, if we have a problem with NIOFSDirectory and this threshold of 16 then many simple boolean queries have problems too, which I don't think is the case in practice? The threshold on the minimum document frequency should also help here, e.g. a near-PK field would only accumulate hits into a DocIdSetBuilder and not pull postings enums?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org