You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2023/01/03 19:10:34 UTC
[GitHub] [lucene] gsmiller commented on a diff in pull request #12055: Better skipping for multi-term queries with a FILTER rewrite.

gsmiller commented on code in PR #12055:
URL: https://github.com/apache/lucene/pull/12055#discussion_r1060869958


##########
lucene/core/src/java/org/apache/lucene/search/MultiTermQueryConstantScoreWrapper.java:
##########
@@ -183,23 +182,31 @@ private WeightOrDocIdSet rewrite(LeafReaderContext context) throws IOException {
           }
           Query q = new ConstantScoreQuery(bq.build());
           final Weight weight = searcher.rewrite(q).createWeight(searcher, scoreMode, score());
-          return new WeightOrDocIdSet(weight);
+          return new WeightOrDocIdSetIterator(weight);
         }
 
         // Too many terms: go back to the terms we already collected and start building the bit set
-        DocIdSetBuilder builder = new DocIdSetBuilder(context.reader().maxDoc(), terms);
+        PriorityQueue<PostingsEnum> highFrequencyTerms =
+            new PriorityQueue<PostingsEnum>(collectedTerms.size()) {
+              @Override
+              protected boolean lessThan(PostingsEnum a, PostingsEnum b) {
+                return a.cost() < b.cost();
+              }
+            };
+        DocIdSetBuilder otherTerms = new DocIdSetBuilder(context.reader().maxDoc(), terms);

Review Comment:
   minor: Could we define `otherTerms` closer to where it first gets used? (e.g., L:207)



##########
lucene/core/src/java/org/apache/lucene/search/MultiTermQueryConstantScoreWrapper.java:
##########
@@ -211,32 +218,39 @@ private WeightOrDocIdSet rewrite(LeafReaderContext context) throws IOException {
                 new ConstantScoreQuery(
                     new TermQuery(new Term(query.field, termsEnum.term()), termStates));
             Weight weight = searcher.rewrite(q).createWeight(searcher, scoreMode, score());
-            return new WeightOrDocIdSet(weight);
+            return new WeightOrDocIdSetIterator(weight);
           }
-          builder.add(docs);
+          PostingsEnum dropped = highFrequencyTerms.insertWithOverflow(postings);
+          otherTerms.add(dropped);
+          postings = dropped;
         } while (termsEnum.next() != null);
 
-        return new WeightOrDocIdSet(builder.build());
+        List<DocIdSetIterator> disis = new ArrayList<>(highFrequencyTerms.size() + 1);
+        for (PostingsEnum pe : highFrequencyTerms) {
+          disis.add(pe);
+        }
+        disis.add(otherTerms.build().iterator());
+        DisiPriorityQueue subs = new DisiPriorityQueue(disis.size());
+        for (DocIdSetIterator disi : disis) {
+          subs.add(new DisiWrapper(disi));
+        }

Review Comment:
   Maybe I'm overlooking something silly, but can't we just do one pass like this?
   
   ```suggestion
   DisiPriorityQueue subs = new DisiPriorityQueue(highFrequencyTerms.size() + 1);
           for (DocIdSetIterator disi : highFrequencyTerms) {
             subs.add(new DisiWrapper(disi));
           }
           subs.add(new DisiWrapper(otherTerms.build().iterator()));
   ```



##########
lucene/core/src/java/org/apache/lucene/search/MultiTermQueryConstantScoreWrapper.java:
##########
@@ -183,23 +182,31 @@ private WeightOrDocIdSet rewrite(LeafReaderContext context) throws IOException {
           }
           Query q = new ConstantScoreQuery(bq.build());
           final Weight weight = searcher.rewrite(q).createWeight(searcher, scoreMode, score());
-          return new WeightOrDocIdSet(weight);
+          return new WeightOrDocIdSetIterator(weight);
         }
 
         // Too many terms: go back to the terms we already collected and start building the bit set

Review Comment:
   Can we update the comments to more accurately reflect the new logic? We don't really start building the bit set here.



##########
lucene/core/src/java/org/apache/lucene/search/MultiTermQueryConstantScoreWrapper.java:
##########
@@ -211,32 +218,39 @@ private WeightOrDocIdSet rewrite(LeafReaderContext context) throws IOException {
                 new ConstantScoreQuery(
                     new TermQuery(new Term(query.field, termsEnum.term()), termStates));
             Weight weight = searcher.rewrite(q).createWeight(searcher, scoreMode, score());
-            return new WeightOrDocIdSet(weight);
+            return new WeightOrDocIdSetIterator(weight);
           }
-          builder.add(docs);
+          PostingsEnum dropped = highFrequencyTerms.insertWithOverflow(postings);
+          otherTerms.add(dropped);
+          postings = dropped;
         } while (termsEnum.next() != null);
 
-        return new WeightOrDocIdSet(builder.build());
+        List<DocIdSetIterator> disis = new ArrayList<>(highFrequencyTerms.size() + 1);
+        for (PostingsEnum pe : highFrequencyTerms) {
+          disis.add(pe);
+        }
+        disis.add(otherTerms.build().iterator());
+        DisiPriorityQueue subs = new DisiPriorityQueue(disis.size());
+        for (DocIdSetIterator disi : disis) {
+          subs.add(new DisiWrapper(disi));
+        }

Review Comment:
   Also, it would be nice if we could get direct access to the underlying array backing `highFrequencyTerms`, then we could leverage `DisiPriorityQueue#addAll` to heapify everything at once.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org