You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2022/08/08 17:34:01 UTC

[GitHub] [lucene] gsmiller opened a new pull request, #1062: Optimize TermInSetQuery for terms that match all docs in a segment

gsmiller opened a new pull request, #1062:
URL: https://github.com/apache/lucene/pull/1062

   ### Description (or a Jira issue link if you have one)
   
   This change introduces an optimization to `TermInSetQuery` when a term is present that matches all docs in a segment.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] LuXugang commented on a diff in pull request #1062: Optimize TermInSetQuery for terms that match all docs in a segment

Posted by GitBox <gi...@apache.org>.

LuXugang commented on code in PR #1062:
URL: https://github.com/apache/lucene/pull/1062#discussion_r946845549


##########
lucene/core/src/java/org/apache/lucene/search/TermInSetQuery.java:
##########
@@ -363,6 +370,29 @@ public boolean isCacheable(LeafReaderContext ctx) {
         // sets.
         return ramBytesUsed() <= RamUsageEstimator.QUERY_DEFAULT_RAM_BYTES_USED;
       }
+
+      static final class MatchAllDocIdSet extends DocIdSet {
+        private final int size;

Review Comment:
   Hi @gsmiller , I saw there was already an `EMPTY` implementation in class `DocIdSet `, and this implementation is commonly useful, should we move this implementation to  class `DocIdSet `  and name it like `all` or the other name ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] LuXugang commented on a diff in pull request #1062: Optimize TermInSetQuery for terms that match all docs in a segment

Posted by GitBox <gi...@apache.org>.

LuXugang commented on code in PR #1062:
URL: https://github.com/apache/lucene/pull/1062#discussion_r950307252


##########
lucene/core/src/java/org/apache/lucene/search/TermInSetQuery.java:
##########
@@ -293,6 +296,9 @@ private WeightOrDocIdSet rewrite(LeafReaderContext context) throws IOException {
               builder.add(docs);

Review Comment:
   Hi @gsmiller , I mean the `termsEnum` in line 295 did not involved with the `reader.maxDoc() == termsEnum.docFreq()` calculation.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] gsmiller commented on a diff in pull request #1062: Optimize TermInSetQuery for terms that match all docs in a segment

Posted by GitBox <gi...@apache.org>.

gsmiller commented on code in PR #1062:
URL: https://github.com/apache/lucene/pull/1062#discussion_r951698471


##########
lucene/core/src/java/org/apache/lucene/search/TermInSetQuery.java:
##########
@@ -293,6 +296,9 @@ private WeightOrDocIdSet rewrite(LeafReaderContext context) throws IOException {
               builder.add(docs);

Review Comment:
   Oh, I get what you're saying now. Thanks @LuXugang! If I'm understanding correctly (I think I am now), you're pointing out that we can further optimize; I'm missing a check on the current TermEnum when dropping into this branch. I'll update the PR shortly. Thanks again!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] LuXugang commented on a diff in pull request #1062: Optimize TermInSetQuery for terms that match all docs in a segment

Posted by GitBox <gi...@apache.org>.

LuXugang commented on code in PR #1062:
URL: https://github.com/apache/lucene/pull/1062#discussion_r950307252


##########
lucene/core/src/java/org/apache/lucene/search/TermInSetQuery.java:
##########
@@ -293,6 +296,9 @@ private WeightOrDocIdSet rewrite(LeafReaderContext context) throws IOException {
               builder.add(docs);

Review Comment:
   Hi @gsmiller , I mean the termsEnum in line 295 did not involved with the reader.maxDoc() == termsEnum.docFreq() calculation



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] gsmiller commented on a diff in pull request #1062: Optimize TermInSetQuery for terms that match all docs in a segment

Posted by GitBox <gi...@apache.org>.

gsmiller commented on code in PR #1062:
URL: https://github.com/apache/lucene/pull/1062#discussion_r950251125


##########
lucene/core/src/java/org/apache/lucene/search/TermInSetQuery.java:
##########
@@ -293,6 +296,9 @@ private WeightOrDocIdSet rewrite(LeafReaderContext context) throws IOException {
               builder.add(docs);

Review Comment:
   @LuXugang I'm not sure I follow your question. `builder.add(docs)` will add all docs for the term's `PostingsEnum` to the `DocIdSetBuilder`. Or are you referring to the initialization of the builder here: `builder = new DocIdSetBuilder(reader.maxDoc(), terms);`? In this initialization, `reader.maxDoc()` is used to specify the number of possible docids (used to size the underlying bitset). `reader.maxDoc()` returns "one greater than the largest possible document number" (according to javadoc), so this should be correct.
   
   Sorry if I'm misunderstanding your question.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] LuXugang commented on a diff in pull request #1062: Optimize TermInSetQuery for terms that match all docs in a segment

Posted by GitBox <gi...@apache.org>.

LuXugang commented on code in PR #1062:
URL: https://github.com/apache/lucene/pull/1062#discussion_r949980320


##########
lucene/core/src/java/org/apache/lucene/search/TermInSetQuery.java:
##########
@@ -293,6 +296,9 @@ private WeightOrDocIdSet rewrite(LeafReaderContext context) throws IOException {
               builder.add(docs);

Review Comment:
   It seems like this term's `reader.maxDoc() == termsEnum.docFreq()` was skipped？



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] gsmiller commented on pull request #1062: Optimize TermInSetQuery for terms that match all docs in a segment

Posted by GitBox <gi...@apache.org>.

gsmiller commented on PR #1062:
URL: https://github.com/apache/lucene/pull/1062#issuecomment-1224477977

   Thanks @LuXugang! Merged and backported.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] gsmiller commented on a diff in pull request #1062: Optimize TermInSetQuery for terms that match all docs in a segment

Posted by GitBox <gi...@apache.org>.

gsmiller commented on code in PR #1062:
URL: https://github.com/apache/lucene/pull/1062#discussion_r948336077


##########
lucene/core/src/java/org/apache/lucene/search/TermInSetQuery.java:
##########
@@ -363,6 +370,29 @@ public boolean isCacheable(LeafReaderContext ctx) {
         // sets.
         return ramBytesUsed() <= RamUsageEstimator.QUERY_DEFAULT_RAM_BYTES_USED;
       }
+
+      static final class MatchAllDocIdSet extends DocIdSet {
+        private final int size;

Review Comment:
   Thanks for the suggestion @LuXugang. Yeah, I think exposing an `ALL` `DocIdSet` for general use is reasonable. I'll update the PR.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] LuXugang commented on a diff in pull request #1062: Optimize TermInSetQuery for terms that match all docs in a segment

Posted by GitBox <gi...@apache.org>.

LuXugang commented on code in PR #1062:
URL: https://github.com/apache/lucene/pull/1062#discussion_r950306914


##########
lucene/core/src/java/org/apache/lucene/search/TermInSetQuery.java:
##########
@@ -293,6 +296,9 @@ private WeightOrDocIdSet rewrite(LeafReaderContext context) throws IOException {
               builder.add(docs);

Review Comment:
   Hi @gsmiller , I mean the `termsEnum` in line 295 did not involved with the `reader.maxDoc() == termsEnum.docFreq() ` calculation



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] gsmiller merged pull request #1062: Optimize TermInSetQuery for terms that match all docs in a segment

Posted by GitBox <gi...@apache.org>.

gsmiller merged PR #1062:
URL: https://github.com/apache/lucene/pull/1062


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org