You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "gsmiller (via GitHub)" <gi...@apache.org> on 2023/02/19 00:37:40 UTC

[GitHub] [lucene] gsmiller opened a new pull request, #12156: Remove custom TermInSetQuery implementation in favor of extending MultiTermQuery

gsmiller opened a new pull request, #12156:
URL: https://github.com/apache/lucene/pull/12156

   ## Description
   
   `TermInSetQuery`'s implementation is more-or-less an exactly copy of `MultiTermQuery` +  `MultiTermQueryConstantScoreWrapper`. This PR removes the custom `TermInSetQuery` implementation in favor of extending `MultiTermQuery` with `MultiTermQueryConstantScoreWrapper` as the default rewrite behavior.
   
   One nice benefit of this (beyond code cleanup) is that different rewrite methods can be provided for different behavior. Specifically, we can leverage `DocValuesRewriteMethod` to completely replace `SortedSetDocValuesSetQuery` (I would propose doing that in a follow up PR once we're happy with this change).
   
   A couple notes about the change:
   1. I needed to give `MultiTermQueryConstantScoreWrapper` a `ScoreSupplier` implementation so `TermInSetQuery` can still be used efficiently with `IndexOrDocValuesQuery`. This also required a new (optional) public API in `MultiTermQuery` where queries can expose their number of terms (where applicable).
   2. Retaining the existing `TermInSetQuery#rewrite` behavior was slightly non-trivial. I don't really like how I've done this in the PR, but I also can't think of a better solution. I'm not actually convinced we need to retain this, but I'm not sure if client code might rely on the existing behavior in places. Even if that's the case, it might not be a good enough argument to keep it, but I did find a breaking unit test if I didn't keep it. (`TestPresearcherMatchCollector.testMatchCollectorShowMatches` assumes the query will get rewritten to a BQ). I'm not familiar enough with the `monitor` package to assess if it would be reasonable to just change the unit test, or if there's something more important going on here. If someone is more familiar and/or has opinions on the need to retain the existing "rewrite" behavior, I'd love some feedback.
   
   ## Performance
   When this has been discussed in the past, there's been an open question around performance since the term intersection happens a bit differently (`seekCeil` vs. `seekExact`). I ran some benchmarks using a one-off tool (similar to #12151) and found no noticeable regressions/issues. Here's the output of my tool (which you can check out here: 
   [TiSBench.java.txt](https://github.com/apache/lucene/files/10775443/TiSBench.java.txt) )
   
   ### All Country Code Filter Terms
   | Approach     | Large Lead Terms | Medium Lead Terms | Small Lead Terms | No Lead Terms |
   |--------------|------------------|-------------------|------------------|---------------|
   | TiS Original | 206.12           | 203.00            | 201.95           | 119.97        |
   | TiS MTQ      | 193.67           | 190.54            | 189.43           | 117.02        |
   
   ### Medium Cardinality + High Cost Country Code Filter Terms
   | Approach     | Large Lead Terms | Medium Lead Terms | Small Lead Terms | No Lead Terms |
   |--------------|------------------|-------------------|------------------|---------------|
   | TiS Original | 132.42           | 127.95            | 126.15           | 77.80         |
   | TiS MTQ      | 130.49           | 125.97            | 124.25           | 77.31         |
   
   ### Low Cardinality + High Cost Country Code Filter Terms
   | Approach     | Large Lead Terms | Medium Lead Terms | Small Lead Terms | No Lead Terms |
   |--------------|------------------|-------------------|------------------|---------------|
   | TiS Original | 239.00           | 110.95            | 36.51            | 75.18         |
   | TiS MTQ      | 242.64           | 113.94            | 38.71            | 76.00         |
   
   ### Medium Cardinality + Low Cost Country Code Filter Terms
   | Approach     | Large Lead Terms | Medium Lead Terms | Small Lead Terms | No Lead Terms |
   |--------------|------------------|-------------------|------------------|---------------|
   | TiS Original | 2.51             | 1.99              | 1.67             | 0.25          |
   | TiS MTQ      | 2.54             | 2.01              | 1.67             | 0.28          |
   
   ### Low Cardinality + Low Cost Country Code Filter Terms
   | Approach     | Large Lead Terms | Medium Lead Terms | Small Lead Terms | No Lead Terms |
   |--------------|------------------|-------------------|------------------|---------------|
   | TiS Original | 2.34             | 2.07              | 1.88             | 0.44          |
   | TiS MTQ      | 2.20             | 1.91              | 1.68             | 0.41          |
   
   ### High Cardinality PK Filter Terms
   | Approach     | Large Lead Terms | Medium Lead Terms | Small Lead Terms | No Lead Terms |
   |--------------|------------------|-------------------|------------------|---------------|
   | TiS Original | 26.41            | 26.19             | 25.96            | 6.36          |
   | TiS MTQ      | 27.78            | 27.65             | 27.32            | 6.69          |
   
   ### Medium Cardinality PK Filter Terms
   | Approach     | Large Lead Terms | Medium Lead Terms | Small Lead Terms | No Lead Terms |
   |--------------|------------------|-------------------|------------------|---------------|
   | TiS Original | 2.19             | 2.13              | 2.15             | 0.49          |
   | TiS MTQ      | 2.33             | 2.28              | 2.25             | 0.50          |
   
   ### Low Cardinality PK Filter Terms
   | Approach     | Large Lead Terms | Medium Lead Terms | Small Lead Terms | No Lead Terms |
   |--------------|------------------|-------------------|------------------|---------------|
   | TiS Original | 1.76             | 1.74              | 1.74             | 0.34          |
   | TiS MTQ      | 1.32             | 1.29              | 1.28             | 0.22          |
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] MarcusSorealheis commented on pull request #12156: Remove custom TermInSetQuery implementation in favor of extending MultiTermQuery

Posted by "MarcusSorealheis (via GitHub)" <gi...@apache.org>.
MarcusSorealheis commented on PR #12156:
URL: https://github.com/apache/lucene/pull/12156#issuecomment-1436069482

   @gsmiller this is a great simplification. Thank you. I'm going to share with it our team.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] romseygeek commented on pull request #12156: Remove custom TermInSetQuery implementation in favor of extending MultiTermQuery

Posted by "romseygeek (via GitHub)" <gi...@apache.org>.
romseygeek commented on PR #12156:
URL: https://github.com/apache/lucene/pull/12156#issuecomment-1436004819

   > I'm not familiar enough with the monitor package to assess if it would be reasonable to just change the unit test, or if there's something more important going on here.
   
   It's fine to just change the test case here, we're only checking that the debug functionality on the monitor reports that a given query has matched and the specific string output isn't important.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] gsmiller commented on pull request #12156: Remove custom TermInSetQuery implementation in favor of extending MultiTermQuery

Posted by "gsmiller (via GitHub)" <gi...@apache.org>.
gsmiller commented on PR #12156:
URL: https://github.com/apache/lucene/pull/12156#issuecomment-1436051567

   > It's fine to just change the test case here, we're only checking that the debug functionality on the monitor reports that a given query has matched and the specific string output isn't important.
   
   OK thanks. I've removed the BQ rewrite logic and updated this test. I'm not convinced we need to actually rewrite to a BQ during the query rewrite, but that's the one big difference with this implementation.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] gsmiller commented on pull request #12156: Remove custom TermInSetQuery implementation in favor of extending MultiTermQuery

Posted by "gsmiller (via GitHub)" <gi...@apache.org>.
gsmiller commented on PR #12156:
URL: https://github.com/apache/lucene/pull/12156#issuecomment-1449098047

   I've rebased this work on top of `main`, picking up the significant changes in the `MultiTermQuery` space (#12055) and switching the default rewrite method for `TermInSetQuery` to the newly added `CONSTANT_SCORE_BLENDED_REWRITE`. I re-ran my simple benchmarking tooling (mentioned above) and results still look good.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] gsmiller commented on pull request #12156: Remove custom TermInSetQuery implementation in favor of extending MultiTermQuery

Posted by "gsmiller (via GitHub)" <gi...@apache.org>.
gsmiller commented on PR #12156:
URL: https://github.com/apache/lucene/pull/12156#issuecomment-1450145244

   Thanks @uschindler. 
   
   > I am not fully sure what default rewrite method is best here.
   
   The nice thing is it's easy to control now (bitset rewrite, boolean scoring, doc values post-filtering, etc.). Based on the benchmark wins in #12055 for other multi-term queries, I thought it would be good to use the same rewrite by default, at least for now. We can change the default easily if we learn more.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] gsmiller merged pull request #12156: Remove custom TermInSetQuery implementation in favor of extending MultiTermQuery

Posted by "gsmiller (via GitHub)" <gi...@apache.org>.
gsmiller merged PR #12156:
URL: https://github.com/apache/lucene/pull/12156


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org