You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "rquesada-tibco (via GitHub)" <gi...@apache.org> on 2024/01/24 11:24:23 UTC

[I] Improve AbstractMultiTermQueryConstantScoreWrapper#RewritingWeight ScorerSupplier cost estimation [lucene]

rquesada-tibco opened a new issue, #13029:
URL: https://github.com/apache/lucene/issues/13029

   ### Description
   
   We recently discovered a performance degradation on our project when going from Lucene 9.4 to 9.9. The cause seems to be a side effect of https://github.com/apache/lucene/commit/c6667e709f610669b57a6a1a6ab1cc80f4f0ebaf and https://github.com/apache/lucene/commit/3809106602a9675f4fd217b1090af4505d4ec2a7
   
   The situation is as follows: we have a `WildcardQuery` and a `TermInSetQuery` which are and-combined (within a `BooleanQuery`). This structure gets executed repeatedly, kind of like a nested loop where the `WildcardQuery` remains the same, but the `TermInSetQuery` keeps changes its terms. In the old version, this was fast because the `WildcardQuery` was cached within the `LRUQueryCache`. However in the new version this is no longer the case, so the execution time of our scenario has increased.
   
   Why our `WildcardQuery` is not cached any more? It boils down to [this line](https://github.com/apache/lucene/blob/f16007c3eca7cbf89bf6c2f88907707cf30c0058/lucene/core/src/java/org/apache/lucene/search/LRUQueryCache.java#L771) in `LRUQueryCache`, where the cache operation won't happen if the cost estimation is too high:
   ```
   final ScorerSupplier supplier = in.scorerSupplier(context);
   ...
   final long cost = supplier.cost();
   ...
   // skip cache operation which would slow query down too much
   if (cost / skipCacheFactor > leadCost) {
   ...
   ```
   
   Before the upgrade to 9.9, that cost was provided by a `ConstantScoreWeight` returned by the old `MultiTermQueryConstantScoreWrapper` (which was returned by the default `RewriteMethod`), which in the end was just based on the "default" `Weight#scoreSupplier` [implementation](https://github.com/apache/lucene/blob/f16007c3eca7cbf89bf6c2f88907707cf30c0058/lucene/core/src/java/org/apache/lucene/search/Weight.java#L147): basically the cost was the `scorer.iterator().cost();` and in **our case the `WildcardQuery` returns just one document, so cost 1**.
   
   After the upgrade, the default `RewriteMethod` has changed and now this cost is provided by `AbstractMultiTermQueryConstantScoreWrapper#RewritingWeight#scorerSupplier` [here](https://github.com/apache/lucene/blob/f16007c3eca7cbf89bf6c2f88907707cf30c0058/lucene/core/src/java/org/apache/lucene/search/AbstractMultiTermQueryConstantScoreWrapper.java#L263), and for that purpose a private [estimateCost method](https://github.com/apache/lucene/blob/f16007c3eca7cbf89bf6c2f88907707cf30c0058/lucene/core/src/java/org/apache/lucene/search/AbstractMultiTermQueryConstantScoreWrapper.java#L295) was introduced, which bases the estimation on the [MultiTermQuery#getTermsCount](https://github.com/apache/lucene/blob/f16007c3eca7cbf89bf6c2f88907707cf30c0058/lucene/core/src/java/org/apache/lucene/search/MultiTermQuery.java#L316) value. The problem is that, for our `WildcardQuery` (in fact for any sub-class of `AutomatonQuery`), this value is unknown, i.e. `-1`, so the `estimateCost` method just [return
 s](https://github.com/apache/lucene/blob/f16007c3eca7cbf89bf6c2f88907707cf30c0058/lucene/core/src/java/org/apache/lucene/search/AbstractMultiTermQueryConstantScoreWrapper.java#L309)  `terms.getSumDocFreq()`, which is clearly an overestimation in our case, so it prevents the caching, and leads to a performance degradation.
   
   I understand that I can fix this situation by writing my customized `RewriteMethod`.
   The question is: could we improve `AbstractMultiTermQueryConstantScoreWrapper#RewritingWeight#scorerSupplier#cost`  so that, if the MTQ cannot provide a term count (`getTermsCount() == -1`) then we return `scorer.iterator().cost()` ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


Re: [I] Improve AbstractMultiTermQueryConstantScoreWrapper#RewritingWeight ScorerSupplier cost estimation [lucene]

Posted by "msfroh (via GitHub)" <gi...@apache.org>.
msfroh commented on issue #13029:
URL: https://github.com/apache/lucene/issues/13029#issuecomment-2014023729

   I think the `else` clause for the cost estimate is also not great. 
   
   I came across this same problem where a user was essentially running a single-term `TermInSetQuery` (that actually matched a single doc) AND a numeric range query that matches millions of docs. I was shocked when profiler output showed a lot of time spent in the BKReader -- the IndexOrDocValues query over the range query should have clearly taken the doc values path.
   
   Let's say you have a string field with 10 million distinct values (so 10 million terms), and they match 20 million documents (with individual terms matching 1-3 docs, say). My read is that this `estimateCost()` function will say that a `TermInSetQuery` over a single term has cost 10,000,001 (i.e. 20 million `sumDocFreq` minus 10 million for the terms `size`, plus 1 for the query term count). 
   
   I get that the absolute worst case is that 9,999,999 terms each have doc freq 1 and the remaining term has doc freq 10,000,001, but this feels silly as a cost estimate for a query that is just going to rewrite to a single `TermQuery` with cost <= 3.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


Re: [I] Improve AbstractMultiTermQueryConstantScoreWrapper#RewritingWeight ScorerSupplier cost estimation [lucene]

Posted by "msfroh (via GitHub)" <gi...@apache.org>.
msfroh commented on issue #13029:
URL: https://github.com/apache/lucene/issues/13029#issuecomment-2014084615

   I get that part of the point of this cost estimate is to avoid the (potentially-expensive) rewrite if, e.g. we can do a doc-value rewrite instead, but I'm thinking we could do something a little bit more term-aware.
   
   How expensive is `MultiTermQuery#getTermsEnum()`? I think it's cheap, but I'm not positive. Maybe we could get it, try walking through it, summing doc freqs, giving up if the number of terms gets big.
   
   We could go down that path only if the cost estimate from the existing logic is very high. I can try sketching out a PR with a test for it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org