You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2022/09/02 00:57:18 UTC

[GitHub] [lucene] gsmiller opened a new pull request, #11741: DRAFT: Experiment with intersecting TermInSetQuery terms up-front to better estimate cost

gsmiller opened a new pull request, #11741:
URL: https://github.com/apache/lucene/pull/11741

   …estimate cost
   
   ### Description
   
   Here's a rough sketch of what it might look like to intersect `TermInSetQuery` terms when creating a `ScoreSupplier` to more effectively estimate cost (see #11740)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] jpountz commented on pull request #11741: DRAFT: Experiment with intersecting TermInSetQuery terms up-front to better estimate cost

Posted by GitBox <gi...@apache.org>.

jpountz commented on PR #11741:
URL: https://github.com/apache/lucene/pull/11741#issuecomment-1241681411

   > If we assume a scenario where we have a TermInSetQuery over very selective terms (low docFreqs for each), we'd want to use the index query unless there's another clause that can lead that query that's significantly more restrictive. So there's no benefit of using IndexOrDocValuesQuery in that scenario, but moving the term lookup to the scorerSupplier also shouldn't hurt this case since we have to do it anyway.
   
   I'm not sure if this is true. I've seen users run `TermInSetQuery`s with 10k terms or more, a typical use-case being implementing some form of join where a first query collects IDs of interest and a second query uses them as a filter. Terms dictionary lookups tend to be expensive, so looking up these 10k terms is not cheap, and if `IndexOrDocValuesQuery` decides that the doc-values approach is better because of costs, then Lucene will perform these 10k lookups again in the terms dictionary of the doc values field, which would be wasteful?
   
   Not directly related to this discussion, but `TermInSetQuery` feels more complicated than ranges when it comes to figuring out the best data structure to run the query, and maybe we should fold the logic about whether or not to use doc-values directly into `TermInSetQuery` instead of expecting users to build an `IndexOrDocValuesQuery` themselves? This would allow more sophisticated logic, such as making different decisions depending on whether we expect terms to be selective or not?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] jpountz commented on pull request #11741: DRAFT: Experiment with intersecting TermInSetQuery terms up-front to better estimate cost

Posted by GitBox <gi...@apache.org>.

jpountz commented on PR #11741:
URL: https://github.com/apache/lucene/pull/11741#issuecomment-1240710691

   I guess that the main downside of this approach is that the terms lookups are the bottleneck of a `TermInSetQuery` when the included terms have low docFreqs. So moving the cost to `scorerSupplier` would kill the benefits of using an `IndexOrDocValuesQuery`?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] gsmiller commented on pull request #11741: DRAFT: Experiment with intersecting TermInSetQuery terms up-front to better estimate cost

Posted by GitBox <gi...@apache.org>.

gsmiller commented on PR #11741:
URL: https://github.com/apache/lucene/pull/11741#issuecomment-1241017654

   @jpountz thanks for the feedback! If we assume a scenario where we have a `TermInSetQuery` over very selective terms (low docFreqs for each), we'd want to use the index query unless there's another clause that can lead that query that's significantly more restrictive. So there's no benefit of using `IndexOrDocValuesQuery` in that scenario, but moving the term lookup to the `scorerSupplier` also shouldn't hurt this case since we have to do it anyway.
   
   On the other hand, with today's implementation, we may not know that the `TermInSetQuery` is very selective (e.g., maybe there are terms in the field, but not in the query, that match a large number of documents). In this case, it may be beneficial to use the index-based query, but—because of our naive cost heuristic—we could end up using a doc-values query because we're significantly over-estimating the cost of the index query.
   
   I think the case where this approach would really hurt us is when the `TermInSetQuery` is not particularly restrictive—to the point that we end up using the doc-values query—but we have to pay this up-front cost to look up terms just to do decide we don't want to use the index-based query after all. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] gsmiller commented on pull request #11741: DRAFT: Experiment with intersecting TermInSetQuery terms up-front to better estimate cost

Posted by GitBox <gi...@apache.org>.

gsmiller commented on PR #11741:
URL: https://github.com/apache/lucene/pull/11741#issuecomment-1242553881

   > I'm not sure if this is true. I've seen users run TermInSetQuerys with 10k terms or more, a typical use-case being implementing some form of join where a first query collects IDs of interest and a second query uses them as a filter.
   
   Right, that's fair. I'm curious about the cases you've seen though. The cost estimate for `TermInSetQuery` is pretty perfectly suited for this use-case right now (assuming the key uniquely identifies a document). Putting cost estimation aside though, I'd be curious when a doc-values approach would be more efficient here. Looking up the 10k terms has to be done for both query approaches, so I'd only expect the doc-values approach to be more efficient if there's a lead iterator with relatively few documents (relative to the number of unique terms in the join). Is that the sort of case you have in mind?
   
   > Terms dictionary lookups tend to be expensive, so looking up these 10k terms is not cheap, and if IndexOrDocValuesQuery decides that the doc-values approach is better because of costs, then Lucene will perform these 10k lookups again in the terms dictionary of the doc values field, which would be wasteful?
   
   +1. I think we're saying roughly the same thing. The problematic case with doing term lookup as part of cost estimation is duplicating that work when deciding to use doc-values. It's too bad we have to effectively do the same work twice in this case. It would be super cool if we could find a way to reuse the term lookup work, but that's obviously very non-trivial since the term dictionaries are completely different implementations, types of fields, etc.
   
   > Not directly related to this discussion, but TermInSetQuery feels more complicated than ranges when it comes to figuring out the best data structure to run the query, and maybe we should fold the logic about whether or not to use doc-values directly into TermInSetQuery instead of expecting users to build an IndexOrDocValuesQuery themselves? This would allow more sophisticated logic, such as making different decisions depending on whether we expect terms to be selective or not?
   
   I like this idea a lot actually. I can imagine a fairly different type of heuristic we might want to use as compared to numeric ranges. For instance, the number of terms matters. Whether-or-not we can identify that the field is a unique key matters. Hmmm... seems worth experimenting with a bit. Thanks for the idea!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] gsmiller closed pull request #11741: DRAFT: Experiment with intersecting TermInSetQuery terms up-front to better estimate cost

Posted by "gsmiller (via GitHub)" <gi...@apache.org>.

gsmiller closed pull request #11741: DRAFT: Experiment with intersecting TermInSetQuery terms up-front to better estimate cost
URL: https://github.com/apache/lucene/pull/11741


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org