You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Adrien Grand (Jira)" <ji...@apache.org> on 2019/12/24 14:02:00 UTC

[jira] [Commented] (LUCENE-9107) CommonsTermsQuery with huge no. of terms slower with top-k scoring

    [ https://issues.apache.org/jira/browse/LUCENE-9107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17002848#comment-17002848 ] 

Adrien Grand commented on LUCENE-9107:
--------------------------------------

CommonTermsQuery probably makes the issue worse by having clauses on multiple levels of boolean queries (see e.g. how the nested boolean queries perform worse than single-level boolean queries in the nightly benchmarks http://people.apache.org/~mikemccand/lucenebench/AndMedOrHighHigh.html), but this is an issue with BooleanQuery too. We have complex logic that tries to skip as many hits as possible, but when this logic is defeated, which is typically the case when
 - there are lots of clauses,
 - or clauses have about the same max scores,
 - or maximum score upper bounds are highly overestimated (ClassicSimilarity might contribute a bit here too),
then we need to pay the price for this overhead without getting any benefits.

What latency do you get if you run a pure disjunction with these clauses instead of a CommonTermsQuery?

> CommonsTermsQuery with huge no. of terms slower with top-k scoring
> ------------------------------------------------------------------
>
>                 Key: LUCENE-9107
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9107
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: core/search
>    Affects Versions: 8.3
>            Reporter: Tommaso Teofili
>            Priority: Major
>
> In [1] a {{CommonTermsQuery}} is used in order to perform a query with lots of (duplicate) terms. Using a max term frequency cutoff of 0.999 for low frequency terms, the query, although big, finishes in around 2-300ms with Lucene 7.6.0. 
> However, when upgrading the code to Lucene 8.x, the query runs in 2-3s instead [2].
> After digging a bit into it it seems that the regression in speed comes from the fact that top-k scoring introduced by default in version 8 is causing that, not sure "where" exactly in the code though.
> When switching back to complete hit scoring [3], the speed goes back to the initial 2-300ms also in Lucene 8.3.x.
> It'd be nice to understand the reason why this is happening and if it is only concerning {{CommonTermsQuery}} or affecting {{BooleanQuery}} as well.
> If this is a case that depends on the data and application involved (Anserini in this case), the application should handle it, otherwise if it is a regression/bug in Lucene it'd be nice to fix it.
> [1] : https://github.com/tteofili/Anserini-embeddings/blob/nnsearch/src/main/java/io/anserini/embeddings/nn/fw/FakeWordsRunner.java
> [2] : https://github.com/castorini/anserini/blob/master/src/main/java/io/anserini/analysis/vectors/ApproximateNearestNeighborEval.java
> [3] : https://github.com/tteofili/anserini/blob/ann-paper-reproduce/src/main/java/io/anserini/analysis/vectors/ApproximateNearestNeighborEval.java#L174



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org