You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Tommaso Teofili (Jira)" <ji...@apache.org> on 2019/12/23 09:17:00 UTC
[jira] [Updated] (LUCENE-9107) CommonsTermsQuery with huge no. of
terms slower with top-k scoring
[ https://issues.apache.org/jira/browse/LUCENE-9107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tommaso Teofili updated LUCENE-9107:
------------------------------------
Description:
In [1] a {{CommonTermsQuery}} is used in order to perform a query with lots of (duplicate) terms. Using a max term frequency cutoff of 0.999 for low frequency terms, the query, although big, finishes in around 2-300ms with Lucene 7.6.0.
However, when upgrading the code to Lucene 8.x, the query runs in 2-3s instead.
After digging a bit into it it seems that the regression in speed comes from the fact that top-k scoring introduced by default in version 8 is causing that, not sure "where" exactly in the code though.
When switching back to complete hit scoring [3], the speed goes back to the initial 2-300ms also in Lucene 8.3.x.
I am looking into why this is happening and if it is only concerning {{CommonTermsQuery}} or affecting {{BooleanQuery}} as well.
[1] : https://github.com/tteofili/Anserini-embeddings/blob/nnsearch/src/main/java/io/anserini/embeddings/nn/fw/FakeWordsRunner.java
[3] : https://github.com/tteofili/anserini/blob/ann-paper-reproduce/src/main/java/io/anserini/analysis/vectors/ApproximateNearestNeighborEval.java#L174
was:
In [1] a {{CommonTermsQuery}} is used in order to perform a query with lots of (duplicate) terms. Using a max term frequency cutoff of 0.999 for low frequency terms, the query, although big, finishes in around 2-300ms with Lucene 7.6.0.
However, when upgrading the code to Lucene 8.x, the query runs in 2-3s instead.
After digging a bit into it it seems that the regression in speed comes from the fact that top-k scoring introduced by default in version 8 is causing that, not sure "where" exactly in the code though.
When switching back to complete hit scoring [3], the speed goes back to the initial 2-300ms also in Lucene 8.3.x.
I am looking into why this is happening and if it is only concerning {{CommonTermsQuery}} or affecting {BooleanQuery}} as well.
[1] : https://github.com/tteofili/Anserini-embeddings/blob/nnsearch/src/main/java/io/anserini/embeddings/nn/fw/FakeWordsRunner.java
[3] : https://github.com/tteofili/anserini/blob/ann-paper-reproduce/src/main/java/io/anserini/analysis/vectors/ApproximateNearestNeighborEval.java#L174
> CommonsTermsQuery with huge no. of terms slower with top-k scoring
> ------------------------------------------------------------------
>
> Key: LUCENE-9107
> URL: https://issues.apache.org/jira/browse/LUCENE-9107
> Project: Lucene - Core
> Issue Type: Bug
> Components: core/search
> Affects Versions: 8.3
> Reporter: Tommaso Teofili
> Priority: Major
>
> In [1] a {{CommonTermsQuery}} is used in order to perform a query with lots of (duplicate) terms. Using a max term frequency cutoff of 0.999 for low frequency terms, the query, although big, finishes in around 2-300ms with Lucene 7.6.0.
> However, when upgrading the code to Lucene 8.x, the query runs in 2-3s instead.
> After digging a bit into it it seems that the regression in speed comes from the fact that top-k scoring introduced by default in version 8 is causing that, not sure "where" exactly in the code though.
> When switching back to complete hit scoring [3], the speed goes back to the initial 2-300ms also in Lucene 8.3.x.
> I am looking into why this is happening and if it is only concerning {{CommonTermsQuery}} or affecting {{BooleanQuery}} as well.
> [1] : https://github.com/tteofili/Anserini-embeddings/blob/nnsearch/src/main/java/io/anserini/embeddings/nn/fw/FakeWordsRunner.java
> [3] : https://github.com/tteofili/anserini/blob/ann-paper-reproduce/src/main/java/io/anserini/analysis/vectors/ApproximateNearestNeighborEval.java#L174
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org