You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Zach Chen (Jira)" <ji...@apache.org> on 2021/05/02 04:17:00 UTC
[jira] [Comment Edited] (LUCENE-9335) Add a bulk scorer for
disjunctions that does dynamic pruning
[ https://issues.apache.org/jira/browse/LUCENE-9335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17337948#comment-17337948 ]
Zach Chen edited comment on LUCENE-9335 at 5/2/21, 4:16 AM:
------------------------------------------------------------
I was trying to modify the _CreateQueries_ class in luceneutil to generate OR queries with 5 clauses, but got some issues running it. So I did some quick hack to combine the queries from OrHighHigh, OrHighMed and OrHighLow to create a new OrHighHighMedHighLow task with queries. I've attached the resulting file _wikimedium.10M.nostopwords.tasks_ to this ticket.
Here are the luceneutil results from 2 runs for each implementation:
Scorer [https://github.com/apache/lucene/pull/101]
{code:java}
TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value
OrHighHighMedHighLow 30.97 (6.2%) 24.92 (4.4%) -19.5% ( -28% - -9%) 0.000
PKLookup 223.53 (2.4%) 228.10 (3.7%) 2.0% ( -3% - 8%) 0.037{code}
{code:java}
TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value
OrHighHighMedHighLow 32.83 (3.4%) 34.00 (5.1%) 3.6% ( -4% - 12%) 0.009
PKLookup 217.86 (2.8%) 228.14 (4.2%) 4.7% ( -2% - 12%) 0.000
{code}
BulkScorer [https://github.com/apache/lucene/pull/113|https://github.com/apache/lucene/pull/113.]
{code:java}
TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value
PKLookup 197.84 (4.1%) 207.79 (4.2%) 5.0% ( -3% - 13%) 0.000
OrHighHighMedHighLow 32.50 (16.7%) 35.79 (9.9%) 10.1% ( -14% - 44%) 0.020 {code}
{code:java}
TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value
OrHighHighMedHighLow 28.61 (5.4%) 22.28 (4.2%) -22.1% ( -30% - -13%) 0.000
PKLookup 227.38 (2.6%) 233.05 (2.7%) 2.5% ( -2% - 8%) 0.003
{code}
was (Author: zacharymorn):
I was trying to modify the _CreateQueries_ class in luceneutil to generate OR queries with 5 clauses, but got some issues running it. So I did some quick hack to combine the queries from OrHighHigh, OrHighMed and OrHighLow to create a new OrHighHighMedHighLow task with queries. I've attached the resulting file _wikimedium.10M.nostopwords.tasks_ to this ticket.
Here are the luceneutil results from 2 runs for each implementation:
Scorer [https://github.com/apache/lucene/pull/101]
{code:java}
TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value
OrHighHighMedHighLow 30.97 (6.2%) 24.92 (4.4%) -19.5% ( -28% - -9%) 0.000
PKLookup 223.53 (2.4%) 228.10 (3.7%) 2.0% ( -3% - 8%) 0.037{code}
{code:java}
TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value OrHighHighMedHighLow 32.83 (3.4%) 34.00 (5.1%) 3.6% ( -4% - 12%) 0.009 PKLookup 217.86 (2.8%) 228.14 (4.2%) 4.7% ( -2% - 12%) 0.000
{code}
BulkScorer [https://github.com/apache/lucene/pull/113|https://github.com/apache/lucene/pull/113.]
{code:java}
TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value
PKLookup 197.84 (4.1%) 207.79 (4.2%) 5.0% ( -3% - 13%) 0.000
OrHighHighMedHighLow 32.50 (16.7%) 35.79 (9.9%) 10.1% ( -14% - 44%) 0.020 {code}
{code:java}
TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value OrHighHighMedHighLow 28.61 (5.4%) 22.28 (4.2%) -22.1% ( -30% - -13%) 0.000 PKLookup 227.38 (2.6%) 233.05 (2.7%) 2.5% ( -2% - 8%) 0.003
{code}
> Add a bulk scorer for disjunctions that does dynamic pruning
> ------------------------------------------------------------
>
> Key: LUCENE-9335
> URL: https://issues.apache.org/jira/browse/LUCENE-9335
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Adrien Grand
> Priority: Minor
> Attachments: wikimedium.10M.nostopwords.tasks
>
> Time Spent: 2.5h
> Remaining Estimate: 0h
>
> Lucene often gets benchmarked against other engines, e.g. against Tantivy and PISA at [https://tantivy-search.github.io/bench/] or against research prototypes in Table 1 of [https://cs.uwaterloo.ca/~jimmylin/publications/Grand_etal_ECIR2020_preprint.pdf]. Given that top-level disjunctions of term queries are commonly used for benchmarking, it would be nice to optimize this case a bit more, I suspect that we could make fewer per-document decisions by implementing a BulkScorer instead of a Scorer.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org