You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Zach Chen (Jira)" <ji...@apache.org> on 2021/05/02 04:17:00 UTC
[jira] [Comment Edited] (LUCENE-9335) Add a bulk scorer for disjunctions that does dynamic pruning

    [ https://issues.apache.org/jira/browse/LUCENE-9335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17337948#comment-17337948 ] 

Zach Chen edited comment on LUCENE-9335 at 5/2/21, 4:16 AM:
------------------------------------------------------------

I was trying to modify the _CreateQueries_ class in luceneutil to generate OR queries with 5 clauses, but got some issues running it. So I did some quick hack to combine the queries from OrHighHigh, OrHighMed and OrHighLow to create a new OrHighHighMedHighLow task with queries. I've attached the resulting file _wikimedium.10M.nostopwords.tasks_ to this ticket. 

Here are the luceneutil results from 2 runs for each implementation:

Scorer [https://github.com/apache/lucene/pull/101]
{code:java}
                   TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
    OrHighHighMedHighLow       30.97      (6.2%)       24.92      (4.4%)  -19.5% ( -28% -   -9%) 0.000
                PKLookup      223.53      (2.4%)      228.10      (3.7%)    2.0% (  -3% -    8%) 0.037{code}
{code:java}
                    TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value     
    OrHighHighMedHighLow       32.83      (3.4%)       34.00      (5.1%)    3.6% (  -4% -   12%) 0.009                         
                PKLookup      217.86      (2.8%)      228.14      (4.2%)    4.7% (  -2% -   12%) 0.000
{code}
BulkScorer [https://github.com/apache/lucene/pull/113|https://github.com/apache/lucene/pull/113.]
{code:java}
                    TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                PKLookup      197.84      (4.1%)      207.79      (4.2%)    5.0% (  -3% -   13%) 0.000
    OrHighHighMedHighLow       32.50     (16.7%)       35.79      (9.9%)   10.1% ( -14% -   44%) 0.020 {code}
{code:java}
                    TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value     
    OrHighHighMedHighLow       28.61      (5.4%)       22.28      (4.2%)  -22.1% ( -30% -  -13%) 0.000                 
                PKLookup      227.38      (2.6%)      233.05      (2.7%)    2.5% (  -2% -    8%) 0.003
{code}
 


was (Author: zacharymorn):
I was trying to modify the _CreateQueries_ class in luceneutil to generate OR queries with 5 clauses, but got some issues running it. So I did some quick hack to combine the queries from OrHighHigh, OrHighMed and OrHighLow to create a new OrHighHighMedHighLow task with queries. I've attached the resulting file _wikimedium.10M.nostopwords.tasks_ to this ticket. 

Here are the luceneutil results from 2 runs for each implementation:

Scorer [https://github.com/apache/lucene/pull/101]
{code:java}
                   TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
    OrHighHighMedHighLow       30.97      (6.2%)       24.92      (4.4%)  -19.5% ( -28% -   -9%) 0.000
                PKLookup      223.53      (2.4%)      228.10      (3.7%)    2.0% (  -3% -    8%) 0.037{code}
{code:java}
                    TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value     OrHighHighMedHighLow       32.83      (3.4%)       34.00      (5.1%)    3.6% (  -4% -   12%) 0.009                 PKLookup      217.86      (2.8%)      228.14      (4.2%)    4.7% (  -2% -   12%) 0.000
{code}
BulkScorer [https://github.com/apache/lucene/pull/113|https://github.com/apache/lucene/pull/113.]
{code:java}
                    TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                PKLookup      197.84      (4.1%)      207.79      (4.2%)    5.0% (  -3% -   13%) 0.000
    OrHighHighMedHighLow       32.50     (16.7%)       35.79      (9.9%)   10.1% ( -14% -   44%) 0.020 {code}
{code:java}
                    TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value     OrHighHighMedHighLow       28.61      (5.4%)       22.28      (4.2%)  -22.1% ( -30% -  -13%) 0.000                 PKLookup      227.38      (2.6%)      233.05      (2.7%)    2.5% (  -2% -    8%) 0.003
{code}
 

> Add a bulk scorer for disjunctions that does dynamic pruning
> ------------------------------------------------------------
>
>                 Key: LUCENE-9335
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9335
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Adrien Grand
>            Priority: Minor
>         Attachments: wikimedium.10M.nostopwords.tasks
>
>          Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> Lucene often gets benchmarked against other engines, e.g. against Tantivy and PISA at [https://tantivy-search.github.io/bench/] or against research prototypes in Table 1 of [https://cs.uwaterloo.ca/~jimmylin/publications/Grand_etal_ECIR2020_preprint.pdf]. Given that top-level disjunctions of term queries are commonly used for benchmarking, it would be nice to optimize this case a bit more, I suspect that we could make fewer per-document decisions by implementing a BulkScorer instead of a Scorer.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org