You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Michael Gibney (Jira)" <ji...@apache.org> on 2021/06/18 01:47:00 UTC

[jira] [Comment Edited] (LUCENE-9204) Move span queries to the queries module

    [ https://issues.apache.org/jira/browse/LUCENE-9204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17365106#comment-17365106 ] 

Michael Gibney edited comment on LUCENE-9204 at 6/18/21, 1:46 AM:
------------------------------------------------------------------

I hope it's ok to post this here; I've [added benchmarks|https://github.com/mikemccand/luceneutil/pull/133] with the goal of quantifying performance for these different approaches. 500k docs from wikimedium; baseline and candidate code are the same, since I'm initially seeking to compare different queries, not different code.

First, a realistic use-case, somewhat contrived to exercise {{pullUpDisjunctions()}}:
{code:java}
# (body:us|united-states health|health-care policy|public-policy law|legal-aspects)~10

           Task QPS baseline      StdDev    QPS candidate      StdDev                  Pct diff   p-value
    IntervalDis        20.34     (11.4%)            19.83      (9.2%)     -2.5% ( -20% -   20%)     0.446
 IntervalMinDis        34.03      (9.9%)            35.22      (9.5%)      3.5% ( -14% -   25%)     0.251
        SpanDis        63.63     (10.4%)            68.56     (11.0%)      7.8% ( -12% -   32%)     0.022
{code}
 

Next, an intensive use-case, contrived to push/illustrate the performance profile of increasing the numbers of internal disjunctions:
{code:java}
# (body:smith a|in-the)~10
# (body:smith a|in-the the|in-the)~10
# (body:smith a|in-the the|in-the a|in-the)~10
# (body:smith a|in-the the|in-the a|in-the the|in-the)~10
# (body:smith a|in-the the|in-the a|in-the the|in-the a|in-the)~10
# (body:smith a|in-the the|in-the a|in-the the|in-the a|in-the the|in-the)~10
# NOTE: "smith" is arbitrary; just to push QPS numbers into a more human-friendly range

           Task QPS baseline  StdDev  QPS candidate  StdDev               Pct diff  p-value
   IntervalDis1        82.47  (2.3%)          81.27  (1.9%)  -1.5% (  -5% -    2%)    0.276
   IntervalDis2        25.96  (1.3%)          25.91  (1.7%)  -0.2% (  -3% -    2%)    0.851
   IntervalDis3         9.46  (2.3%)           9.46  (3.4%)  -0.0% (  -5% -    5%)    0.986
   IntervalDis4         3.69  (2.1%)           3.69  (2.3%)   0.1% (  -4% -    4%)    0.962
   IntervalDis5         1.57  (1.1%)           1.56  (0.9%)  -0.7% (  -2% -    1%)    0.282
   IntervalDis6         0.66  (0.6%)           0.66  (1.5%)  -0.6% (  -2% -    1%)    0.414
IntervalMinDis1       130.06  (5.6%)         129.07  (4.8%)  -0.8% ( -10% -   10%)    0.817
IntervalMinDis2       115.44  (6.3%)         116.59  (4.2%)   1.0% (  -8% -   12%)    0.769
IntervalMinDis3        97.24  (5.0%)          99.19  (7.6%)   2.0% ( -10% -   15%)    0.625
IntervalMinDis4       100.28  (8.0%)         101.31  (3.1%)   1.0% (  -9% -   13%)    0.791
IntervalMinDis5       102.01  (8.0%)         101.34  (6.2%)  -0.6% ( -13% -   14%)    0.886
IntervalMinDis6        99.96  (2.2%)          97.27  (7.0%)  -2.7% ( -11% -    6%)    0.410
       SpanDis1        81.13  (4.0%)          80.34  (2.1%)  -1.0% (  -6% -    5%)    0.630
       SpanDis2        45.01  (1.6%)          44.21  (1.5%)  -1.8% (  -4% -    1%)    0.068
       SpanDis3        31.01  (2.0%)          31.21  (1.9%)   0.6% (  -3% -    4%)    0.608
       SpanDis4        24.36  (2.2%)          23.01  (5.7%)  -5.6% ( -13% -    2%)    0.042
       SpanDis5        19.76  (4.0%)          20.22  (3.5%)   2.3% (  -4% -   10%)    0.324
       SpanDis6        17.29  (4.5%)          16.74  (5.9%)  -3.2% ( -12% -    7%)    0.340
{code}
 

For good measure, I added two tasks that compare non-positional disjunctions across different implementations: SpanOrQuery and DisjunctionIntervalsSource. (fwiw, I'd guess the performance gap between straight disjunctions could probably be closed without too much work?)
{code:java}
#  (body:trash|waste|garbage|recycling|refuse)

            Task QPS baseline   StdDev QPS candidate   StdDev               Pct diff  p-value
    PlainSpanDis        80.92  (11.3%)         82.80  (17.5%)   2.3% ( -23% -   35%)    0.619
PlainIntervalDis       142.66  (10.8%)        154.38  (13.6%)   8.2% ( -14% -   36%)    0.035
{code}
 


was (Author: mgibney):
I hope it's ok to post this here; I've [added benchmarks|https://github.com/mikemccand/luceneutil/pull/133] with the goal of quantifying performance for these different approaches. 500k docs from wikimedium; baseline and candidate code are the same, since I'm initially seeking to compare different queries, not different code.

First, a realistic use-case, somewhat contrived to exercise {{pullUpDisjunctions()}}:
{code}
# (body:us|united-states health|health-care policy|public-policy law|legal-aspects)~10

           Task QPS baseline      StdDev    QPS candidate      StdDev                  Pct diff   p-value
    IntervalDis        20.34     (11.4%)            19.83      (9.2%)     -2.5% ( -20% -   20%)     0.446
 IntervalMinDis        34.03      (9.9%)            35.22      (9.5%)      3.5% ( -14% -   25%)     0.251
        SpanDis        63.63     (10.4%)            68.56     (11.0%)      7.8% ( -12% -   32%)     0.022
{code}
 

Next, an intensive use-case, contrived to push/illustrate the performance profile of increasing the numbers of internal disjunctions:
{code}
# (body:smith a|in-the)~10
# (body:smith a|in-the the|in-the)~10
# (body:smith a|in-the the|in-the a|in-the)~10
# (body:smith a|in-the the|in-the a|in-the the|in-the)~10
# (body:smith a|in-the the|in-the a|in-the the|in-the a|in-the)~10
# (body:smith a|in-the the|in-the a|in-the the|in-the a|in-the the|in-the)~10
# NOTE: "smith" is arbitrary; just to push QPS numbers into a more human-friendly range

           Task QPS baseline      StdDev    QPS candidate      StdDev                  Pct diff   p-value
   IntervalDis1        82.47      (2.3%)            81.27      (1.9%)     -1.5% (  -5% -    2%)     0.276
   IntervalDis2        25.96      (1.3%)            25.91      (1.7%)     -0.2% (  -3% -    2%)     0.851
   IntervalDis3         9.46      (2.3%)             9.46      (3.4%)     -0.0% (  -5% -    5%)     0.986
   IntervalDis4         3.69      (2.1%)             3.69      (2.3%)      0.1% (  -4% -    4%)     0.962
   IntervalDis5         1.57      (1.1%)             1.56      (0.9%)     -0.7% (  -2% -    1%)     0.282
   IntervalDis6         0.66      (0.6%)             0.66      (1.5%)     -0.6% (  -2% -    1%)     0.414
IntervalMinDis1       130.06      (5.6%)           129.07      (4.8%)     -0.8% ( -10% -   10%)     0.817
IntervalMinDis2       115.44      (6.3%)           116.59      (4.2%)      1.0% (  -8% -   12%)     0.769
IntervalMinDis3        97.24      (5.0%)            99.19      (7.6%)      2.0% ( -10% -   15%)     0.625
IntervalMinDis4       100.28      (8.0%)           101.31      (3.1%)      1.0% (  -9% -   13%)     0.791
IntervalMinDis5       102.01      (8.0%)           101.34      (6.2%)     -0.6% ( -13% -   14%)     0.886
IntervalMinDis6        99.96      (2.2%)            97.27      (7.0%)     -2.7% ( -11% -    6%)     0.410
       SpanDis1        81.13      (4.0%)            80.34      (2.1%)     -1.0% (  -6% -    5%)     0.630
       SpanDis2        45.01      (1.6%)            44.21      (1.5%)     -1.8% (  -4% -    1%)     0.068
       SpanDis3        31.01      (2.0%)            31.21      (1.9%)      0.6% (  -3% -    4%)     0.608
       SpanDis4        24.36      (2.2%)            23.01      (5.7%)     -5.6% ( -13% -    2%)     0.042
       SpanDis5        19.76      (4.0%)            20.22      (3.5%)      2.3% (  -4% -   10%)     0.324
       SpanDis6        17.29      (4.5%)            16.74      (5.9%)     -3.2% ( -12% -    7%)     0.340
{code}
 

For good measure, I added two tasks that compare non-positional disjunctions across different implementations: SpanOrQuery and DisjunctionIntervalsSource. (fwiw, I'd guess the performance gap between straight disjunctions could probably be closed without too much work?)
{code}
#  (body:trash|waste|garbage|recycling|refuse)

             Task QPS baseline      StdDev       QPS candidate      StdDev                 Pct diff     p-value
     PlainSpanDis        80.92     (11.3%)               82.80     (17.5%)     2.3% ( -23% -   35%)       0.619
 PlainIntervalDis       142.66     (10.8%)              154.38     (13.6%)     8.2% ( -14% -   36%)       0.035
{code}
 

> Move span queries to the queries module
> ---------------------------------------
>
>                 Key: LUCENE-9204
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9204
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Alan Woodward
>            Assignee: Alan Woodward
>            Priority: Major
>             Fix For: main (9.0)
>
>          Time Spent: 1h
>  Remaining Estimate: 0h
>
> We have a slightly odd situation currently, with two parallel query structures for building complex positional queries: the long-standing span queries, in core; and interval queries, in the queries module.  Given that interval queries solve at least some of the problems we've had with Spans, I think we should be pushing users more towards these implementations.  It's counter-intuitive to do that when Spans are in core though.  I've opened this issue to discuss moving the spans package as a whole to the queries module.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org