You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Michael Gibney (Jira)" <ji...@apache.org> on 2021/06/18 01:47:00 UTC
[jira] [Comment Edited] (LUCENE-9204) Move span queries to the
queries module
[ https://issues.apache.org/jira/browse/LUCENE-9204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17365106#comment-17365106 ]
Michael Gibney edited comment on LUCENE-9204 at 6/18/21, 1:46 AM:
------------------------------------------------------------------
I hope it's ok to post this here; I've [added benchmarks|https://github.com/mikemccand/luceneutil/pull/133] with the goal of quantifying performance for these different approaches. 500k docs from wikimedium; baseline and candidate code are the same, since I'm initially seeking to compare different queries, not different code.
First, a realistic use-case, somewhat contrived to exercise {{pullUpDisjunctions()}}:
{code:java}
# (body:us|united-states health|health-care policy|public-policy law|legal-aspects)~10
Task QPS baseline StdDev QPS candidate StdDev Pct diff p-value
IntervalDis 20.34 (11.4%) 19.83 (9.2%) -2.5% ( -20% - 20%) 0.446
IntervalMinDis 34.03 (9.9%) 35.22 (9.5%) 3.5% ( -14% - 25%) 0.251
SpanDis 63.63 (10.4%) 68.56 (11.0%) 7.8% ( -12% - 32%) 0.022
{code}
Next, an intensive use-case, contrived to push/illustrate the performance profile of increasing the numbers of internal disjunctions:
{code:java}
# (body:smith a|in-the)~10
# (body:smith a|in-the the|in-the)~10
# (body:smith a|in-the the|in-the a|in-the)~10
# (body:smith a|in-the the|in-the a|in-the the|in-the)~10
# (body:smith a|in-the the|in-the a|in-the the|in-the a|in-the)~10
# (body:smith a|in-the the|in-the a|in-the the|in-the a|in-the the|in-the)~10
# NOTE: "smith" is arbitrary; just to push QPS numbers into a more human-friendly range
Task QPS baseline StdDev QPS candidate StdDev Pct diff p-value
IntervalDis1 82.47 (2.3%) 81.27 (1.9%) -1.5% ( -5% - 2%) 0.276
IntervalDis2 25.96 (1.3%) 25.91 (1.7%) -0.2% ( -3% - 2%) 0.851
IntervalDis3 9.46 (2.3%) 9.46 (3.4%) -0.0% ( -5% - 5%) 0.986
IntervalDis4 3.69 (2.1%) 3.69 (2.3%) 0.1% ( -4% - 4%) 0.962
IntervalDis5 1.57 (1.1%) 1.56 (0.9%) -0.7% ( -2% - 1%) 0.282
IntervalDis6 0.66 (0.6%) 0.66 (1.5%) -0.6% ( -2% - 1%) 0.414
IntervalMinDis1 130.06 (5.6%) 129.07 (4.8%) -0.8% ( -10% - 10%) 0.817
IntervalMinDis2 115.44 (6.3%) 116.59 (4.2%) 1.0% ( -8% - 12%) 0.769
IntervalMinDis3 97.24 (5.0%) 99.19 (7.6%) 2.0% ( -10% - 15%) 0.625
IntervalMinDis4 100.28 (8.0%) 101.31 (3.1%) 1.0% ( -9% - 13%) 0.791
IntervalMinDis5 102.01 (8.0%) 101.34 (6.2%) -0.6% ( -13% - 14%) 0.886
IntervalMinDis6 99.96 (2.2%) 97.27 (7.0%) -2.7% ( -11% - 6%) 0.410
SpanDis1 81.13 (4.0%) 80.34 (2.1%) -1.0% ( -6% - 5%) 0.630
SpanDis2 45.01 (1.6%) 44.21 (1.5%) -1.8% ( -4% - 1%) 0.068
SpanDis3 31.01 (2.0%) 31.21 (1.9%) 0.6% ( -3% - 4%) 0.608
SpanDis4 24.36 (2.2%) 23.01 (5.7%) -5.6% ( -13% - 2%) 0.042
SpanDis5 19.76 (4.0%) 20.22 (3.5%) 2.3% ( -4% - 10%) 0.324
SpanDis6 17.29 (4.5%) 16.74 (5.9%) -3.2% ( -12% - 7%) 0.340
{code}
For good measure, I added two tasks that compare non-positional disjunctions across different implementations: SpanOrQuery and DisjunctionIntervalsSource. (fwiw, I'd guess the performance gap between straight disjunctions could probably be closed without too much work?)
{code:java}
# (body:trash|waste|garbage|recycling|refuse)
Task QPS baseline StdDev QPS candidate StdDev Pct diff p-value
PlainSpanDis 80.92 (11.3%) 82.80 (17.5%) 2.3% ( -23% - 35%) 0.619
PlainIntervalDis 142.66 (10.8%) 154.38 (13.6%) 8.2% ( -14% - 36%) 0.035
{code}
was (Author: mgibney):
I hope it's ok to post this here; I've [added benchmarks|https://github.com/mikemccand/luceneutil/pull/133] with the goal of quantifying performance for these different approaches. 500k docs from wikimedium; baseline and candidate code are the same, since I'm initially seeking to compare different queries, not different code.
First, a realistic use-case, somewhat contrived to exercise {{pullUpDisjunctions()}}:
{code}
# (body:us|united-states health|health-care policy|public-policy law|legal-aspects)~10
Task QPS baseline StdDev QPS candidate StdDev Pct diff p-value
IntervalDis 20.34 (11.4%) 19.83 (9.2%) -2.5% ( -20% - 20%) 0.446
IntervalMinDis 34.03 (9.9%) 35.22 (9.5%) 3.5% ( -14% - 25%) 0.251
SpanDis 63.63 (10.4%) 68.56 (11.0%) 7.8% ( -12% - 32%) 0.022
{code}
Next, an intensive use-case, contrived to push/illustrate the performance profile of increasing the numbers of internal disjunctions:
{code}
# (body:smith a|in-the)~10
# (body:smith a|in-the the|in-the)~10
# (body:smith a|in-the the|in-the a|in-the)~10
# (body:smith a|in-the the|in-the a|in-the the|in-the)~10
# (body:smith a|in-the the|in-the a|in-the the|in-the a|in-the)~10
# (body:smith a|in-the the|in-the a|in-the the|in-the a|in-the the|in-the)~10
# NOTE: "smith" is arbitrary; just to push QPS numbers into a more human-friendly range
Task QPS baseline StdDev QPS candidate StdDev Pct diff p-value
IntervalDis1 82.47 (2.3%) 81.27 (1.9%) -1.5% ( -5% - 2%) 0.276
IntervalDis2 25.96 (1.3%) 25.91 (1.7%) -0.2% ( -3% - 2%) 0.851
IntervalDis3 9.46 (2.3%) 9.46 (3.4%) -0.0% ( -5% - 5%) 0.986
IntervalDis4 3.69 (2.1%) 3.69 (2.3%) 0.1% ( -4% - 4%) 0.962
IntervalDis5 1.57 (1.1%) 1.56 (0.9%) -0.7% ( -2% - 1%) 0.282
IntervalDis6 0.66 (0.6%) 0.66 (1.5%) -0.6% ( -2% - 1%) 0.414
IntervalMinDis1 130.06 (5.6%) 129.07 (4.8%) -0.8% ( -10% - 10%) 0.817
IntervalMinDis2 115.44 (6.3%) 116.59 (4.2%) 1.0% ( -8% - 12%) 0.769
IntervalMinDis3 97.24 (5.0%) 99.19 (7.6%) 2.0% ( -10% - 15%) 0.625
IntervalMinDis4 100.28 (8.0%) 101.31 (3.1%) 1.0% ( -9% - 13%) 0.791
IntervalMinDis5 102.01 (8.0%) 101.34 (6.2%) -0.6% ( -13% - 14%) 0.886
IntervalMinDis6 99.96 (2.2%) 97.27 (7.0%) -2.7% ( -11% - 6%) 0.410
SpanDis1 81.13 (4.0%) 80.34 (2.1%) -1.0% ( -6% - 5%) 0.630
SpanDis2 45.01 (1.6%) 44.21 (1.5%) -1.8% ( -4% - 1%) 0.068
SpanDis3 31.01 (2.0%) 31.21 (1.9%) 0.6% ( -3% - 4%) 0.608
SpanDis4 24.36 (2.2%) 23.01 (5.7%) -5.6% ( -13% - 2%) 0.042
SpanDis5 19.76 (4.0%) 20.22 (3.5%) 2.3% ( -4% - 10%) 0.324
SpanDis6 17.29 (4.5%) 16.74 (5.9%) -3.2% ( -12% - 7%) 0.340
{code}
For good measure, I added two tasks that compare non-positional disjunctions across different implementations: SpanOrQuery and DisjunctionIntervalsSource. (fwiw, I'd guess the performance gap between straight disjunctions could probably be closed without too much work?)
{code}
# (body:trash|waste|garbage|recycling|refuse)
Task QPS baseline StdDev QPS candidate StdDev Pct diff p-value
PlainSpanDis 80.92 (11.3%) 82.80 (17.5%) 2.3% ( -23% - 35%) 0.619
PlainIntervalDis 142.66 (10.8%) 154.38 (13.6%) 8.2% ( -14% - 36%) 0.035
{code}
> Move span queries to the queries module
> ---------------------------------------
>
> Key: LUCENE-9204
> URL: https://issues.apache.org/jira/browse/LUCENE-9204
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Alan Woodward
> Assignee: Alan Woodward
> Priority: Major
> Fix For: main (9.0)
>
> Time Spent: 1h
> Remaining Estimate: 0h
>
> We have a slightly odd situation currently, with two parallel query structures for building complex positional queries: the long-standing span queries, in core; and interval queries, in the queries module. Given that interval queries solve at least some of the problems we've had with Spans, I think we should be pushing users more towards these implementations. It's counter-intuitive to do that when Spans are in core though. I've opened this issue to discuss moving the spans package as a whole to the queries module.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org