You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Michael Sokolov (Jira)" <ji...@apache.org> on 2020/02/03 22:14:00 UTC
[jira] [Commented] (LUCENE-8929) Early Terminating CollectorManager

    [ https://issues.apache.org/jira/browse/LUCENE-8929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17029331#comment-17029331 ] 

Michael Sokolov commented on LUCENE-8929:
-----------------------------------------

This has been sitting for a while, and since I ended up needing something similar at ${DAY_JOB}, I'm sharing our solution. The basic idea is the same as what [~atris] described above, but ends up combining both (1) and (2) in the description, if I understand what was meant there. The comments in {{MaxScoreAccumulator}} go into the details. The basic idea is that we periodically share number of docs collected + current worst score from each leaf. Then once we have (globally) enough hits, and a leaf can determine it is no longer competitive (compared to the other leaves – ie it is the current worst), we terminate it. Then we revise the number of hits required by the remaining leaves, and if they are sufficient to cover that new reduced threshold, we can update the global worst score (to a better one), enabling a tighter bound for the remaining leaves. If the updating period is reduced to *every hit* then I think we essentially get a global priority queue, but this is too costly due to synchronization, so we back off to a reduced frequency.

I tested perf with luceneutil creating a sorted index over `DayOfYear` and another one over `LastModDate`. The day of year sorted queries got about 20% QPS gain with this change. All the tests below were run with 16 searcher threads (this only applies to concurrent searches across multiple segments):

||                    Task||  QPS before||      StdDev||   QPS after||      StdDev||                Pct diff||
|    LowTermDayOfYearSort|      586.56|      (2.8%)|      716.24      |(1.2%)|   22.1% |(  17% -   26%)|
|   HighTermDayOfYearSort|      580.73|      (2.8%)|      721.21      |(1.2%)|   24.2% |(  19% -   28%)|

Last mod date queries see a much smaller improvement. I believe this is because the values are indexed roughly in order, causing simlar values to be grouped together in the same segment, which is an adversarial case for this algorithm.  

||                    Task||  QPS before||      StdDev||   QPS after||      StdDev||                Pct diff||
|      LowTermLastModSort|     1034.49|      (4.3%)|     1037.92      |(3.1%)|    0.3% |(  -6% -    8%)|
|     HighTermLastModSort|     1056.17|      (4.7%)|     1109.77      |(3.5%)|    5.1% |(  -2% -   13%)|

Performance is unchanged when the conditions are not right, ie the index is not sorted by the query's sort, or relevance ranking is used:

||                    Task||  QPS before||      StdDev||   QPS after||      StdDev||                Pct diff||
|   HighTermDayOfYearSort|       35.67|      (2.2%)|       35.45      |(2.4%)|   -0.6% |(  -5% -    4%)|
|              OrHighHigh|       14.07|      (2.3%)|       14.01      |(3.1%)|   -0.4% |(  -5% -    5%)|
|                 LowTerm|       35.23|      (2.2%)|       35.11      |(2.4%)|   -0.3% |(  -4% -    4%)|
|                 MedTerm|       34.81|      (2.1%)|       34.70      |(2.2%)|   -0.3% |(  -4% -    4%)|
|              AndHighLow|       34.68|      (2.2%)|       34.58      |(2.5%)|   -0.3% |(  -4% -    4%)|
|                Wildcard|       30.06|      (1.8%)|       29.98      |(2.2%)|   -0.3% |(  -4% -    3%)|
|                HighTerm|       34.99|      (2.1%)|       34.90      |(2.4%)|   -0.3% |(  -4% -    4%)|
|               OrHighLow|       32.55|      (1.9%)|       32.47      |(2.3%)|   -0.3% |(  -4% -    3%)|
|             AndHighHigh|       25.04|      (2.7%)|       24.98      |(2.7%)|   -0.3% |(  -5% -    5%)|
|                PKLookup|       58.95|      (1.0%)|       58.80      |(1.2%)|   -0.2% |(  -2% -    2%)|
|              AndHighMed|       30.45|      (2.2%)|       30.38      |(2.3%)|   -0.2% |(  -4% -    4%)|
|               OrHighMed|       28.19|      (2.4%)|       28.13      |(2.8%)|   -0.2% |(  -5% -    5%)|

> Early Terminating CollectorManager
> ----------------------------------
>
>                 Key: LUCENE-8929
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8929
>             Project: Lucene - Core
>          Issue Type: Sub-task
>            Reporter: Atri Sharma
>            Priority: Major
>          Time Spent: 3h
>  Remaining Estimate: 0h
>
> We should have an early terminating collector manager which accurately tracks hits across all of its collectors and determines when there are enough hits, allowing all the collectors to abort.
> The options for the same are:
> 1) Shared total count : Global "scoreboard" where all collectors update their current hit count. At the end of each document's collection, collector checks if N > threshold, and aborts if true
> 2) State Reporting Collectors: Collectors report their total number of counts collected periodically using a callback mechanism, and get a proceed or abort decision.
> 1) has the overhead of synchronization in the hot path, 2) can collect unnecessary hits before aborting.
> I am planning to work on 2), unless objections



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org