You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Adrien Grand (JIRA)" <ji...@apache.org> on 2015/07/03 14:14:04 UTC

[jira] [Updated] (LUCENE-6645) BKD tree queries should use BitDocIdSet.Builder

     [ https://issues.apache.org/jira/browse/LUCENE-6645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Adrien Grand updated LUCENE-6645:
---------------------------------
    Attachment: LUCENE-6645.patch

I played a bit with the benchmark and have similar results (1.76 sec for trunk and more than 4 sec with the patch). It's a worst case for BitDocIdSetBuilder given that it always starts to build a SparseFixedBitSet to eventually upgrade it to a FixedBitSet. But still it's disappointing that it's so slow compared to building a FixedBitSet directly.

I've experimented with a more brute-force approach (see attached patch) that uses a plain int[] instead of a SparseFixedBitSet for the sparse case, and it seems to perform better: the benchmark runs in 1.76 sec on trunk and 2.70 sec with the patch if the builder is configured to use an int[] up to number of docs of maxDoc / 128. It goes down to 1.96 with a threshold of maxDoc / 2048.  Maybe this is what we should use instead of BitDocIdSetBuilder?

I tried to see how this affects our luceneutil benchmark and there is barely any change:

{noformat}
                    TaskQPS baseline      StdDev   QPS patch      StdDev                Pct diff
                  Fuzzy1       74.41     (18.3%)       69.59     (19.4%)   -6.5% ( -37% -   38%)
                 LowTerm      761.39      (2.4%)      749.20      (3.6%)   -1.6% (  -7% -    4%)
            OrNotHighLow      877.81      (2.2%)      867.60      (5.3%)   -1.2% (  -8% -    6%)
            OrHighNotMed       76.63      (2.1%)       75.89      (2.7%)   -1.0% (  -5% -    3%)
                 MedTerm      309.75      (1.3%)      306.86      (2.6%)   -0.9% (  -4% -    2%)
              OrHighHigh       26.86      (5.4%)       26.64      (3.3%)   -0.8% (  -9% -    8%)
           OrNotHighHigh       67.94      (1.0%)       67.42      (2.0%)   -0.8% (  -3% -    2%)
                HighTerm      132.28      (1.4%)      131.29      (1.7%)   -0.7% (  -3% -    2%)
                 Respell       78.71      (2.8%)       78.14      (3.2%)   -0.7% (  -6% -    5%)
               LowPhrase      121.23      (0.8%)      120.47      (1.3%)   -0.6% (  -2% -    1%)
            OrHighNotLow      112.94      (2.3%)      112.25      (2.5%)   -0.6% (  -5% -    4%)
            OrNotHighMed      223.81      (2.4%)      222.52      (3.8%)   -0.6% (  -6% -    5%)
               OrHighLow       71.79      (4.3%)       71.39      (3.3%)   -0.6% (  -7% -    7%)
             MedSpanNear       23.33      (1.1%)       23.21      (1.8%)   -0.5% (  -3% -    2%)
             AndHighHigh       62.01      (1.9%)       61.71      (3.6%)   -0.5% (  -5% -    5%)
               OrHighMed       41.79      (5.5%)       41.61      (3.6%)   -0.4% (  -9% -    9%)
              AndHighMed       90.86      (2.0%)       90.61      (2.8%)   -0.3% (  -5% -    4%)
        HighSloppyPhrase       47.43      (4.6%)       47.33      (4.8%)   -0.2% (  -9% -    9%)
              HighPhrase       28.36      (1.6%)       28.30      (1.3%)   -0.2% (  -3% -    2%)
               MedPhrase      147.25      (1.4%)      146.99      (1.6%)   -0.2% (  -3% -    2%)
         LowSloppyPhrase       37.07      (2.2%)       37.03      (2.3%)   -0.1% (  -4% -    4%)
         MedSloppyPhrase      156.95      (3.7%)      156.80      (3.6%)   -0.1% (  -7% -    7%)
             LowSpanNear       29.05      (2.2%)       29.02      (2.0%)   -0.1% (  -4% -    4%)
           OrHighNotHigh       61.13      (1.5%)       61.08      (1.6%)   -0.1% (  -3% -    3%)
            HighSpanNear       15.36      (1.7%)       15.36      (1.8%)    0.0% (  -3% -    3%)
                Wildcard      111.57      (3.1%)      113.05      (2.1%)    1.3% (  -3% -    6%)
                  IntNRQ        7.49      (7.3%)        7.60      (5.2%)    1.4% ( -10% -   14%)
                 Prefix3       72.81      (4.6%)       74.18      (4.1%)    1.9% (  -6% -   11%)
              AndHighLow      974.36      (3.0%)      994.46      (2.9%)    2.1% (  -3% -    8%)
                  Fuzzy2       47.42     (16.1%)       53.71     (16.5%)   13.3% ( -16% -   54%)
{noformat}

I suspect this is because our multi-term queries in this benchmark match some high-frequency terms so the upgrade to a FixedBitSet happens quickly anyway.

> BKD tree queries should use BitDocIdSet.Builder
> -----------------------------------------------
>
>                 Key: LUCENE-6645
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6645
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>         Attachments: LUCENE-6645.patch, LUCENE-6645.patch
>
>
> When I was iterating on BKD tree originally I remember trying to use this builder (which makes a sparse bit set at first and then upgrades to dense if enough bits get set) and being disappointed with its performance.
> I wound up just making a FixedBitSet every time, but this is obviously wasteful for small queries.
> It could be the perf was poor because I was always .or'ing in DISIs that had 512 - 1024 hits each time (the size of each leaf cell in the BKD tree)?  I also had to make my own DISI wrapper around each leaf cell... maybe that was the source of the slowness, not sure.
> I also sort of wondered whether the SmallDocSet in spatial module (backed by a SentinelIntSet) might be faster ... though it'd need to be sorted in the and after building before returning to Lucene.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org