You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Michael McCandless (JIRA)" <ji...@apache.org> on 2013/03/02 15:39:13 UTC

[jira] [Updated] (LUCENE-4795) Add FacetsCollector based on SortedSetDocValues

     [ https://issues.apache.org/jira/browse/LUCENE-4795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-4795:
---------------------------------------

    Attachment: LUCENE-4795.patch

New patch ... I think it's close but there are still some nocommits.

 I switched to a FacetsAccumulator (SortedSetDVAccumulator) instead of
XXXCollector because:

  * It's more fair since it now does all counting "in the end",
    matching trunk, which was a bit faster than count-as-you-go when
    we last tested.

  * It means you can use this class with DrillSideways ... I fixed
    TestDrillSideways to test it (passes!).

I also got a custom topK impl working.

The facets are the same as trunk, except for tie-break differences.
The new collector is better in this regard: it breaks ties in an
understandable-to-the-end-user way (by ord = Unicode sort order),
unlike the taxo index which is "order in which label was indexed into
taxo index" (confusing to end user).

I first went down the road of making a TaxoReader that wraps a
SlowCompositeReaderWrapper ... but this became problematic because a
DV instance is not thread-safe, yet TaxoReader's APIs are supposed to
be thread-safe.  I also really didn't like making 3 int[maxOrd] to
handle "hierarchy" when SorteSetDV facets only support 2 level
hierarchy (dim + child).

So I backed off of that and made a separate State object, which you
must re-init after ever top-reader-reopen, and it does the heavyish
stuff.

Current results (base = trunk w/ allbutdim, comp = patch, full wikibig
index with 5 flat dims):

{noformat}
                    Task    QPS base      StdDev    QPS comp      StdDev                Pct diff
                HighTerm        9.36      (1.9%)        7.02      (3.6%)  -25.0% ( -29% -  -19%)
                 MedTerm       53.21      (1.5%)       40.65      (2.8%)  -23.6% ( -27% -  -19%)
               OrHighLow       13.25      (2.1%)       10.55      (3.4%)  -20.4% ( -25% -  -15%)
               OrHighMed       25.77      (1.9%)       20.90      (3.1%)  -18.9% ( -23% -  -14%)
              OrHighHigh       13.03      (2.2%)       10.63      (3.2%)  -18.4% ( -23% -  -13%)
                 LowTerm      146.28      (1.7%)      120.22      (1.7%)  -17.8% ( -20% -  -14%)
{noformat}

                
> Add FacetsCollector based on SortedSetDocValues
> -----------------------------------------------
>
>                 Key: LUCENE-4795
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4795
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/facet
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>         Attachments: LUCENE-4795.patch, LUCENE-4795.patch, LUCENE-4795.patch, pleaseBenchmarkMe.patch
>
>
> Recently (LUCENE-4765) we added multi-valued DocValues field
> (SortedSetDocValuesField), and this can be used for faceting in Solr
> (SOLR-4490).  I think we should also add support in the facet module?
> It'd be an option with different tradeoffs.  Eg, it wouldn't require
> the taxonomy index, since the main index handles label/ord resolving.
> There are at least two possible approaches:
>   * On every reopen, build the seg -> global ord map, and then on
>     every collect, get the seg ord, map it to the global ord space,
>     and increment counts.  This adds cost during reopen in proportion
>     to number of unique terms ...
>   * On every collect, increment counts based on the seg ords, and then
>     do a "merge" in the end just like distributed faceting does.
> The first approach is much easier so I built a quick prototype using
> that.  The prototype does the counting, but it does NOT do the top K
> facets gathering in the end, and it doesn't "know" parent/child ord
> relationships, so there's tons more to do before this is real.  I also
> was unsure how to properly integrate it since the existing classes
> seem to expect that you use a taxonomy index to resolve ords.
> I ran a quick performance test.  base = trunk except I disabled the
> "compute top-K" in FacetsAccumulator to make the comparison fair; comp
> = using the prototype collector in the patch:
> {noformat}
>                     Task    QPS base      StdDev    QPS comp      StdDev                Pct diff
>                OrHighLow       18.79      (2.5%)       14.36      (3.3%)  -23.6% ( -28% -  -18%)
>                 HighTerm       21.58      (2.4%)       16.53      (3.7%)  -23.4% ( -28% -  -17%)
>                OrHighMed       18.20      (2.5%)       13.99      (3.3%)  -23.2% ( -28% -  -17%)
>                  Prefix3       14.37      (1.5%)       11.62      (3.5%)  -19.1% ( -23% -  -14%)
>                  LowTerm      130.80      (1.6%)      106.95      (2.4%)  -18.2% ( -21% -  -14%)
>               OrHighHigh        9.60      (2.6%)        7.88      (3.5%)  -17.9% ( -23% -  -12%)
>              AndHighHigh       24.61      (0.7%)       20.74      (1.9%)  -15.7% ( -18% -  -13%)
>                   Fuzzy1       49.40      (2.5%)       43.48      (1.9%)  -12.0% ( -15% -   -7%)
>          MedSloppyPhrase       27.06      (1.6%)       23.95      (2.3%)  -11.5% ( -15% -   -7%)
>                  MedTerm       51.43      (2.0%)       46.21      (2.7%)  -10.2% ( -14% -   -5%)
>                   IntNRQ        4.02      (1.6%)        3.63      (4.0%)   -9.7% ( -15% -   -4%)
>                 Wildcard       29.14      (1.5%)       26.46      (2.5%)   -9.2% ( -13% -   -5%)
>         HighSloppyPhrase        0.92      (4.5%)        0.87      (5.8%)   -5.4% ( -15% -    5%)
>              MedSpanNear       29.51      (2.5%)       27.94      (2.2%)   -5.3% (  -9% -    0%)
>             HighSpanNear        3.55      (2.4%)        3.38      (2.0%)   -4.9% (  -9% -    0%)
>               AndHighMed      108.34      (0.9%)      104.55      (1.1%)   -3.5% (  -5% -   -1%)
>          LowSloppyPhrase       20.50      (2.0%)       20.09      (4.2%)   -2.0% (  -8% -    4%)
>                LowPhrase       21.60      (6.0%)       21.26      (5.1%)   -1.6% ( -11% -   10%)
>                   Fuzzy2       53.16      (3.9%)       52.40      (2.7%)   -1.4% (  -7% -    5%)
>              LowSpanNear        8.42      (3.2%)        8.45      (3.0%)    0.3% (  -5% -    6%)
>                  Respell       45.17      (4.3%)       45.38      (4.4%)    0.5% (  -7% -    9%)
>                MedPhrase      113.93      (5.8%)      115.02      (4.9%)    1.0% (  -9% -   12%)
>               AndHighLow      596.42      (2.5%)      617.12      (2.8%)    3.5% (  -1% -    8%)
>               HighPhrase       17.30     (10.5%)       18.36      (9.1%)    6.2% ( -12% -   28%)
> {noformat}
> I'm impressed that this approach is only ~24% slower in the worst
> case!  I think this means it's a good option to make available?  Yes
> it has downsides (NRT reopen more costly, small added RAM usage,
> slightly slower faceting), but it's also simpler (no taxo index to
> manage).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org