You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Shai Erera (JIRA)" <ji...@apache.org> on 2013/10/26 21:10:31 UTC

[jira] [Commented] (LUCENE-5308) explore per-dimension fixed-width ordinal encoding

    [ https://issues.apache.org/jira/browse/LUCENE-5308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13806157#comment-13806157 ] 

Shai Erera commented on LUCENE-5308:
------------------------------------

It's a nice idea. You could probably do the "right" thing if you extend FacetFields and override the CountingListBuilder to generate this fixed-encoding. And instead of FacetsAccumlator write a FacetsAggregator which decodes per-document, and then you get top-K computation for free .. I think?

I guess if an app truly knows it has a fixed taxonomy, and that every document contains all facets, this could be useful. Maybe instead of asking the app to specify uniqueValueCount we write a FixedTaxonomyWriter where the app has to build once before it adds any document, with all categories, and we compute uniqueValueCount ourselves?

I mean, this could eliminate app making mistakes - e.g. FixedFacetFields would not even get a TaxonomyWriter, so if app tries to find a CategoryPath which wasn't added already, it hits a hard exception. Hmm, now it also hits it if the uniqueValueCount smaller than an ord a CP gets, but I think it makes things more clear that you must create the taxonomy up front.

Although, I can see an app saying "I don't know which categories I'll see, but there will never be more than X of them" ... so maybe a uniqueValueCount constraint is good as well.

Net/net, this is a very limited solution which an app needs to think about before using it. If it matches app's needs, it can speed things up. I wonder what the speedups will be when it's fully "productized", and whether it will still be worth keeping in the code base.

I do think though that we could think about per CategoryList ordinal space (separate issue) by default. It requires heavy changes to the taxonomy index and supporting code, but maybe it will be worth it too (compression-wise and hopefully decoding time too).

> explore per-dimension fixed-width ordinal encoding
> --------------------------------------------------
>
>                 Key: LUCENE-5308
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5308
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/facet
>            Reporter: Michael McCandless
>         Attachments: LUCENE-5308.patch
>
>
> I've been testing performance of Solr vs Lucene facets, and one area
> where Solr's "fcs" method shines (low RAM, high faceting perf) is in
> low-cardinality dimensions.
> I suspect the gains are because with the field-cache entries the ords
> are encoded in "column-stride" form, and are private to that dim (vs
> facet module's shared ord space).
> So I thought about whether we could do something like this in the
> facet module ...
> I.e., if we know certain documents will have a specific set of
> single-valued dimensions, we can pick an encoding format for the
> per-doc byte[] "globally" for all such documents, and use private ord
> space per-dimension to improve compression.
> The basic idea is to pre-assign up-front (before the segment is
> written) which bytes belong to which dim.  E.g., date takes bytes 0-1
> (<= than 65536 unique labels), imageCount takes byte 2 (<= 256
> unique labels), username takes bytes 3-6 (<= 16.8 M unique labels),
> etc.  This only works for single-valued dims, and only works if all
> docs (or at least an identifiable subset?) have all dims.
> To test this idea, I made a hacked up prototype patch; it has tons of
> limitations so we clearly can't commit it, but I was able to test full
> wikipedia en with 6 facet dims (date, username, refCount, imageCount,
> sectionCount, subSectionCount, subSubSectionCount).
> Trunk (base) requires 181 MB of net doc values to hold the facet ords,
> while the patch requires 183 MB.
> Perf:
> {noformat}
> Report after iter 19:
>                     Task    QPS base      StdDev    QPS comp      StdDev                Pct diff
>                  Respell       54.30      (3.1%)       54.02      (2.7%)   -0.5% (  -6% -    5%)
>          MedSloppyPhrase        3.58      (5.6%)        3.60      (6.0%)    0.6% ( -10% -   12%)
>             OrNotHighLow       63.58      (6.8%)       64.03      (6.9%)    0.7% ( -12% -   15%)
>         HighSloppyPhrase        3.80      (7.4%)        3.84      (7.1%)    1.1% ( -12% -   16%)
>              LowSpanNear        8.93      (3.5%)        9.09      (4.6%)    1.8% (  -6% -   10%)
>                LowPhrase       12.15      (6.4%)       12.43      (7.2%)    2.3% ( -10% -   17%)
>               AndHighLow      402.54      (1.4%)      425.23      (2.3%)    5.6% (   1% -    9%)
>          LowSloppyPhrase       39.53      (1.6%)       42.01      (1.9%)    6.3% (   2% -    9%)
>              MedSpanNear       26.54      (2.8%)       28.39      (3.6%)    7.0% (   0% -   13%)
>               HighPhrase        4.01      (8.1%)        4.30      (9.7%)    7.4% (  -9% -   27%)
>                   Fuzzy2       44.01      (2.3%)       47.43      (1.8%)    7.8% (   3% -   12%)
>             OrNotHighMed       32.64      (4.7%)       35.22      (5.5%)    7.9% (  -2% -   19%)
>                   Fuzzy1       62.24      (2.1%)       67.35      (1.9%)    8.2% (   4% -   12%)
>                MedPhrase      129.06      (4.9%)      141.14      (6.2%)    9.4% (  -1% -   21%)
>               AndHighMed       27.71      (0.7%)       30.32      (1.1%)    9.4% (   7% -   11%)
>             HighSpanNear        5.15      (3.5%)        5.63      (4.2%)    9.5% (   1% -   17%)
>              AndHighHigh       24.98      (0.7%)       27.89      (1.1%)   11.7% (   9% -   13%)
>            OrNotHighHigh       15.13      (2.0%)       17.90      (2.6%)   18.3% (  13% -   23%)
>                 Wildcard        9.06      (1.4%)       10.85      (2.6%)   19.8% (  15% -   24%)
>            OrHighNotHigh        8.84      (1.8%)       10.64      (2.6%)   20.3% (  15% -   25%)
>               OrHighHigh        3.73      (1.6%)        4.51      (2.4%)   20.9% (  16% -   25%)
>                OrHighLow        5.22      (1.5%)        6.34      (2.5%)   21.4% (  17% -   25%)
>             OrHighNotLow        8.94      (1.6%)       10.95      (2.5%)   22.5% (  18% -   26%)
>                  Prefix3       27.61      (1.2%)       33.90      (2.3%)   22.8% (  19% -   26%)
>                OrHighMed       11.72      (1.6%)       14.56      (2.3%)   24.3% (  20% -   28%)
>             OrHighNotMed       14.74      (1.5%)       18.34      (2.2%)   24.5% (  20% -   28%)
>                  MedTerm       26.37      (1.2%)       32.85      (2.7%)   24.6% (  20% -   28%)
>                   IntNRQ        2.61      (1.2%)        3.25      (3.0%)   24.7% (  20% -   29%)
>                 HighTerm       19.69      (1.3%)       25.33      (3.0%)   28.7% (  23% -   33%)
>                  LowTerm      131.50      (1.3%)      170.49      (3.0%)   29.7% (  25% -   34%)
> {noformat}
> I think the gains are sizable, and the increase in index size quite
> minor (in another test with fewer dims I saw the index size get a bit
> smaller) ... at least for this specific test.
> However, finding a clean solution here will be tricky...



--
This message was sent by Atlassian JIRA
(v6.1#6144)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org