You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Michael McCandless (JIRA)" <ji...@apache.org> on 2013/10/26 20:27:30 UTC
[jira] [Updated] (LUCENE-5308) explore per-dimension fixed-width ordinal encoding

     [ https://issues.apache.org/jira/browse/LUCENE-5308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-5308:
---------------------------------------

    Attachment: LUCENE-5308.patch

Totally hacked up, but I think working, patch.


> explore per-dimension fixed-width ordinal encoding
> --------------------------------------------------
>
>                 Key: LUCENE-5308
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5308
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/facet
>            Reporter: Michael McCandless
>         Attachments: LUCENE-5308.patch
>
>
> I've been testing performance of Solr vs Lucene facets, and one area
> where Solr's "fcs" method shines (low RAM, high faceting perf) is in
> low-cardinality dimensions.
> I suspect the gains are because with the field-cache entries the ords
> are encoded in "column-stride" form, and are private to that dim (vs
> facet module's shared ord space).
> So I thought about whether we could do something like this in the
> facet module ...
> I.e., if we know certain documents will have a specific set of
> single-valued dimensions, we can pick an encoding format for the
> per-doc byte[] "globally" for all such documents, and use private ord
> space per-dimension to improve compression.
> The basic idea is to pre-assign up-front (before the segment is
> written) which bytes belong to which dim.  E.g., date takes bytes 0-1
> (<= than 65536 unique labels), imageCount takes byte 2 (<= 256
> unique labels), username takes bytes 3-6 (<= 16.8 M unique labels),
> etc.  This only works for single-valued dims, and only works if all
> docs (or at least an identifiable subset?) have all dims.
> To test this idea, I made a hacked up prototype patch; it has tons of
> limitations so we clearly can't commit it, but I was able to test full
> wikipedia en with 6 facet dims (date, username, refCount, imageCount,
> sectionCount, subSectionCount, subSubSectionCount).
> Trunk (base) requires 181 MB of net doc values to hold the facet ords,
> while the patch requires 183 MB.
> Perf:
> {noformat}
> Report after iter 19:
>                     Task    QPS base      StdDev    QPS comp      StdDev                Pct diff
>                  Respell       54.30      (3.1%)       54.02      (2.7%)   -0.5% (  -6% -    5%)
>          MedSloppyPhrase        3.58      (5.6%)        3.60      (6.0%)    0.6% ( -10% -   12%)
>             OrNotHighLow       63.58      (6.8%)       64.03      (6.9%)    0.7% ( -12% -   15%)
>         HighSloppyPhrase        3.80      (7.4%)        3.84      (7.1%)    1.1% ( -12% -   16%)
>              LowSpanNear        8.93      (3.5%)        9.09      (4.6%)    1.8% (  -6% -   10%)
>                LowPhrase       12.15      (6.4%)       12.43      (7.2%)    2.3% ( -10% -   17%)
>               AndHighLow      402.54      (1.4%)      425.23      (2.3%)    5.6% (   1% -    9%)
>          LowSloppyPhrase       39.53      (1.6%)       42.01      (1.9%)    6.3% (   2% -    9%)
>              MedSpanNear       26.54      (2.8%)       28.39      (3.6%)    7.0% (   0% -   13%)
>               HighPhrase        4.01      (8.1%)        4.30      (9.7%)    7.4% (  -9% -   27%)
>                   Fuzzy2       44.01      (2.3%)       47.43      (1.8%)    7.8% (   3% -   12%)
>             OrNotHighMed       32.64      (4.7%)       35.22      (5.5%)    7.9% (  -2% -   19%)
>                   Fuzzy1       62.24      (2.1%)       67.35      (1.9%)    8.2% (   4% -   12%)
>                MedPhrase      129.06      (4.9%)      141.14      (6.2%)    9.4% (  -1% -   21%)
>               AndHighMed       27.71      (0.7%)       30.32      (1.1%)    9.4% (   7% -   11%)
>             HighSpanNear        5.15      (3.5%)        5.63      (4.2%)    9.5% (   1% -   17%)
>              AndHighHigh       24.98      (0.7%)       27.89      (1.1%)   11.7% (   9% -   13%)
>            OrNotHighHigh       15.13      (2.0%)       17.90      (2.6%)   18.3% (  13% -   23%)
>                 Wildcard        9.06      (1.4%)       10.85      (2.6%)   19.8% (  15% -   24%)
>            OrHighNotHigh        8.84      (1.8%)       10.64      (2.6%)   20.3% (  15% -   25%)
>               OrHighHigh        3.73      (1.6%)        4.51      (2.4%)   20.9% (  16% -   25%)
>                OrHighLow        5.22      (1.5%)        6.34      (2.5%)   21.4% (  17% -   25%)
>             OrHighNotLow        8.94      (1.6%)       10.95      (2.5%)   22.5% (  18% -   26%)
>                  Prefix3       27.61      (1.2%)       33.90      (2.3%)   22.8% (  19% -   26%)
>                OrHighMed       11.72      (1.6%)       14.56      (2.3%)   24.3% (  20% -   28%)
>             OrHighNotMed       14.74      (1.5%)       18.34      (2.2%)   24.5% (  20% -   28%)
>                  MedTerm       26.37      (1.2%)       32.85      (2.7%)   24.6% (  20% -   28%)
>                   IntNRQ        2.61      (1.2%)        3.25      (3.0%)   24.7% (  20% -   29%)
>                 HighTerm       19.69      (1.3%)       25.33      (3.0%)   28.7% (  23% -   33%)
>                  LowTerm      131.50      (1.3%)      170.49      (3.0%)   29.7% (  25% -   34%)
> {noformat}
> I think the gains are sizable, and the increase in index size quite
> minor (in another test with fewer dims I saw the index size get a bit
> smaller) ... at least for this specific test.
> However, finding a clean solution here will be tricky...



--
This message was sent by Atlassian JIRA
(v6.1#6144)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org