You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Michael McCandless (JIRA)" <ji...@apache.org> on 2013/10/26 20:27:30 UTC
[jira] [Updated] (LUCENE-5308) explore per-dimension fixed-width
ordinal encoding
[ https://issues.apache.org/jira/browse/LUCENE-5308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Michael McCandless updated LUCENE-5308:
---------------------------------------
Attachment: LUCENE-5308.patch
Totally hacked up, but I think working, patch.
> explore per-dimension fixed-width ordinal encoding
> --------------------------------------------------
>
> Key: LUCENE-5308
> URL: https://issues.apache.org/jira/browse/LUCENE-5308
> Project: Lucene - Core
> Issue Type: Improvement
> Components: modules/facet
> Reporter: Michael McCandless
> Attachments: LUCENE-5308.patch
>
>
> I've been testing performance of Solr vs Lucene facets, and one area
> where Solr's "fcs" method shines (low RAM, high faceting perf) is in
> low-cardinality dimensions.
> I suspect the gains are because with the field-cache entries the ords
> are encoded in "column-stride" form, and are private to that dim (vs
> facet module's shared ord space).
> So I thought about whether we could do something like this in the
> facet module ...
> I.e., if we know certain documents will have a specific set of
> single-valued dimensions, we can pick an encoding format for the
> per-doc byte[] "globally" for all such documents, and use private ord
> space per-dimension to improve compression.
> The basic idea is to pre-assign up-front (before the segment is
> written) which bytes belong to which dim. E.g., date takes bytes 0-1
> (<= than 65536 unique labels), imageCount takes byte 2 (<= 256
> unique labels), username takes bytes 3-6 (<= 16.8 M unique labels),
> etc. This only works for single-valued dims, and only works if all
> docs (or at least an identifiable subset?) have all dims.
> To test this idea, I made a hacked up prototype patch; it has tons of
> limitations so we clearly can't commit it, but I was able to test full
> wikipedia en with 6 facet dims (date, username, refCount, imageCount,
> sectionCount, subSectionCount, subSubSectionCount).
> Trunk (base) requires 181 MB of net doc values to hold the facet ords,
> while the patch requires 183 MB.
> Perf:
> {noformat}
> Report after iter 19:
> Task QPS base StdDev QPS comp StdDev Pct diff
> Respell 54.30 (3.1%) 54.02 (2.7%) -0.5% ( -6% - 5%)
> MedSloppyPhrase 3.58 (5.6%) 3.60 (6.0%) 0.6% ( -10% - 12%)
> OrNotHighLow 63.58 (6.8%) 64.03 (6.9%) 0.7% ( -12% - 15%)
> HighSloppyPhrase 3.80 (7.4%) 3.84 (7.1%) 1.1% ( -12% - 16%)
> LowSpanNear 8.93 (3.5%) 9.09 (4.6%) 1.8% ( -6% - 10%)
> LowPhrase 12.15 (6.4%) 12.43 (7.2%) 2.3% ( -10% - 17%)
> AndHighLow 402.54 (1.4%) 425.23 (2.3%) 5.6% ( 1% - 9%)
> LowSloppyPhrase 39.53 (1.6%) 42.01 (1.9%) 6.3% ( 2% - 9%)
> MedSpanNear 26.54 (2.8%) 28.39 (3.6%) 7.0% ( 0% - 13%)
> HighPhrase 4.01 (8.1%) 4.30 (9.7%) 7.4% ( -9% - 27%)
> Fuzzy2 44.01 (2.3%) 47.43 (1.8%) 7.8% ( 3% - 12%)
> OrNotHighMed 32.64 (4.7%) 35.22 (5.5%) 7.9% ( -2% - 19%)
> Fuzzy1 62.24 (2.1%) 67.35 (1.9%) 8.2% ( 4% - 12%)
> MedPhrase 129.06 (4.9%) 141.14 (6.2%) 9.4% ( -1% - 21%)
> AndHighMed 27.71 (0.7%) 30.32 (1.1%) 9.4% ( 7% - 11%)
> HighSpanNear 5.15 (3.5%) 5.63 (4.2%) 9.5% ( 1% - 17%)
> AndHighHigh 24.98 (0.7%) 27.89 (1.1%) 11.7% ( 9% - 13%)
> OrNotHighHigh 15.13 (2.0%) 17.90 (2.6%) 18.3% ( 13% - 23%)
> Wildcard 9.06 (1.4%) 10.85 (2.6%) 19.8% ( 15% - 24%)
> OrHighNotHigh 8.84 (1.8%) 10.64 (2.6%) 20.3% ( 15% - 25%)
> OrHighHigh 3.73 (1.6%) 4.51 (2.4%) 20.9% ( 16% - 25%)
> OrHighLow 5.22 (1.5%) 6.34 (2.5%) 21.4% ( 17% - 25%)
> OrHighNotLow 8.94 (1.6%) 10.95 (2.5%) 22.5% ( 18% - 26%)
> Prefix3 27.61 (1.2%) 33.90 (2.3%) 22.8% ( 19% - 26%)
> OrHighMed 11.72 (1.6%) 14.56 (2.3%) 24.3% ( 20% - 28%)
> OrHighNotMed 14.74 (1.5%) 18.34 (2.2%) 24.5% ( 20% - 28%)
> MedTerm 26.37 (1.2%) 32.85 (2.7%) 24.6% ( 20% - 28%)
> IntNRQ 2.61 (1.2%) 3.25 (3.0%) 24.7% ( 20% - 29%)
> HighTerm 19.69 (1.3%) 25.33 (3.0%) 28.7% ( 23% - 33%)
> LowTerm 131.50 (1.3%) 170.49 (3.0%) 29.7% ( 25% - 34%)
> {noformat}
> I think the gains are sizable, and the increase in index size quite
> minor (in another test with fewer dims I saw the index size get a bit
> smaller) ... at least for this specific test.
> However, finding a clean solution here will be tricky...
--
This message was sent by Atlassian JIRA
(v6.1#6144)
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org