You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Shai Erera (JIRA)" <ji...@apache.org> on 2013/10/26 21:10:31 UTC
[jira] [Commented] (LUCENE-5308) explore per-dimension fixed-width
ordinal encoding
[ https://issues.apache.org/jira/browse/LUCENE-5308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13806157#comment-13806157 ]
Shai Erera commented on LUCENE-5308:
------------------------------------
It's a nice idea. You could probably do the "right" thing if you extend FacetFields and override the CountingListBuilder to generate this fixed-encoding. And instead of FacetsAccumlator write a FacetsAggregator which decodes per-document, and then you get top-K computation for free .. I think?
I guess if an app truly knows it has a fixed taxonomy, and that every document contains all facets, this could be useful. Maybe instead of asking the app to specify uniqueValueCount we write a FixedTaxonomyWriter where the app has to build once before it adds any document, with all categories, and we compute uniqueValueCount ourselves?
I mean, this could eliminate app making mistakes - e.g. FixedFacetFields would not even get a TaxonomyWriter, so if app tries to find a CategoryPath which wasn't added already, it hits a hard exception. Hmm, now it also hits it if the uniqueValueCount smaller than an ord a CP gets, but I think it makes things more clear that you must create the taxonomy up front.
Although, I can see an app saying "I don't know which categories I'll see, but there will never be more than X of them" ... so maybe a uniqueValueCount constraint is good as well.
Net/net, this is a very limited solution which an app needs to think about before using it. If it matches app's needs, it can speed things up. I wonder what the speedups will be when it's fully "productized", and whether it will still be worth keeping in the code base.
I do think though that we could think about per CategoryList ordinal space (separate issue) by default. It requires heavy changes to the taxonomy index and supporting code, but maybe it will be worth it too (compression-wise and hopefully decoding time too).
> explore per-dimension fixed-width ordinal encoding
> --------------------------------------------------
>
> Key: LUCENE-5308
> URL: https://issues.apache.org/jira/browse/LUCENE-5308
> Project: Lucene - Core
> Issue Type: Improvement
> Components: modules/facet
> Reporter: Michael McCandless
> Attachments: LUCENE-5308.patch
>
>
> I've been testing performance of Solr vs Lucene facets, and one area
> where Solr's "fcs" method shines (low RAM, high faceting perf) is in
> low-cardinality dimensions.
> I suspect the gains are because with the field-cache entries the ords
> are encoded in "column-stride" form, and are private to that dim (vs
> facet module's shared ord space).
> So I thought about whether we could do something like this in the
> facet module ...
> I.e., if we know certain documents will have a specific set of
> single-valued dimensions, we can pick an encoding format for the
> per-doc byte[] "globally" for all such documents, and use private ord
> space per-dimension to improve compression.
> The basic idea is to pre-assign up-front (before the segment is
> written) which bytes belong to which dim. E.g., date takes bytes 0-1
> (<= than 65536 unique labels), imageCount takes byte 2 (<= 256
> unique labels), username takes bytes 3-6 (<= 16.8 M unique labels),
> etc. This only works for single-valued dims, and only works if all
> docs (or at least an identifiable subset?) have all dims.
> To test this idea, I made a hacked up prototype patch; it has tons of
> limitations so we clearly can't commit it, but I was able to test full
> wikipedia en with 6 facet dims (date, username, refCount, imageCount,
> sectionCount, subSectionCount, subSubSectionCount).
> Trunk (base) requires 181 MB of net doc values to hold the facet ords,
> while the patch requires 183 MB.
> Perf:
> {noformat}
> Report after iter 19:
> Task QPS base StdDev QPS comp StdDev Pct diff
> Respell 54.30 (3.1%) 54.02 (2.7%) -0.5% ( -6% - 5%)
> MedSloppyPhrase 3.58 (5.6%) 3.60 (6.0%) 0.6% ( -10% - 12%)
> OrNotHighLow 63.58 (6.8%) 64.03 (6.9%) 0.7% ( -12% - 15%)
> HighSloppyPhrase 3.80 (7.4%) 3.84 (7.1%) 1.1% ( -12% - 16%)
> LowSpanNear 8.93 (3.5%) 9.09 (4.6%) 1.8% ( -6% - 10%)
> LowPhrase 12.15 (6.4%) 12.43 (7.2%) 2.3% ( -10% - 17%)
> AndHighLow 402.54 (1.4%) 425.23 (2.3%) 5.6% ( 1% - 9%)
> LowSloppyPhrase 39.53 (1.6%) 42.01 (1.9%) 6.3% ( 2% - 9%)
> MedSpanNear 26.54 (2.8%) 28.39 (3.6%) 7.0% ( 0% - 13%)
> HighPhrase 4.01 (8.1%) 4.30 (9.7%) 7.4% ( -9% - 27%)
> Fuzzy2 44.01 (2.3%) 47.43 (1.8%) 7.8% ( 3% - 12%)
> OrNotHighMed 32.64 (4.7%) 35.22 (5.5%) 7.9% ( -2% - 19%)
> Fuzzy1 62.24 (2.1%) 67.35 (1.9%) 8.2% ( 4% - 12%)
> MedPhrase 129.06 (4.9%) 141.14 (6.2%) 9.4% ( -1% - 21%)
> AndHighMed 27.71 (0.7%) 30.32 (1.1%) 9.4% ( 7% - 11%)
> HighSpanNear 5.15 (3.5%) 5.63 (4.2%) 9.5% ( 1% - 17%)
> AndHighHigh 24.98 (0.7%) 27.89 (1.1%) 11.7% ( 9% - 13%)
> OrNotHighHigh 15.13 (2.0%) 17.90 (2.6%) 18.3% ( 13% - 23%)
> Wildcard 9.06 (1.4%) 10.85 (2.6%) 19.8% ( 15% - 24%)
> OrHighNotHigh 8.84 (1.8%) 10.64 (2.6%) 20.3% ( 15% - 25%)
> OrHighHigh 3.73 (1.6%) 4.51 (2.4%) 20.9% ( 16% - 25%)
> OrHighLow 5.22 (1.5%) 6.34 (2.5%) 21.4% ( 17% - 25%)
> OrHighNotLow 8.94 (1.6%) 10.95 (2.5%) 22.5% ( 18% - 26%)
> Prefix3 27.61 (1.2%) 33.90 (2.3%) 22.8% ( 19% - 26%)
> OrHighMed 11.72 (1.6%) 14.56 (2.3%) 24.3% ( 20% - 28%)
> OrHighNotMed 14.74 (1.5%) 18.34 (2.2%) 24.5% ( 20% - 28%)
> MedTerm 26.37 (1.2%) 32.85 (2.7%) 24.6% ( 20% - 28%)
> IntNRQ 2.61 (1.2%) 3.25 (3.0%) 24.7% ( 20% - 29%)
> HighTerm 19.69 (1.3%) 25.33 (3.0%) 28.7% ( 23% - 33%)
> LowTerm 131.50 (1.3%) 170.49 (3.0%) 29.7% ( 25% - 34%)
> {noformat}
> I think the gains are sizable, and the increase in index size quite
> minor (in another test with fewer dims I saw the index size get a bit
> smaller) ... at least for this specific test.
> However, finding a clean solution here will be tricky...
--
This message was sent by Atlassian JIRA
(v6.1#6144)
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org