You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Michael McCandless (JIRA)" <ji...@apache.org> on 2013/01/16 23:10:13 UTC
[jira] [Updated] (LUCENE-4609) Write a PackedIntsEncoder/Decoder
for facets
[ https://issues.apache.org/jira/browse/LUCENE-4609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Michael McCandless updated LUCENE-4609:
---------------------------------------
Attachment: LUCENE-4609.patch
Patch, w/ a "custom" (not using our PackedInts APIs) packed ints encoder/decoder. It only uses as many bytes as are necessary, and packs bpv & "leftoverBits" into a single byte header.
I tested on first 1M Wikipedia docs ... and performance is much worse than current default in trunk... admittedly it's not quite fair (trunk has specialized vInt/dGap decoder, but patch leaves dGap separate from packed int decode), and admittedly this decoder will be slower than the optimized oal.util.PackedInts ... but perf is so far off that I find it hard to believe PackedInts can match vInt even after optimizing.
Trunk gets these results:
{noformat}
Task QPS base StdDev QPS comp StdDev Pct diff
PKLookup 203.77 (1.8%) 202.25 (1.8%) -0.7% ( -4% - 2%)
HighTerm 20.43 (1.8%) 20.53 (0.8%) 0.5% ( -2% - 3%)
MedTerm 33.12 (1.7%) 33.30 (0.9%) 0.5% ( -2% - 3%)
LowTerm 87.55 (3.0%) 88.59 (2.5%) 1.2% ( -4% - 6%)
{noformat}
Patch gets this:
{noformat}
Task QPS base StdDev QPS comp StdDev Pct diff
HighTerm 10.82 (3.6%) 10.69 (4.4%) -1.2% ( -8% - 7%)
MedTerm 19.33 (3.2%) 19.10 (4.0%) -1.2% ( -8% - 6%)
LowTerm 67.75 (2.8%) 67.11 (3.0%) -0.9% ( -6% - 5%)
PKLookup 196.49 (1.0%) 196.24 (1.9%) -0.1% ( -3% - 2%)
{noformat}
(NOTE: base/comp are the same in each run, so ignore the differences w/in each run (it's noise) and compare absolute across the two runs ... ie HighTerm gets ~20.43 QPS with trunk but ~10.82 with patch).
Also: trunk took ~63 MB for the DV files while patch took ~84 MB. Net/net I think postings compress better with PackedInts than facet ords (at least for these 9 facet fields I'm using in Wikipedia)...
> Write a PackedIntsEncoder/Decoder for facets
> --------------------------------------------
>
> Key: LUCENE-4609
> URL: https://issues.apache.org/jira/browse/LUCENE-4609
> Project: Lucene - Core
> Issue Type: New Feature
> Components: modules/facet
> Reporter: Shai Erera
> Priority: Minor
> Attachments: LUCENE-4609.patch, LUCENE-4609.patch
>
>
> Today the facets API lets you write IntEncoder/Decoder to encode/decode the category ordinals. We have several such encoders, including VInt (default), and block encoders.
> It would be interesting to implement and benchmark a PackedIntsEncoder/Decoder, with potentially two variants: (1) receives bitsPerValue up front, when you e.g. know that you have a small taxonomy and the max value you can see and (2) one that decides for each doc on the optimal bitsPerValue, writes it as a header in the byte[] or something.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org