You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Michael McCandless (JIRA)" <ji...@apache.org> on 2013/01/16 23:10:13 UTC

[jira] [Updated] (LUCENE-4609) Write a PackedIntsEncoder/Decoder for facets

     [ https://issues.apache.org/jira/browse/LUCENE-4609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-4609:
---------------------------------------

    Attachment: LUCENE-4609.patch

Patch, w/ a "custom" (not using our PackedInts APIs) packed ints encoder/decoder.  It only uses as many bytes as are necessary, and packs bpv & "leftoverBits" into a single byte header.

I tested on first 1M Wikipedia docs ... and performance is much worse than current default in trunk... admittedly it's not quite fair (trunk has specialized vInt/dGap decoder, but patch leaves dGap separate from packed int decode), and admittedly this decoder will be slower than the optimized oal.util.PackedInts ... but perf is so far off that I find it hard to believe PackedInts can match vInt even after optimizing.

Trunk gets these results:
{noformat}
                    Task    QPS base      StdDev    QPS comp      StdDev                Pct diff
                PKLookup      203.77      (1.8%)      202.25      (1.8%)   -0.7% (  -4% -    2%)
                HighTerm       20.43      (1.8%)       20.53      (0.8%)    0.5% (  -2% -    3%)
                 MedTerm       33.12      (1.7%)       33.30      (0.9%)    0.5% (  -2% -    3%)
                 LowTerm       87.55      (3.0%)       88.59      (2.5%)    1.2% (  -4% -    6%)
{noformat}

Patch gets this:
{noformat}
                    Task    QPS base      StdDev    QPS comp      StdDev                Pct diff
                HighTerm       10.82      (3.6%)       10.69      (4.4%)   -1.2% (  -8% -    7%)
                 MedTerm       19.33      (3.2%)       19.10      (4.0%)   -1.2% (  -8% -    6%)
                 LowTerm       67.75      (2.8%)       67.11      (3.0%)   -0.9% (  -6% -    5%)
                PKLookup      196.49      (1.0%)      196.24      (1.9%)   -0.1% (  -3% -    2%)
{noformat}

(NOTE: base/comp are the same in each run, so ignore the differences w/in each run (it's noise) and compare absolute across the two runs ... ie HighTerm gets ~20.43 QPS with trunk but ~10.82 with patch).

Also: trunk took ~63 MB for the DV files while patch took ~84 MB.  Net/net I think postings compress better with PackedInts than facet ords (at least for these 9 facet fields I'm using in Wikipedia)...
                
> Write a PackedIntsEncoder/Decoder for facets
> --------------------------------------------
>
>                 Key: LUCENE-4609
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4609
>             Project: Lucene - Core
>          Issue Type: New Feature
>          Components: modules/facet
>            Reporter: Shai Erera
>            Priority: Minor
>         Attachments: LUCENE-4609.patch, LUCENE-4609.patch
>
>
> Today the facets API lets you write IntEncoder/Decoder to encode/decode the category ordinals. We have several such encoders, including VInt (default), and block encoders.
> It would be interesting to implement and benchmark a PackedIntsEncoder/Decoder, with potentially two variants: (1) receives bitsPerValue up front, when you e.g. know that you have a small taxonomy and the max value you can see and (2) one that decides for each doc on the optimal bitsPerValue, writes it as a header in the byte[] or something.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org