You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Adrien Grand (JIRA)" <ji...@apache.org> on 2016/05/13 12:53:12 UTC

[jira] [Commented] (LUCENE-4702) Terms dictionary compression

    [ https://issues.apache.org/jira/browse/LUCENE-4702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15282672#comment-15282672 ] 

Adrien Grand commented on LUCENE-4702:
--------------------------------------

I've had interest in this patch again for compression of uid fields. For instance if you are using flake ids (https://github.com/boundary/flake 16 bytes: 8 bytes for the timestamp, 6 for the mac address and 2 for a counter), there is a lot of redundancy in the suffixes due to the mac address component. It is possible to help compression by moving the mac address closer to the beginning of the id (so that the prefix compression that the terms dict performs helps, and since Lucene does not need ids to be sortable) but then this makes lookups slower since we need to walk more bytes of the id before realizing it does not belong to a segment. 

For instance I simulated a 100 docs/s indexing rate with 4 unique mac addresses, and the size of the terms dictionary decreased from 16.4 bytes per id to ~13 bytes per id (-20%) with the patch. Maybe we could fold it into the block tree terms dict but only enable it on unique fields and/or if the codec is configured with BEST_COMPRESSION.

> Terms dictionary compression
> ----------------------------
>
>                 Key: LUCENE-4702
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4702
>             Project: Lucene - Core
>          Issue Type: Wish
>            Reporter: Adrien Grand
>            Assignee: Adrien Grand
>            Priority: Trivial
>         Attachments: LUCENE-4702.patch, LUCENE-4702.patch
>
>
> I've done a quick test with the block tree terms dictionary by replacing a call to IndexOutput.writeBytes to write suffix bytes with a call to LZ4.compressHC to test the peformance hit. Interestingly, search performance was very good (see comparison table below) and the tim files were 14% smaller (from 150432 bytes overall to 129516).
> {noformat}
>                     TaskQPS baseline      StdDevQPS compressed      StdDev                Pct diff
>                   Fuzzy1      111.50      (2.0%)       78.78      (1.5%)  -29.4% ( -32% -  -26%)
>                   Fuzzy2       36.99      (2.7%)       28.59      (1.5%)  -22.7% ( -26% -  -18%)
>                  Respell      122.86      (2.1%)      103.89      (1.7%)  -15.4% ( -18% -  -11%)
>                 Wildcard      100.58      (4.3%)       94.42      (3.2%)   -6.1% ( -13% -    1%)
>                  Prefix3      124.90      (5.7%)      122.67      (4.7%)   -1.8% ( -11% -    9%)
>                OrHighLow      169.87      (6.8%)      167.77      (8.0%)   -1.2% ( -15% -   14%)
>                  LowTerm     1949.85      (4.5%)     1929.02      (3.4%)   -1.1% (  -8% -    7%)
>               AndHighLow     2011.95      (3.5%)     1991.85      (3.3%)   -1.0% (  -7% -    5%)
>               OrHighHigh      155.63      (6.7%)      154.12      (7.9%)   -1.0% ( -14% -   14%)
>              AndHighHigh      341.82      (1.2%)      339.49      (1.7%)   -0.7% (  -3% -    2%)
>                OrHighMed      217.55      (6.3%)      216.16      (7.1%)   -0.6% ( -13% -   13%)
>                   IntNRQ       53.10     (10.9%)       52.90      (8.6%)   -0.4% ( -17% -   21%)
>                  MedTerm      998.11      (3.8%)      994.82      (5.6%)   -0.3% (  -9% -    9%)
>              MedSpanNear       60.50      (3.7%)       60.36      (4.8%)   -0.2% (  -8% -    8%)
>             HighSpanNear       19.74      (4.5%)       19.72      (5.1%)   -0.1% (  -9% -    9%)
>              LowSpanNear      101.93      (3.2%)      101.82      (4.4%)   -0.1% (  -7% -    7%)
>               AndHighMed      366.18      (1.7%)      366.93      (1.7%)    0.2% (  -3% -    3%)
>                 PKLookup      237.28      (4.0%)      237.96      (4.2%)    0.3% (  -7% -    8%)
>                MedPhrase      173.17      (4.7%)      174.69      (4.7%)    0.9% (  -8% -   10%)
>          LowSloppyPhrase      180.91      (2.6%)      182.79      (2.7%)    1.0% (  -4% -    6%)
>                LowPhrase      374.64      (5.5%)      379.11      (5.8%)    1.2% (  -9% -   13%)
>                 HighTerm      253.14      (7.9%)      256.97     (11.4%)    1.5% ( -16% -   22%)
>               HighPhrase       19.52     (10.6%)       19.83     (11.0%)    1.6% ( -18% -   25%)
>          MedSloppyPhrase      141.90      (2.6%)      144.11      (2.5%)    1.6% (  -3% -    6%)
>         HighSloppyPhrase       25.26      (4.8%)       25.97      (5.0%)    2.8% (  -6% -   13%)
> {noformat}
> Only queries which are very terms-dictionary-intensive got a performance hit (Fuzzy, Fuzzy2, Respell, Wildcard), other queries including Prefix3 behaved (surprisingly) well.
> Do you think of it as something worth exploring?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org