You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Adrien Grand (JIRA)" <ji...@apache.org> on 2013/01/21 14:30:12 UTC

[jira] [Updated] (LUCENE-4702) Terms dictionary compression

     [ https://issues.apache.org/jira/browse/LUCENE-4702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Adrien Grand updated LUCENE-4702:
---------------------------------

    Attachment: LUCENE-4702.patch

Here is the patch I used (very ugly, for testing only).
                
> Terms dictionary compression
> ----------------------------
>
>                 Key: LUCENE-4702
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4702
>             Project: Lucene - Core
>          Issue Type: Wish
>            Reporter: Adrien Grand
>            Assignee: Adrien Grand
>            Priority: Trivial
>         Attachments: LUCENE-4702.patch
>
>
> I've done a quick test with the block tree terms dictionary by replacing a call to IndexOutput.writeBytes to write suffix bytes with a call to LZ4.compressHC to test the peformance hit. Interestingly, search performance was very good (see comparison table below) and the tim files were 14% smaller (from 150432 bytes overall to 129516).
> {noformat}
>                     TaskQPS baseline      StdDevQPS compressed      StdDev                Pct diff
>                   Fuzzy1      111.50      (2.0%)       78.78      (1.5%)  -29.4% ( -32% -  -26%)
>                   Fuzzy2       36.99      (2.7%)       28.59      (1.5%)  -22.7% ( -26% -  -18%)
>                  Respell      122.86      (2.1%)      103.89      (1.7%)  -15.4% ( -18% -  -11%)
>                 Wildcard      100.58      (4.3%)       94.42      (3.2%)   -6.1% ( -13% -    1%)
>                  Prefix3      124.90      (5.7%)      122.67      (4.7%)   -1.8% ( -11% -    9%)
>                OrHighLow      169.87      (6.8%)      167.77      (8.0%)   -1.2% ( -15% -   14%)
>                  LowTerm     1949.85      (4.5%)     1929.02      (3.4%)   -1.1% (  -8% -    7%)
>               AndHighLow     2011.95      (3.5%)     1991.85      (3.3%)   -1.0% (  -7% -    5%)
>               OrHighHigh      155.63      (6.7%)      154.12      (7.9%)   -1.0% ( -14% -   14%)
>              AndHighHigh      341.82      (1.2%)      339.49      (1.7%)   -0.7% (  -3% -    2%)
>                OrHighMed      217.55      (6.3%)      216.16      (7.1%)   -0.6% ( -13% -   13%)
>                   IntNRQ       53.10     (10.9%)       52.90      (8.6%)   -0.4% ( -17% -   21%)
>                  MedTerm      998.11      (3.8%)      994.82      (5.6%)   -0.3% (  -9% -    9%)
>              MedSpanNear       60.50      (3.7%)       60.36      (4.8%)   -0.2% (  -8% -    8%)
>             HighSpanNear       19.74      (4.5%)       19.72      (5.1%)   -0.1% (  -9% -    9%)
>              LowSpanNear      101.93      (3.2%)      101.82      (4.4%)   -0.1% (  -7% -    7%)
>               AndHighMed      366.18      (1.7%)      366.93      (1.7%)    0.2% (  -3% -    3%)
>                 PKLookup      237.28      (4.0%)      237.96      (4.2%)    0.3% (  -7% -    8%)
>                MedPhrase      173.17      (4.7%)      174.69      (4.7%)    0.9% (  -8% -   10%)
>          LowSloppyPhrase      180.91      (2.6%)      182.79      (2.7%)    1.0% (  -4% -    6%)
>                LowPhrase      374.64      (5.5%)      379.11      (5.8%)    1.2% (  -9% -   13%)
>                 HighTerm      253.14      (7.9%)      256.97     (11.4%)    1.5% ( -16% -   22%)
>               HighPhrase       19.52     (10.6%)       19.83     (11.0%)    1.6% ( -18% -   25%)
>          MedSloppyPhrase      141.90      (2.6%)      144.11      (2.5%)    1.6% (  -3% -    6%)
>         HighSloppyPhrase       25.26      (4.8%)       25.97      (5.0%)    2.8% (  -6% -   13%)
> {noformat}
> Only queries which are very terms-dictionary-intensive got a performance hit (Fuzzy, Fuzzy2, Respell, Wildcard), other queries including Prefix3 behaved (surprisingly) well.
> Do you think of it as something worth exploring?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org