You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Adrien Grand (JIRA)" <ji...@apache.org> on 2016/05/13 12:53:12 UTC
[jira] [Commented] (LUCENE-4702) Terms dictionary compression
[ https://issues.apache.org/jira/browse/LUCENE-4702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15282672#comment-15282672 ]
Adrien Grand commented on LUCENE-4702:
--------------------------------------
I've had interest in this patch again for compression of uid fields. For instance if you are using flake ids (https://github.com/boundary/flake 16 bytes: 8 bytes for the timestamp, 6 for the mac address and 2 for a counter), there is a lot of redundancy in the suffixes due to the mac address component. It is possible to help compression by moving the mac address closer to the beginning of the id (so that the prefix compression that the terms dict performs helps, and since Lucene does not need ids to be sortable) but then this makes lookups slower since we need to walk more bytes of the id before realizing it does not belong to a segment.
For instance I simulated a 100 docs/s indexing rate with 4 unique mac addresses, and the size of the terms dictionary decreased from 16.4 bytes per id to ~13 bytes per id (-20%) with the patch. Maybe we could fold it into the block tree terms dict but only enable it on unique fields and/or if the codec is configured with BEST_COMPRESSION.
> Terms dictionary compression
> ----------------------------
>
> Key: LUCENE-4702
> URL: https://issues.apache.org/jira/browse/LUCENE-4702
> Project: Lucene - Core
> Issue Type: Wish
> Reporter: Adrien Grand
> Assignee: Adrien Grand
> Priority: Trivial
> Attachments: LUCENE-4702.patch, LUCENE-4702.patch
>
>
> I've done a quick test with the block tree terms dictionary by replacing a call to IndexOutput.writeBytes to write suffix bytes with a call to LZ4.compressHC to test the peformance hit. Interestingly, search performance was very good (see comparison table below) and the tim files were 14% smaller (from 150432 bytes overall to 129516).
> {noformat}
> TaskQPS baseline StdDevQPS compressed StdDev Pct diff
> Fuzzy1 111.50 (2.0%) 78.78 (1.5%) -29.4% ( -32% - -26%)
> Fuzzy2 36.99 (2.7%) 28.59 (1.5%) -22.7% ( -26% - -18%)
> Respell 122.86 (2.1%) 103.89 (1.7%) -15.4% ( -18% - -11%)
> Wildcard 100.58 (4.3%) 94.42 (3.2%) -6.1% ( -13% - 1%)
> Prefix3 124.90 (5.7%) 122.67 (4.7%) -1.8% ( -11% - 9%)
> OrHighLow 169.87 (6.8%) 167.77 (8.0%) -1.2% ( -15% - 14%)
> LowTerm 1949.85 (4.5%) 1929.02 (3.4%) -1.1% ( -8% - 7%)
> AndHighLow 2011.95 (3.5%) 1991.85 (3.3%) -1.0% ( -7% - 5%)
> OrHighHigh 155.63 (6.7%) 154.12 (7.9%) -1.0% ( -14% - 14%)
> AndHighHigh 341.82 (1.2%) 339.49 (1.7%) -0.7% ( -3% - 2%)
> OrHighMed 217.55 (6.3%) 216.16 (7.1%) -0.6% ( -13% - 13%)
> IntNRQ 53.10 (10.9%) 52.90 (8.6%) -0.4% ( -17% - 21%)
> MedTerm 998.11 (3.8%) 994.82 (5.6%) -0.3% ( -9% - 9%)
> MedSpanNear 60.50 (3.7%) 60.36 (4.8%) -0.2% ( -8% - 8%)
> HighSpanNear 19.74 (4.5%) 19.72 (5.1%) -0.1% ( -9% - 9%)
> LowSpanNear 101.93 (3.2%) 101.82 (4.4%) -0.1% ( -7% - 7%)
> AndHighMed 366.18 (1.7%) 366.93 (1.7%) 0.2% ( -3% - 3%)
> PKLookup 237.28 (4.0%) 237.96 (4.2%) 0.3% ( -7% - 8%)
> MedPhrase 173.17 (4.7%) 174.69 (4.7%) 0.9% ( -8% - 10%)
> LowSloppyPhrase 180.91 (2.6%) 182.79 (2.7%) 1.0% ( -4% - 6%)
> LowPhrase 374.64 (5.5%) 379.11 (5.8%) 1.2% ( -9% - 13%)
> HighTerm 253.14 (7.9%) 256.97 (11.4%) 1.5% ( -16% - 22%)
> HighPhrase 19.52 (10.6%) 19.83 (11.0%) 1.6% ( -18% - 25%)
> MedSloppyPhrase 141.90 (2.6%) 144.11 (2.5%) 1.6% ( -3% - 6%)
> HighSloppyPhrase 25.26 (4.8%) 25.97 (5.0%) 2.8% ( -6% - 13%)
> {noformat}
> Only queries which are very terms-dictionary-intensive got a performance hit (Fuzzy, Fuzzy2, Respell, Wildcard), other queries including Prefix3 behaved (surprisingly) well.
> Do you think of it as something worth exploring?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org