You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Bruno Roustant (JIRA)" <ji...@apache.org> on 2019/04/03 14:52:00 UTC

[jira] [Commented] (LUCENE-8753) New PostingFormat - UniformSplit

    [ https://issues.apache.org/jira/browse/LUCENE-8753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16808799#comment-16808799 ] 

Bruno Roustant commented on LUCENE-8753:
----------------------------------------

Here's the Luceneutil benchmark with the wikimedium500k data set using Java 8. This is a bit dated using Lucene 7.1; it'd be nice to update to master.

 

Report after iter 19:
 TaskQPS blocktree StdDevQPS uniformsplit StdDev Pct diff
 Fuzzy1 508.47 (3.8%) 221.37 (0.9%) {color:#59afe1}-56.5%{color} ( -58% - -53%)
 Fuzzy2 171.73 (6.4%) 80.62 (1.4%) {color:#59afe1}-53.1%{color} ( -57% - -48%)
 PKLookup 182.47 (2.4%) 149.62 (2.5%) {color:#59afe1}-18.0%{color} ( -22% - -13%)
 Wildcard 1788.74 (5.9%) 1729.37 (4.5%) {color:#59afe1}-3.3%{color} ( -12% - 7%)
 IntNRQ 1561.48 (2.1%) 1564.33 (1.9%) {color:#59afe1}0.2%{color} ( -3% - 4%)
 Prefix3 1759.69 (5.0%) 1829.74 (4.8%) {color:#59afe1}4.0%{color} ( -5% - 14%)
 HighTermDayOfYearSort 586.06 (5.4%) 622.34 (8.2%) {color:#59afe1}6.2%{color} ( -6% - 20%)
 MedPhrase 1204.85 (5.5%) 1282.89 (7.7%) {color:#59afe1}6.5%{color} ( -6% - 20%)
 HighSpanNear 590.88 (4.1%) 629.64 (6.1%) {color:#59afe1}6.6%{color} ( -3% - 17%)
 OrHighMed 1101.48 (4.5%) 1220.75 (6.2%) {color:#59afe1}10.8%{color} ( 0% - 22%)
 HighTermMonthSort 2617.10 (2.6%) 2916.34 (4.6%) {color:#59afe1}11.4%{color} ( 4% - 19%)
 HighPhrase 961.04 (5.5%) 1073.62 (6.0%) {color:#59afe1}11.7%{color} ( 0% - 24%)
 MedSloppyPhrase 604.56 (13.3%) 680.31 (13.7%) {color:#59afe1}12.5%{color} ( -12% - 45%)
 LowSloppyPhrase 954.87 (8.1%) 1075.67 (5.4%) {color:#59afe1}12.7%{color} ( 0% - 28%)
 MedSpanNear 737.14 (5.8%) 830.68 (8.3%) {color:#59afe1}12.7%{color} ( -1% - 28%)
 OrHighHigh 811.57 (5.7%) 915.01 (6.2%) {color:#59afe1}12.7%{color} ( 0% - 26%)
 AndHighMed 1157.45 (5.3%) 1317.78 (5.1%) {color:#59afe1}13.9%{color} ( 3% - 25%)
 AndHighHigh 1095.29 (5.7%) 1254.16 (4.9%) {color:#59afe1}14.5%{color} ( 3% - 26%)
 HighSloppyPhrase 880.42 (8.2%) 1009.72 (7.0%) {color:#59afe1}14.7%{color} ( 0% - 32%)
 LowPhrase 1245.33 (6.0%) 1473.57 (4.4%) {color:#59afe1}18.3%{color} ( 7% - 30%)
 Respell 81.10 (12.7%) 99.43 (10.3%) {color:#59afe1}22.6%{color} ( 0% - 52%)
 HighTerm 3733.81 (6.1%) 4599.96 (6.8%) {color:#59afe1}23.2%{color} ( 9% - 38%)
 OrHighLow 1960.13 (6.2%) 2415.81 (6.0%) {color:#59afe1}23.2%{color} ( 10% - 37%)
 MedTerm 4411.60 (4.9%) 5450.56 (5.8%) {color:#59afe1}23.6%{color} ( 12% - 35%)
 LowSpanNear 1944.27 (5.3%) 2416.29 (4.5%) {color:#59afe1}24.3%{color} ( 13% - 36%)
 AndHighLow 1978.10 (7.6%) 2500.74 (5.8%) {color:#59afe1}26.4%{color} ( 12% - 43%)
 LowTerm 4949.24 (4.8%) 6589.86 (5.3%) {color:#59afe1}33.1%{color} ( 22% - 45%)

> New PostingFormat - UniformSplit
> --------------------------------
>
>                 Key: LUCENE-8753
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8753
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/codecs
>    Affects Versions: 8.0
>            Reporter: Bruno Roustant
>            Priority: Major
>         Attachments: Uniform Split Technique.pdf
>
>
> This is a proposal to add a new PostingsFormat called "UniformSplit" with 4 objectives:
> - Clear design and simple code.
> - Easily extensible, for both the logic and the index format.
> - Light memory usage with a very compact FST.
> - Focus on efficient TermQuery, PhraseQuery and PrefixQuery performance.
> (the pdf attached explains visually the technique in more details)
>  The principle is to split the list of terms into blocks and use a FST to access the block, but not as a prefix trie, rather with a seek-floor pattern. For the selection of the blocks, there is a target average block size (number of terms), with an allowed delta variation (10%) to compare the terms and select the one with the minimal distinguishing prefix.
> There are also several optimizations inside the block to make it more compact and speed up the loading/scanning.
> The performance obtained is interesting with the luceneutil benchmark, comparing UniformSplit with BlockTree. Find it in the first comment.
>  
>  Although the precise percentages vary between runs, three main points:
> - TermQuery and PhraseQuery are improved.
> - PrefixQuery and WildcardQuery are ok.
> - Fuzzy queries are clearly less performant, because BlockTree is so optimized for them.
> Compared to BlockTree, FST size is reduced by 15%, and segment writing time is reduced by 20%. So this PostingsFormat scales to lots of docs, as BlockTree.
>  
>  This initial version passes all Lucene tests. Use “ant test -Dtests.codec=UniformSplitTesting” to test with this PostingsFormat.
> Subjectively, we think we have fulfilled our goal of code simplicity. And we have already exercised this PostingsFormat extensibility to create a different flavor for our own use-case.
>  
>  Contributors: Juan Camilo Rodriguez Duran, Bruno Roustant, David Smiley



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org