You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Han Jiang (JIRA)" <ji...@apache.org> on 2013/08/02 17:57:51 UTC

[jira] [Updated] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary

     [ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Han Jiang updated LUCENE-3069:
------------------------------

    Attachment: LUCENE-3069.patch

Uploaded patch.

It is optimized for wildcardquery, and I did a quick test on 1M wiki data:
{noformat}
                    Task    QPS base      StdDev    QPS comp      StdDev                Pct diff
                PKLookup      314.63      (1.5%)      314.64      (1.2%)    0.0% (  -2% -    2%)
                  Fuzzy1       91.32      (3.7%)       92.50      (1.6%)    1.3% (  -3% -    6%)
                 Respell      104.54      (3.9%)      106.97      (1.6%)    2.3% (  -2% -    8%)
                  Fuzzy2       38.22      (4.1%)       39.16      (1.2%)    2.5% (  -2% -    8%)
                Wildcard      109.56      (3.1%)      273.42      (5.0%)  149.6% ( 137% -  162%)
{noformat}

and TempFSTOrd vs. Lucene41, on 1M data:
{noformat}
                    Task    QPS base      StdDev    QPS comp      StdDev                Pct diff
                 Respell      134.85      (3.7%)      106.30      (0.6%)  -21.2% ( -24% -  -17%)
                  Fuzzy2       47.78      (4.1%)       39.03      (0.9%)  -18.3% ( -22% -  -13%)
                  Fuzzy1      112.02      (3.0%)       91.95      (0.6%)  -17.9% ( -20% -  -14%)
                Wildcard      326.68      (3.5%)      273.41      (1.9%)  -16.3% ( -20% -  -11%)
                PKLookup      194.61      (1.8%)      314.24      (0.7%)   61.5% (  57% -   65%)
{noformat}

But I'm not happy with it :(, the hack I did here is to consume another big block to store the last byte of each term. So for wildcard query ab*c, we have external information to tell the ord of nearest term like *c. Knowing the ord, we can use a similar approach like getByOutput to jump to the next target term.

Previously, we have to walk on fst to the stop node to find out whether the last byte is 'c', so this optimization comes to be a big chunk.

However I don't really like this patch :(, we have to increase index size (521M => 530M), and the code comes to be mess up, since we always have to foresee the next arc on current stack. 
                
> Lucene should have an entirely memory resident term dictionary
> --------------------------------------------------------------
>
>                 Key: LUCENE-3069
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3069
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/index, core/search
>    Affects Versions: 4.0-ALPHA
>            Reporter: Simon Willnauer
>            Assignee: Han Jiang
>              Labels: gsoc2013
>             Fix For: 5.0, 4.5
>
>         Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch
>
>
> FST based TermDictionary has been a great improvement yet it still uses a delta codec file for scanning to terms. Some environments have enough memory available to keep the entire FST based term dict in memory. We should add a TermDictionary implementation that encodes all needed information for each term into the FST (custom fst.Output) and builds a FST from the entire term not just the delta.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org