You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Michael McCandless (JIRA)" <ji...@apache.org> on 2011/01/03 12:58:46 UTC

[jira] Commented: (LUCENE-2843) Add variable-gap terms index impl.

    [ https://issues.apache.org/jira/browse/LUCENE-2843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12976709#action_12976709 ] 

Michael McCandless commented on LUCENE-2843:
--------------------------------------------

As a first test, I just made a policy that's identical to the fixed
gap terms index, ie, it just picks every 32nd term as the index term.
So this is really a test of the packed int/bytes vs FST.

On the 10M Wikipedia test index, the resulting terms index files (=
RAM used by SegmentReader) is ~38% smaller (~52% once optimized -- FST
"scales up" well).

Here's the query perf vs trunk:

||Query||QPS base||QPS vargap||Pct diff||||
|spanFirst(unit, 5)|17.13|16.75|{color:red}-2.2%{color}|
|"unit state"~3|5.31|5.20|{color:red}-2.1%{color}|
|spanNear([unit, state], 10, true)|4.59|4.52|{color:red}-1.4%{color}|
|"unit state"|7.86|7.77|{color:red}-1.1%{color}|
|+nebraska +state|204.74|202.85|{color:red}-0.9%{color}|
|+unit +state|11.37|11.30|{color:red}-0.6%{color}|
|doctimesecnum:[10000 TO 60000]|9.74|9.76|{color:green}0.2%{color}|
|unit~1.0|21.70|21.82|{color:green}0.6%{color}|
|unit*|26.18|26.55|{color:green}1.4%{color}|
|state|29.29|29.75|{color:green}1.6%{color}|
|uni*|15.06|15.32|{color:green}1.7%{color}|
|unit state|10.73|10.93|{color:green}1.9%{color}|
|unit~2.0|21.05|21.45|{color:green}1.9%{color}|
|un*d|77.10|79.65|{color:green}3.3%{color}|
|u*d|26.41|28.81|{color:green}9.1%{color}|
|united~1.0|102.27|116.88|{color:green}14.3%{color}|
|united~2.0|25.47|31.18|{color:green}22.4%{color}|

It's great that for the seek intensive fuzzy queries, the FST-based
seeking is substantially faster.  For other queries the term seek time
is in the noise.

I think we should make this (VariableGapTermsIndex) terms index impl
the default (for Standard codec).


> Add variable-gap terms index impl.
> ----------------------------------
>
>                 Key: LUCENE-2843
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2843
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 4.0
>
>         Attachments: LUCENE-2843.patch
>
>
> PrefixCodedTermsReader/Writer (used by all "real" core codecs) already
> supports pluggable terms index impls.
> The only impl we have now is FixedGapTermsIndexReader/Writer, which
> picks every Nth (default 32) term and holds it in efficient packed
> int/byte arrays in RAM.  This is already an enormous improvement (RAM
> reduction, init time) over 3.x.
> This patch adds another impl, VariableGapTermsIndexReader/Writer,
> which lets you specify an arbitrary IndexTermSelector to pick which
> terms are indexed, and then uses an FST to hold the indexed terms.
> This is typically even more memory efficient than packed int/byte
> arrays, though, it does not support ord() so it's not quite a fair
> comparison.
> I had to relax the terms index plugin api for
> PrefixCodedTermsReader/Writer to not assume that the terms index impl
> supports ord.
> I also did some cleanup of the FST/FSTEnum APIs and impls, and broke
> out separate seekCeil and seekFloor in FSTEnum.  Eg we need seekFloor
> when the FST is used as a terms index but seekCeil when it's holding
> all terms in the index (ie which SimpleText uses FSTs for).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org