You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Michael McCandless (JIRA)" <ji...@apache.org> on 2011/01/03 12:58:46 UTC
[jira] Commented: (LUCENE-2843) Add variable-gap terms index impl.
[ https://issues.apache.org/jira/browse/LUCENE-2843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12976709#action_12976709 ]
Michael McCandless commented on LUCENE-2843:
--------------------------------------------
As a first test, I just made a policy that's identical to the fixed
gap terms index, ie, it just picks every 32nd term as the index term.
So this is really a test of the packed int/bytes vs FST.
On the 10M Wikipedia test index, the resulting terms index files (=
RAM used by SegmentReader) is ~38% smaller (~52% once optimized -- FST
"scales up" well).
Here's the query perf vs trunk:
||Query||QPS base||QPS vargap||Pct diff||||
|spanFirst(unit, 5)|17.13|16.75|{color:red}-2.2%{color}|
|"unit state"~3|5.31|5.20|{color:red}-2.1%{color}|
|spanNear([unit, state], 10, true)|4.59|4.52|{color:red}-1.4%{color}|
|"unit state"|7.86|7.77|{color:red}-1.1%{color}|
|+nebraska +state|204.74|202.85|{color:red}-0.9%{color}|
|+unit +state|11.37|11.30|{color:red}-0.6%{color}|
|doctimesecnum:[10000 TO 60000]|9.74|9.76|{color:green}0.2%{color}|
|unit~1.0|21.70|21.82|{color:green}0.6%{color}|
|unit*|26.18|26.55|{color:green}1.4%{color}|
|state|29.29|29.75|{color:green}1.6%{color}|
|uni*|15.06|15.32|{color:green}1.7%{color}|
|unit state|10.73|10.93|{color:green}1.9%{color}|
|unit~2.0|21.05|21.45|{color:green}1.9%{color}|
|un*d|77.10|79.65|{color:green}3.3%{color}|
|u*d|26.41|28.81|{color:green}9.1%{color}|
|united~1.0|102.27|116.88|{color:green}14.3%{color}|
|united~2.0|25.47|31.18|{color:green}22.4%{color}|
It's great that for the seek intensive fuzzy queries, the FST-based
seeking is substantially faster. For other queries the term seek time
is in the noise.
I think we should make this (VariableGapTermsIndex) terms index impl
the default (for Standard codec).
> Add variable-gap terms index impl.
> ----------------------------------
>
> Key: LUCENE-2843
> URL: https://issues.apache.org/jira/browse/LUCENE-2843
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Index
> Reporter: Michael McCandless
> Assignee: Michael McCandless
> Fix For: 4.0
>
> Attachments: LUCENE-2843.patch
>
>
> PrefixCodedTermsReader/Writer (used by all "real" core codecs) already
> supports pluggable terms index impls.
> The only impl we have now is FixedGapTermsIndexReader/Writer, which
> picks every Nth (default 32) term and holds it in efficient packed
> int/byte arrays in RAM. This is already an enormous improvement (RAM
> reduction, init time) over 3.x.
> This patch adds another impl, VariableGapTermsIndexReader/Writer,
> which lets you specify an arbitrary IndexTermSelector to pick which
> terms are indexed, and then uses an FST to hold the indexed terms.
> This is typically even more memory efficient than packed int/byte
> arrays, though, it does not support ord() so it's not quite a fair
> comparison.
> I had to relax the terms index plugin api for
> PrefixCodedTermsReader/Writer to not assume that the terms index impl
> supports ord.
> I also did some cleanup of the FST/FSTEnum APIs and impls, and broke
> out separate seekCeil and seekFloor in FSTEnum. Eg we need seekFloor
> when the FST is used as a terms index but seekCeil when it's holding
> all terms in the index (ie which SimpleText uses FSTs for).
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org