You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Michael McCandless (JIRA)" <ji...@apache.org> on 2014/08/08 11:35:12 UTC
[jira] [Updated] (LUCENE-5879) Add auto-prefix terms to block tree
terms dict
[ https://issues.apache.org/jira/browse/LUCENE-5879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Michael McCandless updated LUCENE-5879:
---------------------------------------
Attachment: LUCENE-5879.patch
Initial work-in-progress patch: tests do NOT consistently pass, there
are still bugs/corner cases (e.g. when an auto-prefix term is the last
term in a block...).
This change very much requires LUCENE-5268 (pull API for postings
format) which I'd like to backport to 4.x along with this: it makes an
initial pass through all terms to identify "good" prefixes, using the
same algorithm block tree uses to assign terms to blocks, just with
different block sizes. I haven't picked defaults yet, but e.g. you
could state that an auto prefix term should expand to 100-200 terms
and then the first pass picks prefix terms to handle that.
The problem is inherently over-constrained: a given set of prefixes
like fooa*, foob*, fooc*, etc. may have too-few terms each, but then
their common prefix foo* would have way too many. For this case it
creates "floored" prefix terms, e.g. foo\[a-e\]\*, foo\[f-p\]\*, foo\[q-z\]\*.
On the 2nd pass, when it writes the actual terms, it inserts these
auto-prefix terms at the right places.
Currently it only works for DOCS_ONLY fields, and it uses a
FixedBitSet(maxDoc) when writing each prefix term.
These auto-prefix terms are fully hidden from all the normal
Terms/Enum APIs, statistics, etc. They are only used in
Terms.intersect, if you pass a new flag allowing them to be used.
I haven't done anything about the document / searching side of things:
this is just a low level change at this point, for the terms dict.
Maybe we need a new FieldType boolean "computeAutoPrefixTerms" or some
such; currently it's just exposed as additional params to the block
tree terms dict writer.
I think this would mean NumericRangeQuery/Filter can just rewrite to
ordinary TermRangeQuery/Filter, and the numeric fields just become
sugar for encoding their numeric values as sortable binary terms.
> Add auto-prefix terms to block tree terms dict
> ----------------------------------------------
>
> Key: LUCENE-5879
> URL: https://issues.apache.org/jira/browse/LUCENE-5879
> Project: Lucene - Core
> Issue Type: New Feature
> Components: core/codecs
> Reporter: Michael McCandless
> Assignee: Michael McCandless
> Fix For: 5.0, 4.10
>
> Attachments: LUCENE-5879.patch
>
>
> This cool idea to generalize numeric/trie fields came from Adrien:
> Today, when we index a numeric field (LongField, etc.) we pre-compute
> (via NumericTokenStream) outside of indexer/codec which prefix terms
> should be indexed.
> But this can be inefficient: you set a static precisionStep, and
> always add those prefix terms regardless of how the terms in the field
> are actually distributed. Yet typically in real world applications
> the terms have a non-random distribution.
> So, it should be better if instead the terms dict decides where it
> makes sense to insert prefix terms, based on how dense the terms are
> in each region of term space.
> This way we can speed up query time for both term (e.g. infix
> suggester) and numeric ranges, and it should let us use less index
> space and get faster range queries.
>
> This would also mean that min/maxTerm for a numeric field would now be
> correct, vs today where the externally computed prefix terms are
> placed after the full precision terms, causing hairy code like
> NumericUtils.getMaxInt/Long. So optos like LUCENE-5860 become
> feasible.
> The terms dict can also do tricks not possible if you must live on top
> of its APIs, e.g. to handle the adversary/over-constrained case when a
> given prefix has too many terms following it but finer prefixes
> have too few (what block tree calls "floor term blocks").
--
This message was sent by Atlassian JIRA
(v6.2#6252)
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org