You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Michael McCandless (JIRA)" <ji...@apache.org> on 2014/08/08 11:35:12 UTC

[jira] [Updated] (LUCENE-5879) Add auto-prefix terms to block tree terms dict

     [ https://issues.apache.org/jira/browse/LUCENE-5879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-5879:
---------------------------------------

    Attachment: LUCENE-5879.patch

Initial work-in-progress patch: tests do NOT consistently pass, there
are still bugs/corner cases (e.g. when an auto-prefix term is the last
term in a block...).

This change very much requires LUCENE-5268 (pull API for postings
format) which I'd like to backport to 4.x along with this: it makes an
initial pass through all terms to identify "good" prefixes, using the
same algorithm block tree uses to assign terms to blocks, just with
different block sizes.  I haven't picked defaults yet, but e.g. you
could state that an auto prefix term should expand to 100-200 terms
and then the first pass picks prefix terms to handle that.

The problem is inherently over-constrained: a given set of prefixes
like fooa*, foob*, fooc*, etc. may have too-few terms each, but then
their common prefix foo* would have way too many.  For this case it
creates "floored" prefix terms, e.g. foo\[a-e\]\*, foo\[f-p\]\*, foo\[q-z\]\*.

On the 2nd pass, when it writes the actual terms, it inserts these
auto-prefix terms at the right places.

Currently it only works for DOCS_ONLY fields, and it uses a
FixedBitSet(maxDoc) when writing each prefix term.

These auto-prefix terms are fully hidden from all the normal
Terms/Enum APIs, statistics, etc.  They are only used in
Terms.intersect, if you pass a new flag allowing them to be used.

I haven't done anything about the document / searching side of things:
this is just a low level change at this point, for the terms dict.
Maybe we need a new FieldType boolean "computeAutoPrefixTerms" or some
such; currently it's just exposed as additional params to the block
tree terms dict writer.

I think this would mean NumericRangeQuery/Filter can just rewrite to
ordinary TermRangeQuery/Filter, and the numeric fields just become
sugar for encoding their numeric values as sortable binary terms.


> Add auto-prefix terms to block tree terms dict
> ----------------------------------------------
>
>                 Key: LUCENE-5879
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5879
>             Project: Lucene - Core
>          Issue Type: New Feature
>          Components: core/codecs
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 5.0, 4.10
>
>         Attachments: LUCENE-5879.patch
>
>
> This cool idea to generalize numeric/trie fields came from Adrien:
> Today, when we index a numeric field (LongField, etc.) we pre-compute
> (via NumericTokenStream) outside of indexer/codec which prefix terms
> should be indexed.
> But this can be inefficient: you set a static precisionStep, and
> always add those prefix terms regardless of how the terms in the field
> are actually distributed.  Yet typically in real world applications
> the terms have a non-random distribution.
> So, it should be better if instead the terms dict decides where it
> makes sense to insert prefix terms, based on how dense the terms are
> in each region of term space.
> This way we can speed up query time for both term (e.g. infix
> suggester) and numeric ranges, and it should let us use less index
> space and get faster range queries.
>  
> This would also mean that min/maxTerm for a numeric field would now be
> correct, vs today where the externally computed prefix terms are
> placed after the full precision terms, causing hairy code like
> NumericUtils.getMaxInt/Long.  So optos like LUCENE-5860 become
> feasible.
> The terms dict can also do tricks not possible if you must live on top
> of its APIs, e.g. to handle the adversary/over-constrained case when a
> given prefix has too many terms following it but finer prefixes
> have too few (what block tree calls "floor term blocks").



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org