You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Robert Muir (JIRA)" <ji...@apache.org> on 2015/03/18 17:12:39 UTC
[jira] [Commented] (LUCENE-5879) Add auto-prefix terms to block tree terms dict

    [ https://issues.apache.org/jira/browse/LUCENE-5879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14367385#comment-14367385 ] 

Robert Muir commented on LUCENE-5879:
-------------------------------------

{quote}
There are maybe some problems with the patch: is it horrible to store
CompiledAutomaton on PrefixQuery/TermRangeQuery because of query
caching...? Should I instead recompute it for every segment in
.getTermsEnum? Or store it on the weight? Hmm or in a shared
attribute, like FuzzyQuery (what a hack)?
{quote}

Why would this be an issue? Its still index-independent, e.g. AutomatonQuery does this too.

{quote}
It allocates one FixedBitSet(maxDoc) at write time, per segment, to
hold all docs matching each auto-prefix term ... maybe that's too
costly? I could switch to more sparse impls (roaring, sparse,
BitDocIdSet.Builder?) but I suspect typically we will require fairly
dense bitsets anyway for the short prefixes. We end up OR'ing many
terms together at write time...
{quote}

If you use BitDocIdSet.Builder, I think it works well either way. SparseFixedBitSet also has optimized or(DISI).

{quote}
It only works for IndexOptions.DOCS fields; I think that's fine?
{quote}

Yes, I think so, since the user gets an exception if they screw this up. 

{quote}
I created a FixedBitPostingsEnum, FixedBitTermsEnum, both package
private under oal.index, so I can send the bit set to PostingsConsumer
at write time. Maybe there's a cleaner way?
{quote}

I don't understand why these need to be tied to FixedBitSet. Seems like they can use the more generic Bitset api at least (nothing fixed-specific about them).

{quote}
Maybe the changes should be moved to lucene/misc or lucene/codecs, not
core? But this would mean yet another fork of block tree...
{quote}

Alternatively, code could stay where it is. Lucene50PF wires zeros and doesnt have any options for now. in codecs/ we could have AutoPrefixPF that exposes it and make it experimental or something? This way when we feel comfortable, we can "expose" in the default index format by adding ctors there and removing the experimental one.

> Add auto-prefix terms to block tree terms dict
> ----------------------------------------------
>
>                 Key: LUCENE-5879
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5879
>             Project: Lucene - Core
>          Issue Type: New Feature
>          Components: core/codecs
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 5.0, Trunk
>
>         Attachments: LUCENE-5879.patch, LUCENE-5879.patch, LUCENE-5879.patch, LUCENE-5879.patch, LUCENE-5879.patch, LUCENE-5879.patch, LUCENE-5879.patch, LUCENE-5879.patch, LUCENE-5879.patch, LUCENE-5879.patch, LUCENE-5879.patch
>
>
> This cool idea to generalize numeric/trie fields came from Adrien:
> Today, when we index a numeric field (LongField, etc.) we pre-compute
> (via NumericTokenStream) outside of indexer/codec which prefix terms
> should be indexed.
> But this can be inefficient: you set a static precisionStep, and
> always add those prefix terms regardless of how the terms in the field
> are actually distributed.  Yet typically in real world applications
> the terms have a non-random distribution.
> So, it should be better if instead the terms dict decides where it
> makes sense to insert prefix terms, based on how dense the terms are
> in each region of term space.
> This way we can speed up query time for both term (e.g. infix
> suggester) and numeric ranges, and it should let us use less index
> space and get faster range queries.
>  
> This would also mean that min/maxTerm for a numeric field would now be
> correct, vs today where the externally computed prefix terms are
> placed after the full precision terms, causing hairy code like
> NumericUtils.getMaxInt/Long.  So optos like LUCENE-5860 become
> feasible.
> The terms dict can also do tricks not possible if you must live on top
> of its APIs, e.g. to handle the adversary/over-constrained case when a
> given prefix has too many terms following it but finer prefixes
> have too few (what block tree calls "floor term blocks").



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org