You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "David Smiley (JIRA)" <ji...@apache.org> on 2014/09/21 21:33:35 UTC

[jira] [Comment Edited] (LUCENE-5879) Add auto-prefix terms to block tree terms dict

    [ https://issues.apache.org/jira/browse/LUCENE-5879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14142590#comment-14142590 ] 

David Smiley edited comment on LUCENE-5879 at 9/21/14 7:33 PM:
---------------------------------------------------------------

Some more questions:

bq. It's per-segment, so each segment will look at how its terms fall and find "good" places to insert the auto-prefix terms.

So for the whole segment, does it decide to insert auto-prefix'es at specific byte lengths (e.g. 3, 5, and 7)?   Or does it vary based on specific terms?  I'm hoping it's smart enough to vary based on specific terms.  For example if, hypothetically there were lots of terms that had this common prefix: "BGA" then it might decide "BGA" makes a good auto-prefix but not necessarily all terms at length 3 since many others might not make good prefixes.  Make sense?

At a low level, do I take advantage of this in the same way that I might do so at a high level using PrefixQuery and then getting the weight then getting the scorer to iterate docIds?  Or is there a lower-level path?  Although there is some elegance to not introducing new APIs, I think it's worth exploring having prefix & range capabilities be on the TermsEnum in some way.

Do you envision other posting formats being able to re-use the logic here?  That would be nice.

In your future tuning, I suggest you give the ability to vary the conservative vs aggressive prefixing based on the very beginning and very end (assuming known common lengths).  In the FlexPrefixTree Varun (GSOC) worked on, the leaves per level is configurable at each level (i.e. prefix length)... and it's better to have little prefixing at the very top and little at the bottom too.  At the top, prefixes only help for queries span massive portions of the possible term space (which in spatial is rare; likely other apps too).  And at the bottom (long prefixes) just shy of the maximum length (say 7 bytes out of 8 for a double), there is marginal value because in the spatial search algorithm, the bottom detail is scan'ed over (e.g. TermsEnum.next()) instead of seek'ed, because the data is less dense and it's adjacent.  This principle may apply to numeric-range queries depending on how they are coded; I'm not sure.


was (Author: dsmiley):
Some more questions:

bq. It's per-segment, so each segment will look at how its terms fall and find "good" places to insert the auto-prefix terms.

So for the whole segment, does it decide to insert auto-prefix'es at specific byte lengths (e.g. 3, 5, and 7)?   Or does it vary based on specific terms?  I'm hoping it's smart enough to vary based on specific terms.  For example if, hypothetically there were lots of terms that had this common prefix: "BGA" then it might decide "BGA" makes a good auto-prefix but not necessarily all terms at length 3 since many others might not make good prefixes.  Make sense?

At a low level, do I take advantage of this in the same way that I might do so at a high level using PrefixQuery and then getting the weight then getting the scorer to iterate docIds?  Or is there a lower-level path?  Although there is some elegance to not introducing new APIs, I think it's worth exploring having prefix & range capabilities be on the TermsEnum in some way.

Do you envision other posting formats being able to re-use the logic here?  That would be nice.

In your future tuning, I suggest you give the ability to vary the convervative vs aggressive prefixing based on the very beginning and very end (assuming known common lengths).  In the FlexPrefixTree Varun (GSOC) worked on, the leaves per level is configurable at each level (i.e. prefix length)... and it's better to have little prefixing at the very top and little at the bottom too.  At the top, prefixes only help for queries span massive portions of the possible term space (which in spatial is rare; likely other apps too).  And at the bottom (long prefixes) just shy of the maximum length (say 7 bytes out of 8 for a double), there is marginal value because in the spatial search algorithm, the bottom detail is scan'ed over (e.g. TermsEnum.next()) instead of seek'ed, because the data is less dense and it's adjacent.  This principle may apply to numeric-range queries depending on how they are coded; I'm not sure.

> Add auto-prefix terms to block tree terms dict
> ----------------------------------------------
>
>                 Key: LUCENE-5879
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5879
>             Project: Lucene - Core
>          Issue Type: New Feature
>          Components: core/codecs
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 5.0, Trunk
>
>         Attachments: LUCENE-5879.patch, LUCENE-5879.patch, LUCENE-5879.patch, LUCENE-5879.patch, LUCENE-5879.patch, LUCENE-5879.patch, LUCENE-5879.patch
>
>
> This cool idea to generalize numeric/trie fields came from Adrien:
> Today, when we index a numeric field (LongField, etc.) we pre-compute
> (via NumericTokenStream) outside of indexer/codec which prefix terms
> should be indexed.
> But this can be inefficient: you set a static precisionStep, and
> always add those prefix terms regardless of how the terms in the field
> are actually distributed.  Yet typically in real world applications
> the terms have a non-random distribution.
> So, it should be better if instead the terms dict decides where it
> makes sense to insert prefix terms, based on how dense the terms are
> in each region of term space.
> This way we can speed up query time for both term (e.g. infix
> suggester) and numeric ranges, and it should let us use less index
> space and get faster range queries.
>  
> This would also mean that min/maxTerm for a numeric field would now be
> correct, vs today where the externally computed prefix terms are
> placed after the full precision terms, causing hairy code like
> NumericUtils.getMaxInt/Long.  So optos like LUCENE-5860 become
> feasible.
> The terms dict can also do tricks not possible if you must live on top
> of its APIs, e.g. to handle the adversary/over-constrained case when a
> given prefix has too many terms following it but finer prefixes
> have too few (what block tree calls "floor term blocks").



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org