You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Ferenczi Jim (JIRA)" <ji...@apache.org> on 2016/08/23 23:03:20 UTC

[jira] [Updated] (LUCENE-7423) AutoPrefixPostingsFormat: a PostingsFormat optimized for prefix queries on text fields.

     [ https://issues.apache.org/jira/browse/LUCENE-7423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ferenczi Jim updated LUCENE-7423:
---------------------------------
    Attachment: LUCENE-7423.patch

> AutoPrefixPostingsFormat: a PostingsFormat optimized for prefix queries on text fields.
> ---------------------------------------------------------------------------------------
>
>                 Key: LUCENE-7423
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7423
>             Project: Lucene - Core
>          Issue Type: New Feature
>          Components: modules/sandbox
>            Reporter: Ferenczi Jim
>            Priority: Minor
>         Attachments: LUCENE-7423.patch
>
>
> The autoprefix terms dict added in https://issues.apache.org/jira/browse/LUCENE-5879 has been removed with https://issues.apache.org/jira/browse/LUCENE-7317.
> The new points API is now used to do efficient range queries but the replacement for prefix string queries is unclear. The edge ngrams could be used instead but they have a lot of drawbacks and are hard to configure correctly. The completion postings format is also a good replacement but it requires to have a big FST in RAM and it cannot be intersected with other fields. 
> This patch is a proposal for a new PostingsFormat optimized for prefix query on string fields. It detects prefixes that match "enough" terms and writes auto-prefix terms into their own virtual field.
>  At search time the virtual field is used to speed up prefix queries that match "enough" terms.
> The auto-prefix terms are built in two pass:
> * The first pass builds a compact prefix tree. Since the terms enum is sorted the prefixes are flushed on the fly depending on the input. For each prefix we build its corresponding inverted lists using a DocIdSetBuilder. The first pass visits each term of the field TermsEnum only once. When a prefix is flushed from the prefix tree its inverted lists is dumped into a temporary file for further use. This is necessary since the prefixes are not sorted when they are removed from the tree. The selected auto prefixes are sorted at the end of the first pass.
> * The second pass is a sorted scan of the prefixes and the temporary file is used to read the corresponding inverted lists.
> The patch is just a POC and there are rooms for optimizations but the first results are promising:
> I tested the patch with the geonames dataset. I indexed all the titles with the KeywordAnalyzer and compared the index/merge time and the size of the indices. 
> The edge ngram index (with a min edge ngram size of 2 and a max of 20) takes 572M on disk and it took 130s to index and optimize the 11M titles. 
> The auto prefix index takes 287M on disk and took 70s to index and optimize the same 11M titles. Among the 287M, only 170M are used for the auto prefix fields and the rest is for the regular keyword field. All the auto prefixes were generated for this test (at least 2 terms per auto-prefix).  
> The queries have similar performance since we are sure on both sides that one inverted list can answer any prefix query.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org