You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by wilqor <wi...@gmail.com> on 2017/08/28 15:18:15 UTC

Getting most frequent terms from single-token field values in a subset of Lucene documents

Hello,

In a Lucene index I have documents containing a number of single-token text
fields indexed as StringField. I would like to query the most frequent
terms from single-token values of each field - like a top 10 of occurrences
- and be able to perform the query on a subset of documents effectively.

I have tried out several approaches with Lucene 6.4.2:
- using HighFreqTerms like:
    HighFreqTerms.getHighFreqTerms(
        searcher.getIndexReader(),
        TOP_N_COUNT,
        fieldName,
        new HighFreqTerms.DocFreqComparator()
    );
It is pretty convenient, however it cannot be done on a subset of documents.

- using GroupingSearch. This way I can filter documents in the index. At
the same time, the resulting groups cannot be sorted by the number of
occurrences. As a workaround I could specify large number of result groups
and then sort them by the number of hits. It's far from perfect, since in
the worst case field values could be unique for each document, leading to
high memory consumption.

- using Facets API by adding a FacetField for each document field and
utilizing FastTaxonomyFacetCounts for querying top N values. With this
approach I am able to both filter the documents and get most frequent terms
among single-token values without loading all the groups to the memory. The
main disadvantage is the degradation of indexing performance - in my case
indexing runs two times longer when adding a FacetField to each StringField.

Is there any other or better way to get the most frequent single-token
field values with the ability to filter documents in Lucene index?

Thanks.