You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@lucenenet.apache.org by Trevor Watson <tw...@datassimilate.com> on 2012/05/03 17:54:19 UTC

Top Terms in Subset

We're attempting to create a Tag (Word) Cloud of terms in a sub-set of 
documents (or the entire index if the set is the entire index).

Is there a way to get the top terms in a given field for a subset of 
documents?

I was looking and the idea I had was to create a HitCollector that 
collects the terms for each document as each passes through it.

This could be highly data intensive and slow if there are a large number 
of terms or documents.

Looking for kind of the "top terms" option in Luke

Thanks in advance,

RE: Top Terms in Subset

Posted by Simon Svensson <si...@devhost.se>.
Hi,

This sounds a lot like a faceted search. You do need a Collector which grabs
reader+documentId from a search, and then iterate them to read all the
terms. This can be cached in-memory using  FieldCache, assuming that you
have only one term per field, or write a custom caching implementation that
support multiple terms per field.

The top-terms view in Luke is very slow, I guess it does a full iteration
every time. It's open source, so you can check the actual implementation if
you want to copy it.

// Simon

-----Original Message-----
From: Trevor Watson [mailto:twatson@datassimilate.com] 
Sent: Thursday, May 03, 2012 5:54 PM
To: lucene-net-user@lucene.apache.org
Subject: Top Terms in Subset

We're attempting to create a Tag (Word) Cloud of terms in a sub-set of
documents (or the entire index if the set is the entire index).

Is there a way to get the top terms in a given field for a subset of
documents?

I was looking and the idea I had was to create a HitCollector that collects
the terms for each document as each passes through it.

This could be highly data intensive and slow if there are a large number of
terms or documents.

Looking for kind of the "top terms" option in Luke

Thanks in advance,