You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@lucenenet.apache.org by Trevor Watson <tw...@datassimilate.com> on 2012/05/03 17:54:19 UTC
Top Terms in Subset
We're attempting to create a Tag (Word) Cloud of terms in a sub-set of
documents (or the entire index if the set is the entire index).
Is there a way to get the top terms in a given field for a subset of
documents?
I was looking and the idea I had was to create a HitCollector that
collects the terms for each document as each passes through it.
This could be highly data intensive and slow if there are a large number
of terms or documents.
Looking for kind of the "top terms" option in Luke
Thanks in advance,
RE: Top Terms in Subset
Posted by Simon Svensson <si...@devhost.se>.
Hi,
This sounds a lot like a faceted search. You do need a Collector which grabs
reader+documentId from a search, and then iterate them to read all the
terms. This can be cached in-memory using FieldCache, assuming that you
have only one term per field, or write a custom caching implementation that
support multiple terms per field.
The top-terms view in Luke is very slow, I guess it does a full iteration
every time. It's open source, so you can check the actual implementation if
you want to copy it.
// Simon
-----Original Message-----
From: Trevor Watson [mailto:twatson@datassimilate.com]
Sent: Thursday, May 03, 2012 5:54 PM
To: lucene-net-user@lucene.apache.org
Subject: Top Terms in Subset
We're attempting to create a Tag (Word) Cloud of terms in a sub-set of
documents (or the entire index if the set is the entire index).
Is there a way to get the top terms in a given field for a subset of
documents?
I was looking and the idea I had was to create a HitCollector that collects
the terms for each document as each passes through it.
This could be highly data intensive and slow if there are a large number of
terms or documents.
Looking for kind of the "top terms" option in Luke
Thanks in advance,