You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Chris Hostetter <ho...@fucit.org> on 2010/01/01 04:55:15 UTC
Re: Using IDF to find Collactions and SIPs . . ?
: The Collocations would be similar to facets except I am also trying to get
: multi word phrases as well as single terms. So suppose I could write
Assuming I understand what you want, I would look into using the
SingleFilter to build up Tokens consisting of N->M tokens, then you could
just facet on that field to see the really common "phrases" or use the
TermsComponent to get them as well...
: The highly unusual phrases on the other hand requires getting a handle on
: the IDF which at present only appears to be available via the explain
: function of debugging.
...as i mentioned, you can use the TermsComponent to get terms and their
document count ... it has a terms.maxcount param so you can use that to
limit the output to only terms that appear in no more then X documents.
That said: These are possible ways of solving these types of problems
using Solr, which can be handy if you are building a Solr for other things
in general -- but if you are just trying to do a one-time analysis of a
large corpus of data (or even a many-time analysis of a corpus that
changes very frequently) w/o needing any of Solr's other features then you
may find that you can accomplish this type of task much simpler (and
probably faster) with some simple map/reduce jobs in Hadoop.
-Hoss