You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Subscriptions <su...@metaheuristica.com> on 2009/12/28 03:43:56 UTC
Using IDF to find Collactions and SIPs . . ?
I am trying to write a query analyzer to pull:
1. Common phrases (also known as Collocations) with in a query
2. Highly unusual phrases (also known as Statistically Improbable
Phrases or SIPs) with in a query
The Collocations would be similar to facets except I am also trying to get
multi word phrases as well as single terms. So suppose I could write
something that does a chained query off the facet query looking for words in
proximity. Conceptually (as I understand it) this should just be a question
of using the IDF (inverse document frequency i.e. the measure of how often
the term appears across the index).
* Has anyone tried to write an analyzer that looks for the words
that typically occur within a given proximity of another word?
The highly unusual phrases on the other hand requires getting a handle on
the IDF which at present only appears to be available via the explain
function of debugging.
* Has anyone written something to go directly after the IDF score
only?
* If I do have to go down the path of writing this from scratch is
the org.apache.lucene.search.Similarity class the one to leverage?
Most grateful for any feedback or insights,
Christopher
Re: Using IDF to find Collactions and SIPs . . ?
Posted by Otis Gospodnetic <ot...@yahoo.com>.
Christopher,
It's not Lucene or Solr, but have a look at http://www.sematext.com/products/key-phrase-extractor/index.html
There is an unofficial demo for it (uses Reuters news feeds with 2 1-week long windows for SIPs):
http://www.sematext.com/demo/kpe/i.html
(it looks like the CollateFilter option on the left is kaput, so ignore it -- though that filter is actually quite useful and without it you may see some phrase overlap)
Otis
--
Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch
----- Original Message ----
> From: Subscriptions <su...@metaheuristica.com>
> To: solr-user@lucene.apache.org
> Sent: Sun, December 27, 2009 9:43:56 PM
> Subject: Using IDF to find Collactions and SIPs . . ?
>
> I am trying to write a query analyzer to pull:
>
>
>
> 1. Common phrases (also known as Collocations) with in a query
>
>
>
> 2. Highly unusual phrases (also known as Statistically Improbable
> Phrases or SIPs) with in a query
>
>
>
> The Collocations would be similar to facets except I am also trying to get
> multi word phrases as well as single terms. So suppose I could write
> something that does a chained query off the facet query looking for words in
> proximity. Conceptually (as I understand it) this should just be a question
> of using the IDF (inverse document frequency i.e. the measure of how often
> the term appears across the index).
>
>
>
> * Has anyone tried to write an analyzer that looks for the words
> that typically occur within a given proximity of another word?
>
>
>
> The highly unusual phrases on the other hand requires getting a handle on
> the IDF which at present only appears to be available via the explain
> function of debugging.
>
>
>
> * Has anyone written something to go directly after the IDF score
> only?
>
>
>
> * If I do have to go down the path of writing this from scratch is
> the org.apache.lucene.search.Similarity class the one to leverage?
>
>
>
> Most grateful for any feedback or insights,
>
>
>
> Christopher
Re: Using IDF to find Collactions and SIPs . . ?
Posted by Chris Hostetter <ho...@fucit.org>.
: The Collocations would be similar to facets except I am also trying to get
: multi word phrases as well as single terms. So suppose I could write
Assuming I understand what you want, I would look into using the
SingleFilter to build up Tokens consisting of N->M tokens, then you could
just facet on that field to see the really common "phrases" or use the
TermsComponent to get them as well...
: The highly unusual phrases on the other hand requires getting a handle on
: the IDF which at present only appears to be available via the explain
: function of debugging.
...as i mentioned, you can use the TermsComponent to get terms and their
document count ... it has a terms.maxcount param so you can use that to
limit the output to only terms that appear in no more then X documents.
That said: These are possible ways of solving these types of problems
using Solr, which can be handy if you are building a Solr for other things
in general -- but if you are just trying to do a one-time analysis of a
large corpus of data (or even a many-time analysis of a corpus that
changes very frequently) w/o needing any of Solr's other features then you
may find that you can accomplish this type of task much simpler (and
probably faster) with some simple map/reduce jobs in Hadoop.
-Hoss
Re: Using IDF to find Collactions and SIPs . . ?
Posted by Siddhartha Pahade <pa...@gmail.com>.
pl unsubscribe me
On 12/28/09, Subscriptions <su...@metaheuristica.com> wrote:
>
> I am trying to write a query analyzer to pull:
>
>
>
> 1. Common phrases (also known as Collocations) with in a query
>
>
>
> 2. Highly unusual phrases (also known as Statistically Improbable
> Phrases or SIPs) with in a query
>
>
>
> The Collocations would be similar to facets except I am also trying to get
> multi word phrases as well as single terms. So suppose I could write
> something that does a chained query off the facet query looking for words
> in
> proximity. Conceptually (as I understand it) this should just be a question
> of using the IDF (inverse document frequency i.e. the measure of how often
> the term appears across the index).
>
>
>
> * Has anyone tried to write an analyzer that looks for the words
> that typically occur within a given proximity of another word?
>
>
>
> The highly unusual phrases on the other hand requires getting a handle on
> the IDF which at present only appears to be available via the explain
> function of debugging.
>
>
>
> * Has anyone written something to go directly after the IDF score
> only?
>
>
>
> * If I do have to go down the path of writing this from scratch is
> the org.apache.lucene.search.Similarity class the one to leverage?
>
>
>
> Most grateful for any feedback or insights,
>
>
>
> Christopher
>
>