You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Subscriptions <su...@metaheuristica.com> on 2009/12/28 03:43:56 UTC

Using IDF to find Collactions and SIPs . . ?

I am trying to write a query analyzer to pull:

 

1.	Common phrases (also known as Collocations) with in a query

 

2.	Highly unusual phrases (also known as Statistically Improbable
Phrases or SIPs) with in a query

 

The Collocations would be similar to facets except I am also trying to get
multi word phrases as well as single terms. So suppose I could write
something that does a chained query off the facet query looking for words in
proximity. Conceptually (as I understand it) this should just be a question
of using the IDF (inverse document frequency i.e. the measure of how often
the term appears across the index).

 

*         Has anyone tried to write an analyzer that looks for the words
that typically occur within a given proximity of another word?

 

The highly unusual phrases on the other hand requires getting a handle on
the IDF which at present only appears to be available via the explain
function of debugging. 

 

*         Has anyone written something to go directly after the IDF score
only?

 

*         If I do have to go down the path of writing this from scratch is
the org.apache.lucene.search.Similarity class the one to leverage?

 

Most grateful for any feedback or insights,

 

Christopher 


Re: Using IDF to find Collactions and SIPs . . ?

Posted by Otis Gospodnetic <ot...@yahoo.com>.
Christopher,

It's not Lucene or Solr, but have a look at http://www.sematext.com/products/key-phrase-extractor/index.html 


There is an unofficial demo for it (uses Reuters news feeds with 2 1-week long windows for SIPs):

  http://www.sematext.com/demo/kpe/i.html

(it looks like the CollateFilter option on the left is kaput, so ignore it -- though that filter is actually quite useful and without it you may see some phrase overlap)

Otis
--
Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch



----- Original Message ----
> From: Subscriptions <su...@metaheuristica.com>
> To: solr-user@lucene.apache.org
> Sent: Sun, December 27, 2009 9:43:56 PM
> Subject: Using IDF to find Collactions and SIPs . . ?
> 
> I am trying to write a query analyzer to pull:
> 
> 
> 
> 1.    Common phrases (also known as Collocations) with in a query
> 
> 
> 
> 2.    Highly unusual phrases (also known as Statistically Improbable
> Phrases or SIPs) with in a query
> 
> 
> 
> The Collocations would be similar to facets except I am also trying to get
> multi word phrases as well as single terms. So suppose I could write
> something that does a chained query off the facet query looking for words in
> proximity. Conceptually (as I understand it) this should just be a question
> of using the IDF (inverse document frequency i.e. the measure of how often
> the term appears across the index).
> 
> 
> 
> *         Has anyone tried to write an analyzer that looks for the words
> that typically occur within a given proximity of another word?
> 
> 
> 
> The highly unusual phrases on the other hand requires getting a handle on
> the IDF which at present only appears to be available via the explain
> function of debugging. 
> 
> 
> 
> *         Has anyone written something to go directly after the IDF score
> only?
> 
> 
> 
> *         If I do have to go down the path of writing this from scratch is
> the org.apache.lucene.search.Similarity class the one to leverage?
> 
> 
> 
> Most grateful for any feedback or insights,
> 
> 
> 
> Christopher 


Re: Using IDF to find Collactions and SIPs . . ?

Posted by Chris Hostetter <ho...@fucit.org>.
: The Collocations would be similar to facets except I am also trying to get
: multi word phrases as well as single terms. So suppose I could write

Assuming I understand what you want, I would look into using the 
SingleFilter to build up Tokens consisting of N->M tokens, then you could 
just facet on that field to see the really common "phrases" or use the 
TermsComponent to get them as well...

: The highly unusual phrases on the other hand requires getting a handle on
: the IDF which at present only appears to be available via the explain
: function of debugging. 

...as i mentioned, you can use the TermsComponent to get terms and their 
document count ... it has a terms.maxcount param so you can use that to 
limit the output to only terms that appear in no more then X documents.

That said: These are possible ways of solving these types of problems 
using Solr, which can be handy if you are building a Solr for other things 
in general -- but if you are just trying to do a one-time analysis of a 
large corpus of data (or even a many-time analysis of a corpus that 
changes very frequently) w/o needing any of Solr's other features then you 
may find that you can accomplish this type of task much simpler (and 
probably faster) with some simple map/reduce jobs in Hadoop.


-Hoss


Re: Using IDF to find Collactions and SIPs . . ?

Posted by Siddhartha Pahade <pa...@gmail.com>.
pl unsubscribe me

On 12/28/09, Subscriptions <su...@metaheuristica.com> wrote:
>
> I am trying to write a query analyzer to pull:
>
>
>
> 1.      Common phrases (also known as Collocations) with in a query
>
>
>
> 2.      Highly unusual phrases (also known as Statistically Improbable
> Phrases or SIPs) with in a query
>
>
>
> The Collocations would be similar to facets except I am also trying to get
> multi word phrases as well as single terms. So suppose I could write
> something that does a chained query off the facet query looking for words
> in
> proximity. Conceptually (as I understand it) this should just be a question
> of using the IDF (inverse document frequency i.e. the measure of how often
> the term appears across the index).
>
>
>
> *         Has anyone tried to write an analyzer that looks for the words
> that typically occur within a given proximity of another word?
>
>
>
> The highly unusual phrases on the other hand requires getting a handle on
> the IDF which at present only appears to be available via the explain
> function of debugging.
>
>
>
> *         Has anyone written something to go directly after the IDF score
> only?
>
>
>
> *         If I do have to go down the path of writing this from scratch is
> the org.apache.lucene.search.Similarity class the one to leverage?
>
>
>
> Most grateful for any feedback or insights,
>
>
>
> Christopher
>
>