You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Mathias Silbermann <ma...@web.de> on 2010/03/25 20:07:36 UTC

adapting lucene's practical scoring function

Dear Lucene Users,

I'd like to use Lucene to find scientific papers in the index that are 
similar to a given paper from the
index. This seems to be possible using the MoreLikeThis-feature or 
wrapping the given document
in a query composed of several other queries (BooleanQuery). The 
similarity is calculated
according to Lucene's Practical Scoring Function defined in the JavaDoc 
of class Similarity.

What I am trying to do is to calculate the "semantic document 
similarity". One example similarity
function for that purpose is given on page two of the paper 
"Corpus-based and Knowledge-based
Measures of Text Semantic Similarity" by Rada Mihalcea (formula 1). 
Instead of using the TF and
IDF values, it uses IDF values and the relatednesses between every 
unique words in the documents
to compare. First, it sums up the relatednesses of each unique word in 
document 1 to its most
related word in document 2 multiplied by its IDF value. The same 
procedure is done for document1.
After that, the sums are averaged.

My question is: Given I am able to store WordNet-Words extracted from 
the documents in the
index and pre-calculate the word-word similarities, is it possibe / does 
it make sense (e.g. from
the (computational) effort point of view) to adapt the Practical Scoring 
Function to such a function
of semantic document similarity? And where (in which class) is the 
Practical Scoring Function
implemented, i.e. where are the values of TF, IDF, Boost... put together?

Regards,
Mathias Silbermann

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: adapting lucene's practical scoring function

Posted by Grant Ingersoll <gs...@apache.org>.

On Mar 25, 2010, at 3:07 PM, Mathias Silbermann wrote:

> Dear Lucene Users,
> 
> I'd like to use Lucene to find scientific papers in the index that are similar to a given paper from the
> index. This seems to be possible using the MoreLikeThis-feature or wrapping the given document
> in a query composed of several other queries (BooleanQuery). The similarity is calculated
> according to Lucene's Practical Scoring Function defined in the JavaDoc of class Similarity.
> 
> What I am trying to do is to calculate the "semantic document similarity". One example similarity
> function for that purpose is given on page two of the paper "Corpus-based and Knowledge-based
> Measures of Text Semantic Similarity" by Rada Mihalcea (formula 1). Instead of using the TF and
> IDF values, it uses IDF values and the relatednesses between every unique words in the documents
> to compare. First, it sums up the relatednesses of each unique word in document 1 to its most
> related word in document 2 multiplied by its IDF value. The same procedure is done for document1.
> After that, the sums are averaged.
> 

Interesting.

> My question is: Given I am able to store WordNet-Words extracted from the documents in the
> index and pre-calculate the word-word similarities, is it possibe / does it make sense (e.g. from
> the (computational) effort point of view) to adapt the Practical Scoring Function to such a function
> of semantic document similarity? And where (in which class) is the Practical Scoring Function
> implemented, i.e. where are the values of TF, IDF, Boost... put together?
> 

This stuff is all done in the Scorer for a specific query (see TermQuery/TermScorer for an example).  

Just thinking out loud here, but I think you will need to write your own Query to do this. I'm not entirely certain on what that means for you, though.  Seems like a FunctionQuery might help, too.   Seems like, just possibly, Lucene is a bit of overkill here other than using it to get IDF values.  Can't you just create a big matrix (maybe w/ Hadoop and HBase or something similar) of your precomputed similarities and then just lookups on the document?

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem using Solr/Lucene: http://www.lucidimagination.com/search


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org