You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Mathias Silbermann <ma...@web.de> on 2010/03/25 20:07:36 UTC
adapting lucene's practical scoring function
Dear Lucene Users,
I'd like to use Lucene to find scientific papers in the index that are
similar to a given paper from the
index. This seems to be possible using the MoreLikeThis-feature or
wrapping the given document
in a query composed of several other queries (BooleanQuery). The
similarity is calculated
according to Lucene's Practical Scoring Function defined in the JavaDoc
of class Similarity.
What I am trying to do is to calculate the "semantic document
similarity". One example similarity
function for that purpose is given on page two of the paper
"Corpus-based and Knowledge-based
Measures of Text Semantic Similarity" by Rada Mihalcea (formula 1).
Instead of using the TF and
IDF values, it uses IDF values and the relatednesses between every
unique words in the documents
to compare. First, it sums up the relatednesses of each unique word in
document 1 to its most
related word in document 2 multiplied by its IDF value. The same
procedure is done for document1.
After that, the sums are averaged.
My question is: Given I am able to store WordNet-Words extracted from
the documents in the
index and pre-calculate the word-word similarities, is it possibe / does
it make sense (e.g. from
the (computational) effort point of view) to adapt the Practical Scoring
Function to such a function
of semantic document similarity? And where (in which class) is the
Practical Scoring Function
implemented, i.e. where are the values of TF, IDF, Boost... put together?
Regards,
Mathias Silbermann
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: adapting lucene's practical scoring function
Posted by Grant Ingersoll <gs...@apache.org>.
On Mar 25, 2010, at 3:07 PM, Mathias Silbermann wrote:
> Dear Lucene Users,
>
> I'd like to use Lucene to find scientific papers in the index that are similar to a given paper from the
> index. This seems to be possible using the MoreLikeThis-feature or wrapping the given document
> in a query composed of several other queries (BooleanQuery). The similarity is calculated
> according to Lucene's Practical Scoring Function defined in the JavaDoc of class Similarity.
>
> What I am trying to do is to calculate the "semantic document similarity". One example similarity
> function for that purpose is given on page two of the paper "Corpus-based and Knowledge-based
> Measures of Text Semantic Similarity" by Rada Mihalcea (formula 1). Instead of using the TF and
> IDF values, it uses IDF values and the relatednesses between every unique words in the documents
> to compare. First, it sums up the relatednesses of each unique word in document 1 to its most
> related word in document 2 multiplied by its IDF value. The same procedure is done for document1.
> After that, the sums are averaged.
>
Interesting.
> My question is: Given I am able to store WordNet-Words extracted from the documents in the
> index and pre-calculate the word-word similarities, is it possibe / does it make sense (e.g. from
> the (computational) effort point of view) to adapt the Practical Scoring Function to such a function
> of semantic document similarity? And where (in which class) is the Practical Scoring Function
> implemented, i.e. where are the values of TF, IDF, Boost... put together?
>
This stuff is all done in the Scorer for a specific query (see TermQuery/TermScorer for an example).
Just thinking out loud here, but I think you will need to write your own Query to do this. I'm not entirely certain on what that means for you, though. Seems like a FunctionQuery might help, too. Seems like, just possibly, Lucene is a bit of overkill here other than using it to get IDF values. Can't you just create a big matrix (maybe w/ Hadoop and HBase or something similar) of your precomputed similarities and then just lookups on the document?
--------------------------
Grant Ingersoll
http://www.lucidimagination.com/
Search the Lucene ecosystem using Solr/Lucene: http://www.lucidimagination.com/search
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org