You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by sdeck <sc...@gmail.com> on 2007/03/10 01:27:47 UTC

Find related question

Hello,
I run Nutch and get a whole slew of articles and when I display search
results, there may be 5-6 articles that have different titles, and most of
the body text is the same, but I want to group them all under one result. 
These are usually AP articles that all newspapers repurpose.

When using the MoreLikeThis functionality, the articles that are returned
may or may not be similar. When I run the query, the scores returned can
range from .1 to .4 for the first 2 hits and it usually will return around
50 results, with the last score coming in fairly close to 0. Usually, the
first hit is the exact same article as what I am trying to determine related
articles for.  I know that the score value has no real meaning though,
because it is done based upon the query, and other factors and then
normalized.

So, should I be taking (hit score/1) to use as a percentage value to see
what other articles might be similar after that first hit? Try and normalize
the similarity basically? Am I off my rocker?

Or, is there possibly a way to use Carrot2 to find related articles for a
given document?

Thanks,
Scott

-- 
View this message in context: http://www.nabble.com/Find-related-question-tf3379250.html#a9405661
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Find related question

Posted by markharw00d <ma...@yahoo.co.uk>.

 >>most of the body text is the same, but I want to group them all under 
one result.

I created this analyzer class to identify content that was "mostly 
similar" but not necessarily identical.
http://issues.apache.org/jira/browse/LUCENE-725

If you feed a small set of documents through it (say your search 
results) it will only emit tokens for sequences of words that it hasn't 
seen before.  Documents that have significantly more tokens going into 
NovelAnalyzer than coming out of NovelAnalyzer are near-duplicates

Cheers
Mark


sdeck wrote:
> Hello,
> I run Nutch and get a whole slew of articles and when I display search
> results, there may be 5-6 articles that have different titles, and most of
> the body text is the same, but I want to group them all under one result. 
> These are usually AP articles that all newspapers repurpose.
>
> When using the MoreLikeThis functionality, the articles that are returned
> may or may not be similar. When I run the query, the scores returned can
> range from .1 to .4 for the first 2 hits and it usually will return around
> 50 results, with the last score coming in fairly close to 0. Usually, the
> first hit is the exact same article as what I am trying to determine related
> articles for.  I know that the score value has no real meaning though,
> because it is done based upon the query, and other factors and then
> normalized.
>
> So, should I be taking (hit score/1) to use as a percentage value to see
> what other articles might be similar after that first hit? Try and normalize
> the similarity basically? Am I off my rocker?
>
> Or, is there possibly a way to use Carrot2 to find related articles for a
> given document?
>
> Thanks,
> Scott
>
>   



	
	
		
___________________________________________________________ 
All new Yahoo! Mail "The new Interface is stunning in its simplicity and ease of use." - PC Magazine 
http://uk.docs.yahoo.com/nowyoucan.html

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org