You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by David Spencer <da...@tropo.com> on 2005/01/17 22:46:23 UTC

MoreLikeThis and other similarity query generators checked in + online demo

Based on mail from Doug I wrote a "more like this" query generator, 
named, well, MoreLikeThis. Bruce Ritchie and Mark Harwood made changes 
to it (esp term vector support) and bug fixes. Thanks to everyone.

I've checked in the code to the sandbox under contributions/similarity.

The package it ends up at is org.apache.lucene.search.similar -- hope 
that makes sense.

I also created a class, SimilarityQueries, to hold other methods of 
similarity query generation. The 2 methods in there are "dumber" 
variations that use the entire source of the target doc to from a large 
query.

Javadoc is here:

http://www.searchmorph.com/pub/jakarta-lucene-sandbox/contributions/similarity/build/docs/api/org/apache/lucene/search/similar/package-summary.html

Online demo here - this page below compares the 3 variations on 
detecting similar docs. The timing info (3 numbers w/ "(ms)") may be 
suspect. Also note if you scroll to the bottom you can see the queries 
that were generated.

Here's a page showing docs similar to the entry for Iraq:

http://www.searchmorph.com/kat/wikipedia-compare.jsp?s=Iraq

And here's one for docs similar to the one on Garry Kasparov (he knows 
how to play chess :) ):

http://www.searchmorph.com/kat/wikipedia-compare.jsp?s=Garry_Kasparov


To get to it you start here:

http://www.searchmorph.com/kat/wikipedia.jsp

And search for something - on the search results page follow a "cmp" link

http://www.searchmorph.com/kat/wikipedia.jsp?s=iraq

Make sense? Useful? Has anyone done any other variations (e.g. cosine 
measure)?

- Dave


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org