You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Chris Hostetter <ho...@fucit.org> on 2011/12/07 00:17:33 UTC

Re: Attempting to achieve something similar to PostgreSQL's pg_trgm / K-NN combo with Solr

: I'm working on using trigrams for similarity matching on some data, 
: where there's a canonical name and lots of personalised variants, e.g.:
: 
: canonical: "My Wonderful Thing"
: variant: "My Wonderful Thing (for Matt Patterson)"

I'm really not sure why you would need trigrams for something like this 
... just doing something basic like whitespace tokenization and using 
length norms should allow any of these queries...

  q=My Wonderful Thing       
     ...basic 3 clause term query
  q="My Wonderful Thing"
     ...strict phrase query
  q="My Wonderful Thing"~5
     ...sloppy phrase query, alowing other words mixed in 

...to match both docs, with the canonical version scoring higher because 
of it's length.

: I really want the canonical version to be returned first in the results 
: list, and the setup I have now is returning results like:
: 
: * "My Wonderful Thing (for Somebody Else)"
: * "My Wonderful Thing (for Yet Another Person)"
: * "My Wonderful Thing"

what exactly does your query look like?  are you doing a quoted phrase 
search? what does the debugQuery ouput tell you about those matches?

The order you are getting probably because the variants are getting 
additional matches for some of the trigrams in the various names of 
people.  I don't see any specific cases in those contrived examples but 
for instance "My Wonderful Thing (for Raymond Fuller)" will match a 
basic query for the trigrams better then just "My Wonderful Thing" because 
"mon" and "ful" appears twice in the variant but only once each in the 
canonical.  (this is why real examples are critical)


-Hoss