You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Matt Patterson <ma...@reprocessed.org> on 2011/11/24 18:47:50 UTC

Attempting to achieve something similar to PostgreSQL's pg_trgm / K-NN combo with Solr

Hello,

I'm working on using trigrams for similarity matching on some data, where there's a canonical name and lots of personalised variants, e.g.:

canonical: "My Wonderful Thing"
variant: "My Wonderful Thing (for Matt Patterson)"

Using the pg_trgm (http://wiki.postgresql.org/wiki/What's_new_in_PostgreSQL_9.1#Extensions) index type and the K-Nearest-Neighbour operator in Postgres 9.1 I get pretty good results, and I want to do something similar using Solr - for one it feels like there's a lot more room to tweak and optimise this than with Postgres. Being new to Solr, I'm a little unsure about exactly what to do. I've set up a test Solr instance using a configuration like this: https://gist.github.com/1391468.

This is working, in as much as it's returning results, but the data set I'm working with is somewhat polluted, and even with regular manual cleaning probably will always be a bit polluted. So, we have names in the data like:

"My Wonderful Thing"
"My Wonderful Thing (for Somebody Else)"
"My Wonderful Thing (for Yet Another Person)"

I really want the canonical version to be returned first in the results list, and the setup I have now is returning results like:

* "My Wonderful Thing (for Somebody Else)"
* "My Wonderful Thing (for Yet Another Person)"
* "My Wonderful Thing"
* "Other name with Wonderful or Thing in it"

With the Postgres pg_trgm index and <-> K-NN operator I get results like

* "My Wonderful Thing"
* "My Wonderful Thing (for Somebody Else)"
* "My Wonderful Thing (for Yet Another Person)"
* "Other name with Wonderful or Thing in it"

Which is better, and I guess the difference is to do with the way that the distance between search term and results are calculated. 

So, is there something I can do to change the way ranking is calculated? Also, is there a good place to start reading about this kind of similarity searching and Solr?  Everything I've looked at so far seems to cover this kind of n-gram approach very lightly at best.

Thanks,

Matt

Re: Attempting to achieve something similar to PostgreSQL's pg_trgm / K-NN combo with Solr

Posted by Chris Hostetter <ho...@fucit.org>.

: I'm working on using trigrams for similarity matching on some data, 
: where there's a canonical name and lots of personalised variants, e.g.:
: 
: canonical: "My Wonderful Thing"
: variant: "My Wonderful Thing (for Matt Patterson)"

I'm really not sure why you would need trigrams for something like this 
... just doing something basic like whitespace tokenization and using 
length norms should allow any of these queries...

  q=My Wonderful Thing       
     ...basic 3 clause term query
  q="My Wonderful Thing"
     ...strict phrase query
  q="My Wonderful Thing"~5
     ...sloppy phrase query, alowing other words mixed in 

...to match both docs, with the canonical version scoring higher because 
of it's length.

: I really want the canonical version to be returned first in the results 
: list, and the setup I have now is returning results like:
: 
: * "My Wonderful Thing (for Somebody Else)"
: * "My Wonderful Thing (for Yet Another Person)"
: * "My Wonderful Thing"

what exactly does your query look like?  are you doing a quoted phrase 
search? what does the debugQuery ouput tell you about those matches?

The order you are getting probably because the variants are getting 
additional matches for some of the trigrams in the various names of 
people.  I don't see any specific cases in those contrived examples but 
for instance "My Wonderful Thing (for Raymond Fuller)" will match a 
basic query for the trigrams better then just "My Wonderful Thing" because 
"mon" and "ful" appears twice in the variant but only once each in the 
canonical.  (this is why real examples are critical)


-Hoss