You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Ryszard Szopa <ry...@gmail.com> on 2009/08/20 21:01:45 UTC

solr and approximate string matching

Hi,

I've been using Solr for some time in the simplest possible way (as a
backend to a search engine for English documents) and I've been really
happy about it. However, now I need to do something which is a bit
non-standard, and unfortunately I am desperately stuck. To make things
more complicated, I am using solr in a Django application through
Haystack [http://haystacksearch.org], but I am pretty sure that
there's no funny business going on between haystack and solr.

So, we have a database of movies and series, and as the data comes
from many sources of varying reliability, we'd like to be able to do
fuzzy string matching on the titles of episodes (the default matching
mechanisms operate on word levels, which is not good enough for short
strings, like titles). I had used n-grams approximate matching in the
past, and I was very happy to find that Lucene (and Solr) supports
something like this out of the box.

I assumed that I need a special field type for this, so I added the
following field-type to my schema.xml:

   <fieldType
       name="trigrams"
       stored="true"
       class="solr.StrField">
     <analyzer type="index">
       <tokenizer
           class="solr.analysis.NGramTokenizerFactory"
           minGramSize="3"
           maxGramSize="5"
           />
       <filter class="solr.LowerCaseFilterFactory"/>
     </analyzer>
   </fieldType>

and changed the appropriate field in the schema to:

<field name="title" type="trigrams" indexed="true" stored="true"
multiValued="false" />

However, this is not working as I expected. The query analysis looks
correctly, but I don't get any results, which makes me believe that
something happens at index time (ie. the title is indexed like a
default string field instead of trigram field).

Moreover, I would like to be able to do something more. I'd like to
lowercace the string, remove all punctuation marks and spaces, remove
English stopwords and THEN change the string into trigrams. However,
the filters are applied only after the string has been tokenized...

Could you please suggest me any solution to this problem?

Thanks in advance for your answers.

 -- Ryszard Szopa

-- 
http://gryziemy.net
http://robimy.net

Re: solr and approximate string matching

Posted by Ryszard Szopa <ry...@gmail.com>.
Hi,

On Sun, Aug 30, 2009 at 9:32 PM, Shalin Shekhar
Mangar<sh...@gmail.com> wrote:

> The best way to debug these kind of problems is to look at analysis.jsp
> and/or use debugQuery=on on the query to see exactly how it is being parsed.
>
> Can you post the output of your query with debugQuery=on?

Thanks a lot for your answer. Fortunately, I've managed to deal with
the problem by myself, and it turned out to be mostly unrelated with
the schema. I was using AND as the default operator, and that didn't
play nicely with ngrams.

  -- RS

-- 
http://gryziemy.net
http://robimy.net

Re: solr and approximate string matching

Posted by Shalin Shekhar Mangar <sh...@gmail.com>.
On Fri, Aug 21, 2009 at 12:31 AM, Ryszard Szopa <ry...@gmail.com>wrote:

>
> So, we have a database of movies and series, and as the data comes
> from many sources of varying reliability, we'd like to be able to do
> fuzzy string matching on the titles of episodes (the default matching
> mechanisms operate on word levels, which is not good enough for short
> strings, like titles). I had used n-grams approximate matching in the
> past, and I was very happy to find that Lucene (and Solr) supports
> something like this out of the box.
>
> I assumed that I need a special field type for this, so I added the
> following field-type to my schema.xml:
>
>   <fieldType
>       name="trigrams"
>       stored="true"
>       class="solr.StrField">
>     <analyzer type="index">
>       <tokenizer
>           class="solr.analysis.NGramTokenizerFactory"
>           minGramSize="3"
>           maxGramSize="5"
>           />
>       <filter class="solr.LowerCaseFilterFactory"/>
>     </analyzer>
>   </fieldType>
>
> and changed the appropriate field in the schema to:
>
> <field name="title" type="trigrams" indexed="true" stored="true"
> multiValued="false" />
>
> However, this is not working as I expected. The query analysis looks
> correctly, but I don't get any results, which makes me believe that
> something happens at index time (ie. the title is indexed like a
> default string field instead of trigram field).
>

The best way to debug these kind of problems is to look at analysis.jsp
and/or use debugQuery=on on the query to see exactly how it is being parsed.

Can you post the output of your query with debugQuery=on?

-- 
Regards,
Shalin Shekhar Mangar.