You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Ryan Wilson <rp...@gmail.com> on 2013/05/14 20:49:01 UTC

Strange fuzzy behavior in 4.2.1

Hello all,

I am currently trying to determine what is the cause of some odd behaviour
when performing fuzzy queries in Solr 4.2.1. I have a field that is
configured as follows:

<field type="textSomeField" indexed="true" stored="false"
multiValued="false" name="stuff" />

<fieldType name="textSomeField" omitTermFreqAndPositions="false"
omitNorms="true" termVectors="false" termPositions="false"
termOffsets="false" class="solr.TextField" positionIncrementGap="100"
    <analyzer type="index">
            <tokenizer class="solr.WhitespaceTokenizerFactory"/>
            <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" preserveOriginal="1" generateNumberParts="0"
catenateWords="1" catenateNumbers="1" catenateAll="0"/>
            <filter class="solr.LowerCaseFilterFactory"/>
            <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
    </analyzer>
    <analyzer type="query">
            <tokenizer class="solr.WhitespaceTokenizerFactory"/>
            <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" preserveOriginal="1" generateNumberParts="0"
catenateWords="1" catenateNumbers="1" catenateAll="0"/>
            <filter class="solr.LowerCaseFilterFactory"/>
            <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
    </analyzer>
</fieldType>

Fuzzy searches on this field (and others) gets some darned weird results.
For example the names julie, julia, julian, julio, and juliar are indexed.
The following occurs:

stuff:(julia~1) - Only finds julia
stuff:(julie~1) - finds julia and julie
stuff:(julian~1) - only finds julian
stuff:(julin~1) - finds julian, julia, julie, etc
stuff:(juliz~1) - finds julia, julio, julie, etc

This is one of the simple examples of the behaviour we are seeing. I will
happily provide more if necessary.

My question is why exactly I am getting the results that I am getting from
fuzzy? My understanding of fuzzy is that it is the Levenshtein distance
from one word to the next. Therefore, julia, julie, and julio should be
returning results with each others names with an edit distance of 1 yet
that is definitely not the behavior I am observing. I am uncertain of
whether I have done something wrong with the indexing, querying, or am
simply misunderstanding how fuzzy functions. Any help or clarification
would be appreciated.

Regards,
Ryan Wilson
rpwilson1@gmail.com

Re: Strange fuzzy behavior in 4.2.1

Posted by Jack Krupansky <ja...@basetechnology.com>.

Any chance you may have had a different analyzer or parameter values when 
you indexed compared to now? Like, maybe the data wasn't originally indexed 
as lower case?

Or, that maybe some of the term occurrences have adjacent punctuation 
(comma, period, parentheses, etc.) that the word delimiter filter normally 
removes, but that filter won't be called when analysis is performed for a 
fuzzy query (or wildcard)?

(Julie) or Julie, or Julie. - would not match.

-- Jack Krupansky

-----Original Message----- 
From: Ryan Wilson
Sent: Tuesday, May 14, 2013 2:49 PM
To: solr-user@lucene.apache.org
Subject: Strange fuzzy behavior in 4.2.1

Hello all,

I am currently trying to determine what is the cause of some odd behaviour
when performing fuzzy queries in Solr 4.2.1. I have a field that is
configured as follows:

<field type="textSomeField" indexed="true" stored="false"
multiValued="false" name="stuff" />

<fieldType name="textSomeField" omitTermFreqAndPositions="false"
omitNorms="true" termVectors="false" termPositions="false"
termOffsets="false" class="solr.TextField" positionIncrementGap="100"
    <analyzer type="index">
            <tokenizer class="solr.WhitespaceTokenizerFactory"/>
            <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" preserveOriginal="1" generateNumberParts="0"
catenateWords="1" catenateNumbers="1" catenateAll="0"/>
            <filter class="solr.LowerCaseFilterFactory"/>
            <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
    </analyzer>
    <analyzer type="query">
            <tokenizer class="solr.WhitespaceTokenizerFactory"/>
            <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" preserveOriginal="1" generateNumberParts="0"
catenateWords="1" catenateNumbers="1" catenateAll="0"/>
            <filter class="solr.LowerCaseFilterFactory"/>
            <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
    </analyzer>
</fieldType>

Fuzzy searches on this field (and others) gets some darned weird results.
For example the names julie, julia, julian, julio, and juliar are indexed.
The following occurs:

stuff:(julia~1) - Only finds julia
stuff:(julie~1) - finds julia and julie
stuff:(julian~1) - only finds julian
stuff:(julin~1) - finds julian, julia, julie, etc
stuff:(juliz~1) - finds julia, julio, julie, etc

This is one of the simple examples of the behaviour we are seeing. I will
happily provide more if necessary.

My question is why exactly I am getting the results that I am getting from
fuzzy? My understanding of fuzzy is that it is the Levenshtein distance
from one word to the next. Therefore, julia, julie, and julio should be
returning results with each others names with an edit distance of 1 yet
that is definitely not the behavior I am observing. I am uncertain of
whether I have done something wrong with the indexing, querying, or am
simply misunderstanding how fuzzy functions. Any help or clarification
would be appreciated.

Regards,
Ryan Wilson
rpwilson1@gmail.com