You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Paul Taylor <pa...@fastmail.fm> on 2012/02/01 13:47:23 UTC

Re: Does Fuzzy Search scores the same as Exact Match

On 28/01/2012 11:22, Uwe Schindler wrote:
>>>>> -----Original Message-----
>>>>> From: Paul Taylor [mailto:paul_t100@fastmail.fm]
>>>>> Sent: Saturday, January 28, 2012 10:33 AM
>>>>> To: 'java-user@lucene.apache.org'
>>>>> Subject: Does Fuzzy Search scores the same as Exact Match
>>>>>
>>>>> All things being equal does a fuzzy match give the same score as an
>>>>> exact match.
>>>>> i.e if I do a search for farmin and it matches two docs one on term
>>>> farmin, the
>>>>> other on term farming, will it score farming higher or score both
>>>>> the same
>>>> ?
>>>>
>>>> YES, depends on the Fuzzy configuration (rewrite method,...), but
>>>> the default does so!
>>>>
>>>> Uwe
>>>>
>>>>
>>> So how do I change it, seems like a funny default to have.
>> Maybe I was not clear, it should score "farming" higher than "farmin" by
>> default, but the default rewrite mode also takes TF/IDF into account (in
>> addition).
> Maybe there was some confusion in your original question, to make it clear:
> If you search for "farming", "farming" (exact match) should score higher
> than "farmin" (distance 1). With default rewrite mode this is correct for
> boosting, but if a typo is more unlikely in the corpus, then based on TF-IDF
> the score can still be different. You can prohibit that by using the right
> rewrite mode that *only* takes levensthein distance as inverse boost and not
> use TF-IDF =>  http://goo.gl/0eJ47
>
>> You can change that by a different rewrite method:
>>
>> The default is: http://goo.gl/JhHOA (which combines the standard vector
> model
>> with additionally boosting exact matches - we have that for backwards
>> compatibility only, its not what most users expect)
>>
>> The better one is: http://goo.gl/0eJ47, which does not take TF/IDF into
> account
>> and only boosts by levensthein distance.
>>
>> You can disable fuzzy boosting altogether:
>> Additionally http://goo.gl/VWlkW provides two other scoring models (TF/IDF
>> only, no boosting - or constant score at all)
>>
>> Uwe
>>
>>
Hi

Using the rewrite method you suggested for fuzzy query new 
MultiTermQuery.TopTermsBoostOnlyBooleanQueryRewrite(100), it doesn't 
consider the query idf which makes sense so that rare query terms are 
n't boosted, but neither does it consider the idf or field/norm of the 
matching document this seems wrong because this still seem relavent. The 
end result is that I get alot of identical scores when I normalize the 
scores
and when a match that matches one term in a two term field scores no 
better than a term that matches one term in three , which doesn't seem right

In contrast when I don't change the rewrite I get a better spread of 
scores, but unfortunately what clearly seems to be the best document 
doesn't always match because of the query idf problem.

Isn't there a way to get something inbetween these two extremes, to keep 
the field weight part of the calculation that you get with default, 
multiplied by ConstantScore instead of queryWeight

I have some example explain below,
Original Search is for 'República' from that I construct a disjunction 
query for two fields (artist and sortname), and then for each field we 
create a fuzzy and a wildcard query (wildcard not relevant to this question)

With New rewrite method:
DocNo:1:0.87149507:22222222-1cf0-4d1f-aca7-2a6f89e34b36:0.7922682 = 
(MATCH) custom((() | () | (ConstantScore(sortname:republic)^0.6 
ConstantScore(sortname:republica)^0.8 
ConstantScore(sortname:republice)^0.62222224) | 
ConstantScore(sortname:republica*^0.64000005)^0.64000005 | 
(ConstantScore(artist:republic)^1.2 ConstantScore(artist:republica)^1.6 
ConstantScore(artist:republice)^1.2444445) | 
ConstantScore(artist:republica*^1.2800001)^1.2800001)~0.1), product of:
   0.7922682 = (MATCH) max plus 0.1 times others of:
     0.33857617 = (MATCH) sum of:
       0.33857617 = (MATCH) ConstantScore(sortname:republica)^0.8, 
product of:
         0.8 = boost
         0.42322022 = queryNorm
     0.27086097 = (MATCH) 
ConstantScore(sortname:republica*^0.64000005)^0.64000005, product of:
       0.64000005 = boost
       0.42322022 = queryNorm
     0.67715234 = (MATCH) sum of:
       0.67715234 = (MATCH) ConstantScore(artist:republica)^1.6, product of:
         1.6 = boost
         0.42322022 = queryNorm
     0.54172194 = (MATCH) 
ConstantScore(artist:republica*^1.2800001)^1.2800001, product of:
       1.2800001 = boost
       0.42322022 = queryNorm
   1.0 = queryBoost

With Default Rewrite Method:
DocNo:1:1.2145596:22222222-1cf0-4d1f-aca7-2a6f89e34b36:1.104145 = 
(MATCH) custom((() | () | (sortname:republic^0.6 sortname:republica^0.8 
sortname:republice^0.62222224) | 
ConstantScore(sortname:republica*^0.64000005)^0.64000005 | 
(artist:republic^1.2 artist:republica^1.6 artist:republice^1.2444445) | 
ConstantScore(artist:republica*^1.2800001)^1.2800001)~0.1), product of:
   1.104145 = (MATCH) max plus 0.1 times others of:
     0.5056261 = (MATCH) sum of:
       0.5056261 = (MATCH) weight(sortname:republica^0.8 in 1), product of:
         0.29863092 = queryWeight(sortname:republica^0.8), product of:
           0.8 = boost
           1.6931472 = idf(docFreq=2, maxDocs=6)
           0.22047028 = queryNorm
         1.6931472 = (MATCH) fieldWeight(sortname:republica in 1), 
product of:
           1.0 = tf(termFreq(sortname:republica)=1)
           1.6931472 = idf(docFreq=2, maxDocs=6)
           1.0 = fieldNorm(field=sortname, doc=1)
     0.14110099 = (MATCH) 
ConstantScore(sortname:republica*^0.64000005)^0.64000005, product of:
       0.64000005 = boost
       0.22047028 = queryNorm
     1.0112522 = (MATCH) sum of:
       1.0112522 = (MATCH) weight(artist:republica^1.6 in 1), product of:
         0.59726185 = queryWeight(artist:republica^1.6), product of:
           1.6 = boost
           1.6931472 = idf(docFreq=2, maxDocs=6)
           0.22047028 = queryNorm
         1.6931472 = (MATCH) fieldWeight(artist:republica in 1), product of:
           1.0 = tf(termFreq(artist:republica)=1)
           1.6931472 = idf(docFreq=2, maxDocs=6)
           1.0 = fieldNorm(field=artist, doc=1)
     0.28220198 = (MATCH) 
ConstantScore(artist:republica*^1.2800001)^1.2800001, product of:
       1.2800001 = boost
       0.22047028 = queryNorm
   1.0 = queryBoost

This is my queryParser Code

package org.musicbrainz.search.servlet;

import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.*;
import org.musicbrainz.search.LuceneVersion;

import java.util.HashMap;
import java.util.Map;

public class DismaxQueryParser {

     public static String IMPOSSIBLE_FIELD_NAME = "\uFFFC\uFFFC\uFFFC";
     private DisjunctionQueryParser dqp;

     public DismaxQueryParser(org.apache.lucene.analysis.Analyzer 
analyzer) {
         dqp = new DisjunctionQueryParser(IMPOSSIBLE_FIELD_NAME, analyzer);
     }

     public Query parse(String query) throws 
org.apache.lucene.queryParser.ParseException {

         Query q0 = dqp.parse(DismaxQueryParser.IMPOSSIBLE_FIELD_NAME + 
":(" + query + ")");
         Query phrase = 
dqp.parse(DismaxQueryParser.IMPOSSIBLE_FIELD_NAME + ":\"" + query + "\"");
         if (phrase instanceof DisjunctionMaxQuery) {
             BooleanQuery bq = new BooleanQuery(true);
             bq.add(q0, BooleanClause.Occur.MUST);
             bq.add(phrase, BooleanClause.Occur.SHOULD);
             return bq;
         }
         else {
             return q0;
         }

     }

     public void addAlias(String field, DismaxAlias dismaxAlias) {
         dqp.addAlias(field, dismaxAlias);
     }

     static class DisjunctionQueryParser extends QueryParser {

         //Only make terms that are this length fuzzy
         private static final int MIN_FIELD_LENGTH_TO_MAKE_FUZZY = 4;
         private static final float FUZZY_SIMILARITY = 0.5f;

         //Reduce boost of wildcard matches compared to fuzzy /exact matches
         private static final float WILDCARD_BOOST_REDUCER = 0.8f;

         public DisjunctionQueryParser(String defaultField, 
org.apache.lucene.analysis.Analyzer analyzer) {
             super(LuceneVersion.LUCENE_VERSION, defaultField, analyzer);

         }


         protected Map<String, DismaxAlias> aliases = new 
HashMap<String, DismaxAlias>(3);

         //Field to DismaxAlias
         public void addAlias(String field, DismaxAlias dismaxAlias) {
             aliases.put(field, dismaxAlias);
         }

         protected org.apache.lucene.search.Query 
getFuzzyQuery(java.lang.String field, java.lang.String termStr, float 
minSimilarity)
                 throws org.apache.lucene.queryParser.ParseException {
             FuzzyQuery fq = (FuzzyQuery) super.getFuzzyQuery(field, 
termStr, minSimilarity);
             //so that fuzzy queries term do not get an advantage over 
exact matches just because the query term is rarer
             //fq.setRewriteMethod(new 
MultiTermQuery.TopTermsBoostOnlyBooleanQueryRewrite(100));
             return fq;
         }

         protected Query getFieldQuery(String field, String queryText, 
boolean quoted)
                 throws org.apache.lucene.queryParser.ParseException {
             //If field is an alias
             if (aliases.containsKey(field)) {
                 DismaxAlias a = aliases.get(field);
                 DisjunctionMaxQuery q = new 
DisjunctionMaxQuery(a.getTie());
                 boolean ok = false;

                 for (String f : a.getFields().keySet()) {

                     //if query can be created for this field and text
                     Query querySub;
                     Query queryWildcard = null;

                     if (!quoted && queryText.length() >= 
MIN_FIELD_LENGTH_TO_MAKE_FUZZY) {
                         querySub = getFieldQuery(f, queryText, quoted);
                         queryWildcard = getWildcardQuery(((TermQuery) 
querySub).getTerm().field(), ((TermQuery) querySub).getTerm().text() + '*');
                         querySub = getFuzzyQuery(((TermQuery) 
querySub).getTerm().field(), ((TermQuery) querySub).getTerm().text(), 
FUZZY_SIMILARITY);
                     } else {
                         querySub = getFieldQuery(f, queryText, quoted);
                     }

                     if (querySub != null) {
                         //if query was quoted but doesn't generate a 
phrase query we reject it
                         if (
                                 (quoted == false) ||
                                         (querySub instanceof PhraseQuery)
                                 ) {
                             //Reduce phrase because will have matched 
both parts giving far too much score differential
                             if(quoted == true) {
                                 querySub.setBoost(0.1f);
                             }
                             //Boost as specified
                             else if (a.getFields().get(f) != null) {
                                 querySub.setBoost(a.getFields().get(f));
                             }
                             q.add(querySub);
                             ok = true;
                         }
                     }

                     if (queryWildcard != null) {
                         if (a.getFields().get(f) != null) {
                             
queryWildcard.setBoost(a.getFields().get(f)*WILDCARD_BOOST_REDUCER);
                         }
                         q.add(queryWildcard);
                     }
                 }
                 //Something has been added to disjunction query
                 return ok ? q : null;

             } else {
                 //usual Field
                 try {
                     return super.getFieldQuery(field, queryText, quoted);
                 } catch (Exception e) {
                     return null;
                 }
             }
         }
     }

     static class DismaxAlias {
         public DismaxAlias() {

         }

         private float tie;
         //Field Boosts
         private Map<String, Float> fields;

         public float getTie() {
             return tie;
         }

         public void setTie(float tie) {
             this.tie = tie;
         }

         public Map<String, Float> getFields() {
             return fields;
         }

         public void setFields(Map<String, Float> fields) {
             this.fields = fields;
         }
     }
}
Thanks for any help Paul

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org