You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Paul Taylor <pa...@fastmail.fm> on 2012/02/01 13:47:23 UTC
Re: Does Fuzzy Search scores the same as Exact Match
On 28/01/2012 11:22, Uwe Schindler wrote:
>>>>> -----Original Message-----
>>>>> From: Paul Taylor [mailto:paul_t100@fastmail.fm]
>>>>> Sent: Saturday, January 28, 2012 10:33 AM
>>>>> To: 'java-user@lucene.apache.org'
>>>>> Subject: Does Fuzzy Search scores the same as Exact Match
>>>>>
>>>>> All things being equal does a fuzzy match give the same score as an
>>>>> exact match.
>>>>> i.e if I do a search for farmin and it matches two docs one on term
>>>> farmin, the
>>>>> other on term farming, will it score farming higher or score both
>>>>> the same
>>>> ?
>>>>
>>>> YES, depends on the Fuzzy configuration (rewrite method,...), but
>>>> the default does so!
>>>>
>>>> Uwe
>>>>
>>>>
>>> So how do I change it, seems like a funny default to have.
>> Maybe I was not clear, it should score "farming" higher than "farmin" by
>> default, but the default rewrite mode also takes TF/IDF into account (in
>> addition).
> Maybe there was some confusion in your original question, to make it clear:
> If you search for "farming", "farming" (exact match) should score higher
> than "farmin" (distance 1). With default rewrite mode this is correct for
> boosting, but if a typo is more unlikely in the corpus, then based on TF-IDF
> the score can still be different. You can prohibit that by using the right
> rewrite mode that *only* takes levensthein distance as inverse boost and not
> use TF-IDF => http://goo.gl/0eJ47
>
>> You can change that by a different rewrite method:
>>
>> The default is: http://goo.gl/JhHOA (which combines the standard vector
> model
>> with additionally boosting exact matches - we have that for backwards
>> compatibility only, its not what most users expect)
>>
>> The better one is: http://goo.gl/0eJ47, which does not take TF/IDF into
> account
>> and only boosts by levensthein distance.
>>
>> You can disable fuzzy boosting altogether:
>> Additionally http://goo.gl/VWlkW provides two other scoring models (TF/IDF
>> only, no boosting - or constant score at all)
>>
>> Uwe
>>
>>
Hi
Using the rewrite method you suggested for fuzzy query new
MultiTermQuery.TopTermsBoostOnlyBooleanQueryRewrite(100), it doesn't
consider the query idf which makes sense so that rare query terms are
n't boosted, but neither does it consider the idf or field/norm of the
matching document this seems wrong because this still seem relavent. The
end result is that I get alot of identical scores when I normalize the
scores
and when a match that matches one term in a two term field scores no
better than a term that matches one term in three , which doesn't seem right
In contrast when I don't change the rewrite I get a better spread of
scores, but unfortunately what clearly seems to be the best document
doesn't always match because of the query idf problem.
Isn't there a way to get something inbetween these two extremes, to keep
the field weight part of the calculation that you get with default,
multiplied by ConstantScore instead of queryWeight
I have some example explain below,
Original Search is for 'República' from that I construct a disjunction
query for two fields (artist and sortname), and then for each field we
create a fuzzy and a wildcard query (wildcard not relevant to this question)
With New rewrite method:
DocNo:1:0.87149507:22222222-1cf0-4d1f-aca7-2a6f89e34b36:0.7922682 =
(MATCH) custom((() | () | (ConstantScore(sortname:republic)^0.6
ConstantScore(sortname:republica)^0.8
ConstantScore(sortname:republice)^0.62222224) |
ConstantScore(sortname:republica*^0.64000005)^0.64000005 |
(ConstantScore(artist:republic)^1.2 ConstantScore(artist:republica)^1.6
ConstantScore(artist:republice)^1.2444445) |
ConstantScore(artist:republica*^1.2800001)^1.2800001)~0.1), product of:
0.7922682 = (MATCH) max plus 0.1 times others of:
0.33857617 = (MATCH) sum of:
0.33857617 = (MATCH) ConstantScore(sortname:republica)^0.8,
product of:
0.8 = boost
0.42322022 = queryNorm
0.27086097 = (MATCH)
ConstantScore(sortname:republica*^0.64000005)^0.64000005, product of:
0.64000005 = boost
0.42322022 = queryNorm
0.67715234 = (MATCH) sum of:
0.67715234 = (MATCH) ConstantScore(artist:republica)^1.6, product of:
1.6 = boost
0.42322022 = queryNorm
0.54172194 = (MATCH)
ConstantScore(artist:republica*^1.2800001)^1.2800001, product of:
1.2800001 = boost
0.42322022 = queryNorm
1.0 = queryBoost
With Default Rewrite Method:
DocNo:1:1.2145596:22222222-1cf0-4d1f-aca7-2a6f89e34b36:1.104145 =
(MATCH) custom((() | () | (sortname:republic^0.6 sortname:republica^0.8
sortname:republice^0.62222224) |
ConstantScore(sortname:republica*^0.64000005)^0.64000005 |
(artist:republic^1.2 artist:republica^1.6 artist:republice^1.2444445) |
ConstantScore(artist:republica*^1.2800001)^1.2800001)~0.1), product of:
1.104145 = (MATCH) max plus 0.1 times others of:
0.5056261 = (MATCH) sum of:
0.5056261 = (MATCH) weight(sortname:republica^0.8 in 1), product of:
0.29863092 = queryWeight(sortname:republica^0.8), product of:
0.8 = boost
1.6931472 = idf(docFreq=2, maxDocs=6)
0.22047028 = queryNorm
1.6931472 = (MATCH) fieldWeight(sortname:republica in 1),
product of:
1.0 = tf(termFreq(sortname:republica)=1)
1.6931472 = idf(docFreq=2, maxDocs=6)
1.0 = fieldNorm(field=sortname, doc=1)
0.14110099 = (MATCH)
ConstantScore(sortname:republica*^0.64000005)^0.64000005, product of:
0.64000005 = boost
0.22047028 = queryNorm
1.0112522 = (MATCH) sum of:
1.0112522 = (MATCH) weight(artist:republica^1.6 in 1), product of:
0.59726185 = queryWeight(artist:republica^1.6), product of:
1.6 = boost
1.6931472 = idf(docFreq=2, maxDocs=6)
0.22047028 = queryNorm
1.6931472 = (MATCH) fieldWeight(artist:republica in 1), product of:
1.0 = tf(termFreq(artist:republica)=1)
1.6931472 = idf(docFreq=2, maxDocs=6)
1.0 = fieldNorm(field=artist, doc=1)
0.28220198 = (MATCH)
ConstantScore(artist:republica*^1.2800001)^1.2800001, product of:
1.2800001 = boost
0.22047028 = queryNorm
1.0 = queryBoost
This is my queryParser Code
package org.musicbrainz.search.servlet;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.*;
import org.musicbrainz.search.LuceneVersion;
import java.util.HashMap;
import java.util.Map;
public class DismaxQueryParser {
public static String IMPOSSIBLE_FIELD_NAME = "\uFFFC\uFFFC\uFFFC";
private DisjunctionQueryParser dqp;
public DismaxQueryParser(org.apache.lucene.analysis.Analyzer
analyzer) {
dqp = new DisjunctionQueryParser(IMPOSSIBLE_FIELD_NAME, analyzer);
}
public Query parse(String query) throws
org.apache.lucene.queryParser.ParseException {
Query q0 = dqp.parse(DismaxQueryParser.IMPOSSIBLE_FIELD_NAME +
":(" + query + ")");
Query phrase =
dqp.parse(DismaxQueryParser.IMPOSSIBLE_FIELD_NAME + ":\"" + query + "\"");
if (phrase instanceof DisjunctionMaxQuery) {
BooleanQuery bq = new BooleanQuery(true);
bq.add(q0, BooleanClause.Occur.MUST);
bq.add(phrase, BooleanClause.Occur.SHOULD);
return bq;
}
else {
return q0;
}
}
public void addAlias(String field, DismaxAlias dismaxAlias) {
dqp.addAlias(field, dismaxAlias);
}
static class DisjunctionQueryParser extends QueryParser {
//Only make terms that are this length fuzzy
private static final int MIN_FIELD_LENGTH_TO_MAKE_FUZZY = 4;
private static final float FUZZY_SIMILARITY = 0.5f;
//Reduce boost of wildcard matches compared to fuzzy /exact matches
private static final float WILDCARD_BOOST_REDUCER = 0.8f;
public DisjunctionQueryParser(String defaultField,
org.apache.lucene.analysis.Analyzer analyzer) {
super(LuceneVersion.LUCENE_VERSION, defaultField, analyzer);
}
protected Map<String, DismaxAlias> aliases = new
HashMap<String, DismaxAlias>(3);
//Field to DismaxAlias
public void addAlias(String field, DismaxAlias dismaxAlias) {
aliases.put(field, dismaxAlias);
}
protected org.apache.lucene.search.Query
getFuzzyQuery(java.lang.String field, java.lang.String termStr, float
minSimilarity)
throws org.apache.lucene.queryParser.ParseException {
FuzzyQuery fq = (FuzzyQuery) super.getFuzzyQuery(field,
termStr, minSimilarity);
//so that fuzzy queries term do not get an advantage over
exact matches just because the query term is rarer
//fq.setRewriteMethod(new
MultiTermQuery.TopTermsBoostOnlyBooleanQueryRewrite(100));
return fq;
}
protected Query getFieldQuery(String field, String queryText,
boolean quoted)
throws org.apache.lucene.queryParser.ParseException {
//If field is an alias
if (aliases.containsKey(field)) {
DismaxAlias a = aliases.get(field);
DisjunctionMaxQuery q = new
DisjunctionMaxQuery(a.getTie());
boolean ok = false;
for (String f : a.getFields().keySet()) {
//if query can be created for this field and text
Query querySub;
Query queryWildcard = null;
if (!quoted && queryText.length() >=
MIN_FIELD_LENGTH_TO_MAKE_FUZZY) {
querySub = getFieldQuery(f, queryText, quoted);
queryWildcard = getWildcardQuery(((TermQuery)
querySub).getTerm().field(), ((TermQuery) querySub).getTerm().text() + '*');
querySub = getFuzzyQuery(((TermQuery)
querySub).getTerm().field(), ((TermQuery) querySub).getTerm().text(),
FUZZY_SIMILARITY);
} else {
querySub = getFieldQuery(f, queryText, quoted);
}
if (querySub != null) {
//if query was quoted but doesn't generate a
phrase query we reject it
if (
(quoted == false) ||
(querySub instanceof PhraseQuery)
) {
//Reduce phrase because will have matched
both parts giving far too much score differential
if(quoted == true) {
querySub.setBoost(0.1f);
}
//Boost as specified
else if (a.getFields().get(f) != null) {
querySub.setBoost(a.getFields().get(f));
}
q.add(querySub);
ok = true;
}
}
if (queryWildcard != null) {
if (a.getFields().get(f) != null) {
queryWildcard.setBoost(a.getFields().get(f)*WILDCARD_BOOST_REDUCER);
}
q.add(queryWildcard);
}
}
//Something has been added to disjunction query
return ok ? q : null;
} else {
//usual Field
try {
return super.getFieldQuery(field, queryText, quoted);
} catch (Exception e) {
return null;
}
}
}
}
static class DismaxAlias {
public DismaxAlias() {
}
private float tie;
//Field Boosts
private Map<String, Float> fields;
public float getTie() {
return tie;
}
public void setTie(float tie) {
this.tie = tie;
}
public Map<String, Float> getFields() {
return fields;
}
public void setFields(Map<String, Float> fields) {
this.fields = fields;
}
}
}
Thanks for any help Paul
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org