You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Paul Taylor <pa...@fastmail.fm> on 2012/02/03 16:01:42 UTC

Performance improvements for fuzzy queries ?

Using Lucene 3.5,  I created a query parser based on the dismax parser 
but in order to get matches on misspellings ecetra I additionally do a 
fuzzy search and a wildcard search

http://svn.musicbrainz.org/search_server/trunk/servlet/src/main/java/org/musicbrainz/search/servlet/DismaxQueryParser.java

So a search for 'echo bunneymen' searches for over three fields (alias, 
sortname, artist) and becomes dijunction searches on these and phrase 
search

custom(+((
alias:echo~0.5^0.71999997 | alias:echo*^0.71999997 | alias:echo^0.9
| sortname:echo~0.5^0.88000005 | sortname:echo*^0.88000005 | 
sortname:echo^1.1
| artist:echo~0.5^1.04 | artist:echo*^1.04 | artist:echo^1.3)~0.1
  (
alias:bunneymen~0.5^0.71999997 | alias:bunneymen*^0.71999997 | 
alias:bunneymen^0.9
| sortname:bunneymen~0.5^0.88000005 | sortname:bunneymen*^0.88000005 | 
sortname:bunneymen^1.1
| artist:bunneymen~0.5^1.04 | artist:bunneymen*^1.04 | 
artist:bunneymen^1.3)~0.1)
  (alias:"echo bunneymen"^0.2 | sortname:"echo bunneymen"^0.2 | 
artist:"echo bunneymen"^0.2)~0.1)

and it gives me exactly the results and scoring that I want, trouble is 
that its TOO SLOW

I tried using a different write mechanism as recommended new 
MultiTermQuery.TopTermsBoostOnlyBooleanQueryRewrite(100) but then it 
doesn't consider the query idf which makes sense so that rare query 
terms aren't boosted, but neither does it consider the idf or field/norm 
of the matching document this seems wrong because this still seem 
relevent, and more problematically the fuzzy query scores are so much 
lower than normal
and phrase matches, so it doesn't seem to work when using fuzzy queries 
mixed in with other queries, is there a better option or even some 
better documentation on the rewrite method so I can understand it better.

Alternatively, is there an analyzer I can use to analyse the fields 
using the fuzzy/levenstein logic so I can do this at index time instead 
then just use a normal term query with same analyzer instead of a fuzzy 
query

Paul



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Performance improvements for fuzzy queries ?

Posted by Paul Taylor <pa...@fastmail.fm>.

On 03/02/2012 15:01, Paul Taylor wrote:
>
> Using Lucene 3.5,  I created a query parser based on the dismax parser 
> but in order to get matches on misspellings ecetra I additionally do a 
> fuzzy search and a wildcard search
>
> http://svn.musicbrainz.org/search_server/trunk/servlet/src/main/java/org/musicbrainz/search/servlet/DismaxQueryParser.java 
>
>
> So a search for 'echo bunneymen' searches for over three fields 
> (alias, sortname, artist) and becomes dijunction searches on these and 
> phrase search
>
> custom(+((
> alias:echo~0.5^0.71999997 | alias:echo*^0.71999997 | alias:echo^0.9
> | sortname:echo~0.5^0.88000005 | sortname:echo*^0.88000005 | 
> sortname:echo^1.1
> | artist:echo~0.5^1.04 | artist:echo*^1.04 | artist:echo^1.3)~0.1
>  (
> alias:bunneymen~0.5^0.71999997 | alias:bunneymen*^0.71999997 | 
> alias:bunneymen^0.9
> | sortname:bunneymen~0.5^0.88000005 | sortname:bunneymen*^0.88000005 | 
> sortname:bunneymen^1.1
> | artist:bunneymen~0.5^1.04 | artist:bunneymen*^1.04 | 
> artist:bunneymen^1.3)~0.1)
>  (alias:"echo bunneymen"^0.2 | sortname:"echo bunneymen"^0.2 | 
> artist:"echo bunneymen"^0.2)~0.1)
>
> and it gives me exactly the results and scoring that I want, trouble 
> is that its TOO SLOW
>
> I tried using a different write mechanism as recommended new 
> MultiTermQuery.TopTermsBoostOnlyBooleanQueryRewrite(100) but then it 
> doesn't consider the query idf which makes sense so that rare query 
> terms aren't boosted, but neither does it consider the idf or 
> field/norm of the matching document this seems wrong because this 
> still seem relevent, and more problematically the fuzzy query scores 
> are so much lower than normal
> and phrase matches, so it doesn't seem to work when using fuzzy 
> queries mixed in with other queries, is there a better option or even 
> some better documentation on the rewrite method so I can understand it 
> better.
>
> Alternatively, is there an analyzer I can use to analyse the fields 
> using the fuzzy/levenstein logic so I can do this at index time 
> instead then just use a normal term query with same analyzer instead 
> of a fuzzy query
>
> Paul
>
FYI turns out the performance problems were more to do with the fact 
that I hadn't changed prefixLength from zero , although I only did fuzzy 
queries when the term length was at least 4 characters I didn't realise 
that unless I set the prefix length to four this wouldn't prevent 
matching the query term to terms shorter than 4.

But interestingly just came across 
http://blog.mikemccandless.com/2011/03/lucenes-fuzzyquery-is-100-times-faster.html 
so looking forward to the 4.0 release, whenever that happens


Paul

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org