You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Michael Tobias <mi...@tobias.org.uk> on 2013/08/14 07:00:22 UTC

Fuzzy Searching on Lucene / Solr

My first post so please be gentle with me.

I am about to start 'playing' with Solr to see if it will be the correct
tool for a new searchable database development.  One of my requirements is
the ability to do 'fuzzy' searches and I understand that the latest versions
of Lucene / Solr use an improved version of indexing and the Levenshtein
distance formula (or rather a modified version of Levenshtein if wished for,
treating letter transpositions as a single difference rather than 2).

Levenshtein is precisely what I need, but I also understand that the maximum
distance currently implemented is a distance of just TWO.  That is not
really adequate for my purposes.  I need to be able to handle at least a
distance of 3 and probably 4.

Is the current maximum distance of 2 hard-coded in the system?  Can it be
over-ridden?  How?

I understand that performance (both indexing and querying) may be impaired
significantly by doing this but that might be a price worth paying.  If it
IS possible to change the max distance to 3 or 4 does anybody have any idea
what the performance implications might be?

Many thanks for any/all assistance you can provide.

Regards

Michael


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


RE: Fuzzy Searching on Lucene / Solr

Posted by Uwe Schindler <uw...@thetaphi.de>.
Hi Michael,

It is also a size constraint! The FSA would be horrible huge.
FYI: the different fuzzy distances are not implemented by a simple "parameter" to some algorithm. For every fuzzy distance there is a separate(automatically generated) Java class with huge FSA matrices that handles the fuzzy matching. The classes for distances > 2 are huge (up to megabytes of Java code) and slow.

If you want fuzzy matching for distances > 2 use SlowFuzzyQuery from the "queries" lucene module, see http://lucene.apache.org/core/4_4_0/sandbox/org/apache/lucene/sandbox/queries/SlowFuzzyQuery.html
This one does traditional term dictionary scanning and is not very fast (in fact it will scan the whole term dictionary from start to end to find matching terms).

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de


> -----Original Message-----
> From: Jack Krupansky [mailto:jack@basetechnology.com]
> Sent: Wednesday, August 14, 2013 3:58 PM
> To: java-user@lucene.apache.org
> Subject: Re: Fuzzy Searching on Lucene / Solr
> 
> The limit of 2 is hard-coded precisely because good performance for editing
> distances above 2 cannot be guaranteed.
> 
> -- Jack Krupansky
> 
> -----Original Message-----
> From: Michael Tobias
> Sent: Wednesday, August 14, 2013 1:00 AM
> To: java-user@lucene.apache.org
> Subject: Fuzzy Searching on Lucene / Solr
> 
> My first post so please be gentle with me.
> 
> I am about to start 'playing' with Solr to see if it will be the correct tool for a
> new searchable database development.  One of my requirements is the
> ability to do 'fuzzy' searches and I understand that the latest versions of
> Lucene / Solr use an improved version of indexing and the Levenshtein
> distance formula (or rather a modified version of Levenshtein if wished for,
> treating letter transpositions as a single difference rather than 2).
> 
> Levenshtein is precisely what I need, but I also understand that the
> maximum distance currently implemented is a distance of just TWO.  That is
> not really adequate for my purposes.  I need to be able to handle at least a
> distance of 3 and probably 4.
> 
> Is the current maximum distance of 2 hard-coded in the system?  Can it be
> over-ridden?  How?
> 
> I understand that performance (both indexing and querying) may be
> impaired significantly by doing this but that might be a price worth paying.  If
> it IS possible to change the max distance to 3 or 4 does anybody have any
> idea what the performance implications might be?
> 
> Many thanks for any/all assistance you can provide.
> 
> Regards
> 
> Michael
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Fuzzy Searching on Lucene / Solr

Posted by Jack Krupansky <ja...@basetechnology.com>.
The limit of 2 is hard-coded precisely because good performance for editing 
distances above 2 cannot be guaranteed.

-- Jack Krupansky

-----Original Message----- 
From: Michael Tobias
Sent: Wednesday, August 14, 2013 1:00 AM
To: java-user@lucene.apache.org
Subject: Fuzzy Searching on Lucene / Solr

My first post so please be gentle with me.

I am about to start 'playing' with Solr to see if it will be the correct
tool for a new searchable database development.  One of my requirements is
the ability to do 'fuzzy' searches and I understand that the latest versions
of Lucene / Solr use an improved version of indexing and the Levenshtein
distance formula (or rather a modified version of Levenshtein if wished for,
treating letter transpositions as a single difference rather than 2).

Levenshtein is precisely what I need, but I also understand that the maximum
distance currently implemented is a distance of just TWO.  That is not
really adequate for my purposes.  I need to be able to handle at least a
distance of 3 and probably 4.

Is the current maximum distance of 2 hard-coded in the system?  Can it be
over-ridden?  How?

I understand that performance (both indexing and querying) may be impaired
significantly by doing this but that might be a price worth paying.  If it
IS possible to change the max distance to 3 or 4 does anybody have any idea
what the performance implications might be?

Many thanks for any/all assistance you can provide.

Regards

Michael


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org