You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Jonathan Hager <jh...@gmail.com> on 2004/10/23 01:08:30 UTC

[PATCH] FuzzyTermEnum optimization and refactor

Since fuzzy searching is kind of slow, I took a look at it to see if
it could be improved.  I saw speed improvements of 10% - 60% by making
a couple changes.  Along the way I fixed a potential bug or two that I
saw.

The patch is here:
http://www.hagerfamily.com/patches/FuzzyTermEnumOptimizePatch.txt

I've never submitted a patch before, so don't flame me if I do or say
anything wrong.

What Changed?

Since the word was discarded if the edit distance for the word was
above a certain threshold, I updated the distance algorithm to abort
if at any time during the calculation it is determined that the best
possible outcome of the edit distance algorithm is above this
threshold.  The source code has a great explanation.

I also reduced the amount of floating point math, reduced the amount
of potential space the array takes in its first dimension, removed the
potential divide by 0 error when one term is an empty string, and
fixed a bug where an IllegalArgumentException was thrown if the class
was somehow initialized wrong, instead of looking at the arguments.

The behavior is almost identical.  The exception is that similarity is
set to 0.0 when it is guaranteed to be below the minimum similarity.

Results

I saw the biggest improvement from longer words, which makes a sense. 
My long word was "bridgetown" and I saw a 60% improvement on this. 
The biggest improvement are for words that are farthest away from the
median length of the words in the index.  Short words (1-3 characters)
saw a 30% improvement.  Medium words saw a 10% improvement (5-7
characters).  These improvements are with the prefix set to 0.

Would someone be willing to validate that they see similar results? 
Especially on large indexes.

Other Questions

I am still wondering about two things.  1) Why can't minimumSimilarity
be less than 0?  Similarity may be negative, especially for small
words.  2) Why does the formula for similarity not return a number
between 0.0 (no common characters) and 1.0 (identical) inclusive? 
This would be easy to do, just use Math.max() instead of Math.min(). 
Of course this would change the order of the results.

As an example the similarity for
  "to" and "todor" =  1 - 3/2 = -0.5
   "tod" and "for"  = 1 - 2/3 = +0.33

Jonathan

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

Re: [PATCH] FuzzyTermEnum optimization and refactor

Posted by Daniel Naber <da...@t-online.de>.

On Saturday 23 October 2004 01:08, Jonathan Hager wrote:

> Since fuzzy searching is kind of slow, I took a look at it to see if
> it could be improved.  I saw speed improvements of 10% - 60% by making
> a couple changes.

Thanks for your patch. I did not yet have time to look at the patch in 
detail, but I did some tests on a 12,000 document index and it speeds up 
searching 20-40%.

Regards
 Daniel

-- 
http://www.danielnaber.de

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

Re: [PATCH] FuzzyTermEnum optimization and refactor

Posted by Christoph Goller <go...@detego-software.de>.

Jonathan Hager schrieb:
> Since fuzzy searching is kind of slow, I took a look at it to see if
> it could be improved.  I saw speed improvements of 10% - 60% by making
> a couple changes.  Along the way I fixed a potential bug or two that I
> saw.
> 
> The patch is here:
> http://www.hagerfamily.com/patches/FuzzyTermEnumOptimizePatch.txt
> 
> I've never submitted a patch before, so don't flame me if I do or say
> anything wrong.

Thank you for the patch. Everything you say sounds very reasonable.
A committer will certainly look into your patch. Please put it into
bugzilla so that it does not get lost. Also include the explanation
from your mail.

> 
> What Changed?
> 
> Since the word was discarded if the edit distance for the word was
> above a certain threshold, I updated the distance algorithm to abort
> if at any time during the calculation it is determined that the best
> possible outcome of the edit distance algorithm is above this
> threshold.  The source code has a great explanation.
> 
> I also reduced the amount of floating point math, reduced the amount
> of potential space the array takes in its first dimension, removed the
> potential divide by 0 error when one term is an empty string, and
> fixed a bug where an IllegalArgumentException was thrown if the class
> was somehow initialized wrong, instead of looking at the arguments.
> 
> The behavior is almost identical.  The exception is that similarity is
> set to 0.0 when it is guaranteed to be below the minimum similarity.
> 
> Results
> 
> I saw the biggest improvement from longer words, which makes a sense. 
> My long word was "bridgetown" and I saw a 60% improvement on this. 
> The biggest improvement are for words that are farthest away from the
> median length of the words in the index.  Short words (1-3 characters)
> saw a 30% improvement.  Medium words saw a 10% improvement (5-7
> characters).  These improvements are with the prefix set to 0.
> 
> Would someone be willing to validate that they see similar results? 
> Especially on large indexes.
> 
> Other Questions
> 
> I am still wondering about two things.  1) Why can't minimumSimilarity
> be less than 0?  Similarity may be negative, especially for small
> words.  2) Why does the formula for similarity not return a number
> between 0.0 (no common characters) and 1.0 (identical) inclusive? 
> This would be easy to do, just use Math.max() instead of Math.min(). 
> Of course this would change the order of the results.
> 
> As an example the similarity for
>   "to" and "todor" =  1 - 3/2 = -0.5
>    "tod" and "for"  = 1 - 2/3 = +0.33

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org