You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Fuad Efendi (JIRA)" <ji...@apache.org> on 2011/05/17 21:43:48 UTC
[jira] [Commented] (LUCENE-2230) Lucene Fuzzy Search: BK-Tree can
improve performance 3-20 times.
[ https://issues.apache.org/jira/browse/LUCENE-2230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034999#comment-13034999 ]
Fuad Efendi commented on LUCENE-2230:
-------------------------------------
I believe this issue should be closed due to significant performance improvements related to LUCENE-2089 and LUCENE-2258.
I don't think there is any interest from the community to continue with this (BK Tree and "Strike a Match") naive approach; although some people found it useful. Of course we might have few more distance implementations as a separate improvement.
Please close it.
Thanks
> Lucene Fuzzy Search: BK-Tree can improve performance 3-20 times.
> ----------------------------------------------------------------
>
> Key: LUCENE-2230
> URL: https://issues.apache.org/jira/browse/LUCENE-2230
> Project: Lucene - Java
> Issue Type: Improvement
> Components: core/search
> Affects Versions: 3.0
> Environment: Lucene currently uses brute force full-terms scanner and calculates distance for each term. New BKTree structure improves performance in average 20 times when distance is 1, and 3 times when distance is 3. I tested with index size several millions docs, and 250,000 terms.
> New algo uses integer distances between objects.
> Reporter: Fuad Efendi
> Attachments: BKTree.java, Distance.java, DistanceImpl.java, FuzzyTermEnumNEW.java, FuzzyTermEnumNEW.java
>
> Original Estimate: 1m
> Remaining Estimate: 1m
>
> W. Burkhard and R. Keller. Some approaches to best-match file searching, CACM, 1973
> http://portal.acm.org/citation.cfm?doid=362003.362025
> I was inspired by http://blog.notdot.net/2007/4/Damn-Cool-Algorithms-Part-1-BK-Trees (Nick Johnson, Google).
> Additionally, simplified algorythm at http://www.catalysoft.com/articles/StrikeAMatch.html seems to be much more logically correct than Levenstein distance, and it is 3-5 times faster (isolated tests).
> Big list od distance implementations:
> http://www.dcs.shef.ac.uk/~sam/stringmetrics.htm
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org