You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Benjamin Patrick Jung <bp...@terreon.de> on 2010/03/29 16:57:44 UTC
Problem / question concerning "Fuzzy Search"
Hi all,
I tried to figure out how the fuzzy search implementation
in Apache Lucene works and I'm kinda stuck here.
--> Version : Apache Lucene 3.0.1 (JAVA)
[What I want / need]
I'm looking for a way to combine a prefix-, fuzzy- and wildcard query.
Q: Is it possible to have a query like "user_input~0.5*" ?
[JavaDoc for org.apache.lucene.search.FuzzyQuery c-tor]
@param minimumSimilarity: a value between 0 and 1 to set
the required similarity between the query term and the
matching terms. For example, for a minimumSimilarity of
0.5 a term of the same length as the query term is
considered similar to the query term if the edit distance
between both terms is less than length(term)*0.5
Q: Mh... what if the query term differs in it's length to the
term in my document?
[Test case]
I have written a small test program (JUnit test case) to
explain my problem / confusion in detail:
--> http://eugeneciurana.com/pastebin/pastebin.php?show=42619
[Examples] Search term --> Subset of expected result
Cinamo~0.5 --> Cinema, Cinnamon [works]
Strawbarr~0.8 --> Strawberry [doesn't work]
-->
As far as I understand, the "Edit distance"
(aka "Levinshtein distance") between "Strawbarr" and "Strawberry"
is 2 (one replacement and one insertion to transform "Strawbarr" into
"Strawberry")
The query "Strawbarr~0.8" in my opinion (and from what I read from
the JavaDocs) should work just fine, because
len(Strawbarr)*0.8 == 9*0.8 == 7.2 ... 7.2 >= 2 ... still --
it doesn't work. Is that, because the length of the search term
and the word in my document differ?
I already searched the wiki, the mailing list archive
and had a look in all the "obvious" places but had no luck
so far.
If I am missing something obvious here I would be glad to receive
some pointers into the right direction.
<--
Regards
-benjamin-
--
Benjamin Jung <bp...@terreon.de>
Terreon, http://terreon.de/
Tel.: +49 (0)69 / 8484 65 37
Fax: +49 (0)6054 / 909 788 2
Mobil +49 (0)1577 / 159 788 3
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Problem / question concerning "Fuzzy Search"
Posted by Robert Muir <rc...@gmail.com>.
On Mon, Mar 29, 2010 at 10:57 AM, Benjamin Patrick Jung
<bp...@terreon.de>wrote:
>
> [Examples] Search term --> Subset of expected result
> Cinamo~0.5 --> Cinema, Cinnamon [works]
> Strawbarr~0.8 --> Strawberry [doesn't work]
>
> -->
> As far as I understand, the "Edit distance"
> (aka "Levinshtein distance") between "Strawbarr" and "Strawberry"
> is 2 (one replacement and one insertion to transform "Strawbarr" into
> "Strawberry")
>
>
yes you are correct, the scaling is a bit strange in my opinion. you can see
it in FuzzyTermsEnum's javadocs (if you look at the code):
Similarity returns a number that is 1.0f or less (including negative
numbers) based on how similar the Term is compared to a target term. It
returns
exactly 0.0f when
editDistance > maximumEditDistance
Otherwise it returns:
1 - (editDistance / length)
where length is the length of the shortest term (text or target) including a
prefix that are identical and editDistance is the Levenshtein distance for
the two words.
I think other implementations instead tend to use 1 - (editDistance /
length) for scaling, where length is the length of the longest term.
--
Robert Muir
rcmuir@gmail.com