You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Benjamin Patrick Jung <bp...@terreon.de> on 2010/03/29 16:57:44 UTC

Problem / question concerning "Fuzzy Search"

Hi all,


I tried to figure out how the fuzzy search implementation
in Apache Lucene works and I'm kinda stuck here.
--> Version : Apache Lucene 3.0.1 (JAVA)


[What I want / need]
I'm looking for a way to combine a prefix-, fuzzy- and wildcard query.

Q: Is it possible to have a query like "user_input~0.5*" ?


[JavaDoc for org.apache.lucene.search.FuzzyQuery c-tor]
  @param minimumSimilarity: a value between 0 and 1 to set
   the required similarity between the query term and the
   matching terms. For example, for a minimumSimilarity of
   0.5 a term of the same length as the query term is
   considered similar to the query term if the edit distance
   between both terms is less than length(term)*0.5

Q: Mh... what if the query term differs in it's length to the
   term in my document?


[Test case]
I have written a small test program (JUnit test case) to
explain my problem / confusion in detail:

--> http://eugeneciurana.com/pastebin/pastebin.php?show=42619



[Examples] Search term --> Subset of expected result
  Cinamo~0.5 --> Cinema, Cinnamon [works]
  Strawbarr~0.8 --> Strawberry    [doesn't work]
  
-->
As far as I understand, the "Edit distance"
(aka "Levinshtein distance") between "Strawbarr" and "Strawberry" 
is 2 (one replacement and one insertion to transform "Strawbarr" into
"Strawberry")

The query "Strawbarr~0.8" in my opinion (and from what I read from
the JavaDocs) should work just fine, because 
  len(Strawbarr)*0.8 == 9*0.8 == 7.2 ... 7.2 >= 2 ... still -- 
it doesn't work. Is that, because the length of the search term 
and the word in my document differ?

I already searched the wiki, the mailing list archive
and had a look in all the "obvious" places but had no luck
so far.

If I am missing something obvious here I would be glad to receive
some pointers into the right direction.
<--




Regards
-benjamin-

-- 
Benjamin Jung <bp...@terreon.de>
Terreon, http://terreon.de/
Tel.: +49 (0)69 / 8484 65 37
Fax: +49 (0)6054 / 909 788 2
Mobil +49 (0)1577 / 159 788 3

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Problem / question concerning "Fuzzy Search"

Posted by Robert Muir <rc...@gmail.com>.

On Mon, Mar 29, 2010 at 10:57 AM, Benjamin Patrick Jung
<bp...@terreon.de>wrote:

>
> [Examples] Search term --> Subset of expected result
>  Cinamo~0.5 --> Cinema, Cinnamon [works]
>  Strawbarr~0.8 --> Strawberry    [doesn't work]
>
> -->
> As far as I understand, the "Edit distance"
> (aka "Levinshtein distance") between "Strawbarr" and "Strawberry"
> is 2 (one replacement and one insertion to transform "Strawbarr" into
> "Strawberry")
>
>
yes you are correct, the scaling is a bit strange in my opinion. you can see
it in FuzzyTermsEnum's javadocs (if you look at the code):

Similarity returns a number that is 1.0f or less (including negative
numbers) based on how similar the Term is compared to a target term.  It
returns
exactly 0.0f when

    editDistance > maximumEditDistance

Otherwise it returns:

    1 - (editDistance / length)

where length is the length of the shortest term (text or target) including a
prefix that are identical and editDistance is the Levenshtein distance for
the two words.

I think other implementations instead tend to use 1 - (editDistance /
length) for scaling, where length is the length of the longest term.

-- 
Robert Muir
rcmuir@gmail.com