You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@lucenenet.apache.org by Michael Barbarelli <mb...@gmail.com> on 2007/04/11 08:07:07 UTC

How to access Levenstein distance number?

Hello.

I am using Lucene to submit fuzzy queries against an index. I have noticed
that relevant matches are often retreived, but the scoring is not at all
what I expected.

For example, if my query is "rightches~", a reference to a text file with
the single word "righteous" is returned with a score of 100 percent.
However, I think the actual score should be somewhere in the neighborhood of
.66, not 1. Anyone follow me?  Degree of similarity is what I want in this
case.

But Lucene score does not take into account how well a term matches a
FuzzyQuery. That just seems to be the way Lucene is built currently. The
score is based on term frequency of the actual matching term. FuzzyQuery
gets rewritten as a BooleanQuery with all matching terms OR'd.

Degree of similarity is what I want in this case.  When "rightches~" matches
"rightheous", I should get a similarity score of about .66.

What I want is to get at the raw difference that Lucene uses:  the
Levenstein distance algorithm.  I think I'll need to use the code in
FuzzyTermEnum.java (or .cs) as a starting point. I figure I can can probably
use that code directly somehow, or at least borrow the similarity
computation.

Frankly, though, I'm not sure I'm treading down the right path on this.  Can
anyone help with specifics, past experience, or examples?

Cheers,
Mike