You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by ka...@nokia.com on 2010/07/23 16:24:36 UTC

LevenshteinFilter proposal

Hi Folks,

I'm very interested in using (or developing!) a Levenshtein Filter within the family of Solr Filter objects. I don't see such a class today anywhere. I see how the AutomatonQuery object would permit such a thing to be built, but to date I don't know of anyone who has built one. Do you?  If not, I'm willing to give it a whirl.  Also, AutomatonQuery doesn't seem to come up when I look for it in the javadocs for Lucene - can you point me in the correct direction?
Thanks!
Karl



RE: LevenshteinFilter proposal

Posted by Uwe Schindler <uw...@thetaphi.de>.
Yes! See TermRangeFilter for an example, its wrapping TermRangeQuery like
this J

 

-----

Uwe Schindler

H.-H.-Meier-Allee 63, D-28213 Bremen

 <http://www.thetaphi.de/> http://www.thetaphi.de

eMail: uwe@thetaphi.de

 

From: karl.wright@nokia.com [mailto:karl.wright@nokia.com] 
Sent: Friday, July 23, 2010 6:26 PM
To: dev@lucene.apache.org
Subject: RE: LevenshteinFilter proposal

 

Surprise!

 


protected 

MultiTermQueryWrapperFilter
<http://lucene.apache.org/java/2_9_1/api/all/org/apache/lucene/search/MultiT
ermQueryWrapperFilter.html#MultiTermQueryWrapperFilter(org.apache.lucene.sea
rch.MultiTermQuery)> (MultiTermQuery
<http://lucene.apache.org/java/2_9_1/api/all/org/apache/lucene/search/MultiT
ermQuery.html>  query) 

 

I guess I can write a class that extends MultiTermQueryWrapperFilter.  Is
this the intended usage?

 

Karl

 

From: ext Uwe Schindler [mailto:uwe@thetaphi.de] 
Sent: Friday, July 23, 2010 10:45 AM
To: dev@lucene.apache.org
Subject: RE: LevenshteinFilter proposal

 

Automaton is only in Lucene/Solr Trunk. To get a filter out of FuzzyQuery,
use MultiTermQueryWrapperFilter(new FuzzyQuery(.))

 

-----

Uwe Schindler

H.-H.-Meier-Allee 63, D-28213 Bremen

http://www.thetaphi.de <http://www.thetaphi.de/> 

eMail: uwe@thetaphi.de

 

From: karl.wright@nokia.com [mailto:karl.wright@nokia.com] 
Sent: Friday, July 23, 2010 4:25 PM
To: dev@lucene.apache.org
Subject: LevenshteinFilter proposal

 

Hi Folks,


I'm very interested in using (or developing!) a Levenshtein Filter within
the family of Solr Filter objects. I don't see such a class today anywhere.
I see how the AutomatonQuery object would permit such a thing to be built,
but to date I don't know of anyone who has built one. Do you?  If not, I'm
willing to give it a whirl.  Also, AutomatonQuery doesn't seem to come up
when I look for it in the javadocs for Lucene - can you point me in the
correct direction?

Thanks!
Karl

 

 


RE: LevenshteinFilter proposal

Posted by ka...@nokia.com.
Surprise!

protected

MultiTermQueryWrapperFilter<http://lucene.apache.org/java/2_9_1/api/all/org/apache/lucene/search/MultiTermQueryWrapperFilter.html#MultiTermQueryWrapperFilter(org.apache.lucene.search.MultiTermQuery)>(MultiTermQuery<http://lucene.apache.org/java/2_9_1/api/all/org/apache/lucene/search/MultiTermQuery.html> query)


I guess I can write a class that extends MultiTermQueryWrapperFilter.  Is this the intended usage?

Karl

From: ext Uwe Schindler [mailto:uwe@thetaphi.de]
Sent: Friday, July 23, 2010 10:45 AM
To: dev@lucene.apache.org
Subject: RE: LevenshteinFilter proposal

Automaton is only in Lucene/Solr Trunk. To get a filter out of FuzzyQuery, use MultiTermQueryWrapperFilter(new FuzzyQuery(...))

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de<http://www.thetaphi.de/>
eMail: uwe@thetaphi.de

From: karl.wright@nokia.com [mailto:karl.wright@nokia.com]
Sent: Friday, July 23, 2010 4:25 PM
To: dev@lucene.apache.org
Subject: LevenshteinFilter proposal

Hi Folks,

I'm very interested in using (or developing!) a Levenshtein Filter within the family of Solr Filter objects. I don't see such a class today anywhere. I see how the AutomatonQuery object would permit such a thing to be built, but to date I don't know of anyone who has built one. Do you?  If not, I'm willing to give it a whirl.  Also, AutomatonQuery doesn't seem to come up when I look for it in the javadocs for Lucene - can you point me in the correct direction?
Thanks!
Karl



Re: LevenshteinFilter proposal

Posted by Itamar Syn-Hershko <it...@code972.com>.
Just a thought: edit distance is meant for overcoming spelling errors in 
form of assimilations or mistype. In your case there is a limited number 
of cases that need special care, and you can actually define most of 
them pretty well - hence edit distance is by definition much more than 
you actually need.

Since all that is necessary here is to identify known spelling 
differences (or errors), perhaps you could use some of what I call 
"tolerated lookup" using autometa on a Lucene index. Such a lookup is 
what I use in HebMorph[1] to find words in a dictionary of Hebrew words, 
where spelling is ALWAYS like what you're having with street names. I 
use that on a radix, but the idea could be adapted for what you're 
looking for fairly easily.

The idea is quite simple - first you do an exact lookup on your 
dictionary (TermQuery in your case). If you hit no results, you do a 
tolerant lookup, where a "tolerator function" is being consulted with 
before going on to the next leaf. Those functions decide whether or not 
to allow this move based on a set of rules (position, characters before 
and after, etc).

You can see some examples for the Hebrew language here (dot-net, but 
should still be readable for Java ppl):

http://github.com/synhershko/HebMorph/blob/master/dotNet/HebMorph/DataStructures/DictRadix.cs 
-- radix implementation with tolerant lookup (TolerantLookupCrawler)
http://github.com/synhershko/HebMorph/blob/master/dotNet/HebMorph/LookupTolerators.cs 
-- tolerator functions

Tolerant lookup on my radix is very fast, should be the same for a 
Lucene index.

Itamar.

[1] www.code972.com/blog/hebmorph/ <http://www.code972.com/blog/hebmorph/>

On 26/7/2010 8:56 PM, Robert Muir wrote:
> Nah, its an analyzer. so you can just use termquery (fast: exact match).
> at query and index time it just maps stuff to a key... typically you 
> would just put this in a separate field.
>
> you can combine this with your edit distance query with a 
> booleanquery, for example the edit distance can handle your 
> le[o]minster just fine.
>
> I think this would be much better for you, i wouldnt abuse levenshtein 
> for phonetics stuff, its not designed for that.
>
> On Mon, Jul 26, 2010 at 1:44 PM, <karl.wright@nokia.com 
> <ma...@nokia.com>> wrote:
>
>     Clearly you haven’t been in the Northeast much.  Try “Worcester”
>     vs. “wuster”, or “Leominster” vs. “leminster”.  It’s also likely
>     to be a challenge to come up with the right phonetics for any
>     given proper location name.   It’s even worse in Britain, or
>     countries where the phonetic rules may be a hodgepodge of
>     different colonial influences.
>
>     That having been said, if there exists a “PhoneticQuery” object
>     that does all this using the automaton logic under the covers, I
>     think it would be  worth a serious look.
>
>     Karl
>
>     *From:* ext Robert Muir [mailto:rcmuir@gmail.com
>     <ma...@gmail.com>]
>     *Sent:* Monday, July 26, 2010 1:24 PM
>
>
>     *To:* dev@lucene.apache.org <ma...@lucene.apache.org>
>     *Subject:* Re: LevenshteinFilter proposal
>
>     On Mon, Jul 26, 2010 at 1:13 PM, <karl.wright@nokia.com
>     <ma...@nokia.com>> wrote:
>
>     What I want to capture is situations where people misspell things
>     in roughly a phonetic way.  For example, “Tchaikovsky Avenue”
>     might be misspelled as “Chicovsky Avenue”.  Modules that do
>     phonetic mapping are possible but you’d have to somehow generate a
>     phonetic database of (say) streetnames, worldwide.  Good luck on
>     getting hold of that kind of data anywhere. ;-)  In the absence of
>     such data, an LD distance will have to do – but it will almost
>     certainly need to be greater than 2.
>
>     I added this to 'TestPhoneticFilter' and it
>     passes:  assertAlgorithm(new DoubleMetaphone(), false,
>     "Tchaikovsky Chicovsky", new String[] { "XKFS", "XKFS" });
>
>     So if you want to give me all your street names, i can sell you a
>     phonetic database, or you can use the filters in
>     modules/analyzers/phonetic, which have a bunch of different
>     configurable algorithms :)
>
>
>     -- 
>     Robert Muir
>     rcmuir@gmail.com <ma...@gmail.com>
>
>
>
>
> -- 
> Robert Muir
> rcmuir@gmail.com <ma...@gmail.com>

Re: LevenshteinFilter proposal

Posted by Robert Muir <rc...@gmail.com>.
or Washington / District of Columbia.

my point is i wouldnt do anything complicated and slow if you can get away
with analyzers/phonetics and maybe some synonyms, or maybe even just the
spellchecker.
theres always crazy cases none of these algorithms will work for.

On Mon, Jul 26, 2010 at 2:09 PM, Walter Underwood <wu...@wunderwood.org>wrote:

> Try mixing colonial influences with Native American names.
>
> When my parents moved to Baton Rouge, LA years ago, they got a
> recommendation for a "Dr. Kyto". They couldn't find him. Years later, they
> met Dr. Cailleteaux. Nice man.
>
> And the French don't pronounce Baton Rouge as "batten wrooj".
>
> Also: Natchitoches is "NACK-a-tish".
>
> wunder
>
> On Jul 26, 2010, at 10:44 AM, <ka...@nokia.com> wrote:
>
> Clearly you haven’t been in the Northeast much.  Try “Worcester” vs.
> “wuster”, or “Leominster” vs. “leminster”.  It’s also likely to be a
> challenge to come up with the right phonetics for any given proper location
> name.   It’s even worse in Britain, or countries where the phonetic rules
> may be a hodgepodge of different colonial influences.
>
> That having been said, if there exists a “PhoneticQuery” object that does
> all this using the automaton logic under the covers, I think it would be
> worth a serious look.
>
> Karl
>
>
>  *From:* ext Robert Muir [mailto:rcmuir@gmail.com]
> *Sent:* Monday, July 26, 2010 1:24 PM
> *To:* dev@lucene.apache.org
> *Subject:* Re: LevenshteinFilter proposal
>
>
>
> On Mon, Jul 26, 2010 at 1:13 PM, <ka...@nokia.com> wrote:
> What I want to capture is situations where people misspell things in
> roughly a phonetic way.  For example, “Tchaikovsky Avenue” might be
> misspelled as “Chicovsky Avenue”.  Modules that do phonetic mapping are
> possible but you’d have to somehow generate a phonetic database of (say)
> streetnames, worldwide.  Good luck on getting hold of that kind of data
> anywhere. ;-)  In the absence of such data, an LD distance will have to do –
> but it will almost certainly need to be greater than 2.
> I added this to 'TestPhoneticFilter' and it passes:  assertAlgorithm(new
> DoubleMetaphone(), false, "Tchaikovsky Chicovsky", new String[] { "XKFS",
> "XKFS" });
>
> So if you want to give me all your street names, i can sell you a phonetic
> database, or you can use the filters in modules/analyzers/phonetic, which
> have a bunch of different configurable algorithms :)
>
> --
> Robert Muir
> rcmuir@gmail.com
>
>
>
>
>


-- 
Robert Muir
rcmuir@gmail.com

Re: LevenshteinFilter proposal

Posted by Walter Underwood <wu...@wunderwood.org>.
Try mixing colonial influences with Native American names.

When my parents moved to Baton Rouge, LA years ago, they got a recommendation for a "Dr. Kyto". They couldn't find him. Years later, they met Dr. Cailleteaux. Nice man.

And the French don't pronounce Baton Rouge as "batten wrooj".

Also: Natchitoches is "NACK-a-tish".

wunder

On Jul 26, 2010, at 10:44 AM, <ka...@nokia.com> wrote:

> Clearly you haven’t been in the Northeast much.  Try “Worcester” vs. “wuster”, or “Leominster” vs. “leminster”.  It’s also likely to be a challenge to come up with the right phonetics for any given proper location name.   It’s even worse in Britain, or countries where the phonetic rules may be a hodgepodge of different colonial influences.
>  
> That having been said, if there exists a “PhoneticQuery” object that does all this using the automaton logic under the covers, I think it would be  worth a serious look.
>  
> Karl
>  
>  
> From: ext Robert Muir [mailto:rcmuir@gmail.com] 
> Sent: Monday, July 26, 2010 1:24 PM
> To: dev@lucene.apache.org
> Subject: Re: LevenshteinFilter proposal
>  
>  
> 
> On Mon, Jul 26, 2010 at 1:13 PM, <ka...@nokia.com> wrote:
> What I want to capture is situations where people misspell things in roughly a phonetic way.  For example, “Tchaikovsky Avenue” might be misspelled as “Chicovsky Avenue”.  Modules that do phonetic mapping are possible but you’d have to somehow generate a phonetic database of (say) streetnames, worldwide.  Good luck on getting hold of that kind of data anywhere. ;-)  In the absence of such data, an LD distance will have to do – but it will almost certainly need to be greater than 2.
> I added this to 'TestPhoneticFilter' and it passes:  assertAlgorithm(new DoubleMetaphone(), false, "Tchaikovsky Chicovsky", new String[] { "XKFS", "XKFS" });
>  
> So if you want to give me all your street names, i can sell you a phonetic database, or you can use the filters in modules/analyzers/phonetic, which have a bunch of different configurable algorithms :)
> 
> -- 
> Robert Muir
> rcmuir@gmail.com





Re: LevenshteinFilter proposal

Posted by Robert Muir <rc...@gmail.com>.
Nah, its an analyzer. so you can just use termquery (fast: exact match).
at query and index time it just maps stuff to a key... typically you would
just put this in a separate field.

you can combine this with your edit distance query with a booleanquery, for
example the edit distance can handle your le[o]minster just fine.

I think this would be much better for you, i wouldnt abuse levenshtein for
phonetics stuff, its not designed for that.

On Mon, Jul 26, 2010 at 1:44 PM, <ka...@nokia.com> wrote:

>  Clearly you haven’t been in the Northeast much.  Try “Worcester” vs.
> “wuster”, or “Leominster” vs. “leminster”.  It’s also likely to be a
> challenge to come up with the right phonetics for any given proper location
> name.   It’s even worse in Britain, or countries where the phonetic rules
> may be a hodgepodge of different colonial influences.
>
>
>
> That having been said, if there exists a “PhoneticQuery” object that does
> all this using the automaton logic under the covers, I think it would be
> worth a serious look.
>
>
>
> Karl
>
>
>
>
>
> *From:* ext Robert Muir [mailto:rcmuir@gmail.com]
> *Sent:* Monday, July 26, 2010 1:24 PM
>
> *To:* dev@lucene.apache.org
> *Subject:* Re: LevenshteinFilter proposal
>
>
>
>
>
> On Mon, Jul 26, 2010 at 1:13 PM, <ka...@nokia.com> wrote:
>
> What I want to capture is situations where people misspell things in
> roughly a phonetic way.  For example, “Tchaikovsky Avenue” might be
> misspelled as “Chicovsky Avenue”.  Modules that do phonetic mapping are
> possible but you’d have to somehow generate a phonetic database of (say)
> streetnames, worldwide.  Good luck on getting hold of that kind of data
> anywhere. ;-)  In the absence of such data, an LD distance will have to do –
> but it will almost certainly need to be greater than 2.
>
> I added this to 'TestPhoneticFilter' and it passes:  assertAlgorithm(new
> DoubleMetaphone(), false, "Tchaikovsky Chicovsky", new String[] { "XKFS",
> "XKFS" });
>
>
>
> So if you want to give me all your street names, i can sell you a phonetic
> database, or you can use the filters in modules/analyzers/phonetic, which
> have a bunch of different configurable algorithms :)
>
>
> --
> Robert Muir
> rcmuir@gmail.com
>



-- 
Robert Muir
rcmuir@gmail.com

RE: LevenshteinFilter proposal

Posted by ka...@nokia.com.
Clearly you haven’t been in the Northeast much.  Try “Worcester” vs. “wuster”, or “Leominster” vs. “leminster”.  It’s also likely to be a challenge to come up with the right phonetics for any given proper location name.   It’s even worse in Britain, or countries where the phonetic rules may be a hodgepodge of different colonial influences.

That having been said, if there exists a “PhoneticQuery” object that does all this using the automaton logic under the covers, I think it would be  worth a serious look.

Karl


From: ext Robert Muir [mailto:rcmuir@gmail.com]
Sent: Monday, July 26, 2010 1:24 PM
To: dev@lucene.apache.org
Subject: Re: LevenshteinFilter proposal


On Mon, Jul 26, 2010 at 1:13 PM, <ka...@nokia.com>> wrote:
What I want to capture is situations where people misspell things in roughly a phonetic way.  For example, “Tchaikovsky Avenue” might be misspelled as “Chicovsky Avenue”.  Modules that do phonetic mapping are possible but you’d have to somehow generate a phonetic database of (say) streetnames, worldwide.  Good luck on getting hold of that kind of data anywhere. ;-)  In the absence of such data, an LD distance will have to do – but it will almost certainly need to be greater than 2.
I added this to 'TestPhoneticFilter' and it passes:  assertAlgorithm(new DoubleMetaphone(), false, "Tchaikovsky Chicovsky", new String[] { "XKFS", "XKFS" });

So if you want to give me all your street names, i can sell you a phonetic database, or you can use the filters in modules/analyzers/phonetic, which have a bunch of different configurable algorithms :)

--
Robert Muir
rcmuir@gmail.com<ma...@gmail.com>

Re: LevenshteinFilter proposal

Posted by Robert Muir <rc...@gmail.com>.
On Mon, Jul 26, 2010 at 1:13 PM, <ka...@nokia.com> wrote:
>
> What I want to capture is situations where people misspell things in
> roughly a phonetic way.  For example, “Tchaikovsky Avenue” might be
> misspelled as “Chicovsky Avenue”.  Modules that do phonetic mapping are
> possible but you’d have to somehow generate a phonetic database of (say)
> streetnames, worldwide.  Good luck on getting hold of that kind of data
> anywhere. ;-)  In the absence of such data, an LD distance will have to do –
> but it will almost certainly need to be greater than 2.
>
I added this to 'TestPhoneticFilter' and it passes:  assertAlgorithm(new
DoubleMetaphone(), false, "Tchaikovsky Chicovsky", new String[] { "XKFS",
"XKFS" });

So if you want to give me all your street names, i can sell you a phonetic
database, or you can use the filters in modules/analyzers/phonetic, which
have a bunch of different configurable algorithms :)

-- 
Robert Muir
rcmuir@gmail.com

RE: LevenshteinFilter proposal

Posted by ka...@nokia.com.
>>>>>>
I really wouldnt recommend this (though i am not sure what exactly you are trying to do), instead i would recommend analyzers/phonetic to deal with the phonetic stuff.
<<<<<<

What I want to capture is situations where people misspell things in roughly a phonetic way.  For example, “Tchaikovsky Avenue” might be misspelled as “Chicovsky Avenue”.  Modules that do phonetic mapping are possible but you’d have to somehow generate a phonetic database of (say) streetnames, worldwide.  Good luck on getting hold of that kind of data anywhere. ;-)  In the absence of such data, an LD distance will have to do – but it will almost certainly need to be greater than 2.

Karl


From: ext Robert Muir [mailto:rcmuir@gmail.com]
Sent: Monday, July 26, 2010 1:07 PM
To: dev@lucene.apache.org
Subject: Re: LevenshteinFilter proposal


On Mon, Jul 26, 2010 at 12:51 PM, <ka...@nokia.com>> wrote:
Is the automaton creation code checked in anywhere?

yes, its generated from lucene/build.xml (createLevAutomata task)
Also, if it’s not possible to generate the automaton table on the fly, perhaps it could be included in an optional jar (whose existence would be queried by reflection).  The reason I’m interested in LD’s of three, especially, is because I can well imagine phonetic misspellings that would have distances on that order.

I really wouldnt recommend this (though i am not sure what exactly you are trying to do), instead i would recommend analyzers/phonetic to deal with the phonetic stuff.

Karl


From: ext Robert Muir [mailto:rcmuir@gmail.com<ma...@gmail.com>]
Sent: Monday, July 26, 2010 11:49 AM

To: dev@lucene.apache.org<ma...@lucene.apache.org>
Subject: Re: LevenshteinFilter proposal

I wanted to show you what i mean, just so you know:

here is what 'ant createLevAutomata' does now:
createLevAutomata:
     [exec] Wrote Lev1ParametricDescription.java [102 lines; 3.7 KB]
     [exec] Wrote Lev2ParametricDescription.java [147 lines; 9.6 KB]

BUILD SUCCESSFUL
Total time: 4 seconds

I modified it with the patch below to generate 3 and 4, just for kicks, and ran it again... it took it 7 minutes to generate lev3, its still running on lev4 (might take hours or days, i dont actually know)

createLevAutomata:
     [exec] Wrote Lev1ParametricDescription.java [102 lines; 3.7 KB]
     [exec] Wrote Lev2ParametricDescription.java [147 lines; 9.6 KB]
     [exec] Wrote Lev3ParametricDescription.java [333 lines; 167.2 KB]
...

Index: build.xml
===================================================================
--- build.xml   (revision 959288)
+++ build.xml   (working copy)
@@ -533,6 +533,8 @@
   <target name="createLevAutomata" depends="check-moman,clone-moman,pull-moman">
        <createLevAutomaton n="1"/>
        <createLevAutomaton n="2"/>
+       <createLevAutomaton n="3"/>
+       <createLevAutomaton n="4"/>
   </target>

   <target name="check-moman">

On Mon, Jul 26, 2010 at 11:39 AM, Michael McCandless <lu...@mikemccandless.com>> wrote:
On Mon, Jul 26, 2010 at 11:35 AM, Robert Muir <rc...@gmail.com>> wrote:
> On Mon, Jul 26, 2010 at 11:30 AM, <ka...@nokia.com>> wrote:
>>
>> It occurs to me that the proper static initializer code might well be able
>> to generate distances of 3, 4 or whatever, without bloating the jar.
>
> I disagree: generating these distances is a non-trivial algorithm and takes
> minutes or hours to compute...
Furthermore, our own implementation for this algo is in Python :)

>> Nevertheless, the real question of import to me right now is: what
>> “minimumDistance” value corresponds to a Levenshtein distance of 1?  2?  The
>> maximum “minimumDistance” the query accepts is 1.0.
>
> right, fuzzyquery uses a strange formula that doesnt tie directly to edit
> distance (unless you factor in length).
> you can do this easier like this:
> LevenshteinAutomata builder = new LevenshteinAutomata("foobar");
> Automaton ed2 = builder.toAutomaton(2);
> // the term doesnt really matter, just for the field name and toString, etc.
> AutomatonQuery query = new AutomatonQuery(new Term("field", "foobar~2"),
> ed2);
>
>>
>>
>>
>> Karl
>>
>>
>>
>>
>>
>> From: ext Robert Muir [mailto:rcmuir@gmail.com<ma...@gmail.com>]
>> Sent: Friday, July 23, 2010 11:22 AM
>>
>> To: dev@lucene.apache.org<ma...@lucene.apache.org>
>> Subject: Re: LevenshteinFilter proposal
>>
>>
>>
>> Well, there are two main things involved:
>>
>> 1. number of terms seeked to in the term enum
>>
>> 2. expense of the comparison itself.
>>
>>
>>
>> one challenge is the construction of a DFA LevK(x) that recognizes all
>> words within edit distance <= k of x is an expensive operation.
>>
>> This is because of the nature of the computation (traditionally its
>> dynamic programming with nasty runtime).
>>
>>
>>
>> We use http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.26.2940 to
>> offline the nasty part so its linear time: O(n) where n is the length of the
>> query term.
>>
>> But these offline-computated tables get pretty large quick, and so we only
>> include 1,2 to keep the lucene jar file down.
>>
>>
>>
>> additionally for n > 2 it starts to be a ton of termenum seeks... at some
>> of these high distances its faster to just next() thru the terms and
>> compare.
>>
>>
>>
>> we might be able to speed it up more by including say a table for n=3 (at
>> the expense of increased jar size), but not using this n=3 table for #1,
>> only to speed up #2. but its diminishing returns here...
>>
>>
>>
>> On Fri, Jul 23, 2010 at 11:09 AM, <ka...@nokia.com>> wrote:
>>
>> Glad I asked.
>>
>>
>>
>> I would think that the automaton would be superior even for larger edit
>> distances than 1 or 2 than the equivalent “crappy” algorithm.  But maybe I
>> don’t understand something. ;-)
>>
>>
>>
>> Karl
>>
>>
>>
>>
>>
>> From: ext Robert Muir [mailto:rcmuir@gmail.com<ma...@gmail.com>]
>> Sent: Friday, July 23, 2010 11:05 AM
>>
>> To: dev@lucene.apache.org<ma...@lucene.apache.org>
>>
>> Subject: Re: LevenshteinFilter proposal
>>
>>
>>
>> this is actually done in trunk.
>>
>>
>>
>> In trunk fuzzy's enum is a "proxy". for low distances (ed=1,2) it uses
>> automaton.
>>
>>
>>
>> for higher distances it uses the crappy "brute force" method.
>>
>> but, higher distances still get accelerated if you use a reasonable
>> 'maxExpansions' to FuzzyQuery... the default is quite bad (1024).
>>
>>
>>
>>
>>
>> On Fri, Jul 23, 2010 at 10:59 AM, <ka...@nokia.com>> wrote:
>>
>> Thanks!
>>
>>
>>
>> FuzzyQuery will do for my purposes, for the interim.  But I suspect that
>> FuzzyQuery could be made a lot more efficient if it were rebuilt on top of
>> Automaton, no?  I understand that this would be a trunk project.
>>
>>
>>
>> Karl
>>
>>
>>
>>
>>
>> From: ext Uwe Schindler [mailto:uwe@thetaphi.de<ma...@thetaphi.de>]
>> Sent: Friday, July 23, 2010 10:45 AM
>>
>> To: dev@lucene.apache.org<ma...@lucene.apache.org>
>>
>> Subject: RE: LevenshteinFilter proposal
>>
>>
>>
>> Automaton is only in Lucene/Solr Trunk. To get a filter out of FuzzyQuery,
>> use MultiTermQueryWrapperFilter(new FuzzyQuery(…))
>>
>>
>>
>> -----
>>
>> Uwe Schindler
>>
>> H.-H.-Meier-Allee 63, D-28213 Bremen
>>
>> http://www.thetaphi.de
>>
>> eMail: uwe@thetaphi.de<ma...@thetaphi.de>
>>
>>
>>
>> From: karl.wright@nokia.com<ma...@nokia.com> [mailto:karl.wright@nokia.com<ma...@nokia.com>]
>> Sent: Friday, July 23, 2010 4:25 PM
>> To: dev@lucene.apache.org<ma...@lucene.apache.org>
>> Subject: LevenshteinFilter proposal
>>
>>
>>
>> Hi Folks,
>>
>> I’m very interested in using (or developing!) a Levenshtein Filter within
>> the family of Solr Filter objects. I don’t see such a class today anywhere.
>> I see how the AutomatonQuery object would permit such a thing to be built,
>> but to date I don’t know of anyone who has built one. Do you?  If not, I’m
>> willing to give it a whirl.  Also, AutomatonQuery doesn’t seem to come up
>> when I look for it in the javadocs for Lucene – can you point me in the
>> correct direction?
>>
>> Thanks!
>> Karl
>>
>>
>>
>>
>>
>>
>> --
>> Robert Muir
>> rcmuir@gmail.com<ma...@gmail.com>
>>
>>
>> --
>> Robert Muir
>> rcmuir@gmail.com<ma...@gmail.com>
>
>
> --
> Robert Muir
> rcmuir@gmail.com<ma...@gmail.com>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org<ma...@lucene.apache.org>
For additional commands, e-mail: dev-help@lucene.apache.org<ma...@lucene.apache.org>



--
Robert Muir
rcmuir@gmail.com<ma...@gmail.com>



--
Robert Muir
rcmuir@gmail.com<ma...@gmail.com>

Re: LevenshteinFilter proposal

Posted by Robert Muir <rc...@gmail.com>.
On Mon, Jul 26, 2010 at 12:51 PM, <ka...@nokia.com> wrote:

>  Is the automaton creation code checked in anywhere?
>

yes, its generated from lucene/build.xml (createLevAutomata task)

> Also, if it’s not possible to generate the automaton table on the fly,
> perhaps it could be included in an optional jar (whose existence would be
> queried by reflection).  The reason I’m interested in LD’s of three,
> especially, is because I can well imagine phonetic misspellings that would
> have distances on that order.
>

I really wouldnt recommend this (though i am not sure what exactly you are
trying to do), instead i would recommend analyzers/phonetic to deal with the
phonetic stuff.

>
>
> Karl
>
>
>
>
>
> *From:* ext Robert Muir [mailto:rcmuir@gmail.com]
> *Sent:* Monday, July 26, 2010 11:49 AM
>
> *To:* dev@lucene.apache.org
> *Subject:* Re: LevenshteinFilter proposal
>
>
>
> I wanted to show you what i mean, just so you know:
>
>
>
> here is what 'ant createLevAutomata' does now:
>
> createLevAutomata:
>
>      [exec] Wrote Lev1ParametricDescription.java [102 lines; 3.7 KB]
>
>      [exec] Wrote Lev2ParametricDescription.java [147 lines; 9.6 KB]
>
>
>
> BUILD SUCCESSFUL
>
> Total time: 4 seconds
>
>
>
> I modified it with the patch below to generate 3 and 4, just for kicks, and
> ran it again... it took it 7 minutes to generate lev3, its still running on
> lev4 (might take hours or days, i dont actually know)
>
>
>
> createLevAutomata:
>
>      [exec] Wrote Lev1ParametricDescription.java [102 lines; 3.7 KB]
>
>      [exec] Wrote Lev2ParametricDescription.java [147 lines; 9.6 KB]
>
>      [exec] Wrote Lev3ParametricDescription.java [333 lines; 167.2 KB]
>
> ...
>
>
>
> Index: build.xml
>
> ===================================================================
>
> --- build.xml   (revision 959288)
>
> +++ build.xml   (working copy)
>
> @@ -533,6 +533,8 @@
>
>    <target name="createLevAutomata"
> depends="check-moman,clone-moman,pull-moman">
>
>         <createLevAutomaton n="1"/>
>
>         <createLevAutomaton n="2"/>
>
> +       <createLevAutomaton n="3"/>
>
> +       <createLevAutomaton n="4"/>
>
>    </target>
>
>
>
>    <target name="check-moman">
>
>
>
> On Mon, Jul 26, 2010 at 11:39 AM, Michael McCandless <
> lucene@mikemccandless.com> wrote:
>
> On Mon, Jul 26, 2010 at 11:35 AM, Robert Muir <rc...@gmail.com> wrote:
> > On Mon, Jul 26, 2010 at 11:30 AM, <ka...@nokia.com> wrote:
> >>
> >> It occurs to me that the proper static initializer code might well be
> able
> >> to generate distances of 3, 4 or whatever, without bloating the jar.
> >
> > I disagree: generating these distances is a non-trivial algorithm and
> takes
> > minutes or hours to compute...
>
> Furthermore, our own implementation for this algo is in Python :)
>
>
> >> Nevertheless, the real question of import to me right now is: what
> >> “minimumDistance” value corresponds to a Levenshtein distance of 1?  2?
> The
> >> maximum “minimumDistance” the query accepts is 1.0.
> >
> > right, fuzzyquery uses a strange formula that doesnt tie directly to edit
> > distance (unless you factor in length).
> > you can do this easier like this:
> > LevenshteinAutomata builder = new LevenshteinAutomata("foobar");
> > Automaton ed2 = builder.toAutomaton(2);
> > // the term doesnt really matter, just for the field name and toString,
> etc.
> > AutomatonQuery query = new AutomatonQuery(new Term("field", "foobar~2"),
> > ed2);
> >
> >>
> >>
> >>
> >> Karl
> >>
> >>
> >>
> >>
> >>
> >> From: ext Robert Muir [mailto:rcmuir@gmail.com]
> >> Sent: Friday, July 23, 2010 11:22 AM
> >>
> >> To: dev@lucene.apache.org
> >> Subject: Re: LevenshteinFilter proposal
> >>
> >>
> >>
> >> Well, there are two main things involved:
> >>
> >> 1. number of terms seeked to in the term enum
> >>
> >> 2. expense of the comparison itself.
> >>
> >>
> >>
> >> one challenge is the construction of a DFA LevK(x) that recognizes all
> >> words within edit distance <= k of x is an expensive operation.
> >>
> >> This is because of the nature of the computation (traditionally its
> >> dynamic programming with nasty runtime).
> >>
> >>
> >>
> >> We use http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.26.2940
>  to
> >> offline the nasty part so its linear time: O(n) where n is the length of
> the
> >> query term.
> >>
> >> But these offline-computated tables get pretty large quick, and so we
> only
> >> include 1,2 to keep the lucene jar file down.
> >>
> >>
> >>
> >> additionally for n > 2 it starts to be a ton of termenum seeks... at
> some
> >> of these high distances its faster to just next() thru the terms and
> >> compare.
> >>
> >>
> >>
> >> we might be able to speed it up more by including say a table for n=3
> (at
> >> the expense of increased jar size), but not using this n=3 table for #1,
> >> only to speed up #2. but its diminishing returns here...
> >>
> >>
> >>
> >> On Fri, Jul 23, 2010 at 11:09 AM, <ka...@nokia.com> wrote:
> >>
> >> Glad I asked.
> >>
> >>
> >>
> >> I would think that the automaton would be superior even for larger edit
> >> distances than 1 or 2 than the equivalent “crappy” algorithm.  But maybe
> I
> >> don’t understand something. ;-)
> >>
> >>
> >>
> >> Karl
> >>
> >>
> >>
> >>
> >>
> >> From: ext Robert Muir [mailto:rcmuir@gmail.com]
> >> Sent: Friday, July 23, 2010 11:05 AM
> >>
> >> To: dev@lucene.apache.org
> >>
> >> Subject: Re: LevenshteinFilter proposal
> >>
> >>
> >>
> >> this is actually done in trunk.
> >>
> >>
> >>
> >> In trunk fuzzy's enum is a "proxy". for low distances (ed=1,2) it uses
> >> automaton.
> >>
> >>
> >>
> >> for higher distances it uses the crappy "brute force" method.
> >>
> >> but, higher distances still get accelerated if you use a reasonable
> >> 'maxExpansions' to FuzzyQuery... the default is quite bad (1024).
> >>
> >>
> >>
> >>
> >>
> >> On Fri, Jul 23, 2010 at 10:59 AM, <ka...@nokia.com> wrote:
> >>
> >> Thanks!
> >>
> >>
> >>
> >> FuzzyQuery will do for my purposes, for the interim.  But I suspect that
> >> FuzzyQuery could be made a lot more efficient if it were rebuilt on top
> of
> >> Automaton, no?  I understand that this would be a trunk project.
> >>
> >>
> >>
> >> Karl
> >>
> >>
> >>
> >>
> >>
> >> From: ext Uwe Schindler [mailto:uwe@thetaphi.de]
> >> Sent: Friday, July 23, 2010 10:45 AM
> >>
> >> To: dev@lucene.apache.org
> >>
> >> Subject: RE: LevenshteinFilter proposal
> >>
> >>
> >>
> >> Automaton is only in Lucene/Solr Trunk. To get a filter out of
> FuzzyQuery,
> >> use MultiTermQueryWrapperFilter(new FuzzyQuery(…))
> >>
> >>
> >>
> >> -----
> >>
> >> Uwe Schindler
> >>
> >> H.-H.-Meier-Allee 63, D-28213 Bremen
> >>
> >> http://www.thetaphi.de
> >>
> >> eMail: uwe@thetaphi.de
> >>
> >>
> >>
> >> From: karl.wright@nokia.com [mailto:karl.wright@nokia.com]
> >> Sent: Friday, July 23, 2010 4:25 PM
> >> To: dev@lucene.apache.org
> >> Subject: LevenshteinFilter proposal
> >>
> >>
> >>
> >> Hi Folks,
> >>
> >> I’m very interested in using (or developing!) a Levenshtein Filter
> within
> >> the family of Solr Filter objects. I don’t see such a class today
> anywhere.
> >> I see how the AutomatonQuery object would permit such a thing to be
> built,
> >> but to date I don’t know of anyone who has built one. Do you?  If not,
> I’m
> >> willing to give it a whirl.  Also, AutomatonQuery doesn’t seem to come
> up
> >> when I look for it in the javadocs for Lucene – can you point me in the
> >> correct direction?
> >>
> >> Thanks!
> >> Karl
> >>
> >>
> >>
> >>
> >>
> >>
> >> --
> >> Robert Muir
> >> rcmuir@gmail.com
> >>
> >>
> >> --
> >> Robert Muir
> >> rcmuir@gmail.com
> >
> >
> > --
> > Robert Muir
> > rcmuir@gmail.com
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>
>
>
> --
> Robert Muir
> rcmuir@gmail.com
>



-- 
Robert Muir
rcmuir@gmail.com

RE: LevenshteinFilter proposal

Posted by ka...@nokia.com.
Is the automaton creation code checked in anywhere?
Also, if it’s not possible to generate the automaton table on the fly, perhaps it could be included in an optional jar (whose existence would be queried by reflection).  The reason I’m interested in LD’s of three, especially, is because I can well imagine phonetic misspellings that would have distances on that order.

Karl


From: ext Robert Muir [mailto:rcmuir@gmail.com]
Sent: Monday, July 26, 2010 11:49 AM
To: dev@lucene.apache.org
Subject: Re: LevenshteinFilter proposal

I wanted to show you what i mean, just so you know:

here is what 'ant createLevAutomata' does now:
createLevAutomata:
     [exec] Wrote Lev1ParametricDescription.java [102 lines; 3.7 KB]
     [exec] Wrote Lev2ParametricDescription.java [147 lines; 9.6 KB]

BUILD SUCCESSFUL
Total time: 4 seconds

I modified it with the patch below to generate 3 and 4, just for kicks, and ran it again... it took it 7 minutes to generate lev3, its still running on lev4 (might take hours or days, i dont actually know)

createLevAutomata:
     [exec] Wrote Lev1ParametricDescription.java [102 lines; 3.7 KB]
     [exec] Wrote Lev2ParametricDescription.java [147 lines; 9.6 KB]
     [exec] Wrote Lev3ParametricDescription.java [333 lines; 167.2 KB]
...

Index: build.xml
===================================================================
--- build.xml   (revision 959288)
+++ build.xml   (working copy)
@@ -533,6 +533,8 @@
   <target name="createLevAutomata" depends="check-moman,clone-moman,pull-moman">
        <createLevAutomaton n="1"/>
        <createLevAutomaton n="2"/>
+       <createLevAutomaton n="3"/>
+       <createLevAutomaton n="4"/>
   </target>

   <target name="check-moman">

On Mon, Jul 26, 2010 at 11:39 AM, Michael McCandless <lu...@mikemccandless.com>> wrote:
On Mon, Jul 26, 2010 at 11:35 AM, Robert Muir <rc...@gmail.com>> wrote:
> On Mon, Jul 26, 2010 at 11:30 AM, <ka...@nokia.com>> wrote:
>>
>> It occurs to me that the proper static initializer code might well be able
>> to generate distances of 3, 4 or whatever, without bloating the jar.
>
> I disagree: generating these distances is a non-trivial algorithm and takes
> minutes or hours to compute...
Furthermore, our own implementation for this algo is in Python :)

>> Nevertheless, the real question of import to me right now is: what
>> “minimumDistance” value corresponds to a Levenshtein distance of 1?  2?  The
>> maximum “minimumDistance” the query accepts is 1.0.
>
> right, fuzzyquery uses a strange formula that doesnt tie directly to edit
> distance (unless you factor in length).
> you can do this easier like this:
> LevenshteinAutomata builder = new LevenshteinAutomata("foobar");
> Automaton ed2 = builder.toAutomaton(2);
> // the term doesnt really matter, just for the field name and toString, etc.
> AutomatonQuery query = new AutomatonQuery(new Term("field", "foobar~2"),
> ed2);
>
>>
>>
>>
>> Karl
>>
>>
>>
>>
>>
>> From: ext Robert Muir [mailto:rcmuir@gmail.com<ma...@gmail.com>]
>> Sent: Friday, July 23, 2010 11:22 AM
>>
>> To: dev@lucene.apache.org<ma...@lucene.apache.org>
>> Subject: Re: LevenshteinFilter proposal
>>
>>
>>
>> Well, there are two main things involved:
>>
>> 1. number of terms seeked to in the term enum
>>
>> 2. expense of the comparison itself.
>>
>>
>>
>> one challenge is the construction of a DFA LevK(x) that recognizes all
>> words within edit distance <= k of x is an expensive operation.
>>
>> This is because of the nature of the computation (traditionally its
>> dynamic programming with nasty runtime).
>>
>>
>>
>> We use http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.26.2940 to
>> offline the nasty part so its linear time: O(n) where n is the length of the
>> query term.
>>
>> But these offline-computated tables get pretty large quick, and so we only
>> include 1,2 to keep the lucene jar file down.
>>
>>
>>
>> additionally for n > 2 it starts to be a ton of termenum seeks... at some
>> of these high distances its faster to just next() thru the terms and
>> compare.
>>
>>
>>
>> we might be able to speed it up more by including say a table for n=3 (at
>> the expense of increased jar size), but not using this n=3 table for #1,
>> only to speed up #2. but its diminishing returns here...
>>
>>
>>
>> On Fri, Jul 23, 2010 at 11:09 AM, <ka...@nokia.com>> wrote:
>>
>> Glad I asked.
>>
>>
>>
>> I would think that the automaton would be superior even for larger edit
>> distances than 1 or 2 than the equivalent “crappy” algorithm.  But maybe I
>> don’t understand something. ;-)
>>
>>
>>
>> Karl
>>
>>
>>
>>
>>
>> From: ext Robert Muir [mailto:rcmuir@gmail.com<ma...@gmail.com>]
>> Sent: Friday, July 23, 2010 11:05 AM
>>
>> To: dev@lucene.apache.org<ma...@lucene.apache.org>
>>
>> Subject: Re: LevenshteinFilter proposal
>>
>>
>>
>> this is actually done in trunk.
>>
>>
>>
>> In trunk fuzzy's enum is a "proxy". for low distances (ed=1,2) it uses
>> automaton.
>>
>>
>>
>> for higher distances it uses the crappy "brute force" method.
>>
>> but, higher distances still get accelerated if you use a reasonable
>> 'maxExpansions' to FuzzyQuery... the default is quite bad (1024).
>>
>>
>>
>>
>>
>> On Fri, Jul 23, 2010 at 10:59 AM, <ka...@nokia.com>> wrote:
>>
>> Thanks!
>>
>>
>>
>> FuzzyQuery will do for my purposes, for the interim.  But I suspect that
>> FuzzyQuery could be made a lot more efficient if it were rebuilt on top of
>> Automaton, no?  I understand that this would be a trunk project.
>>
>>
>>
>> Karl
>>
>>
>>
>>
>>
>> From: ext Uwe Schindler [mailto:uwe@thetaphi.de<ma...@thetaphi.de>]
>> Sent: Friday, July 23, 2010 10:45 AM
>>
>> To: dev@lucene.apache.org<ma...@lucene.apache.org>
>>
>> Subject: RE: LevenshteinFilter proposal
>>
>>
>>
>> Automaton is only in Lucene/Solr Trunk. To get a filter out of FuzzyQuery,
>> use MultiTermQueryWrapperFilter(new FuzzyQuery(…))
>>
>>
>>
>> -----
>>
>> Uwe Schindler
>>
>> H.-H.-Meier-Allee 63, D-28213 Bremen
>>
>> http://www.thetaphi.de
>>
>> eMail: uwe@thetaphi.de<ma...@thetaphi.de>
>>
>>
>>
>> From: karl.wright@nokia.com<ma...@nokia.com> [mailto:karl.wright@nokia.com<ma...@nokia.com>]
>> Sent: Friday, July 23, 2010 4:25 PM
>> To: dev@lucene.apache.org<ma...@lucene.apache.org>
>> Subject: LevenshteinFilter proposal
>>
>>
>>
>> Hi Folks,
>>
>> I’m very interested in using (or developing!) a Levenshtein Filter within
>> the family of Solr Filter objects. I don’t see such a class today anywhere.
>> I see how the AutomatonQuery object would permit such a thing to be built,
>> but to date I don’t know of anyone who has built one. Do you?  If not, I’m
>> willing to give it a whirl.  Also, AutomatonQuery doesn’t seem to come up
>> when I look for it in the javadocs for Lucene – can you point me in the
>> correct direction?
>>
>> Thanks!
>> Karl
>>
>>
>>
>>
>>
>>
>> --
>> Robert Muir
>> rcmuir@gmail.com<ma...@gmail.com>
>>
>>
>> --
>> Robert Muir
>> rcmuir@gmail.com<ma...@gmail.com>
>
>
> --
> Robert Muir
> rcmuir@gmail.com<ma...@gmail.com>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org<ma...@lucene.apache.org>
For additional commands, e-mail: dev-help@lucene.apache.org<ma...@lucene.apache.org>



--
Robert Muir
rcmuir@gmail.com<ma...@gmail.com>

Re: LevenshteinFilter proposal

Posted by Robert Muir <rc...@gmail.com>.
I wanted to show you what i mean, just so you know:

here is what 'ant createLevAutomata' does now:
createLevAutomata:
     [exec] Wrote Lev1ParametricDescription.java [102 lines; 3.7 KB]
     [exec] Wrote Lev2ParametricDescription.java [147 lines; 9.6 KB]

BUILD SUCCESSFUL
Total time: 4 seconds

I modified it with the patch below to generate 3 and 4, just for kicks, and
ran it again... it took it 7 minutes to generate lev3, its still running on
lev4 (might take hours or days, i dont actually know)

createLevAutomata:
     [exec] Wrote Lev1ParametricDescription.java [102 lines; 3.7 KB]
     [exec] Wrote Lev2ParametricDescription.java [147 lines; 9.6 KB]
     [exec] Wrote Lev3ParametricDescription.java [333 lines; 167.2 KB]
...

Index: build.xml
===================================================================
--- build.xml   (revision 959288)
+++ build.xml   (working copy)
@@ -533,6 +533,8 @@
   <target name="createLevAutomata"
depends="check-moman,clone-moman,pull-moman">
        <createLevAutomaton n="1"/>
        <createLevAutomaton n="2"/>
+       <createLevAutomaton n="3"/>
+       <createLevAutomaton n="4"/>
   </target>

   <target name="check-moman">

On Mon, Jul 26, 2010 at 11:39 AM, Michael McCandless <
lucene@mikemccandless.com> wrote:

> On Mon, Jul 26, 2010 at 11:35 AM, Robert Muir <rc...@gmail.com> wrote:
> > On Mon, Jul 26, 2010 at 11:30 AM, <ka...@nokia.com> wrote:
> >>
> >> It occurs to me that the proper static initializer code might well be
> able
> >> to generate distances of 3, 4 or whatever, without bloating the jar.
> >
> > I disagree: generating these distances is a non-trivial algorithm and
> takes
> > minutes or hours to compute...
>
> Furthermore, our own implementation for this algo is in Python :)
>
> >> Nevertheless, the real question of import to me right now is: what
> >> “minimumDistance” value corresponds to a Levenshtein distance of 1?  2?
> The
> >> maximum “minimumDistance” the query accepts is 1.0.
> >
> > right, fuzzyquery uses a strange formula that doesnt tie directly to edit
> > distance (unless you factor in length).
> > you can do this easier like this:
> > LevenshteinAutomata builder = new LevenshteinAutomata("foobar");
> > Automaton ed2 = builder.toAutomaton(2);
> > // the term doesnt really matter, just for the field name and toString,
> etc.
> > AutomatonQuery query = new AutomatonQuery(new Term("field", "foobar~2"),
> > ed2);
> >
> >>
> >>
> >>
> >> Karl
> >>
> >>
> >>
> >>
> >>
> >> From: ext Robert Muir [mailto:rcmuir@gmail.com]
> >> Sent: Friday, July 23, 2010 11:22 AM
> >>
> >> To: dev@lucene.apache.org
> >> Subject: Re: LevenshteinFilter proposal
> >>
> >>
> >>
> >> Well, there are two main things involved:
> >>
> >> 1. number of terms seeked to in the term enum
> >>
> >> 2. expense of the comparison itself.
> >>
> >>
> >>
> >> one challenge is the construction of a DFA LevK(x) that recognizes all
> >> words within edit distance <= k of x is an expensive operation.
> >>
> >> This is because of the nature of the computation (traditionally its
> >> dynamic programming with nasty runtime).
> >>
> >>
> >>
> >> We use http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.26.2940
>  to
> >> offline the nasty part so its linear time: O(n) where n is the length of
> the
> >> query term.
> >>
> >> But these offline-computated tables get pretty large quick, and so we
> only
> >> include 1,2 to keep the lucene jar file down.
> >>
> >>
> >>
> >> additionally for n > 2 it starts to be a ton of termenum seeks... at
> some
> >> of these high distances its faster to just next() thru the terms and
> >> compare.
> >>
> >>
> >>
> >> we might be able to speed it up more by including say a table for n=3
> (at
> >> the expense of increased jar size), but not using this n=3 table for #1,
> >> only to speed up #2. but its diminishing returns here...
> >>
> >>
> >>
> >> On Fri, Jul 23, 2010 at 11:09 AM, <ka...@nokia.com> wrote:
> >>
> >> Glad I asked.
> >>
> >>
> >>
> >> I would think that the automaton would be superior even for larger edit
> >> distances than 1 or 2 than the equivalent “crappy” algorithm.  But maybe
> I
> >> don’t understand something. ;-)
> >>
> >>
> >>
> >> Karl
> >>
> >>
> >>
> >>
> >>
> >> From: ext Robert Muir [mailto:rcmuir@gmail.com]
> >> Sent: Friday, July 23, 2010 11:05 AM
> >>
> >> To: dev@lucene.apache.org
> >>
> >> Subject: Re: LevenshteinFilter proposal
> >>
> >>
> >>
> >> this is actually done in trunk.
> >>
> >>
> >>
> >> In trunk fuzzy's enum is a "proxy". for low distances (ed=1,2) it uses
> >> automaton.
> >>
> >>
> >>
> >> for higher distances it uses the crappy "brute force" method.
> >>
> >> but, higher distances still get accelerated if you use a reasonable
> >> 'maxExpansions' to FuzzyQuery... the default is quite bad (1024).
> >>
> >>
> >>
> >>
> >>
> >> On Fri, Jul 23, 2010 at 10:59 AM, <ka...@nokia.com> wrote:
> >>
> >> Thanks!
> >>
> >>
> >>
> >> FuzzyQuery will do for my purposes, for the interim.  But I suspect that
> >> FuzzyQuery could be made a lot more efficient if it were rebuilt on top
> of
> >> Automaton, no?  I understand that this would be a trunk project.
> >>
> >>
> >>
> >> Karl
> >>
> >>
> >>
> >>
> >>
> >> From: ext Uwe Schindler [mailto:uwe@thetaphi.de]
> >> Sent: Friday, July 23, 2010 10:45 AM
> >>
> >> To: dev@lucene.apache.org
> >>
> >> Subject: RE: LevenshteinFilter proposal
> >>
> >>
> >>
> >> Automaton is only in Lucene/Solr Trunk. To get a filter out of
> FuzzyQuery,
> >> use MultiTermQueryWrapperFilter(new FuzzyQuery(…))
> >>
> >>
> >>
> >> -----
> >>
> >> Uwe Schindler
> >>
> >> H.-H.-Meier-Allee 63, D-28213 Bremen
> >>
> >> http://www.thetaphi.de
> >>
> >> eMail: uwe@thetaphi.de
> >>
> >>
> >>
> >> From: karl.wright@nokia.com [mailto:karl.wright@nokia.com]
> >> Sent: Friday, July 23, 2010 4:25 PM
> >> To: dev@lucene.apache.org
> >> Subject: LevenshteinFilter proposal
> >>
> >>
> >>
> >> Hi Folks,
> >>
> >> I’m very interested in using (or developing!) a Levenshtein Filter
> within
> >> the family of Solr Filter objects. I don’t see such a class today
> anywhere.
> >> I see how the AutomatonQuery object would permit such a thing to be
> built,
> >> but to date I don’t know of anyone who has built one. Do you?  If not,
> I’m
> >> willing to give it a whirl.  Also, AutomatonQuery doesn’t seem to come
> up
> >> when I look for it in the javadocs for Lucene – can you point me in the
> >> correct direction?
> >>
> >> Thanks!
> >> Karl
> >>
> >>
> >>
> >>
> >>
> >>
> >> --
> >> Robert Muir
> >> rcmuir@gmail.com
> >>
> >>
> >> --
> >> Robert Muir
> >> rcmuir@gmail.com
> >
> >
> > --
> > Robert Muir
> > rcmuir@gmail.com
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>


-- 
Robert Muir
rcmuir@gmail.com

Re: LevenshteinFilter proposal

Posted by Michael McCandless <lu...@mikemccandless.com>.
On Mon, Jul 26, 2010 at 11:35 AM, Robert Muir <rc...@gmail.com> wrote:
> On Mon, Jul 26, 2010 at 11:30 AM, <ka...@nokia.com> wrote:
>>
>> It occurs to me that the proper static initializer code might well be able
>> to generate distances of 3, 4 or whatever, without bloating the jar.
>
> I disagree: generating these distances is a non-trivial algorithm and takes
> minutes or hours to compute...

Furthermore, our own implementation for this algo is in Python :)

>> Nevertheless, the real question of import to me right now is: what
>> “minimumDistance” value corresponds to a Levenshtein distance of 1?  2?  The
>> maximum “minimumDistance” the query accepts is 1.0.
>
> right, fuzzyquery uses a strange formula that doesnt tie directly to edit
> distance (unless you factor in length).
> you can do this easier like this:
> LevenshteinAutomata builder = new LevenshteinAutomata("foobar");
> Automaton ed2 = builder.toAutomaton(2);
> // the term doesnt really matter, just for the field name and toString, etc.
> AutomatonQuery query = new AutomatonQuery(new Term("field", "foobar~2"),
> ed2);
>
>>
>>
>>
>> Karl
>>
>>
>>
>>
>>
>> From: ext Robert Muir [mailto:rcmuir@gmail.com]
>> Sent: Friday, July 23, 2010 11:22 AM
>>
>> To: dev@lucene.apache.org
>> Subject: Re: LevenshteinFilter proposal
>>
>>
>>
>> Well, there are two main things involved:
>>
>> 1. number of terms seeked to in the term enum
>>
>> 2. expense of the comparison itself.
>>
>>
>>
>> one challenge is the construction of a DFA LevK(x) that recognizes all
>> words within edit distance <= k of x is an expensive operation.
>>
>> This is because of the nature of the computation (traditionally its
>> dynamic programming with nasty runtime).
>>
>>
>>
>> We use http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.26.2940 to
>> offline the nasty part so its linear time: O(n) where n is the length of the
>> query term.
>>
>> But these offline-computated tables get pretty large quick, and so we only
>> include 1,2 to keep the lucene jar file down.
>>
>>
>>
>> additionally for n > 2 it starts to be a ton of termenum seeks... at some
>> of these high distances its faster to just next() thru the terms and
>> compare.
>>
>>
>>
>> we might be able to speed it up more by including say a table for n=3 (at
>> the expense of increased jar size), but not using this n=3 table for #1,
>> only to speed up #2. but its diminishing returns here...
>>
>>
>>
>> On Fri, Jul 23, 2010 at 11:09 AM, <ka...@nokia.com> wrote:
>>
>> Glad I asked.
>>
>>
>>
>> I would think that the automaton would be superior even for larger edit
>> distances than 1 or 2 than the equivalent “crappy” algorithm.  But maybe I
>> don’t understand something. ;-)
>>
>>
>>
>> Karl
>>
>>
>>
>>
>>
>> From: ext Robert Muir [mailto:rcmuir@gmail.com]
>> Sent: Friday, July 23, 2010 11:05 AM
>>
>> To: dev@lucene.apache.org
>>
>> Subject: Re: LevenshteinFilter proposal
>>
>>
>>
>> this is actually done in trunk.
>>
>>
>>
>> In trunk fuzzy's enum is a "proxy". for low distances (ed=1,2) it uses
>> automaton.
>>
>>
>>
>> for higher distances it uses the crappy "brute force" method.
>>
>> but, higher distances still get accelerated if you use a reasonable
>> 'maxExpansions' to FuzzyQuery... the default is quite bad (1024).
>>
>>
>>
>>
>>
>> On Fri, Jul 23, 2010 at 10:59 AM, <ka...@nokia.com> wrote:
>>
>> Thanks!
>>
>>
>>
>> FuzzyQuery will do for my purposes, for the interim.  But I suspect that
>> FuzzyQuery could be made a lot more efficient if it were rebuilt on top of
>> Automaton, no?  I understand that this would be a trunk project.
>>
>>
>>
>> Karl
>>
>>
>>
>>
>>
>> From: ext Uwe Schindler [mailto:uwe@thetaphi.de]
>> Sent: Friday, July 23, 2010 10:45 AM
>>
>> To: dev@lucene.apache.org
>>
>> Subject: RE: LevenshteinFilter proposal
>>
>>
>>
>> Automaton is only in Lucene/Solr Trunk. To get a filter out of FuzzyQuery,
>> use MultiTermQueryWrapperFilter(new FuzzyQuery(…))
>>
>>
>>
>> -----
>>
>> Uwe Schindler
>>
>> H.-H.-Meier-Allee 63, D-28213 Bremen
>>
>> http://www.thetaphi.de
>>
>> eMail: uwe@thetaphi.de
>>
>>
>>
>> From: karl.wright@nokia.com [mailto:karl.wright@nokia.com]
>> Sent: Friday, July 23, 2010 4:25 PM
>> To: dev@lucene.apache.org
>> Subject: LevenshteinFilter proposal
>>
>>
>>
>> Hi Folks,
>>
>> I’m very interested in using (or developing!) a Levenshtein Filter within
>> the family of Solr Filter objects. I don’t see such a class today anywhere.
>> I see how the AutomatonQuery object would permit such a thing to be built,
>> but to date I don’t know of anyone who has built one. Do you?  If not, I’m
>> willing to give it a whirl.  Also, AutomatonQuery doesn’t seem to come up
>> when I look for it in the javadocs for Lucene – can you point me in the
>> correct direction?
>>
>> Thanks!
>> Karl
>>
>>
>>
>>
>>
>>
>> --
>> Robert Muir
>> rcmuir@gmail.com
>>
>>
>> --
>> Robert Muir
>> rcmuir@gmail.com
>
>
> --
> Robert Muir
> rcmuir@gmail.com
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: LevenshteinFilter proposal

Posted by Robert Muir <rc...@gmail.com>.
On Mon, Jul 26, 2010 at 11:30 AM, <ka...@nokia.com> wrote:

>  It occurs to me that the proper static initializer code might well be
> able to generate distances of 3, 4 or whatever, without bloating the jar.
>
> I disagree: generating these distances is a non-trivial algorithm and takes
minutes or hours to compute...

> Nevertheless, the real question of import to me right now is: what
> “minimumDistance” value corresponds to a Levenshtein distance of 1?  2?  The
> maximum “minimumDistance” the query accepts is 1.0.
>
right, fuzzyquery uses a strange formula that doesnt tie directly to edit
distance (unless you factor in length).

you can do this easier like this:
LevenshteinAutomata builder = new LevenshteinAutomata("foobar");
Automaton ed2 = builder.toAutomaton(2);
// the term doesnt really matter, just for the field name and toString, etc.
AutomatonQuery query = new AutomatonQuery(new Term("field", "foobar~2"),
ed2);


>
>
> Karl
>
>
>
>
>
> *From:* ext Robert Muir [mailto:rcmuir@gmail.com]
> *Sent:* Friday, July 23, 2010 11:22 AM
>
> *To:* dev@lucene.apache.org
> *Subject:* Re: LevenshteinFilter proposal
>
>
>
> Well, there are two main things involved:
>
> 1. number of terms seeked to in the term enum
>
> 2. expense of the comparison itself.
>
>
>
> one challenge is the construction of a DFA LevK(x) that recognizes all
> words within edit distance <= k of x is an expensive operation.
>
> This is because of the nature of the computation (traditionally its dynamic
> programming with nasty runtime).
>
>
>
> We use http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.26.2940 to
> offline the nasty part so its linear time: O(n) where n is the length of the
> query term.
>
> But these offline-computated tables get pretty large quick, and so we only
> include 1,2 to keep the lucene jar file down.
>
>
>
> additionally for n > 2 it starts to be a ton of termenum seeks... at some
> of these high distances its faster to just next() thru the terms and
> compare.
>
>
>
> we might be able to speed it up more by including say a table for n=3 (at
> the expense of increased jar size), but not using this n=3 table for #1,
> only to speed up #2. but its diminishing returns here...
>
>
>
> On Fri, Jul 23, 2010 at 11:09 AM, <ka...@nokia.com> wrote:
>
> Glad I asked.
>
>
>
> I would think that the automaton would be superior even for larger edit
> distances than 1 or 2 than the equivalent “crappy” algorithm.  But maybe I
> don’t understand something. ;-)
>
>
>
> Karl
>
>
>
>
>
> *From:* ext Robert Muir [mailto:rcmuir@gmail.com]
> *Sent:* Friday, July 23, 2010 11:05 AM
>
>
> *To:* dev@lucene.apache.org
>
> *Subject:* Re: LevenshteinFilter proposal
>
>
>
> this is actually done in trunk.
>
>
>
> In trunk fuzzy's enum is a "proxy". for low distances (ed=1,2) it uses
> automaton.
>
>
>
> for higher distances it uses the crappy "brute force" method.
>
> but, higher distances still get accelerated if you use a reasonable
> 'maxExpansions' to FuzzyQuery... the default is quite bad (1024).
>
>
>
>
>
> On Fri, Jul 23, 2010 at 10:59 AM, <ka...@nokia.com> wrote:
>
> Thanks!
>
>
>
> FuzzyQuery will do for my purposes, for the interim.  But I suspect that
> FuzzyQuery could be made a lot more efficient if it were rebuilt on top of
> Automaton, no?  I understand that this would be a trunk project.
>
>
>
> Karl
>
>
>
>
>
> *From:* ext Uwe Schindler [mailto:uwe@thetaphi.de]
> *Sent:* Friday, July 23, 2010 10:45 AM
>
>
> *To:* dev@lucene.apache.org
>
> *Subject:* RE: LevenshteinFilter proposal
>
>
>
> Automaton is only in Lucene/Solr Trunk. To get a filter out of FuzzyQuery,
> use MultiTermQueryWrapperFilter(new FuzzyQuery(…))
>
>
>
> -----
>
> Uwe Schindler
>
> H.-H.-Meier-Allee 63, D-28213 Bremen
>
> http://www.thetaphi.de
>
> eMail: uwe@thetaphi.de
>
>
>
> *From:* karl.wright@nokia.com [mailto:karl.wright@nokia.com]
> *Sent:* Friday, July 23, 2010 4:25 PM
> *To:* dev@lucene.apache.org
> *Subject:* LevenshteinFilter proposal
>
>
>
> Hi Folks,
>
>
> I’m very interested in using (or developing!) a Levenshtein Filter within
> the family of Solr Filter objects. I don’t see such a class today anywhere.
> I see how the AutomatonQuery object would permit such a thing to be built,
> but to date I don’t know of anyone who has built one. Do you?  If not, I’m
> willing to give it a whirl.  Also, AutomatonQuery doesn’t seem to come up
> when I look for it in the javadocs for Lucene – can you point me in the
> correct direction?
>
> Thanks!
> Karl
>
>
>
>
>
>
>
>
> --
> Robert Muir
> rcmuir@gmail.com
>
>
>
>
> --
> Robert Muir
> rcmuir@gmail.com
>



-- 
Robert Muir
rcmuir@gmail.com

RE: LevenshteinFilter proposal

Posted by ka...@Nokia.com.
It occurs to me that the proper static initializer code might well be able to generate distances of 3, 4 or whatever, without bloating the jar.

Nevertheless, the real question of import to me right now is: what “minimumDistance” value corresponds to a Levenshtein distance of 1?  2?  The maximum “minimumDistance” the query accepts is 1.0.

Karl


From: ext Robert Muir [mailto:rcmuir@gmail.com]
Sent: Friday, July 23, 2010 11:22 AM
To: dev@lucene.apache.org
Subject: Re: LevenshteinFilter proposal

Well, there are two main things involved:
1. number of terms seeked to in the term enum
2. expense of the comparison itself.

one challenge is the construction of a DFA LevK(x) that recognizes all words within edit distance <= k of x is an expensive operation.
This is because of the nature of the computation (traditionally its dynamic programming with nasty runtime).

We use http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.26.2940 to offline the nasty part so its linear time: O(n) where n is the length of the query term.
But these offline-computated tables get pretty large quick, and so we only include 1,2 to keep the lucene jar file down.

additionally for n > 2 it starts to be a ton of termenum seeks... at some of these high distances its faster to just next() thru the terms and compare.

we might be able to speed it up more by including say a table for n=3 (at the expense of increased jar size), but not using this n=3 table for #1, only to speed up #2. but its diminishing returns here...

On Fri, Jul 23, 2010 at 11:09 AM, <ka...@nokia.com>> wrote:
Glad I asked.

I would think that the automaton would be superior even for larger edit distances than 1 or 2 than the equivalent “crappy” algorithm.  But maybe I don’t understand something. ;-)

Karl


From: ext Robert Muir [mailto:rcmuir@gmail.com<ma...@gmail.com>]
Sent: Friday, July 23, 2010 11:05 AM

To: dev@lucene.apache.org<ma...@lucene.apache.org>
Subject: Re: LevenshteinFilter proposal

this is actually done in trunk.

In trunk fuzzy's enum is a "proxy". for low distances (ed=1,2) it uses automaton.

for higher distances it uses the crappy "brute force" method.
but, higher distances still get accelerated if you use a reasonable 'maxExpansions' to FuzzyQuery... the default is quite bad (1024).


On Fri, Jul 23, 2010 at 10:59 AM, <ka...@nokia.com>> wrote:
Thanks!

FuzzyQuery will do for my purposes, for the interim.  But I suspect that FuzzyQuery could be made a lot more efficient if it were rebuilt on top of Automaton, no?  I understand that this would be a trunk project.

Karl


From: ext Uwe Schindler [mailto:uwe@thetaphi.de<ma...@thetaphi.de>]
Sent: Friday, July 23, 2010 10:45 AM

To: dev@lucene.apache.org<ma...@lucene.apache.org>
Subject: RE: LevenshteinFilter proposal

Automaton is only in Lucene/Solr Trunk. To get a filter out of FuzzyQuery, use MultiTermQueryWrapperFilter(new FuzzyQuery(…))

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de<http://www.thetaphi.de/>
eMail: uwe@thetaphi.de<ma...@thetaphi.de>

From: karl.wright@nokia.com<ma...@nokia.com> [mailto:karl.wright@nokia.com<ma...@nokia.com>]
Sent: Friday, July 23, 2010 4:25 PM
To: dev@lucene.apache.org<ma...@lucene.apache.org>
Subject: LevenshteinFilter proposal

Hi Folks,

I’m very interested in using (or developing!) a Levenshtein Filter within the family of Solr Filter objects. I don’t see such a class today anywhere. I see how the AutomatonQuery object would permit such a thing to be built, but to date I don’t know of anyone who has built one. Do you?  If not, I’m willing to give it a whirl.  Also, AutomatonQuery doesn’t seem to come up when I look for it in the javadocs for Lucene – can you point me in the correct direction?
Thanks!
Karl





--
Robert Muir
rcmuir@gmail.com<ma...@gmail.com>



--
Robert Muir
rcmuir@gmail.com<ma...@gmail.com>

Re: LevenshteinFilter proposal

Posted by Robert Muir <rc...@gmail.com>.
Well, there are two main things involved:
1. number of terms seeked to in the term enum
2. expense of the comparison itself.

one challenge is the construction of a DFA LevK(x) that recognizes all words
within edit distance <= k of x is an expensive operation.
This is because of the nature of the computation (traditionally its dynamic
programming with nasty runtime).

We use http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.26.2940 to
offline the nasty part so its linear time: O(n) where n is the length of the
query term.
But these offline-computated tables get pretty large quick, and so we only
include 1,2 to keep the lucene jar file down.

additionally for n > 2 it starts to be a ton of termenum seeks... at some of
these high distances its faster to just next() thru the terms and compare.

we might be able to speed it up more by including say a table for n=3 (at
the expense of increased jar size), but not using this n=3 table for #1,
only to speed up #2. but its diminishing returns here...

On Fri, Jul 23, 2010 at 11:09 AM, <ka...@nokia.com> wrote:

>  Glad I asked.
>
>
>
> I would think that the automaton would be superior even for larger edit
> distances than 1 or 2 than the equivalent “crappy” algorithm.  But maybe I
> don’t understand something. ;-)
>
>
>
> Karl
>
>
>
>
>
> *From:* ext Robert Muir [mailto:rcmuir@gmail.com]
> *Sent:* Friday, July 23, 2010 11:05 AM
>
> *To:* dev@lucene.apache.org
> *Subject:* Re: LevenshteinFilter proposal
>
>
>
> this is actually done in trunk.
>
>
>
> In trunk fuzzy's enum is a "proxy". for low distances (ed=1,2) it uses
> automaton.
>
>
>
> for higher distances it uses the crappy "brute force" method.
>
> but, higher distances still get accelerated if you use a reasonable
> 'maxExpansions' to FuzzyQuery... the default is quite bad (1024).
>
>
>
>
>
> On Fri, Jul 23, 2010 at 10:59 AM, <ka...@nokia.com> wrote:
>
> Thanks!
>
>
>
> FuzzyQuery will do for my purposes, for the interim.  But I suspect that
> FuzzyQuery could be made a lot more efficient if it were rebuilt on top of
> Automaton, no?  I understand that this would be a trunk project.
>
>
>
> Karl
>
>
>
>
>
> *From:* ext Uwe Schindler [mailto:uwe@thetaphi.de]
> *Sent:* Friday, July 23, 2010 10:45 AM
>
>
> *To:* dev@lucene.apache.org
>
> *Subject:* RE: LevenshteinFilter proposal
>
>
>
> Automaton is only in Lucene/Solr Trunk. To get a filter out of FuzzyQuery,
> use MultiTermQueryWrapperFilter(new FuzzyQuery(…))
>
>
>
> -----
>
> Uwe Schindler
>
> H.-H.-Meier-Allee 63, D-28213 Bremen
>
> http://www.thetaphi.de
>
> eMail: uwe@thetaphi.de
>
>
>
> *From:* karl.wright@nokia.com [mailto:karl.wright@nokia.com]
> *Sent:* Friday, July 23, 2010 4:25 PM
> *To:* dev@lucene.apache.org
> *Subject:* LevenshteinFilter proposal
>
>
>
> Hi Folks,
>
>
> I’m very interested in using (or developing!) a Levenshtein Filter within
> the family of Solr Filter objects. I don’t see such a class today anywhere.
> I see how the AutomatonQuery object would permit such a thing to be built,
> but to date I don’t know of anyone who has built one. Do you?  If not, I’m
> willing to give it a whirl.  Also, AutomatonQuery doesn’t seem to come up
> when I look for it in the javadocs for Lucene – can you point me in the
> correct direction?
>
> Thanks!
> Karl
>
>
>
>
>
>
>
>
> --
> Robert Muir
> rcmuir@gmail.com
>



-- 
Robert Muir
rcmuir@gmail.com

RE: LevenshteinFilter proposal

Posted by ka...@nokia.com.
Glad I asked.

I would think that the automaton would be superior even for larger edit distances than 1 or 2 than the equivalent “crappy” algorithm.  But maybe I don’t understand something. ;-)

Karl


From: ext Robert Muir [mailto:rcmuir@gmail.com]
Sent: Friday, July 23, 2010 11:05 AM
To: dev@lucene.apache.org
Subject: Re: LevenshteinFilter proposal

this is actually done in trunk.

In trunk fuzzy's enum is a "proxy". for low distances (ed=1,2) it uses automaton.

for higher distances it uses the crappy "brute force" method.
but, higher distances still get accelerated if you use a reasonable 'maxExpansions' to FuzzyQuery... the default is quite bad (1024).


On Fri, Jul 23, 2010 at 10:59 AM, <ka...@nokia.com>> wrote:
Thanks!

FuzzyQuery will do for my purposes, for the interim.  But I suspect that FuzzyQuery could be made a lot more efficient if it were rebuilt on top of Automaton, no?  I understand that this would be a trunk project.

Karl


From: ext Uwe Schindler [mailto:uwe@thetaphi.de<ma...@thetaphi.de>]
Sent: Friday, July 23, 2010 10:45 AM

To: dev@lucene.apache.org<ma...@lucene.apache.org>
Subject: RE: LevenshteinFilter proposal

Automaton is only in Lucene/Solr Trunk. To get a filter out of FuzzyQuery, use MultiTermQueryWrapperFilter(new FuzzyQuery(…))

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de<http://www.thetaphi.de/>
eMail: uwe@thetaphi.de<ma...@thetaphi.de>

From: karl.wright@nokia.com<ma...@nokia.com> [mailto:karl.wright@nokia.com<ma...@nokia.com>]
Sent: Friday, July 23, 2010 4:25 PM
To: dev@lucene.apache.org<ma...@lucene.apache.org>
Subject: LevenshteinFilter proposal

Hi Folks,

I’m very interested in using (or developing!) a Levenshtein Filter within the family of Solr Filter objects. I don’t see such a class today anywhere. I see how the AutomatonQuery object would permit such a thing to be built, but to date I don’t know of anyone who has built one. Do you?  If not, I’m willing to give it a whirl.  Also, AutomatonQuery doesn’t seem to come up when I look for it in the javadocs for Lucene – can you point me in the correct direction?
Thanks!
Karl





--
Robert Muir
rcmuir@gmail.com<ma...@gmail.com>

Re: LevenshteinFilter proposal

Posted by Robert Muir <rc...@gmail.com>.
this is actually done in trunk.

In trunk fuzzy's enum is a "proxy". for low distances (ed=1,2) it uses
automaton.

for higher distances it uses the crappy "brute force" method.
but, higher distances still get accelerated if you use a reasonable
'maxExpansions' to FuzzyQuery... the default is quite bad (1024).


On Fri, Jul 23, 2010 at 10:59 AM, <ka...@nokia.com> wrote:

>  Thanks!
>
>
>
> FuzzyQuery will do for my purposes, for the interim.  But I suspect that
> FuzzyQuery could be made a lot more efficient if it were rebuilt on top of
> Automaton, no?  I understand that this would be a trunk project.
>
>
>
> Karl
>
>
>
>
>
> *From:* ext Uwe Schindler [mailto:uwe@thetaphi.de]
> *Sent:* Friday, July 23, 2010 10:45 AM
>
> *To:* dev@lucene.apache.org
> *Subject:* RE: LevenshteinFilter proposal
>
>
>
> Automaton is only in Lucene/Solr Trunk. To get a filter out of FuzzyQuery,
> use MultiTermQueryWrapperFilter(new FuzzyQuery(…))
>
>
>
> -----
>
> Uwe Schindler
>
> H.-H.-Meier-Allee 63, D-28213 Bremen
>
> http://www.thetaphi.de
>
> eMail: uwe@thetaphi.de
>
>
>
> *From:* karl.wright@nokia.com [mailto:karl.wright@nokia.com]
> *Sent:* Friday, July 23, 2010 4:25 PM
> *To:* dev@lucene.apache.org
> *Subject:* LevenshteinFilter proposal
>
>
>
> Hi Folks,
>
>
> I’m very interested in using (or developing!) a Levenshtein Filter within
> the family of Solr Filter objects. I don’t see such a class today anywhere.
> I see how the AutomatonQuery object would permit such a thing to be built,
> but to date I don’t know of anyone who has built one. Do you?  If not, I’m
> willing to give it a whirl.  Also, AutomatonQuery doesn’t seem to come up
> when I look for it in the javadocs for Lucene – can you point me in the
> correct direction?
>
> Thanks!
> Karl
>
>
>
>
>



-- 
Robert Muir
rcmuir@gmail.com

RE: LevenshteinFilter proposal

Posted by ka...@nokia.com.
Thanks!

FuzzyQuery will do for my purposes, for the interim.  But I suspect that FuzzyQuery could be made a lot more efficient if it were rebuilt on top of Automaton, no?  I understand that this would be a trunk project.

Karl


From: ext Uwe Schindler [mailto:uwe@thetaphi.de]
Sent: Friday, July 23, 2010 10:45 AM
To: dev@lucene.apache.org
Subject: RE: LevenshteinFilter proposal

Automaton is only in Lucene/Solr Trunk. To get a filter out of FuzzyQuery, use MultiTermQueryWrapperFilter(new FuzzyQuery(...))

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de<http://www.thetaphi.de/>
eMail: uwe@thetaphi.de

From: karl.wright@nokia.com [mailto:karl.wright@nokia.com]
Sent: Friday, July 23, 2010 4:25 PM
To: dev@lucene.apache.org
Subject: LevenshteinFilter proposal

Hi Folks,

I'm very interested in using (or developing!) a Levenshtein Filter within the family of Solr Filter objects. I don't see such a class today anywhere. I see how the AutomatonQuery object would permit such a thing to be built, but to date I don't know of anyone who has built one. Do you?  If not, I'm willing to give it a whirl.  Also, AutomatonQuery doesn't seem to come up when I look for it in the javadocs for Lucene - can you point me in the correct direction?
Thanks!
Karl



RE: LevenshteinFilter proposal

Posted by Uwe Schindler <uw...@thetaphi.de>.
Automaton is only in Lucene/Solr Trunk. To get a filter out of FuzzyQuery,
use MultiTermQueryWrapperFilter(new FuzzyQuery(.))

 

-----

Uwe Schindler

H.-H.-Meier-Allee 63, D-28213 Bremen

http://www.thetaphi.de <http://www.thetaphi.de/> 

eMail: uwe@thetaphi.de

 

From: karl.wright@nokia.com [mailto:karl.wright@nokia.com] 
Sent: Friday, July 23, 2010 4:25 PM
To: dev@lucene.apache.org
Subject: LevenshteinFilter proposal

 

Hi Folks,


I'm very interested in using (or developing!) a Levenshtein Filter within
the family of Solr Filter objects. I don't see such a class today anywhere.
I see how the AutomatonQuery object would permit such a thing to be built,
but to date I don't know of anyone who has built one. Do you?  If not, I'm
willing to give it a whirl.  Also, AutomatonQuery doesn't seem to come up
when I look for it in the javadocs for Lucene - can you point me in the
correct direction?

Thanks!
Karl