You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by ka...@nokia.com on 2010/07/23 16:24:36 UTC
LevenshteinFilter proposal
Hi Folks,
I'm very interested in using (or developing!) a Levenshtein Filter within the family of Solr Filter objects. I don't see such a class today anywhere. I see how the AutomatonQuery object would permit such a thing to be built, but to date I don't know of anyone who has built one. Do you? If not, I'm willing to give it a whirl. Also, AutomatonQuery doesn't seem to come up when I look for it in the javadocs for Lucene - can you point me in the correct direction?
Thanks!
Karl
RE: LevenshteinFilter proposal
Posted by Uwe Schindler <uw...@thetaphi.de>.
Yes! See TermRangeFilter for an example, its wrapping TermRangeQuery like
this J
-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
<http://www.thetaphi.de/> http://www.thetaphi.de
eMail: uwe@thetaphi.de
From: karl.wright@nokia.com [mailto:karl.wright@nokia.com]
Sent: Friday, July 23, 2010 6:26 PM
To: dev@lucene.apache.org
Subject: RE: LevenshteinFilter proposal
Surprise!
protected
MultiTermQueryWrapperFilter
<http://lucene.apache.org/java/2_9_1/api/all/org/apache/lucene/search/MultiT
ermQueryWrapperFilter.html#MultiTermQueryWrapperFilter(org.apache.lucene.sea
rch.MultiTermQuery)> (MultiTermQuery
<http://lucene.apache.org/java/2_9_1/api/all/org/apache/lucene/search/MultiT
ermQuery.html> query)
I guess I can write a class that extends MultiTermQueryWrapperFilter. Is
this the intended usage?
Karl
From: ext Uwe Schindler [mailto:uwe@thetaphi.de]
Sent: Friday, July 23, 2010 10:45 AM
To: dev@lucene.apache.org
Subject: RE: LevenshteinFilter proposal
Automaton is only in Lucene/Solr Trunk. To get a filter out of FuzzyQuery,
use MultiTermQueryWrapperFilter(new FuzzyQuery(.))
-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de <http://www.thetaphi.de/>
eMail: uwe@thetaphi.de
From: karl.wright@nokia.com [mailto:karl.wright@nokia.com]
Sent: Friday, July 23, 2010 4:25 PM
To: dev@lucene.apache.org
Subject: LevenshteinFilter proposal
Hi Folks,
I'm very interested in using (or developing!) a Levenshtein Filter within
the family of Solr Filter objects. I don't see such a class today anywhere.
I see how the AutomatonQuery object would permit such a thing to be built,
but to date I don't know of anyone who has built one. Do you? If not, I'm
willing to give it a whirl. Also, AutomatonQuery doesn't seem to come up
when I look for it in the javadocs for Lucene - can you point me in the
correct direction?
Thanks!
Karl
RE: LevenshteinFilter proposal
Posted by ka...@nokia.com.
Surprise!
protected
MultiTermQueryWrapperFilter<http://lucene.apache.org/java/2_9_1/api/all/org/apache/lucene/search/MultiTermQueryWrapperFilter.html#MultiTermQueryWrapperFilter(org.apache.lucene.search.MultiTermQuery)>(MultiTermQuery<http://lucene.apache.org/java/2_9_1/api/all/org/apache/lucene/search/MultiTermQuery.html> query)
I guess I can write a class that extends MultiTermQueryWrapperFilter. Is this the intended usage?
Karl
From: ext Uwe Schindler [mailto:uwe@thetaphi.de]
Sent: Friday, July 23, 2010 10:45 AM
To: dev@lucene.apache.org
Subject: RE: LevenshteinFilter proposal
Automaton is only in Lucene/Solr Trunk. To get a filter out of FuzzyQuery, use MultiTermQueryWrapperFilter(new FuzzyQuery(...))
-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de<http://www.thetaphi.de/>
eMail: uwe@thetaphi.de
From: karl.wright@nokia.com [mailto:karl.wright@nokia.com]
Sent: Friday, July 23, 2010 4:25 PM
To: dev@lucene.apache.org
Subject: LevenshteinFilter proposal
Hi Folks,
I'm very interested in using (or developing!) a Levenshtein Filter within the family of Solr Filter objects. I don't see such a class today anywhere. I see how the AutomatonQuery object would permit such a thing to be built, but to date I don't know of anyone who has built one. Do you? If not, I'm willing to give it a whirl. Also, AutomatonQuery doesn't seem to come up when I look for it in the javadocs for Lucene - can you point me in the correct direction?
Thanks!
Karl
Re: LevenshteinFilter proposal
Posted by Itamar Syn-Hershko <it...@code972.com>.
Just a thought: edit distance is meant for overcoming spelling errors in
form of assimilations or mistype. In your case there is a limited number
of cases that need special care, and you can actually define most of
them pretty well - hence edit distance is by definition much more than
you actually need.
Since all that is necessary here is to identify known spelling
differences (or errors), perhaps you could use some of what I call
"tolerated lookup" using autometa on a Lucene index. Such a lookup is
what I use in HebMorph[1] to find words in a dictionary of Hebrew words,
where spelling is ALWAYS like what you're having with street names. I
use that on a radix, but the idea could be adapted for what you're
looking for fairly easily.
The idea is quite simple - first you do an exact lookup on your
dictionary (TermQuery in your case). If you hit no results, you do a
tolerant lookup, where a "tolerator function" is being consulted with
before going on to the next leaf. Those functions decide whether or not
to allow this move based on a set of rules (position, characters before
and after, etc).
You can see some examples for the Hebrew language here (dot-net, but
should still be readable for Java ppl):
http://github.com/synhershko/HebMorph/blob/master/dotNet/HebMorph/DataStructures/DictRadix.cs
-- radix implementation with tolerant lookup (TolerantLookupCrawler)
http://github.com/synhershko/HebMorph/blob/master/dotNet/HebMorph/LookupTolerators.cs
-- tolerator functions
Tolerant lookup on my radix is very fast, should be the same for a
Lucene index.
Itamar.
[1] www.code972.com/blog/hebmorph/ <http://www.code972.com/blog/hebmorph/>
On 26/7/2010 8:56 PM, Robert Muir wrote:
> Nah, its an analyzer. so you can just use termquery (fast: exact match).
> at query and index time it just maps stuff to a key... typically you
> would just put this in a separate field.
>
> you can combine this with your edit distance query with a
> booleanquery, for example the edit distance can handle your
> le[o]minster just fine.
>
> I think this would be much better for you, i wouldnt abuse levenshtein
> for phonetics stuff, its not designed for that.
>
> On Mon, Jul 26, 2010 at 1:44 PM, <karl.wright@nokia.com
> <ma...@nokia.com>> wrote:
>
> Clearly you haven’t been in the Northeast much. Try “Worcester”
> vs. “wuster”, or “Leominster” vs. “leminster”. It’s also likely
> to be a challenge to come up with the right phonetics for any
> given proper location name. It’s even worse in Britain, or
> countries where the phonetic rules may be a hodgepodge of
> different colonial influences.
>
> That having been said, if there exists a “PhoneticQuery” object
> that does all this using the automaton logic under the covers, I
> think it would be worth a serious look.
>
> Karl
>
> *From:* ext Robert Muir [mailto:rcmuir@gmail.com
> <ma...@gmail.com>]
> *Sent:* Monday, July 26, 2010 1:24 PM
>
>
> *To:* dev@lucene.apache.org <ma...@lucene.apache.org>
> *Subject:* Re: LevenshteinFilter proposal
>
> On Mon, Jul 26, 2010 at 1:13 PM, <karl.wright@nokia.com
> <ma...@nokia.com>> wrote:
>
> What I want to capture is situations where people misspell things
> in roughly a phonetic way. For example, “Tchaikovsky Avenue”
> might be misspelled as “Chicovsky Avenue”. Modules that do
> phonetic mapping are possible but you’d have to somehow generate a
> phonetic database of (say) streetnames, worldwide. Good luck on
> getting hold of that kind of data anywhere. ;-) In the absence of
> such data, an LD distance will have to do – but it will almost
> certainly need to be greater than 2.
>
> I added this to 'TestPhoneticFilter' and it
> passes: assertAlgorithm(new DoubleMetaphone(), false,
> "Tchaikovsky Chicovsky", new String[] { "XKFS", "XKFS" });
>
> So if you want to give me all your street names, i can sell you a
> phonetic database, or you can use the filters in
> modules/analyzers/phonetic, which have a bunch of different
> configurable algorithms :)
>
>
> --
> Robert Muir
> rcmuir@gmail.com <ma...@gmail.com>
>
>
>
>
> --
> Robert Muir
> rcmuir@gmail.com <ma...@gmail.com>
Re: LevenshteinFilter proposal
Posted by Robert Muir <rc...@gmail.com>.
or Washington / District of Columbia.
my point is i wouldnt do anything complicated and slow if you can get away
with analyzers/phonetics and maybe some synonyms, or maybe even just the
spellchecker.
theres always crazy cases none of these algorithms will work for.
On Mon, Jul 26, 2010 at 2:09 PM, Walter Underwood <wu...@wunderwood.org>wrote:
> Try mixing colonial influences with Native American names.
>
> When my parents moved to Baton Rouge, LA years ago, they got a
> recommendation for a "Dr. Kyto". They couldn't find him. Years later, they
> met Dr. Cailleteaux. Nice man.
>
> And the French don't pronounce Baton Rouge as "batten wrooj".
>
> Also: Natchitoches is "NACK-a-tish".
>
> wunder
>
> On Jul 26, 2010, at 10:44 AM, <ka...@nokia.com> wrote:
>
> Clearly you haven’t been in the Northeast much. Try “Worcester” vs.
> “wuster”, or “Leominster” vs. “leminster”. It’s also likely to be a
> challenge to come up with the right phonetics for any given proper location
> name. It’s even worse in Britain, or countries where the phonetic rules
> may be a hodgepodge of different colonial influences.
>
> That having been said, if there exists a “PhoneticQuery” object that does
> all this using the automaton logic under the covers, I think it would be
> worth a serious look.
>
> Karl
>
>
> *From:* ext Robert Muir [mailto:rcmuir@gmail.com]
> *Sent:* Monday, July 26, 2010 1:24 PM
> *To:* dev@lucene.apache.org
> *Subject:* Re: LevenshteinFilter proposal
>
>
>
> On Mon, Jul 26, 2010 at 1:13 PM, <ka...@nokia.com> wrote:
> What I want to capture is situations where people misspell things in
> roughly a phonetic way. For example, “Tchaikovsky Avenue” might be
> misspelled as “Chicovsky Avenue”. Modules that do phonetic mapping are
> possible but you’d have to somehow generate a phonetic database of (say)
> streetnames, worldwide. Good luck on getting hold of that kind of data
> anywhere. ;-) In the absence of such data, an LD distance will have to do –
> but it will almost certainly need to be greater than 2.
> I added this to 'TestPhoneticFilter' and it passes: assertAlgorithm(new
> DoubleMetaphone(), false, "Tchaikovsky Chicovsky", new String[] { "XKFS",
> "XKFS" });
>
> So if you want to give me all your street names, i can sell you a phonetic
> database, or you can use the filters in modules/analyzers/phonetic, which
> have a bunch of different configurable algorithms :)
>
> --
> Robert Muir
> rcmuir@gmail.com
>
>
>
>
>
--
Robert Muir
rcmuir@gmail.com
Re: LevenshteinFilter proposal
Posted by Walter Underwood <wu...@wunderwood.org>.
Try mixing colonial influences with Native American names.
When my parents moved to Baton Rouge, LA years ago, they got a recommendation for a "Dr. Kyto". They couldn't find him. Years later, they met Dr. Cailleteaux. Nice man.
And the French don't pronounce Baton Rouge as "batten wrooj".
Also: Natchitoches is "NACK-a-tish".
wunder
On Jul 26, 2010, at 10:44 AM, <ka...@nokia.com> wrote:
> Clearly you haven’t been in the Northeast much. Try “Worcester” vs. “wuster”, or “Leominster” vs. “leminster”. It’s also likely to be a challenge to come up with the right phonetics for any given proper location name. It’s even worse in Britain, or countries where the phonetic rules may be a hodgepodge of different colonial influences.
>
> That having been said, if there exists a “PhoneticQuery” object that does all this using the automaton logic under the covers, I think it would be worth a serious look.
>
> Karl
>
>
> From: ext Robert Muir [mailto:rcmuir@gmail.com]
> Sent: Monday, July 26, 2010 1:24 PM
> To: dev@lucene.apache.org
> Subject: Re: LevenshteinFilter proposal
>
>
>
> On Mon, Jul 26, 2010 at 1:13 PM, <ka...@nokia.com> wrote:
> What I want to capture is situations where people misspell things in roughly a phonetic way. For example, “Tchaikovsky Avenue” might be misspelled as “Chicovsky Avenue”. Modules that do phonetic mapping are possible but you’d have to somehow generate a phonetic database of (say) streetnames, worldwide. Good luck on getting hold of that kind of data anywhere. ;-) In the absence of such data, an LD distance will have to do – but it will almost certainly need to be greater than 2.
> I added this to 'TestPhoneticFilter' and it passes: assertAlgorithm(new DoubleMetaphone(), false, "Tchaikovsky Chicovsky", new String[] { "XKFS", "XKFS" });
>
> So if you want to give me all your street names, i can sell you a phonetic database, or you can use the filters in modules/analyzers/phonetic, which have a bunch of different configurable algorithms :)
>
> --
> Robert Muir
> rcmuir@gmail.com
Re: LevenshteinFilter proposal
Posted by Robert Muir <rc...@gmail.com>.
Nah, its an analyzer. so you can just use termquery (fast: exact match).
at query and index time it just maps stuff to a key... typically you would
just put this in a separate field.
you can combine this with your edit distance query with a booleanquery, for
example the edit distance can handle your le[o]minster just fine.
I think this would be much better for you, i wouldnt abuse levenshtein for
phonetics stuff, its not designed for that.
On Mon, Jul 26, 2010 at 1:44 PM, <ka...@nokia.com> wrote:
> Clearly you haven’t been in the Northeast much. Try “Worcester” vs.
> “wuster”, or “Leominster” vs. “leminster”. It’s also likely to be a
> challenge to come up with the right phonetics for any given proper location
> name. It’s even worse in Britain, or countries where the phonetic rules
> may be a hodgepodge of different colonial influences.
>
>
>
> That having been said, if there exists a “PhoneticQuery” object that does
> all this using the automaton logic under the covers, I think it would be
> worth a serious look.
>
>
>
> Karl
>
>
>
>
>
> *From:* ext Robert Muir [mailto:rcmuir@gmail.com]
> *Sent:* Monday, July 26, 2010 1:24 PM
>
> *To:* dev@lucene.apache.org
> *Subject:* Re: LevenshteinFilter proposal
>
>
>
>
>
> On Mon, Jul 26, 2010 at 1:13 PM, <ka...@nokia.com> wrote:
>
> What I want to capture is situations where people misspell things in
> roughly a phonetic way. For example, “Tchaikovsky Avenue” might be
> misspelled as “Chicovsky Avenue”. Modules that do phonetic mapping are
> possible but you’d have to somehow generate a phonetic database of (say)
> streetnames, worldwide. Good luck on getting hold of that kind of data
> anywhere. ;-) In the absence of such data, an LD distance will have to do –
> but it will almost certainly need to be greater than 2.
>
> I added this to 'TestPhoneticFilter' and it passes: assertAlgorithm(new
> DoubleMetaphone(), false, "Tchaikovsky Chicovsky", new String[] { "XKFS",
> "XKFS" });
>
>
>
> So if you want to give me all your street names, i can sell you a phonetic
> database, or you can use the filters in modules/analyzers/phonetic, which
> have a bunch of different configurable algorithms :)
>
>
> --
> Robert Muir
> rcmuir@gmail.com
>
--
Robert Muir
rcmuir@gmail.com
RE: LevenshteinFilter proposal
Posted by ka...@nokia.com.
Clearly you haven’t been in the Northeast much. Try “Worcester” vs. “wuster”, or “Leominster” vs. “leminster”. It’s also likely to be a challenge to come up with the right phonetics for any given proper location name. It’s even worse in Britain, or countries where the phonetic rules may be a hodgepodge of different colonial influences.
That having been said, if there exists a “PhoneticQuery” object that does all this using the automaton logic under the covers, I think it would be worth a serious look.
Karl
From: ext Robert Muir [mailto:rcmuir@gmail.com]
Sent: Monday, July 26, 2010 1:24 PM
To: dev@lucene.apache.org
Subject: Re: LevenshteinFilter proposal
On Mon, Jul 26, 2010 at 1:13 PM, <ka...@nokia.com>> wrote:
What I want to capture is situations where people misspell things in roughly a phonetic way. For example, “Tchaikovsky Avenue” might be misspelled as “Chicovsky Avenue”. Modules that do phonetic mapping are possible but you’d have to somehow generate a phonetic database of (say) streetnames, worldwide. Good luck on getting hold of that kind of data anywhere. ;-) In the absence of such data, an LD distance will have to do – but it will almost certainly need to be greater than 2.
I added this to 'TestPhoneticFilter' and it passes: assertAlgorithm(new DoubleMetaphone(), false, "Tchaikovsky Chicovsky", new String[] { "XKFS", "XKFS" });
So if you want to give me all your street names, i can sell you a phonetic database, or you can use the filters in modules/analyzers/phonetic, which have a bunch of different configurable algorithms :)
--
Robert Muir
rcmuir@gmail.com<ma...@gmail.com>
Re: LevenshteinFilter proposal
Posted by Robert Muir <rc...@gmail.com>.
On Mon, Jul 26, 2010 at 1:13 PM, <ka...@nokia.com> wrote:
>
> What I want to capture is situations where people misspell things in
> roughly a phonetic way. For example, “Tchaikovsky Avenue” might be
> misspelled as “Chicovsky Avenue”. Modules that do phonetic mapping are
> possible but you’d have to somehow generate a phonetic database of (say)
> streetnames, worldwide. Good luck on getting hold of that kind of data
> anywhere. ;-) In the absence of such data, an LD distance will have to do –
> but it will almost certainly need to be greater than 2.
>
I added this to 'TestPhoneticFilter' and it passes: assertAlgorithm(new
DoubleMetaphone(), false, "Tchaikovsky Chicovsky", new String[] { "XKFS",
"XKFS" });
So if you want to give me all your street names, i can sell you a phonetic
database, or you can use the filters in modules/analyzers/phonetic, which
have a bunch of different configurable algorithms :)
--
Robert Muir
rcmuir@gmail.com
RE: LevenshteinFilter proposal
Posted by ka...@nokia.com.
>>>>>>
I really wouldnt recommend this (though i am not sure what exactly you are trying to do), instead i would recommend analyzers/phonetic to deal with the phonetic stuff.
<<<<<<
What I want to capture is situations where people misspell things in roughly a phonetic way. For example, “Tchaikovsky Avenue” might be misspelled as “Chicovsky Avenue”. Modules that do phonetic mapping are possible but you’d have to somehow generate a phonetic database of (say) streetnames, worldwide. Good luck on getting hold of that kind of data anywhere. ;-) In the absence of such data, an LD distance will have to do – but it will almost certainly need to be greater than 2.
Karl
From: ext Robert Muir [mailto:rcmuir@gmail.com]
Sent: Monday, July 26, 2010 1:07 PM
To: dev@lucene.apache.org
Subject: Re: LevenshteinFilter proposal
On Mon, Jul 26, 2010 at 12:51 PM, <ka...@nokia.com>> wrote:
Is the automaton creation code checked in anywhere?
yes, its generated from lucene/build.xml (createLevAutomata task)
Also, if it’s not possible to generate the automaton table on the fly, perhaps it could be included in an optional jar (whose existence would be queried by reflection). The reason I’m interested in LD’s of three, especially, is because I can well imagine phonetic misspellings that would have distances on that order.
I really wouldnt recommend this (though i am not sure what exactly you are trying to do), instead i would recommend analyzers/phonetic to deal with the phonetic stuff.
Karl
From: ext Robert Muir [mailto:rcmuir@gmail.com<ma...@gmail.com>]
Sent: Monday, July 26, 2010 11:49 AM
To: dev@lucene.apache.org<ma...@lucene.apache.org>
Subject: Re: LevenshteinFilter proposal
I wanted to show you what i mean, just so you know:
here is what 'ant createLevAutomata' does now:
createLevAutomata:
[exec] Wrote Lev1ParametricDescription.java [102 lines; 3.7 KB]
[exec] Wrote Lev2ParametricDescription.java [147 lines; 9.6 KB]
BUILD SUCCESSFUL
Total time: 4 seconds
I modified it with the patch below to generate 3 and 4, just for kicks, and ran it again... it took it 7 minutes to generate lev3, its still running on lev4 (might take hours or days, i dont actually know)
createLevAutomata:
[exec] Wrote Lev1ParametricDescription.java [102 lines; 3.7 KB]
[exec] Wrote Lev2ParametricDescription.java [147 lines; 9.6 KB]
[exec] Wrote Lev3ParametricDescription.java [333 lines; 167.2 KB]
...
Index: build.xml
===================================================================
--- build.xml (revision 959288)
+++ build.xml (working copy)
@@ -533,6 +533,8 @@
<target name="createLevAutomata" depends="check-moman,clone-moman,pull-moman">
<createLevAutomaton n="1"/>
<createLevAutomaton n="2"/>
+ <createLevAutomaton n="3"/>
+ <createLevAutomaton n="4"/>
</target>
<target name="check-moman">
On Mon, Jul 26, 2010 at 11:39 AM, Michael McCandless <lu...@mikemccandless.com>> wrote:
On Mon, Jul 26, 2010 at 11:35 AM, Robert Muir <rc...@gmail.com>> wrote:
> On Mon, Jul 26, 2010 at 11:30 AM, <ka...@nokia.com>> wrote:
>>
>> It occurs to me that the proper static initializer code might well be able
>> to generate distances of 3, 4 or whatever, without bloating the jar.
>
> I disagree: generating these distances is a non-trivial algorithm and takes
> minutes or hours to compute...
Furthermore, our own implementation for this algo is in Python :)
>> Nevertheless, the real question of import to me right now is: what
>> “minimumDistance” value corresponds to a Levenshtein distance of 1? 2? The
>> maximum “minimumDistance” the query accepts is 1.0.
>
> right, fuzzyquery uses a strange formula that doesnt tie directly to edit
> distance (unless you factor in length).
> you can do this easier like this:
> LevenshteinAutomata builder = new LevenshteinAutomata("foobar");
> Automaton ed2 = builder.toAutomaton(2);
> // the term doesnt really matter, just for the field name and toString, etc.
> AutomatonQuery query = new AutomatonQuery(new Term("field", "foobar~2"),
> ed2);
>
>>
>>
>>
>> Karl
>>
>>
>>
>>
>>
>> From: ext Robert Muir [mailto:rcmuir@gmail.com<ma...@gmail.com>]
>> Sent: Friday, July 23, 2010 11:22 AM
>>
>> To: dev@lucene.apache.org<ma...@lucene.apache.org>
>> Subject: Re: LevenshteinFilter proposal
>>
>>
>>
>> Well, there are two main things involved:
>>
>> 1. number of terms seeked to in the term enum
>>
>> 2. expense of the comparison itself.
>>
>>
>>
>> one challenge is the construction of a DFA LevK(x) that recognizes all
>> words within edit distance <= k of x is an expensive operation.
>>
>> This is because of the nature of the computation (traditionally its
>> dynamic programming with nasty runtime).
>>
>>
>>
>> We use http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.26.2940 to
>> offline the nasty part so its linear time: O(n) where n is the length of the
>> query term.
>>
>> But these offline-computated tables get pretty large quick, and so we only
>> include 1,2 to keep the lucene jar file down.
>>
>>
>>
>> additionally for n > 2 it starts to be a ton of termenum seeks... at some
>> of these high distances its faster to just next() thru the terms and
>> compare.
>>
>>
>>
>> we might be able to speed it up more by including say a table for n=3 (at
>> the expense of increased jar size), but not using this n=3 table for #1,
>> only to speed up #2. but its diminishing returns here...
>>
>>
>>
>> On Fri, Jul 23, 2010 at 11:09 AM, <ka...@nokia.com>> wrote:
>>
>> Glad I asked.
>>
>>
>>
>> I would think that the automaton would be superior even for larger edit
>> distances than 1 or 2 than the equivalent “crappy” algorithm. But maybe I
>> don’t understand something. ;-)
>>
>>
>>
>> Karl
>>
>>
>>
>>
>>
>> From: ext Robert Muir [mailto:rcmuir@gmail.com<ma...@gmail.com>]
>> Sent: Friday, July 23, 2010 11:05 AM
>>
>> To: dev@lucene.apache.org<ma...@lucene.apache.org>
>>
>> Subject: Re: LevenshteinFilter proposal
>>
>>
>>
>> this is actually done in trunk.
>>
>>
>>
>> In trunk fuzzy's enum is a "proxy". for low distances (ed=1,2) it uses
>> automaton.
>>
>>
>>
>> for higher distances it uses the crappy "brute force" method.
>>
>> but, higher distances still get accelerated if you use a reasonable
>> 'maxExpansions' to FuzzyQuery... the default is quite bad (1024).
>>
>>
>>
>>
>>
>> On Fri, Jul 23, 2010 at 10:59 AM, <ka...@nokia.com>> wrote:
>>
>> Thanks!
>>
>>
>>
>> FuzzyQuery will do for my purposes, for the interim. But I suspect that
>> FuzzyQuery could be made a lot more efficient if it were rebuilt on top of
>> Automaton, no? I understand that this would be a trunk project.
>>
>>
>>
>> Karl
>>
>>
>>
>>
>>
>> From: ext Uwe Schindler [mailto:uwe@thetaphi.de<ma...@thetaphi.de>]
>> Sent: Friday, July 23, 2010 10:45 AM
>>
>> To: dev@lucene.apache.org<ma...@lucene.apache.org>
>>
>> Subject: RE: LevenshteinFilter proposal
>>
>>
>>
>> Automaton is only in Lucene/Solr Trunk. To get a filter out of FuzzyQuery,
>> use MultiTermQueryWrapperFilter(new FuzzyQuery(…))
>>
>>
>>
>> -----
>>
>> Uwe Schindler
>>
>> H.-H.-Meier-Allee 63, D-28213 Bremen
>>
>> http://www.thetaphi.de
>>
>> eMail: uwe@thetaphi.de<ma...@thetaphi.de>
>>
>>
>>
>> From: karl.wright@nokia.com<ma...@nokia.com> [mailto:karl.wright@nokia.com<ma...@nokia.com>]
>> Sent: Friday, July 23, 2010 4:25 PM
>> To: dev@lucene.apache.org<ma...@lucene.apache.org>
>> Subject: LevenshteinFilter proposal
>>
>>
>>
>> Hi Folks,
>>
>> I’m very interested in using (or developing!) a Levenshtein Filter within
>> the family of Solr Filter objects. I don’t see such a class today anywhere.
>> I see how the AutomatonQuery object would permit such a thing to be built,
>> but to date I don’t know of anyone who has built one. Do you? If not, I’m
>> willing to give it a whirl. Also, AutomatonQuery doesn’t seem to come up
>> when I look for it in the javadocs for Lucene – can you point me in the
>> correct direction?
>>
>> Thanks!
>> Karl
>>
>>
>>
>>
>>
>>
>> --
>> Robert Muir
>> rcmuir@gmail.com<ma...@gmail.com>
>>
>>
>> --
>> Robert Muir
>> rcmuir@gmail.com<ma...@gmail.com>
>
>
> --
> Robert Muir
> rcmuir@gmail.com<ma...@gmail.com>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org<ma...@lucene.apache.org>
For additional commands, e-mail: dev-help@lucene.apache.org<ma...@lucene.apache.org>
--
Robert Muir
rcmuir@gmail.com<ma...@gmail.com>
--
Robert Muir
rcmuir@gmail.com<ma...@gmail.com>
Re: LevenshteinFilter proposal
Posted by Robert Muir <rc...@gmail.com>.
On Mon, Jul 26, 2010 at 12:51 PM, <ka...@nokia.com> wrote:
> Is the automaton creation code checked in anywhere?
>
yes, its generated from lucene/build.xml (createLevAutomata task)
> Also, if it’s not possible to generate the automaton table on the fly,
> perhaps it could be included in an optional jar (whose existence would be
> queried by reflection). The reason I’m interested in LD’s of three,
> especially, is because I can well imagine phonetic misspellings that would
> have distances on that order.
>
I really wouldnt recommend this (though i am not sure what exactly you are
trying to do), instead i would recommend analyzers/phonetic to deal with the
phonetic stuff.
>
>
> Karl
>
>
>
>
>
> *From:* ext Robert Muir [mailto:rcmuir@gmail.com]
> *Sent:* Monday, July 26, 2010 11:49 AM
>
> *To:* dev@lucene.apache.org
> *Subject:* Re: LevenshteinFilter proposal
>
>
>
> I wanted to show you what i mean, just so you know:
>
>
>
> here is what 'ant createLevAutomata' does now:
>
> createLevAutomata:
>
> [exec] Wrote Lev1ParametricDescription.java [102 lines; 3.7 KB]
>
> [exec] Wrote Lev2ParametricDescription.java [147 lines; 9.6 KB]
>
>
>
> BUILD SUCCESSFUL
>
> Total time: 4 seconds
>
>
>
> I modified it with the patch below to generate 3 and 4, just for kicks, and
> ran it again... it took it 7 minutes to generate lev3, its still running on
> lev4 (might take hours or days, i dont actually know)
>
>
>
> createLevAutomata:
>
> [exec] Wrote Lev1ParametricDescription.java [102 lines; 3.7 KB]
>
> [exec] Wrote Lev2ParametricDescription.java [147 lines; 9.6 KB]
>
> [exec] Wrote Lev3ParametricDescription.java [333 lines; 167.2 KB]
>
> ...
>
>
>
> Index: build.xml
>
> ===================================================================
>
> --- build.xml (revision 959288)
>
> +++ build.xml (working copy)
>
> @@ -533,6 +533,8 @@
>
> <target name="createLevAutomata"
> depends="check-moman,clone-moman,pull-moman">
>
> <createLevAutomaton n="1"/>
>
> <createLevAutomaton n="2"/>
>
> + <createLevAutomaton n="3"/>
>
> + <createLevAutomaton n="4"/>
>
> </target>
>
>
>
> <target name="check-moman">
>
>
>
> On Mon, Jul 26, 2010 at 11:39 AM, Michael McCandless <
> lucene@mikemccandless.com> wrote:
>
> On Mon, Jul 26, 2010 at 11:35 AM, Robert Muir <rc...@gmail.com> wrote:
> > On Mon, Jul 26, 2010 at 11:30 AM, <ka...@nokia.com> wrote:
> >>
> >> It occurs to me that the proper static initializer code might well be
> able
> >> to generate distances of 3, 4 or whatever, without bloating the jar.
> >
> > I disagree: generating these distances is a non-trivial algorithm and
> takes
> > minutes or hours to compute...
>
> Furthermore, our own implementation for this algo is in Python :)
>
>
> >> Nevertheless, the real question of import to me right now is: what
> >> “minimumDistance” value corresponds to a Levenshtein distance of 1? 2?
> The
> >> maximum “minimumDistance” the query accepts is 1.0.
> >
> > right, fuzzyquery uses a strange formula that doesnt tie directly to edit
> > distance (unless you factor in length).
> > you can do this easier like this:
> > LevenshteinAutomata builder = new LevenshteinAutomata("foobar");
> > Automaton ed2 = builder.toAutomaton(2);
> > // the term doesnt really matter, just for the field name and toString,
> etc.
> > AutomatonQuery query = new AutomatonQuery(new Term("field", "foobar~2"),
> > ed2);
> >
> >>
> >>
> >>
> >> Karl
> >>
> >>
> >>
> >>
> >>
> >> From: ext Robert Muir [mailto:rcmuir@gmail.com]
> >> Sent: Friday, July 23, 2010 11:22 AM
> >>
> >> To: dev@lucene.apache.org
> >> Subject: Re: LevenshteinFilter proposal
> >>
> >>
> >>
> >> Well, there are two main things involved:
> >>
> >> 1. number of terms seeked to in the term enum
> >>
> >> 2. expense of the comparison itself.
> >>
> >>
> >>
> >> one challenge is the construction of a DFA LevK(x) that recognizes all
> >> words within edit distance <= k of x is an expensive operation.
> >>
> >> This is because of the nature of the computation (traditionally its
> >> dynamic programming with nasty runtime).
> >>
> >>
> >>
> >> We use http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.26.2940
> to
> >> offline the nasty part so its linear time: O(n) where n is the length of
> the
> >> query term.
> >>
> >> But these offline-computated tables get pretty large quick, and so we
> only
> >> include 1,2 to keep the lucene jar file down.
> >>
> >>
> >>
> >> additionally for n > 2 it starts to be a ton of termenum seeks... at
> some
> >> of these high distances its faster to just next() thru the terms and
> >> compare.
> >>
> >>
> >>
> >> we might be able to speed it up more by including say a table for n=3
> (at
> >> the expense of increased jar size), but not using this n=3 table for #1,
> >> only to speed up #2. but its diminishing returns here...
> >>
> >>
> >>
> >> On Fri, Jul 23, 2010 at 11:09 AM, <ka...@nokia.com> wrote:
> >>
> >> Glad I asked.
> >>
> >>
> >>
> >> I would think that the automaton would be superior even for larger edit
> >> distances than 1 or 2 than the equivalent “crappy” algorithm. But maybe
> I
> >> don’t understand something. ;-)
> >>
> >>
> >>
> >> Karl
> >>
> >>
> >>
> >>
> >>
> >> From: ext Robert Muir [mailto:rcmuir@gmail.com]
> >> Sent: Friday, July 23, 2010 11:05 AM
> >>
> >> To: dev@lucene.apache.org
> >>
> >> Subject: Re: LevenshteinFilter proposal
> >>
> >>
> >>
> >> this is actually done in trunk.
> >>
> >>
> >>
> >> In trunk fuzzy's enum is a "proxy". for low distances (ed=1,2) it uses
> >> automaton.
> >>
> >>
> >>
> >> for higher distances it uses the crappy "brute force" method.
> >>
> >> but, higher distances still get accelerated if you use a reasonable
> >> 'maxExpansions' to FuzzyQuery... the default is quite bad (1024).
> >>
> >>
> >>
> >>
> >>
> >> On Fri, Jul 23, 2010 at 10:59 AM, <ka...@nokia.com> wrote:
> >>
> >> Thanks!
> >>
> >>
> >>
> >> FuzzyQuery will do for my purposes, for the interim. But I suspect that
> >> FuzzyQuery could be made a lot more efficient if it were rebuilt on top
> of
> >> Automaton, no? I understand that this would be a trunk project.
> >>
> >>
> >>
> >> Karl
> >>
> >>
> >>
> >>
> >>
> >> From: ext Uwe Schindler [mailto:uwe@thetaphi.de]
> >> Sent: Friday, July 23, 2010 10:45 AM
> >>
> >> To: dev@lucene.apache.org
> >>
> >> Subject: RE: LevenshteinFilter proposal
> >>
> >>
> >>
> >> Automaton is only in Lucene/Solr Trunk. To get a filter out of
> FuzzyQuery,
> >> use MultiTermQueryWrapperFilter(new FuzzyQuery(…))
> >>
> >>
> >>
> >> -----
> >>
> >> Uwe Schindler
> >>
> >> H.-H.-Meier-Allee 63, D-28213 Bremen
> >>
> >> http://www.thetaphi.de
> >>
> >> eMail: uwe@thetaphi.de
> >>
> >>
> >>
> >> From: karl.wright@nokia.com [mailto:karl.wright@nokia.com]
> >> Sent: Friday, July 23, 2010 4:25 PM
> >> To: dev@lucene.apache.org
> >> Subject: LevenshteinFilter proposal
> >>
> >>
> >>
> >> Hi Folks,
> >>
> >> I’m very interested in using (or developing!) a Levenshtein Filter
> within
> >> the family of Solr Filter objects. I don’t see such a class today
> anywhere.
> >> I see how the AutomatonQuery object would permit such a thing to be
> built,
> >> but to date I don’t know of anyone who has built one. Do you? If not,
> I’m
> >> willing to give it a whirl. Also, AutomatonQuery doesn’t seem to come
> up
> >> when I look for it in the javadocs for Lucene – can you point me in the
> >> correct direction?
> >>
> >> Thanks!
> >> Karl
> >>
> >>
> >>
> >>
> >>
> >>
> >> --
> >> Robert Muir
> >> rcmuir@gmail.com
> >>
> >>
> >> --
> >> Robert Muir
> >> rcmuir@gmail.com
> >
> >
> > --
> > Robert Muir
> > rcmuir@gmail.com
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>
>
>
> --
> Robert Muir
> rcmuir@gmail.com
>
--
Robert Muir
rcmuir@gmail.com
RE: LevenshteinFilter proposal
Posted by ka...@nokia.com.
Is the automaton creation code checked in anywhere?
Also, if it’s not possible to generate the automaton table on the fly, perhaps it could be included in an optional jar (whose existence would be queried by reflection). The reason I’m interested in LD’s of three, especially, is because I can well imagine phonetic misspellings that would have distances on that order.
Karl
From: ext Robert Muir [mailto:rcmuir@gmail.com]
Sent: Monday, July 26, 2010 11:49 AM
To: dev@lucene.apache.org
Subject: Re: LevenshteinFilter proposal
I wanted to show you what i mean, just so you know:
here is what 'ant createLevAutomata' does now:
createLevAutomata:
[exec] Wrote Lev1ParametricDescription.java [102 lines; 3.7 KB]
[exec] Wrote Lev2ParametricDescription.java [147 lines; 9.6 KB]
BUILD SUCCESSFUL
Total time: 4 seconds
I modified it with the patch below to generate 3 and 4, just for kicks, and ran it again... it took it 7 minutes to generate lev3, its still running on lev4 (might take hours or days, i dont actually know)
createLevAutomata:
[exec] Wrote Lev1ParametricDescription.java [102 lines; 3.7 KB]
[exec] Wrote Lev2ParametricDescription.java [147 lines; 9.6 KB]
[exec] Wrote Lev3ParametricDescription.java [333 lines; 167.2 KB]
...
Index: build.xml
===================================================================
--- build.xml (revision 959288)
+++ build.xml (working copy)
@@ -533,6 +533,8 @@
<target name="createLevAutomata" depends="check-moman,clone-moman,pull-moman">
<createLevAutomaton n="1"/>
<createLevAutomaton n="2"/>
+ <createLevAutomaton n="3"/>
+ <createLevAutomaton n="4"/>
</target>
<target name="check-moman">
On Mon, Jul 26, 2010 at 11:39 AM, Michael McCandless <lu...@mikemccandless.com>> wrote:
On Mon, Jul 26, 2010 at 11:35 AM, Robert Muir <rc...@gmail.com>> wrote:
> On Mon, Jul 26, 2010 at 11:30 AM, <ka...@nokia.com>> wrote:
>>
>> It occurs to me that the proper static initializer code might well be able
>> to generate distances of 3, 4 or whatever, without bloating the jar.
>
> I disagree: generating these distances is a non-trivial algorithm and takes
> minutes or hours to compute...
Furthermore, our own implementation for this algo is in Python :)
>> Nevertheless, the real question of import to me right now is: what
>> “minimumDistance” value corresponds to a Levenshtein distance of 1? 2? The
>> maximum “minimumDistance” the query accepts is 1.0.
>
> right, fuzzyquery uses a strange formula that doesnt tie directly to edit
> distance (unless you factor in length).
> you can do this easier like this:
> LevenshteinAutomata builder = new LevenshteinAutomata("foobar");
> Automaton ed2 = builder.toAutomaton(2);
> // the term doesnt really matter, just for the field name and toString, etc.
> AutomatonQuery query = new AutomatonQuery(new Term("field", "foobar~2"),
> ed2);
>
>>
>>
>>
>> Karl
>>
>>
>>
>>
>>
>> From: ext Robert Muir [mailto:rcmuir@gmail.com<ma...@gmail.com>]
>> Sent: Friday, July 23, 2010 11:22 AM
>>
>> To: dev@lucene.apache.org<ma...@lucene.apache.org>
>> Subject: Re: LevenshteinFilter proposal
>>
>>
>>
>> Well, there are two main things involved:
>>
>> 1. number of terms seeked to in the term enum
>>
>> 2. expense of the comparison itself.
>>
>>
>>
>> one challenge is the construction of a DFA LevK(x) that recognizes all
>> words within edit distance <= k of x is an expensive operation.
>>
>> This is because of the nature of the computation (traditionally its
>> dynamic programming with nasty runtime).
>>
>>
>>
>> We use http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.26.2940 to
>> offline the nasty part so its linear time: O(n) where n is the length of the
>> query term.
>>
>> But these offline-computated tables get pretty large quick, and so we only
>> include 1,2 to keep the lucene jar file down.
>>
>>
>>
>> additionally for n > 2 it starts to be a ton of termenum seeks... at some
>> of these high distances its faster to just next() thru the terms and
>> compare.
>>
>>
>>
>> we might be able to speed it up more by including say a table for n=3 (at
>> the expense of increased jar size), but not using this n=3 table for #1,
>> only to speed up #2. but its diminishing returns here...
>>
>>
>>
>> On Fri, Jul 23, 2010 at 11:09 AM, <ka...@nokia.com>> wrote:
>>
>> Glad I asked.
>>
>>
>>
>> I would think that the automaton would be superior even for larger edit
>> distances than 1 or 2 than the equivalent “crappy” algorithm. But maybe I
>> don’t understand something. ;-)
>>
>>
>>
>> Karl
>>
>>
>>
>>
>>
>> From: ext Robert Muir [mailto:rcmuir@gmail.com<ma...@gmail.com>]
>> Sent: Friday, July 23, 2010 11:05 AM
>>
>> To: dev@lucene.apache.org<ma...@lucene.apache.org>
>>
>> Subject: Re: LevenshteinFilter proposal
>>
>>
>>
>> this is actually done in trunk.
>>
>>
>>
>> In trunk fuzzy's enum is a "proxy". for low distances (ed=1,2) it uses
>> automaton.
>>
>>
>>
>> for higher distances it uses the crappy "brute force" method.
>>
>> but, higher distances still get accelerated if you use a reasonable
>> 'maxExpansions' to FuzzyQuery... the default is quite bad (1024).
>>
>>
>>
>>
>>
>> On Fri, Jul 23, 2010 at 10:59 AM, <ka...@nokia.com>> wrote:
>>
>> Thanks!
>>
>>
>>
>> FuzzyQuery will do for my purposes, for the interim. But I suspect that
>> FuzzyQuery could be made a lot more efficient if it were rebuilt on top of
>> Automaton, no? I understand that this would be a trunk project.
>>
>>
>>
>> Karl
>>
>>
>>
>>
>>
>> From: ext Uwe Schindler [mailto:uwe@thetaphi.de<ma...@thetaphi.de>]
>> Sent: Friday, July 23, 2010 10:45 AM
>>
>> To: dev@lucene.apache.org<ma...@lucene.apache.org>
>>
>> Subject: RE: LevenshteinFilter proposal
>>
>>
>>
>> Automaton is only in Lucene/Solr Trunk. To get a filter out of FuzzyQuery,
>> use MultiTermQueryWrapperFilter(new FuzzyQuery(…))
>>
>>
>>
>> -----
>>
>> Uwe Schindler
>>
>> H.-H.-Meier-Allee 63, D-28213 Bremen
>>
>> http://www.thetaphi.de
>>
>> eMail: uwe@thetaphi.de<ma...@thetaphi.de>
>>
>>
>>
>> From: karl.wright@nokia.com<ma...@nokia.com> [mailto:karl.wright@nokia.com<ma...@nokia.com>]
>> Sent: Friday, July 23, 2010 4:25 PM
>> To: dev@lucene.apache.org<ma...@lucene.apache.org>
>> Subject: LevenshteinFilter proposal
>>
>>
>>
>> Hi Folks,
>>
>> I’m very interested in using (or developing!) a Levenshtein Filter within
>> the family of Solr Filter objects. I don’t see such a class today anywhere.
>> I see how the AutomatonQuery object would permit such a thing to be built,
>> but to date I don’t know of anyone who has built one. Do you? If not, I’m
>> willing to give it a whirl. Also, AutomatonQuery doesn’t seem to come up
>> when I look for it in the javadocs for Lucene – can you point me in the
>> correct direction?
>>
>> Thanks!
>> Karl
>>
>>
>>
>>
>>
>>
>> --
>> Robert Muir
>> rcmuir@gmail.com<ma...@gmail.com>
>>
>>
>> --
>> Robert Muir
>> rcmuir@gmail.com<ma...@gmail.com>
>
>
> --
> Robert Muir
> rcmuir@gmail.com<ma...@gmail.com>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org<ma...@lucene.apache.org>
For additional commands, e-mail: dev-help@lucene.apache.org<ma...@lucene.apache.org>
--
Robert Muir
rcmuir@gmail.com<ma...@gmail.com>
Re: LevenshteinFilter proposal
Posted by Robert Muir <rc...@gmail.com>.
I wanted to show you what i mean, just so you know:
here is what 'ant createLevAutomata' does now:
createLevAutomata:
[exec] Wrote Lev1ParametricDescription.java [102 lines; 3.7 KB]
[exec] Wrote Lev2ParametricDescription.java [147 lines; 9.6 KB]
BUILD SUCCESSFUL
Total time: 4 seconds
I modified it with the patch below to generate 3 and 4, just for kicks, and
ran it again... it took it 7 minutes to generate lev3, its still running on
lev4 (might take hours or days, i dont actually know)
createLevAutomata:
[exec] Wrote Lev1ParametricDescription.java [102 lines; 3.7 KB]
[exec] Wrote Lev2ParametricDescription.java [147 lines; 9.6 KB]
[exec] Wrote Lev3ParametricDescription.java [333 lines; 167.2 KB]
...
Index: build.xml
===================================================================
--- build.xml (revision 959288)
+++ build.xml (working copy)
@@ -533,6 +533,8 @@
<target name="createLevAutomata"
depends="check-moman,clone-moman,pull-moman">
<createLevAutomaton n="1"/>
<createLevAutomaton n="2"/>
+ <createLevAutomaton n="3"/>
+ <createLevAutomaton n="4"/>
</target>
<target name="check-moman">
On Mon, Jul 26, 2010 at 11:39 AM, Michael McCandless <
lucene@mikemccandless.com> wrote:
> On Mon, Jul 26, 2010 at 11:35 AM, Robert Muir <rc...@gmail.com> wrote:
> > On Mon, Jul 26, 2010 at 11:30 AM, <ka...@nokia.com> wrote:
> >>
> >> It occurs to me that the proper static initializer code might well be
> able
> >> to generate distances of 3, 4 or whatever, without bloating the jar.
> >
> > I disagree: generating these distances is a non-trivial algorithm and
> takes
> > minutes or hours to compute...
>
> Furthermore, our own implementation for this algo is in Python :)
>
> >> Nevertheless, the real question of import to me right now is: what
> >> “minimumDistance” value corresponds to a Levenshtein distance of 1? 2?
> The
> >> maximum “minimumDistance” the query accepts is 1.0.
> >
> > right, fuzzyquery uses a strange formula that doesnt tie directly to edit
> > distance (unless you factor in length).
> > you can do this easier like this:
> > LevenshteinAutomata builder = new LevenshteinAutomata("foobar");
> > Automaton ed2 = builder.toAutomaton(2);
> > // the term doesnt really matter, just for the field name and toString,
> etc.
> > AutomatonQuery query = new AutomatonQuery(new Term("field", "foobar~2"),
> > ed2);
> >
> >>
> >>
> >>
> >> Karl
> >>
> >>
> >>
> >>
> >>
> >> From: ext Robert Muir [mailto:rcmuir@gmail.com]
> >> Sent: Friday, July 23, 2010 11:22 AM
> >>
> >> To: dev@lucene.apache.org
> >> Subject: Re: LevenshteinFilter proposal
> >>
> >>
> >>
> >> Well, there are two main things involved:
> >>
> >> 1. number of terms seeked to in the term enum
> >>
> >> 2. expense of the comparison itself.
> >>
> >>
> >>
> >> one challenge is the construction of a DFA LevK(x) that recognizes all
> >> words within edit distance <= k of x is an expensive operation.
> >>
> >> This is because of the nature of the computation (traditionally its
> >> dynamic programming with nasty runtime).
> >>
> >>
> >>
> >> We use http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.26.2940
> to
> >> offline the nasty part so its linear time: O(n) where n is the length of
> the
> >> query term.
> >>
> >> But these offline-computated tables get pretty large quick, and so we
> only
> >> include 1,2 to keep the lucene jar file down.
> >>
> >>
> >>
> >> additionally for n > 2 it starts to be a ton of termenum seeks... at
> some
> >> of these high distances its faster to just next() thru the terms and
> >> compare.
> >>
> >>
> >>
> >> we might be able to speed it up more by including say a table for n=3
> (at
> >> the expense of increased jar size), but not using this n=3 table for #1,
> >> only to speed up #2. but its diminishing returns here...
> >>
> >>
> >>
> >> On Fri, Jul 23, 2010 at 11:09 AM, <ka...@nokia.com> wrote:
> >>
> >> Glad I asked.
> >>
> >>
> >>
> >> I would think that the automaton would be superior even for larger edit
> >> distances than 1 or 2 than the equivalent “crappy” algorithm. But maybe
> I
> >> don’t understand something. ;-)
> >>
> >>
> >>
> >> Karl
> >>
> >>
> >>
> >>
> >>
> >> From: ext Robert Muir [mailto:rcmuir@gmail.com]
> >> Sent: Friday, July 23, 2010 11:05 AM
> >>
> >> To: dev@lucene.apache.org
> >>
> >> Subject: Re: LevenshteinFilter proposal
> >>
> >>
> >>
> >> this is actually done in trunk.
> >>
> >>
> >>
> >> In trunk fuzzy's enum is a "proxy". for low distances (ed=1,2) it uses
> >> automaton.
> >>
> >>
> >>
> >> for higher distances it uses the crappy "brute force" method.
> >>
> >> but, higher distances still get accelerated if you use a reasonable
> >> 'maxExpansions' to FuzzyQuery... the default is quite bad (1024).
> >>
> >>
> >>
> >>
> >>
> >> On Fri, Jul 23, 2010 at 10:59 AM, <ka...@nokia.com> wrote:
> >>
> >> Thanks!
> >>
> >>
> >>
> >> FuzzyQuery will do for my purposes, for the interim. But I suspect that
> >> FuzzyQuery could be made a lot more efficient if it were rebuilt on top
> of
> >> Automaton, no? I understand that this would be a trunk project.
> >>
> >>
> >>
> >> Karl
> >>
> >>
> >>
> >>
> >>
> >> From: ext Uwe Schindler [mailto:uwe@thetaphi.de]
> >> Sent: Friday, July 23, 2010 10:45 AM
> >>
> >> To: dev@lucene.apache.org
> >>
> >> Subject: RE: LevenshteinFilter proposal
> >>
> >>
> >>
> >> Automaton is only in Lucene/Solr Trunk. To get a filter out of
> FuzzyQuery,
> >> use MultiTermQueryWrapperFilter(new FuzzyQuery(…))
> >>
> >>
> >>
> >> -----
> >>
> >> Uwe Schindler
> >>
> >> H.-H.-Meier-Allee 63, D-28213 Bremen
> >>
> >> http://www.thetaphi.de
> >>
> >> eMail: uwe@thetaphi.de
> >>
> >>
> >>
> >> From: karl.wright@nokia.com [mailto:karl.wright@nokia.com]
> >> Sent: Friday, July 23, 2010 4:25 PM
> >> To: dev@lucene.apache.org
> >> Subject: LevenshteinFilter proposal
> >>
> >>
> >>
> >> Hi Folks,
> >>
> >> I’m very interested in using (or developing!) a Levenshtein Filter
> within
> >> the family of Solr Filter objects. I don’t see such a class today
> anywhere.
> >> I see how the AutomatonQuery object would permit such a thing to be
> built,
> >> but to date I don’t know of anyone who has built one. Do you? If not,
> I’m
> >> willing to give it a whirl. Also, AutomatonQuery doesn’t seem to come
> up
> >> when I look for it in the javadocs for Lucene – can you point me in the
> >> correct direction?
> >>
> >> Thanks!
> >> Karl
> >>
> >>
> >>
> >>
> >>
> >>
> >> --
> >> Robert Muir
> >> rcmuir@gmail.com
> >>
> >>
> >> --
> >> Robert Muir
> >> rcmuir@gmail.com
> >
> >
> > --
> > Robert Muir
> > rcmuir@gmail.com
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>
--
Robert Muir
rcmuir@gmail.com
Re: LevenshteinFilter proposal
Posted by Michael McCandless <lu...@mikemccandless.com>.
On Mon, Jul 26, 2010 at 11:35 AM, Robert Muir <rc...@gmail.com> wrote:
> On Mon, Jul 26, 2010 at 11:30 AM, <ka...@nokia.com> wrote:
>>
>> It occurs to me that the proper static initializer code might well be able
>> to generate distances of 3, 4 or whatever, without bloating the jar.
>
> I disagree: generating these distances is a non-trivial algorithm and takes
> minutes or hours to compute...
Furthermore, our own implementation for this algo is in Python :)
>> Nevertheless, the real question of import to me right now is: what
>> “minimumDistance” value corresponds to a Levenshtein distance of 1? 2? The
>> maximum “minimumDistance” the query accepts is 1.0.
>
> right, fuzzyquery uses a strange formula that doesnt tie directly to edit
> distance (unless you factor in length).
> you can do this easier like this:
> LevenshteinAutomata builder = new LevenshteinAutomata("foobar");
> Automaton ed2 = builder.toAutomaton(2);
> // the term doesnt really matter, just for the field name and toString, etc.
> AutomatonQuery query = new AutomatonQuery(new Term("field", "foobar~2"),
> ed2);
>
>>
>>
>>
>> Karl
>>
>>
>>
>>
>>
>> From: ext Robert Muir [mailto:rcmuir@gmail.com]
>> Sent: Friday, July 23, 2010 11:22 AM
>>
>> To: dev@lucene.apache.org
>> Subject: Re: LevenshteinFilter proposal
>>
>>
>>
>> Well, there are two main things involved:
>>
>> 1. number of terms seeked to in the term enum
>>
>> 2. expense of the comparison itself.
>>
>>
>>
>> one challenge is the construction of a DFA LevK(x) that recognizes all
>> words within edit distance <= k of x is an expensive operation.
>>
>> This is because of the nature of the computation (traditionally its
>> dynamic programming with nasty runtime).
>>
>>
>>
>> We use http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.26.2940 to
>> offline the nasty part so its linear time: O(n) where n is the length of the
>> query term.
>>
>> But these offline-computated tables get pretty large quick, and so we only
>> include 1,2 to keep the lucene jar file down.
>>
>>
>>
>> additionally for n > 2 it starts to be a ton of termenum seeks... at some
>> of these high distances its faster to just next() thru the terms and
>> compare.
>>
>>
>>
>> we might be able to speed it up more by including say a table for n=3 (at
>> the expense of increased jar size), but not using this n=3 table for #1,
>> only to speed up #2. but its diminishing returns here...
>>
>>
>>
>> On Fri, Jul 23, 2010 at 11:09 AM, <ka...@nokia.com> wrote:
>>
>> Glad I asked.
>>
>>
>>
>> I would think that the automaton would be superior even for larger edit
>> distances than 1 or 2 than the equivalent “crappy” algorithm. But maybe I
>> don’t understand something. ;-)
>>
>>
>>
>> Karl
>>
>>
>>
>>
>>
>> From: ext Robert Muir [mailto:rcmuir@gmail.com]
>> Sent: Friday, July 23, 2010 11:05 AM
>>
>> To: dev@lucene.apache.org
>>
>> Subject: Re: LevenshteinFilter proposal
>>
>>
>>
>> this is actually done in trunk.
>>
>>
>>
>> In trunk fuzzy's enum is a "proxy". for low distances (ed=1,2) it uses
>> automaton.
>>
>>
>>
>> for higher distances it uses the crappy "brute force" method.
>>
>> but, higher distances still get accelerated if you use a reasonable
>> 'maxExpansions' to FuzzyQuery... the default is quite bad (1024).
>>
>>
>>
>>
>>
>> On Fri, Jul 23, 2010 at 10:59 AM, <ka...@nokia.com> wrote:
>>
>> Thanks!
>>
>>
>>
>> FuzzyQuery will do for my purposes, for the interim. But I suspect that
>> FuzzyQuery could be made a lot more efficient if it were rebuilt on top of
>> Automaton, no? I understand that this would be a trunk project.
>>
>>
>>
>> Karl
>>
>>
>>
>>
>>
>> From: ext Uwe Schindler [mailto:uwe@thetaphi.de]
>> Sent: Friday, July 23, 2010 10:45 AM
>>
>> To: dev@lucene.apache.org
>>
>> Subject: RE: LevenshteinFilter proposal
>>
>>
>>
>> Automaton is only in Lucene/Solr Trunk. To get a filter out of FuzzyQuery,
>> use MultiTermQueryWrapperFilter(new FuzzyQuery(…))
>>
>>
>>
>> -----
>>
>> Uwe Schindler
>>
>> H.-H.-Meier-Allee 63, D-28213 Bremen
>>
>> http://www.thetaphi.de
>>
>> eMail: uwe@thetaphi.de
>>
>>
>>
>> From: karl.wright@nokia.com [mailto:karl.wright@nokia.com]
>> Sent: Friday, July 23, 2010 4:25 PM
>> To: dev@lucene.apache.org
>> Subject: LevenshteinFilter proposal
>>
>>
>>
>> Hi Folks,
>>
>> I’m very interested in using (or developing!) a Levenshtein Filter within
>> the family of Solr Filter objects. I don’t see such a class today anywhere.
>> I see how the AutomatonQuery object would permit such a thing to be built,
>> but to date I don’t know of anyone who has built one. Do you? If not, I’m
>> willing to give it a whirl. Also, AutomatonQuery doesn’t seem to come up
>> when I look for it in the javadocs for Lucene – can you point me in the
>> correct direction?
>>
>> Thanks!
>> Karl
>>
>>
>>
>>
>>
>>
>> --
>> Robert Muir
>> rcmuir@gmail.com
>>
>>
>> --
>> Robert Muir
>> rcmuir@gmail.com
>
>
> --
> Robert Muir
> rcmuir@gmail.com
>
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
Re: LevenshteinFilter proposal
Posted by Robert Muir <rc...@gmail.com>.
On Mon, Jul 26, 2010 at 11:30 AM, <ka...@nokia.com> wrote:
> It occurs to me that the proper static initializer code might well be
> able to generate distances of 3, 4 or whatever, without bloating the jar.
>
> I disagree: generating these distances is a non-trivial algorithm and takes
minutes or hours to compute...
> Nevertheless, the real question of import to me right now is: what
> “minimumDistance” value corresponds to a Levenshtein distance of 1? 2? The
> maximum “minimumDistance” the query accepts is 1.0.
>
right, fuzzyquery uses a strange formula that doesnt tie directly to edit
distance (unless you factor in length).
you can do this easier like this:
LevenshteinAutomata builder = new LevenshteinAutomata("foobar");
Automaton ed2 = builder.toAutomaton(2);
// the term doesnt really matter, just for the field name and toString, etc.
AutomatonQuery query = new AutomatonQuery(new Term("field", "foobar~2"),
ed2);
>
>
> Karl
>
>
>
>
>
> *From:* ext Robert Muir [mailto:rcmuir@gmail.com]
> *Sent:* Friday, July 23, 2010 11:22 AM
>
> *To:* dev@lucene.apache.org
> *Subject:* Re: LevenshteinFilter proposal
>
>
>
> Well, there are two main things involved:
>
> 1. number of terms seeked to in the term enum
>
> 2. expense of the comparison itself.
>
>
>
> one challenge is the construction of a DFA LevK(x) that recognizes all
> words within edit distance <= k of x is an expensive operation.
>
> This is because of the nature of the computation (traditionally its dynamic
> programming with nasty runtime).
>
>
>
> We use http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.26.2940 to
> offline the nasty part so its linear time: O(n) where n is the length of the
> query term.
>
> But these offline-computated tables get pretty large quick, and so we only
> include 1,2 to keep the lucene jar file down.
>
>
>
> additionally for n > 2 it starts to be a ton of termenum seeks... at some
> of these high distances its faster to just next() thru the terms and
> compare.
>
>
>
> we might be able to speed it up more by including say a table for n=3 (at
> the expense of increased jar size), but not using this n=3 table for #1,
> only to speed up #2. but its diminishing returns here...
>
>
>
> On Fri, Jul 23, 2010 at 11:09 AM, <ka...@nokia.com> wrote:
>
> Glad I asked.
>
>
>
> I would think that the automaton would be superior even for larger edit
> distances than 1 or 2 than the equivalent “crappy” algorithm. But maybe I
> don’t understand something. ;-)
>
>
>
> Karl
>
>
>
>
>
> *From:* ext Robert Muir [mailto:rcmuir@gmail.com]
> *Sent:* Friday, July 23, 2010 11:05 AM
>
>
> *To:* dev@lucene.apache.org
>
> *Subject:* Re: LevenshteinFilter proposal
>
>
>
> this is actually done in trunk.
>
>
>
> In trunk fuzzy's enum is a "proxy". for low distances (ed=1,2) it uses
> automaton.
>
>
>
> for higher distances it uses the crappy "brute force" method.
>
> but, higher distances still get accelerated if you use a reasonable
> 'maxExpansions' to FuzzyQuery... the default is quite bad (1024).
>
>
>
>
>
> On Fri, Jul 23, 2010 at 10:59 AM, <ka...@nokia.com> wrote:
>
> Thanks!
>
>
>
> FuzzyQuery will do for my purposes, for the interim. But I suspect that
> FuzzyQuery could be made a lot more efficient if it were rebuilt on top of
> Automaton, no? I understand that this would be a trunk project.
>
>
>
> Karl
>
>
>
>
>
> *From:* ext Uwe Schindler [mailto:uwe@thetaphi.de]
> *Sent:* Friday, July 23, 2010 10:45 AM
>
>
> *To:* dev@lucene.apache.org
>
> *Subject:* RE: LevenshteinFilter proposal
>
>
>
> Automaton is only in Lucene/Solr Trunk. To get a filter out of FuzzyQuery,
> use MultiTermQueryWrapperFilter(new FuzzyQuery(…))
>
>
>
> -----
>
> Uwe Schindler
>
> H.-H.-Meier-Allee 63, D-28213 Bremen
>
> http://www.thetaphi.de
>
> eMail: uwe@thetaphi.de
>
>
>
> *From:* karl.wright@nokia.com [mailto:karl.wright@nokia.com]
> *Sent:* Friday, July 23, 2010 4:25 PM
> *To:* dev@lucene.apache.org
> *Subject:* LevenshteinFilter proposal
>
>
>
> Hi Folks,
>
>
> I’m very interested in using (or developing!) a Levenshtein Filter within
> the family of Solr Filter objects. I don’t see such a class today anywhere.
> I see how the AutomatonQuery object would permit such a thing to be built,
> but to date I don’t know of anyone who has built one. Do you? If not, I’m
> willing to give it a whirl. Also, AutomatonQuery doesn’t seem to come up
> when I look for it in the javadocs for Lucene – can you point me in the
> correct direction?
>
> Thanks!
> Karl
>
>
>
>
>
>
>
>
> --
> Robert Muir
> rcmuir@gmail.com
>
>
>
>
> --
> Robert Muir
> rcmuir@gmail.com
>
--
Robert Muir
rcmuir@gmail.com
RE: LevenshteinFilter proposal
Posted by ka...@Nokia.com.
It occurs to me that the proper static initializer code might well be able to generate distances of 3, 4 or whatever, without bloating the jar.
Nevertheless, the real question of import to me right now is: what “minimumDistance” value corresponds to a Levenshtein distance of 1? 2? The maximum “minimumDistance” the query accepts is 1.0.
Karl
From: ext Robert Muir [mailto:rcmuir@gmail.com]
Sent: Friday, July 23, 2010 11:22 AM
To: dev@lucene.apache.org
Subject: Re: LevenshteinFilter proposal
Well, there are two main things involved:
1. number of terms seeked to in the term enum
2. expense of the comparison itself.
one challenge is the construction of a DFA LevK(x) that recognizes all words within edit distance <= k of x is an expensive operation.
This is because of the nature of the computation (traditionally its dynamic programming with nasty runtime).
We use http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.26.2940 to offline the nasty part so its linear time: O(n) where n is the length of the query term.
But these offline-computated tables get pretty large quick, and so we only include 1,2 to keep the lucene jar file down.
additionally for n > 2 it starts to be a ton of termenum seeks... at some of these high distances its faster to just next() thru the terms and compare.
we might be able to speed it up more by including say a table for n=3 (at the expense of increased jar size), but not using this n=3 table for #1, only to speed up #2. but its diminishing returns here...
On Fri, Jul 23, 2010 at 11:09 AM, <ka...@nokia.com>> wrote:
Glad I asked.
I would think that the automaton would be superior even for larger edit distances than 1 or 2 than the equivalent “crappy” algorithm. But maybe I don’t understand something. ;-)
Karl
From: ext Robert Muir [mailto:rcmuir@gmail.com<ma...@gmail.com>]
Sent: Friday, July 23, 2010 11:05 AM
To: dev@lucene.apache.org<ma...@lucene.apache.org>
Subject: Re: LevenshteinFilter proposal
this is actually done in trunk.
In trunk fuzzy's enum is a "proxy". for low distances (ed=1,2) it uses automaton.
for higher distances it uses the crappy "brute force" method.
but, higher distances still get accelerated if you use a reasonable 'maxExpansions' to FuzzyQuery... the default is quite bad (1024).
On Fri, Jul 23, 2010 at 10:59 AM, <ka...@nokia.com>> wrote:
Thanks!
FuzzyQuery will do for my purposes, for the interim. But I suspect that FuzzyQuery could be made a lot more efficient if it were rebuilt on top of Automaton, no? I understand that this would be a trunk project.
Karl
From: ext Uwe Schindler [mailto:uwe@thetaphi.de<ma...@thetaphi.de>]
Sent: Friday, July 23, 2010 10:45 AM
To: dev@lucene.apache.org<ma...@lucene.apache.org>
Subject: RE: LevenshteinFilter proposal
Automaton is only in Lucene/Solr Trunk. To get a filter out of FuzzyQuery, use MultiTermQueryWrapperFilter(new FuzzyQuery(…))
-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de<http://www.thetaphi.de/>
eMail: uwe@thetaphi.de<ma...@thetaphi.de>
From: karl.wright@nokia.com<ma...@nokia.com> [mailto:karl.wright@nokia.com<ma...@nokia.com>]
Sent: Friday, July 23, 2010 4:25 PM
To: dev@lucene.apache.org<ma...@lucene.apache.org>
Subject: LevenshteinFilter proposal
Hi Folks,
I’m very interested in using (or developing!) a Levenshtein Filter within the family of Solr Filter objects. I don’t see such a class today anywhere. I see how the AutomatonQuery object would permit such a thing to be built, but to date I don’t know of anyone who has built one. Do you? If not, I’m willing to give it a whirl. Also, AutomatonQuery doesn’t seem to come up when I look for it in the javadocs for Lucene – can you point me in the correct direction?
Thanks!
Karl
--
Robert Muir
rcmuir@gmail.com<ma...@gmail.com>
--
Robert Muir
rcmuir@gmail.com<ma...@gmail.com>
Re: LevenshteinFilter proposal
Posted by Robert Muir <rc...@gmail.com>.
Well, there are two main things involved:
1. number of terms seeked to in the term enum
2. expense of the comparison itself.
one challenge is the construction of a DFA LevK(x) that recognizes all words
within edit distance <= k of x is an expensive operation.
This is because of the nature of the computation (traditionally its dynamic
programming with nasty runtime).
We use http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.26.2940 to
offline the nasty part so its linear time: O(n) where n is the length of the
query term.
But these offline-computated tables get pretty large quick, and so we only
include 1,2 to keep the lucene jar file down.
additionally for n > 2 it starts to be a ton of termenum seeks... at some of
these high distances its faster to just next() thru the terms and compare.
we might be able to speed it up more by including say a table for n=3 (at
the expense of increased jar size), but not using this n=3 table for #1,
only to speed up #2. but its diminishing returns here...
On Fri, Jul 23, 2010 at 11:09 AM, <ka...@nokia.com> wrote:
> Glad I asked.
>
>
>
> I would think that the automaton would be superior even for larger edit
> distances than 1 or 2 than the equivalent “crappy” algorithm. But maybe I
> don’t understand something. ;-)
>
>
>
> Karl
>
>
>
>
>
> *From:* ext Robert Muir [mailto:rcmuir@gmail.com]
> *Sent:* Friday, July 23, 2010 11:05 AM
>
> *To:* dev@lucene.apache.org
> *Subject:* Re: LevenshteinFilter proposal
>
>
>
> this is actually done in trunk.
>
>
>
> In trunk fuzzy's enum is a "proxy". for low distances (ed=1,2) it uses
> automaton.
>
>
>
> for higher distances it uses the crappy "brute force" method.
>
> but, higher distances still get accelerated if you use a reasonable
> 'maxExpansions' to FuzzyQuery... the default is quite bad (1024).
>
>
>
>
>
> On Fri, Jul 23, 2010 at 10:59 AM, <ka...@nokia.com> wrote:
>
> Thanks!
>
>
>
> FuzzyQuery will do for my purposes, for the interim. But I suspect that
> FuzzyQuery could be made a lot more efficient if it were rebuilt on top of
> Automaton, no? I understand that this would be a trunk project.
>
>
>
> Karl
>
>
>
>
>
> *From:* ext Uwe Schindler [mailto:uwe@thetaphi.de]
> *Sent:* Friday, July 23, 2010 10:45 AM
>
>
> *To:* dev@lucene.apache.org
>
> *Subject:* RE: LevenshteinFilter proposal
>
>
>
> Automaton is only in Lucene/Solr Trunk. To get a filter out of FuzzyQuery,
> use MultiTermQueryWrapperFilter(new FuzzyQuery(…))
>
>
>
> -----
>
> Uwe Schindler
>
> H.-H.-Meier-Allee 63, D-28213 Bremen
>
> http://www.thetaphi.de
>
> eMail: uwe@thetaphi.de
>
>
>
> *From:* karl.wright@nokia.com [mailto:karl.wright@nokia.com]
> *Sent:* Friday, July 23, 2010 4:25 PM
> *To:* dev@lucene.apache.org
> *Subject:* LevenshteinFilter proposal
>
>
>
> Hi Folks,
>
>
> I’m very interested in using (or developing!) a Levenshtein Filter within
> the family of Solr Filter objects. I don’t see such a class today anywhere.
> I see how the AutomatonQuery object would permit such a thing to be built,
> but to date I don’t know of anyone who has built one. Do you? If not, I’m
> willing to give it a whirl. Also, AutomatonQuery doesn’t seem to come up
> when I look for it in the javadocs for Lucene – can you point me in the
> correct direction?
>
> Thanks!
> Karl
>
>
>
>
>
>
>
>
> --
> Robert Muir
> rcmuir@gmail.com
>
--
Robert Muir
rcmuir@gmail.com
RE: LevenshteinFilter proposal
Posted by ka...@nokia.com.
Glad I asked.
I would think that the automaton would be superior even for larger edit distances than 1 or 2 than the equivalent “crappy” algorithm. But maybe I don’t understand something. ;-)
Karl
From: ext Robert Muir [mailto:rcmuir@gmail.com]
Sent: Friday, July 23, 2010 11:05 AM
To: dev@lucene.apache.org
Subject: Re: LevenshteinFilter proposal
this is actually done in trunk.
In trunk fuzzy's enum is a "proxy". for low distances (ed=1,2) it uses automaton.
for higher distances it uses the crappy "brute force" method.
but, higher distances still get accelerated if you use a reasonable 'maxExpansions' to FuzzyQuery... the default is quite bad (1024).
On Fri, Jul 23, 2010 at 10:59 AM, <ka...@nokia.com>> wrote:
Thanks!
FuzzyQuery will do for my purposes, for the interim. But I suspect that FuzzyQuery could be made a lot more efficient if it were rebuilt on top of Automaton, no? I understand that this would be a trunk project.
Karl
From: ext Uwe Schindler [mailto:uwe@thetaphi.de<ma...@thetaphi.de>]
Sent: Friday, July 23, 2010 10:45 AM
To: dev@lucene.apache.org<ma...@lucene.apache.org>
Subject: RE: LevenshteinFilter proposal
Automaton is only in Lucene/Solr Trunk. To get a filter out of FuzzyQuery, use MultiTermQueryWrapperFilter(new FuzzyQuery(…))
-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de<http://www.thetaphi.de/>
eMail: uwe@thetaphi.de<ma...@thetaphi.de>
From: karl.wright@nokia.com<ma...@nokia.com> [mailto:karl.wright@nokia.com<ma...@nokia.com>]
Sent: Friday, July 23, 2010 4:25 PM
To: dev@lucene.apache.org<ma...@lucene.apache.org>
Subject: LevenshteinFilter proposal
Hi Folks,
I’m very interested in using (or developing!) a Levenshtein Filter within the family of Solr Filter objects. I don’t see such a class today anywhere. I see how the AutomatonQuery object would permit such a thing to be built, but to date I don’t know of anyone who has built one. Do you? If not, I’m willing to give it a whirl. Also, AutomatonQuery doesn’t seem to come up when I look for it in the javadocs for Lucene – can you point me in the correct direction?
Thanks!
Karl
--
Robert Muir
rcmuir@gmail.com<ma...@gmail.com>
Re: LevenshteinFilter proposal
Posted by Robert Muir <rc...@gmail.com>.
this is actually done in trunk.
In trunk fuzzy's enum is a "proxy". for low distances (ed=1,2) it uses
automaton.
for higher distances it uses the crappy "brute force" method.
but, higher distances still get accelerated if you use a reasonable
'maxExpansions' to FuzzyQuery... the default is quite bad (1024).
On Fri, Jul 23, 2010 at 10:59 AM, <ka...@nokia.com> wrote:
> Thanks!
>
>
>
> FuzzyQuery will do for my purposes, for the interim. But I suspect that
> FuzzyQuery could be made a lot more efficient if it were rebuilt on top of
> Automaton, no? I understand that this would be a trunk project.
>
>
>
> Karl
>
>
>
>
>
> *From:* ext Uwe Schindler [mailto:uwe@thetaphi.de]
> *Sent:* Friday, July 23, 2010 10:45 AM
>
> *To:* dev@lucene.apache.org
> *Subject:* RE: LevenshteinFilter proposal
>
>
>
> Automaton is only in Lucene/Solr Trunk. To get a filter out of FuzzyQuery,
> use MultiTermQueryWrapperFilter(new FuzzyQuery(…))
>
>
>
> -----
>
> Uwe Schindler
>
> H.-H.-Meier-Allee 63, D-28213 Bremen
>
> http://www.thetaphi.de
>
> eMail: uwe@thetaphi.de
>
>
>
> *From:* karl.wright@nokia.com [mailto:karl.wright@nokia.com]
> *Sent:* Friday, July 23, 2010 4:25 PM
> *To:* dev@lucene.apache.org
> *Subject:* LevenshteinFilter proposal
>
>
>
> Hi Folks,
>
>
> I’m very interested in using (or developing!) a Levenshtein Filter within
> the family of Solr Filter objects. I don’t see such a class today anywhere.
> I see how the AutomatonQuery object would permit such a thing to be built,
> but to date I don’t know of anyone who has built one. Do you? If not, I’m
> willing to give it a whirl. Also, AutomatonQuery doesn’t seem to come up
> when I look for it in the javadocs for Lucene – can you point me in the
> correct direction?
>
> Thanks!
> Karl
>
>
>
>
>
--
Robert Muir
rcmuir@gmail.com
RE: LevenshteinFilter proposal
Posted by ka...@nokia.com.
Thanks!
FuzzyQuery will do for my purposes, for the interim. But I suspect that FuzzyQuery could be made a lot more efficient if it were rebuilt on top of Automaton, no? I understand that this would be a trunk project.
Karl
From: ext Uwe Schindler [mailto:uwe@thetaphi.de]
Sent: Friday, July 23, 2010 10:45 AM
To: dev@lucene.apache.org
Subject: RE: LevenshteinFilter proposal
Automaton is only in Lucene/Solr Trunk. To get a filter out of FuzzyQuery, use MultiTermQueryWrapperFilter(new FuzzyQuery(...))
-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de<http://www.thetaphi.de/>
eMail: uwe@thetaphi.de
From: karl.wright@nokia.com [mailto:karl.wright@nokia.com]
Sent: Friday, July 23, 2010 4:25 PM
To: dev@lucene.apache.org
Subject: LevenshteinFilter proposal
Hi Folks,
I'm very interested in using (or developing!) a Levenshtein Filter within the family of Solr Filter objects. I don't see such a class today anywhere. I see how the AutomatonQuery object would permit such a thing to be built, but to date I don't know of anyone who has built one. Do you? If not, I'm willing to give it a whirl. Also, AutomatonQuery doesn't seem to come up when I look for it in the javadocs for Lucene - can you point me in the correct direction?
Thanks!
Karl
RE: LevenshteinFilter proposal
Posted by Uwe Schindler <uw...@thetaphi.de>.
Automaton is only in Lucene/Solr Trunk. To get a filter out of FuzzyQuery,
use MultiTermQueryWrapperFilter(new FuzzyQuery(.))
-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de <http://www.thetaphi.de/>
eMail: uwe@thetaphi.de
From: karl.wright@nokia.com [mailto:karl.wright@nokia.com]
Sent: Friday, July 23, 2010 4:25 PM
To: dev@lucene.apache.org
Subject: LevenshteinFilter proposal
Hi Folks,
I'm very interested in using (or developing!) a Levenshtein Filter within
the family of Solr Filter objects. I don't see such a class today anywhere.
I see how the AutomatonQuery object would permit such a thing to be built,
but to date I don't know of anyone who has built one. Do you? If not, I'm
willing to give it a whirl. Also, AutomatonQuery doesn't seem to come up
when I look for it in the javadocs for Lucene - can you point me in the
correct direction?
Thanks!
Karl