You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Robert Muir (JIRA)" <ji...@apache.org> on 2009/01/06 19:02:44 UTC

[jira] Created: (LUCENE-1513) fastss fuzzyquery

fastss fuzzyquery
-----------------

                 Key: LUCENE-1513
                 URL: https://issues.apache.org/jira/browse/LUCENE-1513
             Project: Lucene - Java
          Issue Type: New Feature
          Components: contrib/*
            Reporter: Robert Muir
            Priority: Minor


code for doing fuzzyqueries with fastssWC algorithm.

FuzzyIndexer: given a lucene field, it enumerates all terms and creates an auxiliary offline index for fuzzy queries.
FastFuzzyQuery: similar to fuzzy query except it queries the auxiliary index to retrieve a candidate list. this list is then verified with levenstein algorithm.

sorry but the code is a bit messy... what I'm actually using is very different from this so its pretty much untested. but at least you can see whats going on or fix it up.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Closed: (LUCENE-1513) fastss fuzzyquery

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir closed LUCENE-1513.
-------------------------------

    Resolution: Not A Problem

For Lucene, LUCENE-2089 will always be faster than even FastSS, as our FuzzyQuery is really a top-N query, and we can exploit properties of the priority queue to make it even faster.

LUCENE-2089 also works without any auxiliary index or data structures, just solely on lucene's terms dict, so it works great for updates/NRT/whatever, no back compat problems.

I'm cancelling this issue as the alternative is superior in every aspect.

> fastss fuzzyquery
> -----------------
>
>                 Key: LUCENE-1513
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1513
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/*
>            Reporter: Robert Muir
>            Priority: Minor
>         Attachments: fastSSfuzzy.zip
>
>
> code for doing fuzzyqueries with fastssWC algorithm.
> FuzzyIndexer: given a lucene field, it enumerates all terms and creates an auxiliary offline index for fuzzy queries.
> FastFuzzyQuery: similar to fuzzy query except it queries the auxiliary index to retrieve a candidate list. this list is then verified with levenstein algorithm.
> sorry but the code is a bit messy... what I'm actually using is very different from this so its pretty much untested. but at least you can see whats going on or fix it up.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1513) fastss fuzzyquery

Posted by "Otis Gospodnetic (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12661302#action_12661302 ] 

Otis Gospodnetic commented on LUCENE-1513:
------------------------------------------

I feel like I missed some FastSS discussion on the list.... was there one?

I took a quick look at the paper and the code.  Is the following the general idea:
# index "fuzzy"/"misspelled" terms in addition to the normal terms (=> larger index, slower indexing).  How much fuzziness one wants to allow or handle is decided at index time.
# rewrite the query to include variations/misspellings of each terms and use that to search (=> more clauses, slower than normal search, but faster than the "normal" fuzzy query whose speed depends on the number of indexed terms)
?

Quick code comments:
* Need to add ASL
* Need to replace tabs with 2 spaces and formatting in FuzzyHitCollector
* No @author
* Unit test if possible
* Should FastSSwC not be able to take a variable K?
* Should variables named after types (e.g. "set" in public static String getNeighborhoodString(Set<String> set) { ) be renamed, so they describe what's in them instead? (easier to understand API?)


> fastss fuzzyquery
> -----------------
>
>                 Key: LUCENE-1513
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1513
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/*
>            Reporter: Robert Muir
>            Priority: Minor
>         Attachments: fastSSfuzzy.zip
>
>
> code for doing fuzzyqueries with fastssWC algorithm.
> FuzzyIndexer: given a lucene field, it enumerates all terms and creates an auxiliary offline index for fuzzy queries.
> FastFuzzyQuery: similar to fuzzy query except it queries the auxiliary index to retrieve a candidate list. this list is then verified with levenstein algorithm.
> sorry but the code is a bit messy... what I'm actually using is very different from this so its pretty much untested. but at least you can see whats going on or fix it up.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: [jira] Commented: (LUCENE-1513) fastss fuzzyquery

Posted by Robert Muir <rc...@gmail.com>.
for the k=1 case in my mind your last comment might not really be that much
slower than storing the additional data... sounds worth investigating

On Tue, Jan 6, 2009 at 8:04 PM, robert engels <re...@ix.netcom.com> wrote:

> I think you would need to store the position in the stream using position
> == to the k factor. Pretty straightforward, both for indexing and for
> searching.
> I think if you want the utmost in performance this is the way to go.
>
> If you don't want to store all of the additional data, I still think a
> better fuzzy search can be done without the external index entirely. As I
> see it, the external index's sole purpose (in your case) is to provide
> "indexed" words at a certain edit distance given a certain source word.
>  Using a combination of inverting the alg, and a binary selective search on
> the term index.
>
> On Jan 6, 2009, at 5:44 PM, Robert Muir wrote:
>
> robert theres only one problem i see: i don't see how you can do a single
> search since fastssWC returns some false positives (with k=1 it will still
> return some things with ED of 2). maybe if you store the deletion position
> information as a payload (thus using original fastss where there are no
> false positives) it would work though. i looked at storing position
> information but it appeared like it might be complex and the api was (is)
> still marked experimental so i didn't go that route.
>
> i also agree lucene index might not be the best possible data structure...
> just convenient thats all. i used it because i store other things related to
> the term besides deletion neighborhoods for my fuzzy matching.
>
> i guess i'll also mention that i do think storage size should be a big
> consideration. you really don't need this kind of stuff unless you are
> searching pretty big indexes in the first place (for <= few million docs the
> default fuzzy is probably just fine for a lot of people).
>
> for me, the whole thing was about turning 30second queries into 1 second
> queries by removing a linear algorithm, i didn't really optimize much beyond
> that because i was just very happy to have reasonable performance..
>
> On Tue, Jan 6, 2009 at 6:26 PM, robert engels <re...@ix.netcom.com>wrote:
>
>> I understand now.
>> The index in my case would definitely be MUCH larger, but I think it would
>> perform better, as you only need to do a single search - for obert (if you
>> assume it was a misspelling).
>>
>> In your case you would eventually do an OR search in the lucene index for
>> all possible matches (robert, roberta, roberto, ...) which could be much
>> larger with some commonly prefixed/postfixed words).
>>
>> Classic performance vs. size trade-off.  In your case where it is not for
>> misspellings, the performance difference might be worthwhile.
>>
>> Still, in your case, I am not sure using a Lucene index as the external
>> index is appropriate. Maybe a simple BTREE (Derby?) index of (word,edit
>> permutation) with a a key on both would allow easy search and update. If
>> implemented as a service, some intelligent caching of common misspellings
>> could really improve the performance.
>>
>> On Jan 6, 2009, at 4:29 PM, Robert Muir wrote:
>>
>>
>>
>> On Tue, Jan 6, 2009 at 5:15 PM, robert engels <re...@ix.netcom.com>wrote:
>>
>>>  It is definitely going to increase the index size, but not any more than
>>> than the external one would (if my understanding is correct).
>>> The nice thing is that you don't have to try and keep documents numbers
>>> in sync - it will be automatic.
>>>
>>> Maybe I don't understand what your external index is storing. Given that
>>> the document contains 'robert' but the user enters' obert', what is the
>>> process to find the matching documents?
>>>
>>
>> heres a simple example. neighborhood stored for robert is 'robert obert
>> rbert roert ...'  this is indexed in a tokenized field.
>>
>> at query time user typoes robert and enters 'tobert'. again neighborhood
>> is generated 'tobert obert tbert ...'
>> the system does a query on tobert OR obert OR tbert ... and robert is
>> returned because 'obert' is present in both neighborhoods.
>> in this example, by storing k=1 deletions you guarantee to satisfy all
>> edit distance matches <= 1 without linear scan.
>> you get some false positives too with this approach, thats why what comes
>> back is only a CANDIDATE and true edit distance must be used to verify. this
>> might be tricky to do with your method, i don't know.
>>
>>
>>
>>
>>>
>>> Is the external index essentially a constant list, that given obert, the
>>> source words COULD BE robert, tobert, reobert etc., and it contains no
>>> document information so:
>>>
>>
>> no. see above, you generate all possible 1-character deletions of the
>> index term and store them, then at query time you generate all possible
>> 1-character deletions of the query term. basically, LUCENE and LUBENE are 1
>> character different, but they are the same (LUENE) if you delete 1 character
>> from both of them. so you dont need to store LUCENE LUBENE LUDENE, you just
>> store LUENE.
>>
>>>
>>> given the source word X, and an edit distance k, you ask the external
>>> dictionary for possible indexed words, and it returns the list, and then use
>>> search lucene using each of those words?
>>>
>>> If the above is the case, it certainly seems you could generate this list
>>> in real-time rather efficiently with no IO (unless the external index only
>>> stores words which HAVE BEEN indexed).
>>>
>>> I think the confusion may be because I understand Otis's comments, but
>>> they don't seem to match what you are stating.
>>>
>>> Essentially performing any term match requires efficient
>>> searching/matching of the term index. If this is efficient enough, I don't
>>> think either process is needed - just an improved real-time fuzzy
>>> possibilities word generator.
>>>
>>> On Jan 6, 2009, at 3:58 PM, Robert Muir wrote:
>>>
>>> i see, your idea would definitely simplify some things.
>>>
>>> What about the index size difference between this approach and using
>>> separate index? Would this separate field increase index size?
>>>
>>> I guess my line of thinking is if you have 10 docs with robert, with
>>> separate index you just have robert, and its deletion neighborhood one time.
>>> with this approach you have the same thing, but at least you must have
>>> document numbers and the other inverted index stuff with each neighborhood
>>> term. would this be a significant change to size and/or performance? and
>>> since the documents have multiple terms there is additional positional
>>> information for slop factor for each neighborhood term...
>>>
>>> i think its worth investigating, maybe performance would actually be
>>> better, just curious. i think i boxed myself in to auxiliary index because
>>> of some other irrelevant thigns i am doing.
>>>
>>> On Tue, Jan 6, 2009 at 4:42 PM, robert engels <re...@ix.netcom.com>wrote:
>>>
>>>>  I don't think that is the case. You will have single deletion
>>>> neighborhood. The number of unique terms in the field is going to be the
>>>> union of the deletion dictionaries of each source term.
>>>> For example, given the following documents A which have field 'X' with
>>>> value best, and document B with value jest (and k == 1).
>>>>
>>>> A will generate est bst, bet, bes, B will generate est, jest, jst, jes
>>>>
>>>> so field FieldXFuzzy contains
>>>> (est:AB,bst:A,bet:A,bes:A,jest:B,jst:B,jes)
>>>>
>>>> I don't think the storage requirement is any greater doing it this way.
>>>>
>>>>
>>>> 3.2.1 Indexing
>>>> For all words in a dictionary, and a given number of edit operations k,
>>>> FastSS
>>>> generates all variant spellings recursively and save them as tuples of
>>>> type
>>>> v′ ∈ Ud (v, k) → (v, x) where v is a dictionary word and x a list of
>>>> deletion
>>>> positions.
>>>>
>>>> Theorem 5. Index uses O(nmk+1) space, as it stores al l the variants
>>>> for n
>>>> dictionary words of length m with k mismatches.
>>>>
>>>>
>>>> 3.2.2 Retrieval
>>>> For a query p and edit distance k, first generate the neighborhood Ud (p,
>>>> k).
>>>> Then compare the words in the neighborhood with the index, and find
>>>> matching candidates. Compare deletion positions for each candidate with
>>>> the deletion positions in U(p, k), using Theorem 4.
>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Robert Muir
>>> rcmuir@gmail.com
>>>
>>>
>>>
>>
>>
>> --
>> Robert Muir
>> rcmuir@gmail.com
>>
>>
>>
>
>
> --
> Robert Muir
> rcmuir@gmail.com
>
>
>


-- 
Robert Muir
rcmuir@gmail.com

Re: [jira] Commented: (LUCENE-1513) fastss fuzzyquery

Posted by robert engels <re...@ix.netcom.com>.
I think you would need to store the position in the stream using  
position == to the k factor. Pretty straightforward, both for  
indexing and for searching.

I think if you want the utmost in performance this is the way to go.

If you don't want to store all of the additional data, I still think  
a better fuzzy search can be done without the external index  
entirely. As I see it, the external index's sole purpose (in your  
case) is to provide "indexed" words at a certain edit distance given  
a certain source word.  Using a combination of inverting the alg, and  
a binary selective search on the term index.

On Jan 6, 2009, at 5:44 PM, Robert Muir wrote:

> robert theres only one problem i see: i don't see how you can do a  
> single search since fastssWC returns some false positives (with k=1  
> it will still return some things with ED of 2). maybe if you store  
> the deletion position information as a payload (thus using original  
> fastss where there are no false positives) it would work though. i  
> looked at storing position information but it appeared like it  
> might be complex and the api was (is) still marked experimental so  
> i didn't go that route.
>
> i also agree lucene index might not be the best possible data  
> structure... just convenient thats all. i used it because i store  
> other things related to the term besides deletion neighborhoods for  
> my fuzzy matching.
>
> i guess i'll also mention that i do think storage size should be a  
> big consideration. you really don't need this kind of stuff unless  
> you are searching pretty big indexes in the first place (for <= few  
> million docs the default fuzzy is probably just fine for a lot of  
> people).
>
> for me, the whole thing was about turning 30second queries into 1  
> second queries by removing a linear algorithm, i didn't really  
> optimize much beyond that because i was just very happy to have  
> reasonable performance..
>
> On Tue, Jan 6, 2009 at 6:26 PM, robert engels  
> <re...@ix.netcom.com> wrote:
> I understand now.
>
> The index in my case would definitely be MUCH larger, but I think  
> it would perform better, as you only need to do a single search -  
> for obert (if you assume it was a misspelling).
>
> In your case you would eventually do an OR search in the lucene  
> index for all possible matches (robert, roberta, roberto, ...)  
> which could be much larger with some commonly prefixed/postfixed  
> words).
>
> Classic performance vs. size trade-off.  In your case where it is  
> not for misspellings, the performance difference might be worthwhile.
>
> Still, in your case, I am not sure using a Lucene index as the  
> external index is appropriate. Maybe a simple BTREE (Derby?) index  
> of (word,edit permutation) with a a key on both would allow easy  
> search and update. If implemented as a service, some intelligent  
> caching of common misspellings could really improve the performance.
>
> On Jan 6, 2009, at 4:29 PM, Robert Muir wrote:
>
>>
>>
>> On Tue, Jan 6, 2009 at 5:15 PM, robert engels  
>> <re...@ix.netcom.com> wrote:
>> It is definitely going to increase the index size, but not any  
>> more than than the external one would (if my understanding is  
>> correct).
>>
>> The nice thing is that you don't have to try and keep documents  
>> numbers in sync - it will be automatic.
>>
>> Maybe I don't understand what your external index is storing.  
>> Given that the document contains 'robert' but the user enters'  
>> obert', what is the process to find the matching documents?
>>
>> heres a simple example. neighborhood stored for robert is 'robert  
>> obert rbert roert ...'  this is indexed in a tokenized field.
>>
>> at query time user typoes robert and enters 'tobert'. again  
>> neighborhood is generated 'tobert obert tbert ...'
>> the system does a query on tobert OR obert OR tbert ... and robert  
>> is returned because 'obert' is present in both neighborhoods.
>> in this example, by storing k=1 deletions you guarantee to satisfy  
>> all edit distance matches <= 1 without linear scan.
>> you get some false positives too with this approach, thats why  
>> what comes back is only a CANDIDATE and true edit distance must be  
>> used to verify. this might be tricky to do with your method, i  
>> don't know.
>>
>>
>>
>>
>> Is the external index essentially a constant list, that given  
>> obert, the source words COULD BE robert, tobert, reobert etc., and  
>> it contains no document information so:
>>
>> no. see above, you generate all possible 1-character deletions of  
>> the index term and store them, then at query time you generate all  
>> possible 1-character deletions of the query term. basically,  
>> LUCENE and LUBENE are 1 character different, but they are the same  
>> (LUENE) if you delete 1 character from both of them. so you dont  
>> need to store LUCENE LUBENE LUDENE, you just store LUENE.
>>
>> given the source word X, and an edit distance k, you ask the  
>> external dictionary for possible indexed words, and it returns the  
>> list, and then use search lucene using each of those words?
>>
>> If the above is the case, it certainly seems you could generate  
>> this list in real-time rather efficiently with no IO (unless the  
>> external index only stores words which HAVE BEEN indexed).
>>
>> I think the confusion may be because I understand Otis's comments,  
>> but they don't seem to match what you are stating.
>>
>> Essentially performing any term match requires efficient searching/ 
>> matching of the term index. If this is efficient enough, I don't  
>> think either process is needed - just an improved real-time fuzzy  
>> possibilities word generator.
>>
>> On Jan 6, 2009, at 3:58 PM, Robert Muir wrote:
>>
>>> i see, your idea would definitely simplify some things.
>>>
>>> What about the index size difference between this approach and  
>>> using separate index? Would this separate field increase index size?
>>>
>>> I guess my line of thinking is if you have 10 docs with robert,  
>>> with separate index you just have robert, and its deletion  
>>> neighborhood one time. with this approach you have the same  
>>> thing, but at least you must have document numbers and the other  
>>> inverted index stuff with each neighborhood term. would this be a  
>>> significant change to size and/or performance? and since the  
>>> documents have multiple terms there is additional positional  
>>> information for slop factor for each neighborhood term...
>>>
>>> i think its worth investigating, maybe performance would actually  
>>> be better, just curious. i think i boxed myself in to auxiliary  
>>> index because of some other irrelevant thigns i am doing.
>>>
>>> On Tue, Jan 6, 2009 at 4:42 PM, robert engels  
>>> <re...@ix.netcom.com> wrote:
>>> I don't think that is the case. You will have single deletion  
>>> neighborhood. The number of unique terms in the field is going to  
>>> be the union of the deletion dictionaries of each source term.
>>>
>>> For example, given the following documents A which have field 'X'  
>>> with value best, and document B with value jest (and k == 1).
>>>
>>> A will generate est bst, bet, bes, B will generate est, jest,  
>>> jst, jes
>>>
>>> so field FieldXFuzzy contains  
>>> (est:AB,bst:A,bet:A,bes:A,jest:B,jst:B,jes)
>>>
>>> I don't think the storage requirement is any greater doing it  
>>> this way.
>>>
>>>
>>> 3.2.1 Indexing
>>> For all words in a dictionary, and a given number of edit  
>>> operations k, FastSS
>>> generates all variant spellings recursively and save them as  
>>> tuples of type
>>> v′ ∈ Ud (v, k) → (v, x) where v is a dictionary word and x a  
>>> list of deletion
>>> positions.
>>>
>>> Theorem 5. Index uses O(nmk+1) space, as it stores al l the  
>>> variants for n
>>> dictionary words of length m with k mismatches.
>>>
>>>
>>> 3.2.2 Retrieval
>>> For a query p and edit distance k, first generate the  
>>> neighborhood Ud (p, k).
>>> Then compare the words in the neighborhood with the index, and find
>>> matching candidates. Compare deletion positions for each  
>>> candidate with
>>> the deletion positions in U(p, k), using Theorem 4.
>>>
>>>
>>>
>>>
>>>
>>> -- 
>>> Robert Muir
>>> rcmuir@gmail.com
>>
>>
>>
>>
>> -- 
>> Robert Muir
>> rcmuir@gmail.com
>
>
>
>
> -- 
> Robert Muir
> rcmuir@gmail.com


Re: [jira] Commented: (LUCENE-1513) fastss fuzzyquery

Posted by Robert Muir <rc...@gmail.com>.
robert theres only one problem i see: i don't see how you can do a single
search since fastssWC returns some false positives (with k=1 it will still
return some things with ED of 2). maybe if you store the deletion position
information as a payload (thus using original fastss where there are no
false positives) it would work though. i looked at storing position
information but it appeared like it might be complex and the api was (is)
still marked experimental so i didn't go that route.

i also agree lucene index might not be the best possible data structure...
just convenient thats all. i used it because i store other things related to
the term besides deletion neighborhoods for my fuzzy matching.

i guess i'll also mention that i do think storage size should be a big
consideration. you really don't need this kind of stuff unless you are
searching pretty big indexes in the first place (for <= few million docs the
default fuzzy is probably just fine for a lot of people).

for me, the whole thing was about turning 30second queries into 1 second
queries by removing a linear algorithm, i didn't really optimize much beyond
that because i was just very happy to have reasonable performance..

On Tue, Jan 6, 2009 at 6:26 PM, robert engels <re...@ix.netcom.com> wrote:

> I understand now.
> The index in my case would definitely be MUCH larger, but I think it would
> perform better, as you only need to do a single search - for obert (if you
> assume it was a misspelling).
>
> In your case you would eventually do an OR search in the lucene index for
> all possible matches (robert, roberta, roberto, ...) which could be much
> larger with some commonly prefixed/postfixed words).
>
> Classic performance vs. size trade-off.  In your case where it is not for
> misspellings, the performance difference might be worthwhile.
>
> Still, in your case, I am not sure using a Lucene index as the external
> index is appropriate. Maybe a simple BTREE (Derby?) index of (word,edit
> permutation) with a a key on both would allow easy search and update. If
> implemented as a service, some intelligent caching of common misspellings
> could really improve the performance.
>
> On Jan 6, 2009, at 4:29 PM, Robert Muir wrote:
>
>
>
> On Tue, Jan 6, 2009 at 5:15 PM, robert engels <re...@ix.netcom.com>wrote:
>
>>  It is definitely going to increase the index size, but not any more than
>> than the external one would (if my understanding is correct).
>> The nice thing is that you don't have to try and keep documents numbers in
>> sync - it will be automatic.
>>
>> Maybe I don't understand what your external index is storing. Given that
>> the document contains 'robert' but the user enters' obert', what is the
>> process to find the matching documents?
>>
>
> heres a simple example. neighborhood stored for robert is 'robert obert
> rbert roert ...'  this is indexed in a tokenized field.
>
> at query time user typoes robert and enters 'tobert'. again neighborhood is
> generated 'tobert obert tbert ...'
> the system does a query on tobert OR obert OR tbert ... and robert is
> returned because 'obert' is present in both neighborhoods.
> in this example, by storing k=1 deletions you guarantee to satisfy all edit
> distance matches <= 1 without linear scan.
> you get some false positives too with this approach, thats why what comes
> back is only a CANDIDATE and true edit distance must be used to verify. this
> might be tricky to do with your method, i don't know.
>
>
>
>
>>
>> Is the external index essentially a constant list, that given obert, the
>> source words COULD BE robert, tobert, reobert etc., and it contains no
>> document information so:
>>
>
> no. see above, you generate all possible 1-character deletions of the index
> term and store them, then at query time you generate all possible
> 1-character deletions of the query term. basically, LUCENE and LUBENE are 1
> character different, but they are the same (LUENE) if you delete 1 character
> from both of them. so you dont need to store LUCENE LUBENE LUDENE, you just
> store LUENE.
>
>>
>> given the source word X, and an edit distance k, you ask the external
>> dictionary for possible indexed words, and it returns the list, and then use
>> search lucene using each of those words?
>>
>> If the above is the case, it certainly seems you could generate this list
>> in real-time rather efficiently with no IO (unless the external index only
>> stores words which HAVE BEEN indexed).
>>
>> I think the confusion may be because I understand Otis's comments, but
>> they don't seem to match what you are stating.
>>
>> Essentially performing any term match requires efficient
>> searching/matching of the term index. If this is efficient enough, I don't
>> think either process is needed - just an improved real-time fuzzy
>> possibilities word generator.
>>
>> On Jan 6, 2009, at 3:58 PM, Robert Muir wrote:
>>
>> i see, your idea would definitely simplify some things.
>>
>> What about the index size difference between this approach and using
>> separate index? Would this separate field increase index size?
>>
>> I guess my line of thinking is if you have 10 docs with robert, with
>> separate index you just have robert, and its deletion neighborhood one time.
>> with this approach you have the same thing, but at least you must have
>> document numbers and the other inverted index stuff with each neighborhood
>> term. would this be a significant change to size and/or performance? and
>> since the documents have multiple terms there is additional positional
>> information for slop factor for each neighborhood term...
>>
>> i think its worth investigating, maybe performance would actually be
>> better, just curious. i think i boxed myself in to auxiliary index because
>> of some other irrelevant thigns i am doing.
>>
>> On Tue, Jan 6, 2009 at 4:42 PM, robert engels <re...@ix.netcom.com>wrote:
>>
>>>  I don't think that is the case. You will have single deletion
>>> neighborhood. The number of unique terms in the field is going to be the
>>> union of the deletion dictionaries of each source term.
>>> For example, given the following documents A which have field 'X' with
>>> value best, and document B with value jest (and k == 1).
>>>
>>> A will generate est bst, bet, bes, B will generate est, jest, jst, jes
>>>
>>> so field FieldXFuzzy contains (est:AB,bst:A,bet:A,bes:A,jest:B,jst:B,jes)
>>>
>>> I don't think the storage requirement is any greater doing it this way.
>>>
>>>
>>> 3.2.1 Indexing
>>> For all words in a dictionary, and a given number of edit operations k,
>>> FastSS
>>> generates all variant spellings recursively and save them as tuples of
>>> type
>>> v′ ∈ Ud (v, k) → (v, x) where v is a dictionary word and x a list of
>>> deletion
>>> positions.
>>>
>>> Theorem 5. Index uses O(nmk+1) space, as it stores al l the variants for
>>> n
>>> dictionary words of length m with k mismatches.
>>>
>>>
>>> 3.2.2 Retrieval
>>> For a query p and edit distance k, first generate the neighborhood Ud (p,
>>> k).
>>> Then compare the words in the neighborhood with the index, and find
>>> matching candidates. Compare deletion positions for each candidate with
>>> the deletion positions in U(p, k), using Theorem 4.
>>>
>>>
>>>
>>
>>
>> --
>> Robert Muir
>> rcmuir@gmail.com
>>
>>
>>
>
>
> --
> Robert Muir
> rcmuir@gmail.com
>
>
>


-- 
Robert Muir
rcmuir@gmail.com

Re: [jira] Commented: (LUCENE-1513) fastss fuzzyquery

Posted by robert engels <re...@ix.netcom.com>.
I understand now.

The index in my case would definitely be MUCH larger, but I think it  
would perform better, as you only need to do a single search - for  
obert (if you assume it was a misspelling).

In your case you would eventually do an OR search in the lucene index  
for all possible matches (robert, roberta, roberto, ...) which could  
be much larger with some commonly prefixed/postfixed words).

Classic performance vs. size trade-off.  In your case where it is not  
for misspellings, the performance difference might be worthwhile.

Still, in your case, I am not sure using a Lucene index as the  
external index is appropriate. Maybe a simple BTREE (Derby?) index of  
(word,edit permutation) with a a key on both would allow easy search  
and update. If implemented as a service, some intelligent caching of  
common misspellings could really improve the performance.

On Jan 6, 2009, at 4:29 PM, Robert Muir wrote:

>
>
> On Tue, Jan 6, 2009 at 5:15 PM, robert engels  
> <re...@ix.netcom.com> wrote:
> It is definitely going to increase the index size, but not any more  
> than than the external one would (if my understanding is correct).
>
> The nice thing is that you don't have to try and keep documents  
> numbers in sync - it will be automatic.
>
> Maybe I don't understand what your external index is storing. Given  
> that the document contains 'robert' but the user enters' obert',  
> what is the process to find the matching documents?
>
> heres a simple example. neighborhood stored for robert is 'robert  
> obert rbert roert ...'  this is indexed in a tokenized field.
>
> at query time user typoes robert and enters 'tobert'. again  
> neighborhood is generated 'tobert obert tbert ...'
> the system does a query on tobert OR obert OR tbert ... and robert  
> is returned because 'obert' is present in both neighborhoods.
> in this example, by storing k=1 deletions you guarantee to satisfy  
> all edit distance matches <= 1 without linear scan.
> you get some false positives too with this approach, thats why what  
> comes back is only a CANDIDATE and true edit distance must be used  
> to verify. this might be tricky to do with your method, i don't know.
>
>
>
>
> Is the external index essentially a constant list, that given  
> obert, the source words COULD BE robert, tobert, reobert etc., and  
> it contains no document information so:
>
> no. see above, you generate all possible 1-character deletions of  
> the index term and store them, then at query time you generate all  
> possible 1-character deletions of the query term. basically, LUCENE  
> and LUBENE are 1 character different, but they are the same (LUENE)  
> if you delete 1 character from both of them. so you dont need to  
> store LUCENE LUBENE LUDENE, you just store LUENE.
>
> given the source word X, and an edit distance k, you ask the  
> external dictionary for possible indexed words, and it returns the  
> list, and then use search lucene using each of those words?
>
> If the above is the case, it certainly seems you could generate  
> this list in real-time rather efficiently with no IO (unless the  
> external index only stores words which HAVE BEEN indexed).
>
> I think the confusion may be because I understand Otis's comments,  
> but they don't seem to match what you are stating.
>
> Essentially performing any term match requires efficient searching/ 
> matching of the term index. If this is efficient enough, I don't  
> think either process is needed - just an improved real-time fuzzy  
> possibilities word generator.
>
> On Jan 6, 2009, at 3:58 PM, Robert Muir wrote:
>
>> i see, your idea would definitely simplify some things.
>>
>> What about the index size difference between this approach and  
>> using separate index? Would this separate field increase index size?
>>
>> I guess my line of thinking is if you have 10 docs with robert,  
>> with separate index you just have robert, and its deletion  
>> neighborhood one time. with this approach you have the same thing,  
>> but at least you must have document numbers and the other inverted  
>> index stuff with each neighborhood term. would this be a  
>> significant change to size and/or performance? and since the  
>> documents have multiple terms there is additional positional  
>> information for slop factor for each neighborhood term...
>>
>> i think its worth investigating, maybe performance would actually  
>> be better, just curious. i think i boxed myself in to auxiliary  
>> index because of some other irrelevant thigns i am doing.
>>
>> On Tue, Jan 6, 2009 at 4:42 PM, robert engels  
>> <re...@ix.netcom.com> wrote:
>> I don't think that is the case. You will have single deletion  
>> neighborhood. The number of unique terms in the field is going to  
>> be the union of the deletion dictionaries of each source term.
>>
>> For example, given the following documents A which have field 'X'  
>> with value best, and document B with value jest (and k == 1).
>>
>> A will generate est bst, bet, bes, B will generate est, jest, jst,  
>> jes
>>
>> so field FieldXFuzzy contains  
>> (est:AB,bst:A,bet:A,bes:A,jest:B,jst:B,jes)
>>
>> I don't think the storage requirement is any greater doing it this  
>> way.
>>
>>
>> 3.2.1 Indexing
>> For all words in a dictionary, and a given number of edit  
>> operations k, FastSS
>> generates all variant spellings recursively and save them as  
>> tuples of type
>> v′ ∈ Ud (v, k) → (v, x) where v is a dictionary word and x a  
>> list of deletion
>> positions.
>>
>> Theorem 5. Index uses O(nmk+1) space, as it stores al l the  
>> variants for n
>> dictionary words of length m with k mismatches.
>>
>>
>> 3.2.2 Retrieval
>> For a query p and edit distance k, first generate the neighborhood  
>> Ud (p, k).
>> Then compare the words in the neighborhood with the index, and find
>> matching candidates. Compare deletion positions for each candidate  
>> with
>> the deletion positions in U(p, k), using Theorem 4.
>>
>>
>>
>>
>>
>> -- 
>> Robert Muir
>> rcmuir@gmail.com
>
>
>
>
> -- 
> Robert Muir
> rcmuir@gmail.com


Re: [jira] Commented: (LUCENE-1513) fastss fuzzyquery

Posted by Robert Muir <rc...@gmail.com>.
On Tue, Jan 6, 2009 at 5:15 PM, robert engels <re...@ix.netcom.com> wrote:

> It is definitely going to increase the index size, but not any more than
> than the external one would (if my understanding is correct).
> The nice thing is that you don't have to try and keep documents numbers in
> sync - it will be automatic.
>
> Maybe I don't understand what your external index is storing. Given that
> the document contains 'robert' but the user enters' obert', what is the
> process to find the matching documents?
>

heres a simple example. neighborhood stored for robert is 'robert obert
rbert roert ...'  this is indexed in a tokenized field.

at query time user typoes robert and enters 'tobert'. again neighborhood is
generated 'tobert obert tbert ...'
the system does a query on tobert OR obert OR tbert ... and robert is
returned because 'obert' is present in both neighborhoods.
in this example, by storing k=1 deletions you guarantee to satisfy all edit
distance matches <= 1 without linear scan.
you get some false positives too with this approach, thats why what comes
back is only a CANDIDATE and true edit distance must be used to verify. this
might be tricky to do with your method, i don't know.




>
> Is the external index essentially a constant list, that given obert, the
> source words COULD BE robert, tobert, reobert etc., and it contains no
> document information so:
>

no. see above, you generate all possible 1-character deletions of the index
term and store them, then at query time you generate all possible
1-character deletions of the query term. basically, LUCENE and LUBENE are 1
character different, but they are the same (LUENE) if you delete 1 character
from both of them. so you dont need to store LUCENE LUBENE LUDENE, you just
store LUENE.

>
> given the source word X, and an edit distance k, you ask the external
> dictionary for possible indexed words, and it returns the list, and then use
> search lucene using each of those words?
>
> If the above is the case, it certainly seems you could generate this list
> in real-time rather efficiently with no IO (unless the external index only
> stores words which HAVE BEEN indexed).
>
> I think the confusion may be because I understand Otis's comments, but they
> don't seem to match what you are stating.
>
> Essentially performing any term match requires efficient searching/matching
> of the term index. If this is efficient enough, I don't think either process
> is needed - just an improved real-time fuzzy possibilities word generator.
>
> On Jan 6, 2009, at 3:58 PM, Robert Muir wrote:
>
> i see, your idea would definitely simplify some things.
>
> What about the index size difference between this approach and using
> separate index? Would this separate field increase index size?
>
> I guess my line of thinking is if you have 10 docs with robert, with
> separate index you just have robert, and its deletion neighborhood one time.
> with this approach you have the same thing, but at least you must have
> document numbers and the other inverted index stuff with each neighborhood
> term. would this be a significant change to size and/or performance? and
> since the documents have multiple terms there is additional positional
> information for slop factor for each neighborhood term...
>
> i think its worth investigating, maybe performance would actually be
> better, just curious. i think i boxed myself in to auxiliary index because
> of some other irrelevant thigns i am doing.
>
> On Tue, Jan 6, 2009 at 4:42 PM, robert engels <re...@ix.netcom.com>wrote:
>
>>  I don't think that is the case. You will have single deletion
>> neighborhood. The number of unique terms in the field is going to be the
>> union of the deletion dictionaries of each source term.
>> For example, given the following documents A which have field 'X' with
>> value best, and document B with value jest (and k == 1).
>>
>> A will generate est bst, bet, bes, B will generate est, jest, jst, jes
>>
>> so field FieldXFuzzy contains (est:AB,bst:A,bet:A,bes:A,jest:B,jst:B,jes)
>>
>> I don't think the storage requirement is any greater doing it this way.
>>
>>
>> 3.2.1 Indexing
>> For all words in a dictionary, and a given number of edit operations k,
>> FastSS
>> generates all variant spellings recursively and save them as tuples of
>> type
>> v′ ∈ Ud (v, k) → (v, x) where v is a dictionary word and x a list of
>> deletion
>> positions.
>>
>> Theorem 5. Index uses O(nmk+1) space, as it stores al l the variants for
>> n
>> dictionary words of length m with k mismatches.
>>
>>
>> 3.2.2 Retrieval
>> For a query p and edit distance k, first generate the neighborhood Ud (p,
>> k).
>> Then compare the words in the neighborhood with the index, and find
>> matching candidates. Compare deletion positions for each candidate with
>> the deletion positions in U(p, k), using Theorem 4.
>>
>>
>>
>
>
> --
> Robert Muir
> rcmuir@gmail.com
>
>
>


-- 
Robert Muir
rcmuir@gmail.com

Re: [jira] Commented: (LUCENE-1513) fastss fuzzyquery

Posted by Robert Muir <rc...@gmail.com>.
i see what you are saying here. this is different than fastss but sounds
nice for spelling correction.

i suppose one reason why i like fastss is for my application i need the true
complete edit distance, i'm actually not using it for spelling correction
but as a first step for other tasks.

but maybe fastss is too complex for the general case (spelling correction)
and we should do what you mention below. i think either way it would be nice
to have some kind of fuzzy matching in lucene that isn't a linear scan.

On Tue, Jan 6, 2009 at 5:29 PM, robert engels <re...@ix.netcom.com> wrote:

> To clarify a statement in the last email.
> To generate the 'possible source words' in real-time is not a difficult as
> first seems, if you assume some sort of first character prefix (which is
> what it appears google does).
>
> For example, assume the user typed 'robrt' instead of 'robert'. You see
> that this word has very low frequency (or none), so you want to find
> possible misspellings, so do a fuzzy search starting with r. But this search
> can be optimized, because as the edit/delete position moves to the right,
> the prefix remains the same, so these possibilities can be quickly skipped.
>
> If you don't find any words with high enough frequency as possible edit
> distances, try [a-z]robrt, assuming the user may have dropped the first
> character (possibly try this in "know combination" order, rather than alpha
> (i.e. try sr before nr).
>

> For example, searching google for 'robrt engels' works. So does 'obert
> engels', so does 'robt engels', all ask me if I meant 'robert engels', but
> searching for 'obrt engels' does not.
>
> On Jan 6, 2009, at 4:15 PM, robert engels wrote:
>
> It is definitely going to increase the index size, but not any more than
> than the external one would (if my understanding is correct).
> The nice thing is that you don't have to try and keep documents numbers in
> sync - it will be automatic.
>
> Maybe I don't understand what your external index is storing. Given that
> the document contains 'robert' but the user enters' obert', what is the
> process to find the matching documents?
>
> Is the external index essentially a constant list, that given obert, the
> source words COULD BE robert, tobert, reobert etc., and it contains no
> document information so:
>
> given the source word X, and an edit distance k, you ask the external
> dictionary for possible indexed words, and it returns the list, and then use
> search lucene using each of those words?
>
> If the above is the case, it certainly seems you could generate this list
> in real-time rather efficiently with no IO (unless the external index only
> stores words which HAVE BEEN indexed).
>
> I think the confusion may be because I understand Otis's comments, but they
> don't seem to match what you are stating.
>
> Essentially performing any term match requires efficient searching/matching
> of the term index. If this is efficient enough, I don't think either process
> is needed - just an improved real-time fuzzy possibilities word generator.
>
> On Jan 6, 2009, at 3:58 PM, Robert Muir wrote:
>
> i see, your idea would definitely simplify some things.
>
> What about the index size difference between this approach and using
> separate index? Would this separate field increase index size?
>
> I guess my line of thinking is if you have 10 docs with robert, with
> separate index you just have robert, and its deletion neighborhood one time.
> with this approach you have the same thing, but at least you must have
> document numbers and the other inverted index stuff with each neighborhood
> term. would this be a significant change to size and/or performance? and
> since the documents have multiple terms there is additional positional
> information for slop factor for each neighborhood term...
>
> i think its worth investigating, maybe performance would actually be
> better, just curious. i think i boxed myself in to auxiliary index because
> of some other irrelevant thigns i am doing.
>
> On Tue, Jan 6, 2009 at 4:42 PM, robert engels <re...@ix.netcom.com>wrote:
>
>>  I don't think that is the case. You will have single deletion
>> neighborhood. The number of unique terms in the field is going to be the
>> union of the deletion dictionaries of each source term.
>> For example, given the following documents A which have field 'X' with
>> value best, and document B with value jest (and k == 1).
>>
>> A will generate est bst, bet, bes, B will generate est, jest, jst, jes
>>
>> so field FieldXFuzzy contains (est:AB,bst:A,bet:A,bes:A,jest:B,jst:B,jes)
>>
>> I don't think the storage requirement is any greater doing it this way.
>>
>>
>> 3.2.1 Indexing
>> For all words in a dictionary, and a given number of edit operations k,
>> FastSS
>> generates all variant spellings recursively and save them as tuples of
>> type
>> v′ ∈ Ud (v, k) → (v, x) where v is a dictionary word and x a list of
>> deletion
>> positions.
>>
>> Theorem 5. Index uses O(nmk+1) space, as it stores al l the variants for
>> n
>> dictionary words of length m with k mismatches.
>>
>>
>> 3.2.2 Retrieval
>> For a query p and edit distance k, first generate the neighborhood Ud (p,
>> k).
>> Then compare the words in the neighborhood with the index, and find
>> matching candidates. Compare deletion positions for each candidate with
>> the deletion positions in U(p, k), using Theorem 4.
>>
>>
>>
>
>
> --
> Robert Muir
> rcmuir@gmail.com
>
>
>
>


-- 
Robert Muir
rcmuir@gmail.com

Re: [jira] Commented: (LUCENE-1513) fastss fuzzyquery

Posted by robert engels <re...@ix.netcom.com>.
To clarify a statement in the last email.

To generate the 'possible source words' in real-time is not a  
difficult as first seems, if you assume some sort of first character  
prefix (which is what it appears google does).

For example, assume the user typed 'robrt' instead of 'robert'. You  
see that this word has very low frequency (or none), so you want to  
find possible misspellings, so do a fuzzy search starting with r. But  
this search can be optimized, because as the edit/delete position  
moves to the right, the prefix remains the same, so these  
possibilities can be quickly skipped.

If you don't find any words with high enough frequency as possible  
edit distances, try [a-z]robrt, assuming the user may have dropped  
the first character (possibly try this in "know combination" order,  
rather than alpha (i.e. try sr before nr).

For example, searching google for 'robrt engels' works. So does  
'obert engels', so does 'robt engels', all ask me if I meant 'robert  
engels', but searching for 'obrt engels' does not.

On Jan 6, 2009, at 4:15 PM, robert engels wrote:

> It is definitely going to increase the index size, but not any more  
> than than the external one would (if my understanding is correct).
>
> The nice thing is that you don't have to try and keep documents  
> numbers in sync - it will be automatic.
>
> Maybe I don't understand what your external index is storing. Given  
> that the document contains 'robert' but the user enters' obert',  
> what is the process to find the matching documents?
>
> Is the external index essentially a constant list, that given  
> obert, the source words COULD BE robert, tobert, reobert etc., and  
> it contains no document information so:
>
> given the source word X, and an edit distance k, you ask the  
> external dictionary for possible indexed words, and it returns the  
> list, and then use search lucene using each of those words?
>
> If the above is the case, it certainly seems you could generate  
> this list in real-time rather efficiently with no IO (unless the  
> external index only stores words which HAVE BEEN indexed).
>
> I think the confusion may be because I understand Otis's comments,  
> but they don't seem to match what you are stating.
>
> Essentially performing any term match requires efficient searching/ 
> matching of the term index. If this is efficient enough, I don't  
> think either process is needed - just an improved real-time fuzzy  
> possibilities word generator.
>
> On Jan 6, 2009, at 3:58 PM, Robert Muir wrote:
>
>> i see, your idea would definitely simplify some things.
>>
>> What about the index size difference between this approach and  
>> using separate index? Would this separate field increase index size?
>>
>> I guess my line of thinking is if you have 10 docs with robert,  
>> with separate index you just have robert, and its deletion  
>> neighborhood one time. with this approach you have the same thing,  
>> but at least you must have document numbers and the other inverted  
>> index stuff with each neighborhood term. would this be a  
>> significant change to size and/or performance? and since the  
>> documents have multiple terms there is additional positional  
>> information for slop factor for each neighborhood term...
>>
>> i think its worth investigating, maybe performance would actually  
>> be better, just curious. i think i boxed myself in to auxiliary  
>> index because of some other irrelevant thigns i am doing.
>>
>> On Tue, Jan 6, 2009 at 4:42 PM, robert engels  
>> <re...@ix.netcom.com> wrote:
>> I don't think that is the case. You will have single deletion  
>> neighborhood. The number of unique terms in the field is going to  
>> be the union of the deletion dictionaries of each source term.
>>
>> For example, given the following documents A which have field 'X'  
>> with value best, and document B with value jest (and k == 1).
>>
>> A will generate est bst, bet, bes, B will generate est, jest, jst,  
>> jes
>>
>> so field FieldXFuzzy contains  
>> (est:AB,bst:A,bet:A,bes:A,jest:B,jst:B,jes)
>>
>> I don't think the storage requirement is any greater doing it this  
>> way.
>>
>>
>> 3.2.1 Indexing
>> For all words in a dictionary, and a given number of edit  
>> operations k, FastSS
>> generates all variant spellings recursively and save them as  
>> tuples of type
>> v′ ∈ Ud (v, k) → (v, x) where v is a dictionary word and x a  
>> list of deletion
>> positions.
>>
>> Theorem 5. Index uses O(nmk+1) space, as it stores al l the  
>> variants for n
>> dictionary words of length m with k mismatches.
>>
>>
>> 3.2.2 Retrieval
>> For a query p and edit distance k, first generate the neighborhood  
>> Ud (p, k).
>> Then compare the words in the neighborhood with the index, and find
>> matching candidates. Compare deletion positions for each candidate  
>> with
>> the deletion positions in U(p, k), using Theorem 4.
>>
>>
>>
>>
>>
>> -- 
>> Robert Muir
>> rcmuir@gmail.com
>


Re: [jira] Commented: (LUCENE-1513) fastss fuzzyquery

Posted by robert engels <re...@ix.netcom.com>.
It is definitely going to increase the index size, but not any more  
than than the external one would (if my understanding is correct).

The nice thing is that you don't have to try and keep documents  
numbers in sync - it will be automatic.

Maybe I don't understand what your external index is storing. Given  
that the document contains 'robert' but the user enters' obert', what  
is the process to find the matching documents?

Is the external index essentially a constant list, that given obert,  
the source words COULD BE robert, tobert, reobert etc., and it  
contains no document information so:

given the source word X, and an edit distance k, you ask the external  
dictionary for possible indexed words, and it returns the list, and  
then use search lucene using each of those words?

If the above is the case, it certainly seems you could generate this  
list in real-time rather efficiently with no IO (unless the external  
index only stores words which HAVE BEEN indexed).

I think the confusion may be because I understand Otis's comments,  
but they don't seem to match what you are stating.

Essentially performing any term match requires efficient searching/ 
matching of the term index. If this is efficient enough, I don't  
think either process is needed - just an improved real-time fuzzy  
possibilities word generator.

On Jan 6, 2009, at 3:58 PM, Robert Muir wrote:

> i see, your idea would definitely simplify some things.
>
> What about the index size difference between this approach and  
> using separate index? Would this separate field increase index size?
>
> I guess my line of thinking is if you have 10 docs with robert,  
> with separate index you just have robert, and its deletion  
> neighborhood one time. with this approach you have the same thing,  
> but at least you must have document numbers and the other inverted  
> index stuff with each neighborhood term. would this be a  
> significant change to size and/or performance? and since the  
> documents have multiple terms there is additional positional  
> information for slop factor for each neighborhood term...
>
> i think its worth investigating, maybe performance would actually  
> be better, just curious. i think i boxed myself in to auxiliary  
> index because of some other irrelevant thigns i am doing.
>
> On Tue, Jan 6, 2009 at 4:42 PM, robert engels  
> <re...@ix.netcom.com> wrote:
> I don't think that is the case. You will have single deletion  
> neighborhood. The number of unique terms in the field is going to  
> be the union of the deletion dictionaries of each source term.
>
> For example, given the following documents A which have field 'X'  
> with value best, and document B with value jest (and k == 1).
>
> A will generate est bst, bet, bes, B will generate est, jest, jst, jes
>
> so field FieldXFuzzy contains  
> (est:AB,bst:A,bet:A,bes:A,jest:B,jst:B,jes)
>
> I don't think the storage requirement is any greater doing it this  
> way.
>
>
> 3.2.1 Indexing
> For all words in a dictionary, and a given number of edit  
> operations k, FastSS
> generates all variant spellings recursively and save them as tuples  
> of type
> v′ ∈ Ud (v, k) → (v, x) where v is a dictionary word and x a  
> list of deletion
> positions.
>
> Theorem 5. Index uses O(nmk+1) space, as it stores al l the  
> variants for n
> dictionary words of length m with k mismatches.
>
>
> 3.2.2 Retrieval
> For a query p and edit distance k, first generate the neighborhood  
> Ud (p, k).
> Then compare the words in the neighborhood with the index, and find
> matching candidates. Compare deletion positions for each candidate  
> with
> the deletion positions in U(p, k), using Theorem 4.
>
>
>
>
>
> -- 
> Robert Muir
> rcmuir@gmail.com


Re: [jira] Commented: (LUCENE-1513) fastss fuzzyquery

Posted by Robert Muir <rc...@gmail.com>.
i see, your idea would definitely simplify some things.

What about the index size difference between this approach and using
separate index? Would this separate field increase index size?

I guess my line of thinking is if you have 10 docs with robert, with
separate index you just have robert, and its deletion neighborhood one time.
with this approach you have the same thing, but at least you must have
document numbers and the other inverted index stuff with each neighborhood
term. would this be a significant change to size and/or performance? and
since the documents have multiple terms there is additional positional
information for slop factor for each neighborhood term...

i think its worth investigating, maybe performance would actually be better,
just curious. i think i boxed myself in to auxiliary index because of some
other irrelevant thigns i am doing.

On Tue, Jan 6, 2009 at 4:42 PM, robert engels <re...@ix.netcom.com> wrote:

> I don't think that is the case. You will have single deletion neighborhood.
> The number of unique terms in the field is going to be the union of the
> deletion dictionaries of each source term.
> For example, given the following documents A which have field 'X' with
> value best, and document B with value jest (and k == 1).
>
> A will generate est bst, bet, bes, B will generate est, jest, jst, jes
>
> so field FieldXFuzzy contains (est:AB,bst:A,bet:A,bes:A,jest:B,jst:B,jes)
>
> I don't think the storage requirement is any greater doing it this way.
>
>
> 3.2.1 Indexing
> For all words in a dictionary, and a given number of edit operations k,
> FastSS
> generates all variant spellings recursively and save them as tuples of type
>
> v′ ∈ Ud (v, k) → (v, x) where v is a dictionary word and x a list of
> deletion
> positions.
>
> Theorem 5. Index uses O(nmk+1) space, as it stores al l the variants for n
>
> dictionary words of length m with k mismatches.
>
>
> 3.2.2 Retrieval
> For a query p and edit distance k, first generate the neighborhood Ud (p,
> k).
> Then compare the words in the neighborhood with the index, and find
> matching candidates. Compare deletion positions for each candidate with
> the deletion positions in U(p, k), using Theorem 4.
>
>
>


-- 
Robert Muir
rcmuir@gmail.com

Re: [jira] Commented: (LUCENE-1513) fastss fuzzyquery

Posted by robert engels <re...@ix.netcom.com>.
I don't think that is the case. You will have single deletion  
neighborhood. The number of unique terms in the field is going to be  
the union of the deletion dictionaries of each source term.

For example, given the following documents A which have field 'X'  
with value best, and document B with value jest (and k == 1).

A will generate est bst, bet, bes, B will generate est, jest, jst, jes

so field FieldXFuzzy contains  
(est:AB,bst:A,bet:A,bes:A,jest:B,jst:B,jes)

I don't think the storage requirement is any greater doing it this way.


3.2.1 Indexing
For all words in a dictionary, and a given number of edit operations  
k, FastSS
generates all variant spellings recursively and save them as tuples  
of type
v′ ∈ Ud (v, k) → (v, x) where v is a dictionary word and x a  
list of deletion
positions.

Theorem 5. Index uses O(nmk+1) space, as it stores al l the variants  
for n
dictionary words of length m with k mismatches.


3.2.2 Retrieval
For a query p and edit distance k, first generate the neighborhood Ud  
(p, k).
Then compare the words in the neighborhood with the index, and find
matching candidates. Compare deletion positions for each candidate with
the deletion positions in U(p, k), using Theorem 4.



Re: [jira] Commented: (LUCENE-1513) fastss fuzzyquery

Posted by Robert Muir <rc...@gmail.com>.
a deletion neighborhood can be pretty large (for example robert is something
like robert obert rbert robrt robet ...)
so if you have a 100 million docs with 1 billion words, but only 100k unique
terms, it definitely would be wasteful to have 1 billion deletion
neighborhoods when you only need 100k.

On Tue, Jan 6, 2009 at 4:02 PM, robert engels <re...@ix.netcom.com> wrote:

> Why not just create a new field for this? That is, if you have FieldA,
> create field FieldAFuzzy and put the various permutations there.
>
> The fuzzy scorer/parser can be changed to automatically use the XXXXFuzzy
> field when required.
>
> You could also store positions, and allow that the first term is the
> "closest", next is the second closest, etc. to add support for a slop
> factor.
>
> This is similar to the same way fast phonetic searches can be implemented.
>
> If you do it this way, you don't have any of the synchronization issues
> between the index and the external "fuzzy" index.
>
>
>
> On Jan 6, 2009, at 2:57 PM, Robert Muir (JIRA) wrote:
>
>
>>    [ https://issues.apache.org/jira/browse/LUCENE-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12661314#action_12661314
>> ]
>>
>> Robert Muir commented on LUCENE-1513:
>> -------------------------------------
>>
>> otis, discussion was on java-user.
>>
>> again, I apologize for the messy code. as mentioned there, my setup is
>> very specific to exactly what I am doing and in no way is this code ready.
>> But since i'm currently pretty busy with other things at work I just wanted
>> to put something up here for anyone else interested.
>>
>> theres the issues you mentioned, and also some i mentioned on java-user.
>> for example how to handle updates to indexes that introduce new terms (they
>> must be added to auxiliary index), or even if auxiliary index is the best
>> approach.
>>
>> the general idea is that instead of enumerating terms to find terms, the
>> deletion neighborhood as described in the paper is used instead. this way
>> search time is not linear based on number of terms. yes you are correct that
>> it only can guarantee edit distances of K which is determined at index time.
>> perhaps this should be configurable, but i hardcoded k=1 for simplicity. i
>> think its something like 80% of typos...
>>
>> as i mentioned on the list another idea is you could implement FastSS (not
>> the wC variant) with deletion positions maybe by using payloads. This would
>> require more space but eliminate the candidate verification step. maybe it
>> would be nice to have some of their other algorithms such as block-based,etc
>> available also.
>>
>>
>>
>>  fastss fuzzyquery
>>> -----------------
>>>
>>>                Key: LUCENE-1513
>>>                URL: https://issues.apache.org/jira/browse/LUCENE-1513
>>>            Project: Lucene - Java
>>>         Issue Type: New Feature
>>>         Components: contrib/*
>>>           Reporter: Robert Muir
>>>           Priority: Minor
>>>        Attachments: fastSSfuzzy.zip
>>>
>>>
>>> code for doing fuzzyqueries with fastssWC algorithm.
>>> FuzzyIndexer: given a lucene field, it enumerates all terms and creates
>>> an auxiliary offline index for fuzzy queries.
>>> FastFuzzyQuery: similar to fuzzy query except it queries the auxiliary
>>> index to retrieve a candidate list. this list is then verified with
>>> levenstein algorithm.
>>> sorry but the code is a bit messy... what I'm actually using is very
>>> different from this so its pretty much untested. but at least you can see
>>> whats going on or fix it up.
>>>
>>
>> --
>> This message is automatically generated by JIRA.
>> -
>> You can reply to this email to add a comment to the issue online.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>


-- 
Robert Muir
rcmuir@gmail.com

Re: [jira] Commented: (LUCENE-1513) fastss fuzzyquery

Posted by robert engels <re...@ix.netcom.com>.
Why not just create a new field for this? That is, if you have  
FieldA, create field FieldAFuzzy and put the various permutations there.

The fuzzy scorer/parser can be changed to automatically use the  
XXXXFuzzy field when required.

You could also store positions, and allow that the first term is the  
"closest", next is the second closest, etc. to add support for a slop  
factor.

This is similar to the same way fast phonetic searches can be  
implemented.

If you do it this way, you don't have any of the synchronization  
issues between the index and the external "fuzzy" index.


On Jan 6, 2009, at 2:57 PM, Robert Muir (JIRA) wrote:

>
>     [ https://issues.apache.org/jira/browse/LUCENE-1513? 
> page=com.atlassian.jira.plugin.system.issuetabpanels:comment- 
> tabpanel&focusedCommentId=12661314#action_12661314 ]
>
> Robert Muir commented on LUCENE-1513:
> -------------------------------------
>
> otis, discussion was on java-user.
>
> again, I apologize for the messy code. as mentioned there, my setup  
> is very specific to exactly what I am doing and in no way is this  
> code ready. But since i'm currently pretty busy with other things  
> at work I just wanted to put something up here for anyone else  
> interested.
>
> theres the issues you mentioned, and also some i mentioned on java- 
> user. for example how to handle updates to indexes that introduce  
> new terms (they must be added to auxiliary index), or even if  
> auxiliary index is the best approach.
>
> the general idea is that instead of enumerating terms to find  
> terms, the deletion neighborhood as described in the paper is used  
> instead. this way search time is not linear based on number of  
> terms. yes you are correct that it only can guarantee edit  
> distances of K which is determined at index time. perhaps this  
> should be configurable, but i hardcoded k=1 for simplicity. i think  
> its something like 80% of typos...
>
> as i mentioned on the list another idea is you could implement  
> FastSS (not the wC variant) with deletion positions maybe by using  
> payloads. This would require more space but eliminate the candidate  
> verification step. maybe it would be nice to have some of their  
> other algorithms such as block-based,etc available also.
>
>
>
>> fastss fuzzyquery
>> -----------------
>>
>>                 Key: LUCENE-1513
>>                 URL: https://issues.apache.org/jira/browse/ 
>> LUCENE-1513
>>             Project: Lucene - Java
>>          Issue Type: New Feature
>>          Components: contrib/*
>>            Reporter: Robert Muir
>>            Priority: Minor
>>         Attachments: fastSSfuzzy.zip
>>
>>
>> code for doing fuzzyqueries with fastssWC algorithm.
>> FuzzyIndexer: given a lucene field, it enumerates all terms and  
>> creates an auxiliary offline index for fuzzy queries.
>> FastFuzzyQuery: similar to fuzzy query except it queries the  
>> auxiliary index to retrieve a candidate list. this list is then  
>> verified with levenstein algorithm.
>> sorry but the code is a bit messy... what I'm actually using is  
>> very different from this so its pretty much untested. but at least  
>> you can see whats going on or fix it up.
>
> -- 
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1513) fastss fuzzyquery

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12661314#action_12661314 ] 

Robert Muir commented on LUCENE-1513:
-------------------------------------

otis, discussion was on java-user.

again, I apologize for the messy code. as mentioned there, my setup is very specific to exactly what I am doing and in no way is this code ready. But since i'm currently pretty busy with other things at work I just wanted to put something up here for anyone else interested.

theres the issues you mentioned, and also some i mentioned on java-user. for example how to handle updates to indexes that introduce new terms (they must be added to auxiliary index), or even if auxiliary index is the best approach.

the general idea is that instead of enumerating terms to find terms, the deletion neighborhood as described in the paper is used instead. this way search time is not linear based on number of terms. yes you are correct that it only can guarantee edit distances of K which is determined at index time. perhaps this should be configurable, but i hardcoded k=1 for simplicity. i think its something like 80% of typos...

as i mentioned on the list another idea is you could implement FastSS (not the wC variant) with deletion positions maybe by using payloads. This would require more space but eliminate the candidate verification step. maybe it would be nice to have some of their other algorithms such as block-based,etc available also. 



> fastss fuzzyquery
> -----------------
>
>                 Key: LUCENE-1513
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1513
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/*
>            Reporter: Robert Muir
>            Priority: Minor
>         Attachments: fastSSfuzzy.zip
>
>
> code for doing fuzzyqueries with fastssWC algorithm.
> FuzzyIndexer: given a lucene field, it enumerates all terms and creates an auxiliary offline index for fuzzy queries.
> FastFuzzyQuery: similar to fuzzy query except it queries the auxiliary index to retrieve a candidate list. this list is then verified with levenstein algorithm.
> sorry but the code is a bit messy... what I'm actually using is very different from this so its pretty much untested. but at least you can see whats going on or fix it up.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-1513) fastss fuzzyquery

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-1513:
--------------------------------

    Attachment: fastSSfuzzy.zip

> fastss fuzzyquery
> -----------------
>
>                 Key: LUCENE-1513
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1513
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/*
>            Reporter: Robert Muir
>            Priority: Minor
>         Attachments: fastSSfuzzy.zip
>
>
> code for doing fuzzyqueries with fastssWC algorithm.
> FuzzyIndexer: given a lucene field, it enumerates all terms and creates an auxiliary offline index for fuzzy queries.
> FastFuzzyQuery: similar to fuzzy query except it queries the auxiliary index to retrieve a candidate list. this list is then verified with levenstein algorithm.
> sorry but the code is a bit messy... what I'm actually using is very different from this so its pretty much untested. but at least you can see whats going on or fix it up.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1513) fastss fuzzyquery

Posted by "Otis Gospodnetic (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12661286#action_12661286 ] 

Otis Gospodnetic commented on LUCENE-1513:
------------------------------------------

References provided by Glen Newton:

- Fast Similarity Search in Large Dictionaries. http://fastss.csg.uzh.ch/
- Paper: Fast Similarity Search in Large Dictionaries.
http://fastss.csg.uzh.ch/ifi-2007.02.pdf
- FastSimilarSearch.java http://fastss.csg.uzh.ch/FastSimilarSearch.java
- Paper: Fast Similarity Search in Peer-to-Peer Networks.
  http://www.globis.ethz.ch/script/publication/download?docid=506


> fastss fuzzyquery
> -----------------
>
>                 Key: LUCENE-1513
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1513
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/*
>            Reporter: Robert Muir
>            Priority: Minor
>         Attachments: fastSSfuzzy.zip
>
>
> code for doing fuzzyqueries with fastssWC algorithm.
> FuzzyIndexer: given a lucene field, it enumerates all terms and creates an auxiliary offline index for fuzzy queries.
> FastFuzzyQuery: similar to fuzzy query except it queries the auxiliary index to retrieve a candidate list. this list is then verified with levenstein algorithm.
> sorry but the code is a bit messy... what I'm actually using is very different from this so its pretty much untested. but at least you can see whats going on or fix it up.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org