You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Aad Nales <aa...@rotterdam-cs.com> on 2004/09/08 09:07:00 UTC

combining open office spellchecker with Lucene

Hi All,

Before I start reinventing wheels I would like to do a short check to
see if anybody else has already tried this. A customer has requested us
to look into the possibility to perform a spell check on queries. So far
the most promising way of doing this seems to be to create an Analyzer
based on the spellchecker of OpenOffice. My question is: "has anybody
tried this before?" 

Cheers,
Aad


--
Aad Nales
aad.nales@rotterdam-cs.com, +31-(0)6 54 207 340 



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


RE: frequent terms - Re: combining open office spellchecker with Lucene

Posted by Aad Nales <aa...@rotterdam-cs.com>.
Also,

You can also use an alternative spellchecker for the 'checking part' and
use the Ngram algorithm for the 'suggestion' part. Only if the spell
'check' declares a word illegal the 'suggestion' part would perform its
magic.


cheers,
Aad

Doug Cutting wrote:

> David Spencer wrote:
> 
>> [1] The user enters a query like:
>>     recursize descent parser
>>
>> [2] The search code parses this and sees that the 1st word is not a
>> term in the index, but the next 2 are. So it ignores the last 2 terms

>> ("recursive" and "descent") and suggests alternatives to 
>> "recursize"...thus if any term is in the index, regardless of 
>> frequency,  it is left as-is.
>>
>> I guess you're saying that, if the user enters a term that appears in
>> the index and thus is sort of spelled correctly ( as it exists in
some 
>> doc), then we use the heuristic that any sufficiently large doc 
>> collection will have tons of misspellings, so we assume that rare 
>> terms in the query might be misspelled (i.e. not what the user 
>> intended) and we suggest alternativies to these words too (in
addition 
>> to the words in the query that are not in the index at all).
> 
> 
> Almost.
> 
> If the user enters "a recursize purser", then: "a", which is in, say,
>  >50% of the documents, is probably spelled correctly and "recursize",

> which is in zero documents, is probably mispelled.  But what about 
> "purser"?  If we run the spell check algorithm on "purser" and
generate 
> "parser", should we show it to the user?  If "purser" occurs in 1% of 
> documents and "parser" occurs in 5%, then we probably should, since 
> "parser" is a more common word than "purser".  But if "parser" only 
> occurs in 1% of the documents and purser occurs in 5%, then we
probably 
> shouldn't bother suggesting "parser".
> 
> If you wanted to get really fancy then you could check how frequently
> combinations of query terms occur, i.e., does "purser" or "parser"
occur 
> more frequently near "descent".  But that gets expensive.

I updated the code to have an optional popularity filter - if true then 
it only returns matches more popular (frequent) than the word that is 
passed in for spelling correction.

If true (default) then for common words like "remove", no results are 
returned now, as expected:

http://www.searchmorph.com/kat/spell.jsp?s=remove

But if you set it to false (bottom slot in the form at the bottom of the

page) then the algorithm happily looks for alternatives:

http://www.searchmorph.com/kat/spell.jsp?s=remove&min=2&max=5&maxd=5&max
r=10&bstart=2.0&bend=1.0&btranspose=1.0&popular=0

TBD I need to update the javadoc & repost the code I guess. Also as per 
earlier post I also store simple transpositions for words in the 
ngram-index.

-- Dave

> 
> Doug
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org




---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: frequent terms - Re: combining open office spellchecker with Lucene

Posted by David Spencer <da...@tropo.com>.
Doug Cutting wrote:

> David Spencer wrote:
> 
>> [1] The user enters a query like:
>>     recursize descent parser
>>
>> [2] The search code parses this and sees that the 1st word is not a 
>> term in the index, but the next 2 are. So it ignores the last 2 terms 
>> ("recursive" and "descent") and suggests alternatives to 
>> "recursize"...thus if any term is in the index, regardless of 
>> frequency,  it is left as-is.
>>
>> I guess you're saying that, if the user enters a term that appears in 
>> the index and thus is sort of spelled correctly ( as it exists in some 
>> doc), then we use the heuristic that any sufficiently large doc 
>> collection will have tons of misspellings, so we assume that rare 
>> terms in the query might be misspelled (i.e. not what the user 
>> intended) and we suggest alternativies to these words too (in addition 
>> to the words in the query that are not in the index at all).
> 
> 
> Almost.
> 
> If the user enters "a recursize purser", then: "a", which is in, say, 
>  >50% of the documents, is probably spelled correctly and "recursize", 
> which is in zero documents, is probably mispelled.  But what about 
> "purser"?  If we run the spell check algorithm on "purser" and generate 
> "parser", should we show it to the user?  If "purser" occurs in 1% of 
> documents and "parser" occurs in 5%, then we probably should, since 
> "parser" is a more common word than "purser".  But if "parser" only 
> occurs in 1% of the documents and purser occurs in 5%, then we probably 
> shouldn't bother suggesting "parser".
> 
> If you wanted to get really fancy then you could check how frequently 
> combinations of query terms occur, i.e., does "purser" or "parser" occur 
> more frequently near "descent".  But that gets expensive.

I updated the code to have an optional popularity filter - if true then 
it only returns matches more popular (frequent) than the word that is 
passed in for spelling correction.

If true (default) then for common words like "remove", no results are 
returned now, as expected:

http://www.searchmorph.com/kat/spell.jsp?s=remove

But if you set it to false (bottom slot in the form at the bottom of the 
page) then the algorithm happily looks for alternatives:

http://www.searchmorph.com/kat/spell.jsp?s=remove&min=2&max=5&maxd=5&maxr=10&bstart=2.0&bend=1.0&btranspose=1.0&popular=0

TBD I need to update the javadoc & repost the code I guess. Also as per 
earlier post I also store simple transpositions for words in the 
ngram-index.

-- Dave

> 
> Doug
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: frequent terms - Re: combining open office spellchecker with Lucene

Posted by David Spencer <da...@tropo.com>.
Doug Cutting wrote:

> David Spencer wrote:
> 
>> [1] The user enters a query like:
>>     recursize descent parser
>>
>> [2] The search code parses this and sees that the 1st word is not a 
>> term in the index, but the next 2 are. So it ignores the last 2 terms 
>> ("recursive" and "descent") and suggests alternatives to 
>> "recursize"...thus if any term is in the index, regardless of 
>> frequency,  it is left as-is.
>>
>> I guess you're saying that, if the user enters a term that appears in 
>> the index and thus is sort of spelled correctly ( as it exists in some 
>> doc), then we use the heuristic that any sufficiently large doc 
>> collection will have tons of misspellings, so we assume that rare 
>> terms in the query might be misspelled (i.e. not what the user 
>> intended) and we suggest alternativies to these words too (in addition 
>> to the words in the query that are not in the index at all).
> 
> 
> Almost.
> 
> If the user enters "a recursize purser", then: "a", which is in, say, 
>  >50% of the documents, is probably spelled correctly and "recursize", 
> which is in zero documents, is probably mispelled.  But what about 
> "purser"?  If we run the spell check algorithm on "purser" and generate 
> "parser", should we show it to the user?  If "purser" occurs in 1% of 
> documents and "parser" occurs in 5%, then we probably should, since 
> "parser" is a more common word than "purser".  But if "parser" only 
> occurs in 1% of the documents and purser occurs in 5%, then we probably 
> shouldn't bother suggesting "parser".

OK, sure, got it.
I'll give it a think and try to add this option to my just submitted 
spelling code.


> 
> If you wanted to get really fancy then you could check how frequently 
> combinations of query terms occur, i.e., does "purser" or "parser" occur 
> more frequently near "descent".  But that gets expensive.

Yeah, expensive for a large scale search engine, but probably 
appropriate for a desktop engine.

> 
> Doug
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: frequent terms - Re: combining open office spellchecker with Lucene

Posted by Doug Cutting <cu...@apache.org>.
David Spencer wrote:
> [1] The user enters a query like:
>     recursize descent parser
> 
> [2] The search code parses this and sees that the 1st word is not a term 
> in the index, but the next 2 are. So it ignores the last 2 terms 
> ("recursive" and "descent") and suggests alternatives to 
> "recursize"...thus if any term is in the index, regardless of frequency, 
>  it is left as-is.
> 
> I guess you're saying that, if the user enters a term that appears in 
> the index and thus is sort of spelled correctly ( as it exists in some 
> doc), then we use the heuristic that any sufficiently large doc 
> collection will have tons of misspellings, so we assume that rare terms 
> in the query might be misspelled (i.e. not what the user intended) and 
> we suggest alternativies to these words too (in addition to the words in 
> the query that are not in the index at all).

Almost.

If the user enters "a recursize purser", then: "a", which is in, say, 
 >50% of the documents, is probably spelled correctly and "recursize", 
which is in zero documents, is probably mispelled.  But what about 
"purser"?  If we run the spell check algorithm on "purser" and generate 
"parser", should we show it to the user?  If "purser" occurs in 1% of 
documents and "parser" occurs in 5%, then we probably should, since 
"parser" is a more common word than "purser".  But if "parser" only 
occurs in 1% of the documents and purser occurs in 5%, then we probably 
shouldn't bother suggesting "parser".

If you wanted to get really fancy then you could check how frequently 
combinations of query terms occur, i.e., does "purser" or "parser" occur 
more frequently near "descent".  But that gets expensive.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


RE: frequent terms - Re: combining open office spellchecker with Lucene

Posted by Aad Nales <aa...@rotterdam-cs.com>.
Doug Cutting wrote:

> David Spencer wrote:
> 
>> Doug Cutting wrote:
>>
>>> And one should not try correction at all for terms which occur in a
>>> large proportion of the collection.
>>
>>
>>
>> I keep thinking over this one and I don't understand it. If a user
>> misspells a word and the "did you mean" spelling correction algorithm

>> determines that a frequent term is a good suggestion, why not suggest

>> it? The very fact that it's common could mean that it's more likely 
>> that the user wanted this word (well, the heuristic here is that
users 
>> frequently search for frequent terms, which is probabably wrong, but 
>> anyway..).
> 
> 
> I think you misunderstood me.  What I meant to say was that if the 
> term
> the user enters is very common then spell correction may be skipped. 
> Very common words which are similar to the term the user entered
should 
> of course be shown.  But if the user's term is very common one need
not 
> even attempt to find similarly-spelled words.  Is that any better?

Yes, sure, thx, I understand now - but maybe not - the context I was 
something like this:

[1] The user enters a query like:
     recursize descent parser

[2] The search code parses this and sees that the 1st word is not a term

in the index, but the next 2 are. So it ignores the last 2 terms 
("recursive" and "descent") and suggests alternatives to 
"recursize"...thus if any term is in the index, regardless of frequency,

  it is left as-is.

>>>>
My idea is to first execute the query and only execute the 'spell check'
if the number of results is lower than a certain treshhold. 

Secondly, I would like to use the 'stemming' functionality that MySpell
offers to be used for all stuff that is written to the index together
with the POS appearance.

Thirdly I want to regularly scan the index for often used words to be
added to the list of 'approved' terms. This would serve another purpose
of the customer, which is building an synonym index for Dutch words used
in an eductional context.

But having read all the input I think that using the index itself for a
first spellcheck is probably not a bad start. 
>>>>


I guess you're saying that, if the user enters a term that appears in 
the index and thus is sort of spelled correctly ( as it exists in some 
doc), then we use the heuristic that any sufficiently large doc 
collection will have tons of misspellings, so we assume that rare terms 
in the query might be misspelled (i.e. not what the user intended) and 
we suggest alternativies to these words too (in addition to the words in

the query that are not in the index at all).


> 
> Doug
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org




---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: frequent terms - Re: combining open office spellchecker with Lucene

Posted by David Spencer <da...@tropo.com>.
Doug Cutting wrote:

> David Spencer wrote:
> 
>> Doug Cutting wrote:
>>
>>> And one should not try correction at all for terms which occur in a 
>>> large proportion of the collection.
>>
>>
>>
>> I keep thinking over this one and I don't understand it. If a user 
>> misspells a word and the "did you mean" spelling correction algorithm 
>> determines that a frequent term is a good suggestion, why not suggest 
>> it? The very fact that it's common could mean that it's more likely 
>> that the user wanted this word (well, the heuristic here is that users 
>> frequently search for frequent terms, which is probabably wrong, but 
>> anyway..).
> 
> 
> I think you misunderstood me.  What I meant to say was that if the term 
> the user enters is very common then spell correction may be skipped. 
> Very common words which are similar to the term the user entered should 
> of course be shown.  But if the user's term is very common one need not 
> even attempt to find similarly-spelled words.  Is that any better?

Yes, sure, thx, I understand now - but maybe not - the context I was 
something like this:

[1] The user enters a query like:
     recursize descent parser

[2] The search code parses this and sees that the 1st word is not a term 
in the index, but the next 2 are. So it ignores the last 2 terms 
("recursive" and "descent") and suggests alternatives to 
"recursize"...thus if any term is in the index, regardless of frequency, 
  it is left as-is.

I guess you're saying that, if the user enters a term that appears in 
the index and thus is sort of spelled correctly ( as it exists in some 
doc), then we use the heuristic that any sufficiently large doc 
collection will have tons of misspellings, so we assume that rare terms 
in the query might be misspelled (i.e. not what the user intended) and 
we suggest alternativies to these words too (in addition to the words in 
the query that are not in the index at all).


> 
> Doug
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: frequent terms - Re: combining open office spellchecker with Lucene

Posted by Doug Cutting <cu...@apache.org>.
David Spencer wrote:
> Doug Cutting wrote:
> 
>> And one should not try correction at all for terms which occur in a 
>> large proportion of the collection.
> 
> 
> I keep thinking over this one and I don't understand it. If a user 
> misspells a word and the "did you mean" spelling correction algorithm 
> determines that a frequent term is a good suggestion, why not suggest 
> it? The very fact that it's common could mean that it's more likely that 
> the user wanted this word (well, the heuristic here is that users 
> frequently search for frequent terms, which is probabably wrong, but 
> anyway..).

I think you misunderstood me.  What I meant to say was that if the term 
the user enters is very common then spell correction may be skipped. 
Very common words which are similar to the term the user entered should 
of course be shown.  But if the user's term is very common one need not 
even attempt to find similarly-spelled words.  Is that any better?

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


frequent terms - Re: combining open office spellchecker with Lucene

Posted by David Spencer <da...@tropo.com>.
Doug Cutting wrote:

> Aad Nales wrote:
> 
>> Before I start reinventing wheels I would like to do a short check to
>> see if anybody else has already tried this. A customer has requested us
>> to look into the possibility to perform a spell check on queries. So far
>> the most promising way of doing this seems to be to create an Analyzer
>> based on the spellchecker of OpenOffice. My question is: "has anybody
>> tried this before?" 
> 
> 
> Note that a spell checker used with a search engine should use 
> collection frequency information.  That's to say, only "corrections" 
> which are more frequent in the collection than what the user entered 
> should be displayed.  Frequency information can also be used when 
> constructing the checker.  For example, one need never consider 
> proposing terms that occur in very few documents.  


> And one should not 
> try correction at all for terms which occur in a large proportion of the 
> collection.

I keep thinking over this one and I don't understand it. If a user 
misspells a word and the "did you mean" spelling correction algorithm 
determines that a frequent term is a good suggestion, why not suggest 
it? The very fact that it's common could mean that it's more likely that 
the user wanted this word (well, the heuristic here is that users 
frequently search for frequent terms, which is probabably wrong, but 
anyway..).

I know in other contexts of IR frequent terms are penalized but in this 
context it seems that frequent terms should be fine...

-- Dave



> 
> Doug
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: combining open office spellchecker with Lucene

Posted by David Spencer <da...@tropo.com>.
eks dev wrote:

> Hi Doug,
> 
> 
>>Perhaps.  Are folks really better at spelling the
>>beginning of words?
> 
> 
> Yes they are. There were some comprehensive empirical
> studies on this topic. Winkler modification on Jaro
> string distance is based on this assumption (boosting
> similarity if first n, I think 4, chars match).
> Jaro-Winkler is well documented and some folks thinks
> that it is much more efficient and precise than plain
> Edit distance (of course for normal language, not
> numbers or so).
> I will try to dig-out some references from my disk on

Good ole Citeseer finds 2 docs that seem relevant:

http://citeseer.ist.psu.edu/cs?cs=1&q=Winkler+Jaro&submit=Documents&co=Citations&cm=50&cf=Any&ao=Citations&am=20&af=Any

I have some of the ngram spelling suggestion stuff, based on earlier 
msgs in this thread, working in my dev tree. I'll try to get a test site 
up later today for people to fool around with.


> this topic, if you are interested.
> 
> On another note,
> I would even suggest using Jaro-Winkler distance as
> default for fuzzy query. (one could configure max
> prefix required => prefix query to reduce number of
> distance calculations). This could speed-up fuzzy
> search dramatically.
> 
> Hope this was helpful,
> Eks
> 
> 
> 
> 
>   
> 
> 
> 
> 	
> 	
> 		
> ___________________________________________________________ALL-NEW Yahoo! Messenger - all new features - even more fun!  http://uk.messenger.yahoo.com
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: combining open office spellchecker with Lucene

Posted by eks dev <ek...@yahoo.co.uk>.
Hi Doug,

> Perhaps.  Are folks really better at spelling the
> beginning of words?

Yes they are. There were some comprehensive empirical
studies on this topic. Winkler modification on Jaro
string distance is based on this assumption (boosting
similarity if first n, I think 4, chars match).
Jaro-Winkler is well documented and some folks thinks
that it is much more efficient and precise than plain
Edit distance (of course for normal language, not
numbers or so).
I will try to dig-out some references from my disk on
this topic, if you are interested.

On another note,
I would even suggest using Jaro-Winkler distance as
default for fuzzy query. (one could configure max
prefix required => prefix query to reduce number of
distance calculations). This could speed-up fuzzy
search dramatically.

Hope this was helpful,
Eks




  



	
	
		
___________________________________________________________ALL-NEW Yahoo! Messenger - all new features - even more fun!  http://uk.messenger.yahoo.com

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: combining open office spellchecker with Lucene

Posted by Doug Cutting <cu...@apache.org>.
David Spencer wrote:
> Good heuristics but are there any more precise, standard guidelines as 
> to how to balance or combine what I think are the following possible 
> criteria in suggesting a better choice:

Not that I know of.

> - ignore(penalize?) terms that are rare

I think this one is easy to threshold: ignore matching terms that are 
rarer than the term entered.

> - ignore(penalize?) terms that are common

This, in effect, falls out of the previous criterion.  A term that is 
very common will not have any matching terms that are more common.  As 
an optimization, you could avoid even looking for matching terms when a 
term is very common.

> - terms that are closer (string distance) to the term entered are better

This is the meaty one.

> - terms that start w/ the same 'n' chars as the users term are better

Perhaps.  Are folks really better at spelling the beginning of words?

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: combining open office spellchecker with Lucene

Posted by David Spencer <da...@tropo.com>.
Doug Cutting wrote:

> Aad Nales wrote:
> 
>> Before I start reinventing wheels I would like to do a short check to
>> see if anybody else has already tried this. A customer has requested us
>> to look into the possibility to perform a spell check on queries. So far
>> the most promising way of doing this seems to be to create an Analyzer
>> based on the spellchecker of OpenOffice. My question is: "has anybody
>> tried this before?" 
> 
> 
> Note that a spell checker used with a search engine should use 
> collection frequency information.  That's to say, only "corrections" 
> which are more frequent in the collection than what the user entered 
> should be displayed.  Frequency information can also be used when 
> constructing the checker.  For example, one need never consider 
> proposing terms that occur in very few documents.  And one should not 
> try correction at all for terms which occur in a large proportion of the 
> collection.

Good heuristics but are there any more precise, standard guidelines as 
to how to balance or combine what I think are the following possible 
criteria in suggesting a better choice:

- ignore(penalize?) terms that are rare
- ignore(penalize?) terms that are common
- terms that are closer (string distance) to the term entered are better
- terms that start w/ the same 'n' chars as the users term are better





> 
> Doug
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: combining open office spellchecker with Lucene

Posted by Doug Cutting <cu...@apache.org>.
Aad Nales wrote:
> Before I start reinventing wheels I would like to do a short check to
> see if anybody else has already tried this. A customer has requested us
> to look into the possibility to perform a spell check on queries. So far
> the most promising way of doing this seems to be to create an Analyzer
> based on the spellchecker of OpenOffice. My question is: "has anybody
> tried this before?" 

Note that a spell checker used with a search engine should use 
collection frequency information.  That's to say, only "corrections" 
which are more frequent in the collection than what the user entered 
should be displayed.  Frequency information can also be used when 
constructing the checker.  For example, one need never consider 
proposing terms that occur in very few documents.  And one should not 
try correction at all for terms which occur in a large proportion of the 
collection.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene

Posted by Morus Walter <mo...@tanto.de>.
David Spencer writes:
> > 
> > could you put the current version of your code on that website as a java
> 
> Weblog entry updated:
> 
> http://searchmorph.com/weblog/index.php?id=23
> 
thanks
> 
> Great suggestion and thanks for that idiom - I should know such things 
> by now. To clarify the "issue", it's just a performance one, not other 
> functionality...anyway I put in the code - and to be scientific I 
> benchmarked it two times before the change and two times after - and the 
> results were suprising the same both times (1:45 to 1:50 with an index 
> that takes up > 200MB). Probably there are cases where this will run 
> faster, and the code seems more "correct" now so it's in.
> 
Ahh, I see, you check the field later.
The logging made me think, you index all fields you loop over, in which
case one might get unwanted words into the ngram index.
> 
> 
> > 
> > 
> > An interesting application of this might be an ngram-Index enhanced version
> > of the FuzzyQuery. While this introduces more complexity on the indexing
> > side, it might be a large speedup for fuzzy searches.
> 
> I also thinking of reviewing the list to see if anyone had done a "Jaro 
> Winkler" fuzzy query yet and doing that....
> 
I went into another direction, and changed the ngram index and search
to use a simliarity that computes 

   m * m / ( n1 * n2)

where m is the number of matches and n1 is the number of ngrams in the
query and n2 is the number of ngrams in the word.
(At least if I got that right; I'm not sure if I understand all parts
of the similarity class correctly)

After removing the document boost in the ngram index based on the 
word frequency in the original index I find the results pretty good.
My data is a number of encyclopedias and dictionaries and I only use the
headwords for the ngram index. Term frequency doesn't seem relevent
in this case.

I still use the levenshtein distance to modify the score and sort according
to  score / distance  but in most cases this does not make a difference.
So I'll probably drop the distance calculation completely.

I also see few difference between using 2- and 3-grams on the one hand
and only using 2-grams on the other. So I'll presumably drop the 3-grams.

I'm not sure, if the similarity I use, is useful in general, but I 
attached it to this message in case someone is interested.
Note that you need to set the similarity for the index writer and searcher
and thus have to reindex in case you want to give it a try.

Morus



Re: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene

Posted by David Spencer <da...@tropo.com>.
Morus Walter wrote:

> Hi David,
> 
>>Based on this mail I wrote a "ngram speller" for Lucene. It runs in 2 
>>phases. First you build a "fast lookup index" as mentioned above. Then 
>>to correct a word you do a query in this index based on the ngrams in 
>>the misspelled word.
>>
>>Let's see.
>>
>>[1] Source is attached and I'd like to contribute it to the sandbox, esp 
>>if someone can validate that what it's doing is reasonable and useful.
>>
> 
> great :-)
> 
>>[4] Here's source in HTML:
>>
>>http://www.searchmorph.com/pub/ngramspeller/src-html/org/apache/lucene/spell/NGramSpeller.html#line.152
>>
> 
> could you put the current version of your code on that website as a java

Weblog entry updated:

http://searchmorph.com/weblog/index.php?id=23

To link to source code:

http://www.searchmorph.com/pub/ngramspeller/NGramSpeller.java

> source also? At least until it's in the lucene sandbox.
> 
> 
> I created an ngram index on one of my indexes and think I found an issue
> in the indexing code:
> 
> There is an option -f to specify the field on which the ngram index will
> be created. 
> However there is no code to restrict the term enumeration on this field.
> 
> So instead of 
> 		final TermEnum te = r.terms();
> i'd suggest
> 		final TermEnum te = r.terms(new Term(field, ""));
> and a check within the loop over the terms if the enumerated term
> still has fieldname field, e.g.
> 			Term t = te.term();
> 			if ( !t.field().equals(field) ) {
> 			    break;
> 			}
> 
> otherwise you loop over all terms in all fields.

Great suggestion and thanks for that idiom - I should know such things 
by now. To clarify the "issue", it's just a performance one, not other 
functionality...anyway I put in the code - and to be scientific I 
benchmarked it two times before the change and two times after - and the 
results were suprising the same both times (1:45 to 1:50 with an index 
that takes up > 200MB). Probably there are cases where this will run 
faster, and the code seems more "correct" now so it's in.



> 
> 
> An interesting application of this might be an ngram-Index enhanced version
> of the FuzzyQuery. While this introduces more complexity on the indexing
> side, it might be a large speedup for fuzzy searches.

I also thinking of reviewing the list to see if anyone had done a "Jaro 
Winkler" fuzzy query yet and doing that....

> 


Thanks,
  Dave

> Morus
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene

Posted by Morus Walter <mo...@tanto.de>.
Hi David,
> 
> Based on this mail I wrote a "ngram speller" for Lucene. It runs in 2 
> phases. First you build a "fast lookup index" as mentioned above. Then 
> to correct a word you do a query in this index based on the ngrams in 
> the misspelled word.
> 
> Let's see.
> 
> [1] Source is attached and I'd like to contribute it to the sandbox, esp 
> if someone can validate that what it's doing is reasonable and useful.
> 
great :-)
> 
> [4] Here's source in HTML:
> 
> http://www.searchmorph.com/pub/ngramspeller/src-html/org/apache/lucene/spell/NGramSpeller.html#line.152
> 
could you put the current version of your code on that website as a java
source also? At least until it's in the lucene sandbox.


I created an ngram index on one of my indexes and think I found an issue
in the indexing code:

There is an option -f to specify the field on which the ngram index will
be created. 
However there is no code to restrict the term enumeration on this field.

So instead of 
		final TermEnum te = r.terms();
i'd suggest
		final TermEnum te = r.terms(new Term(field, ""));
and a check within the loop over the terms if the enumerated term
still has fieldname field, e.g.
			Term t = te.term();
			if ( !t.field().equals(field) ) {
			    break;
			}

otherwise you loop over all terms in all fields.


An interesting application of this might be an ngram-Index enhanced version
of the FuzzyQuery. While this introduces more complexity on the indexing
side, it might be a large speedup for fuzzy searches.

Morus

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Hits.doc(x) and range queries

Posted by ro...@xemaps.com.
Hi guys!

I've posted previously that Hits.doc(x) was taking a long time.  Turns out it
has to do with a date range in our query.  We usually do date ranges like this:
Date:[(lucene date field) - (lucene date field)]

Sometimes the begin date is "000000000" which is what we get from
DateField.dateToString( ( new Date( 0 ) ).

This is when getting our search results from the Hits object takes an absurd
amount of time.  Its usually each time the Hits object attempts to get more
results from an IndexSearcher ( aka, every 100? ).

It also takes up more memory...

I was wondering why it affects the search so much even though we're only
returning 350 or so results.  Does the QueryParser do something similar to the
DateFilter on range queries?  Would it be better to use a DateFilter?

We're using Lucene 1.2 (with plans to upgrade).  Do newer versions of Lucene
have this problem?

Roy.

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene

Posted by Doug Cutting <cu...@apache.org>.
Andrzej Bialecki wrote:
> I was wondering about the way you build the n-gram queries. You 
> basically don't care about their position in the input term. Originally 
> I thought about using PhraseQuery with a slop - however, after checking 
> the source of PhraseQuery I realized that this probably wouldn't be that 
> fast... You use BooleanQuery and start/end boosts instead, which may 
> give similar results in the end but much cheaper.

Sloppy PhraseQuery's are slower than BooleanQueries, but not horribly 
slower.  The problem is that they don't handle the case where phrase 
elements are missing altogether, while a BooleanQuery does.  So what you 
really need is maybe a variation of a sloppy PhraseQuery that scores 
matches that do not contain all of the terms...

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene

Posted by David Spencer <da...@tropo.com>.
Andrzej Bialecki wrote:

> David Spencer wrote:
> 
>>> ...or prepare in advance a fast lookup index - split all existing
>>> terms to bi- or trigrams, create a separate lookup index, and then
>>> simply for each term ask a phrase query (phrase = all n-grams from
>>> an input term), with a slop > 0, to get similar existing terms.
>>> This should be fast, and you could provide a "did you mean"
>>> function too...
>>>
>>
>> Based on this mail I wrote a "ngram speller" for Lucene. It runs in 2
>> phases. First you build a "fast lookup index" as mentioned above.
>> Then to correct a word you do a query in this index based on the
>> ngrams in the misspelled word.
>>
> 
> The background for this suggestion was that I was playing some time ago 
> with a Luke plugin that builds various sorts of ancillary indexes, but 
> then I never finished it... Kudos for actually making it work ;-)

Sure, it was a fun little edge project. For the most part the code was 
done last week right after this thread appeared, but it always takes a 
while to get it from 95 to 100%.

> 
>> [1] Source is attached and I'd like to contribute it to the sandbox,
>> esp if someone can validate that what it's doing is reasonable and
>> useful.
> 
> 
> There have been many requests for this or similar functionality in the 
> past, I believe it should go into sandbox.
> 
> I was wondering about the way you build the n-gram queries. You 
> basically don't care about their position in the input term. Originally 
> I thought about using PhraseQuery with a slop - however, after checking 
> the source of PhraseQuery I realized that this probably wouldn't be that 
> fast... You use BooleanQuery and start/end boosts instead, which may 
> give similar results in the end but much cheaper.
> 
> I also wonder how this algorithm would behave for smaller values of 

Sure, I'll try to rebuild the demo w/ lengths 2-5 (and then the query 
page can test any conitguous combo).

> start/end lengths (e.g. 2,3,4). In a sense, the smaller the n-gram 
> length, the more "fuzziness" you introduce, which may or may not be 
> desirable (increased recall at the cost of precision - for small indexes 
> this may be useful from the user's perspective because you will always 
> get a plausible hit, for huge indexes it's a loss).
> 
>>
>> [2] Here's a demo page. I built an ngram index for ngrams of length 3
>>  and 4 based on the existing index I have of approx 100k 
>> javadoc-generated pages. You type in a misspelled word like
>> "recursixe" or whatnot to see what suggestions it returns. Note this
>> is not a normal search index query -- rather this is a test page for
>> spelling corrections.
>>
>> http://www.searchmorph.com/kat/spell.jsp
> 
> 
> Very nice demo! 

Thanks, kinda designed for ngram-nerds if you know what I mean :)

> I bet it's running way faster than the linear search 

Indeed, this is almost zero time, whereas the simple and dumb linear 
search was taking me 10sec. I will have to redo the sites main search 
page so it uses this new code, TBD, prob tomorrow.

> over terms :-), even though you have to build the index in advance. But 
> if you work with static or mostly static indexes this doesn't matter.
> 
>> Based on a subsequent mail in this thread I set boosts for the words
>> in the ngram index. The background is each word (er..term for a given
>>  field) in the orig index is a separate Document in the ngram index.
>> This Doc contains all ngrams (in my test case, like #2 above, of
>> length 3 and 4) of the word. I also set a boost of
>> log(word_freq)/log(num_docs) so that more frequently words will tend
>> to be suggested more often.
> 
> 
> You may want to experiment with 2 <= n <= 5. Some n-gram based 
Yep, will do prob tomorrow.

> techniques use all lengths together, some others use just single length, 
> results also vary depending on the language...
> 
>>
>> I think in "plain" English then the way a word is suggested as a 
>> spelling correction is: - frequently occuring words score higher -
>> words that share more ngrams with the orig word score higher - words
>> that share rare ngrams with the orig word score higher
> 
> 
> I think this is a reasonable heuristics. Reading the code I would 
> present it this way:

ok, thx, will update

> 
> - words that share more ngrams with the orig word score higher, and
>   words that share rare ngrams with the orig word score higher
>   (as a natural consequence of using BooleanQuery),
> 
> - and, frequently occuring words score higher (as a consequence of using
>   per-Document boosts),
> 
> - from reading the source code I see that you use Levenshtein distance
>   to prune the resultset of too long/too short results,
> 
> I think also that because you don't use the positional information about 
>  the input n-grams you may be getting some really weird hits.

Good point, though I haven't seen this yet. Might be due to the prefix 
boost and maybe some Markov chain magic tending to only show reasonable 
words.

> You could 
> prune them by simply checking if you find a (threshold) of input ngrams 
> in the right sequence in the found terms. This shouldn't be too costly 

Good point, I'll try to add that in as an optional parameter.

> because you operate on a small result set.
> 
>>
>> [6]
>>
>> If people want to vote me in as a committer to the sandbox then I can
> 
> 
> Well, someone needs to maintain the code after all... ;-)
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene

Posted by Andrzej Bialecki <ab...@getopt.org>.
David Spencer wrote:

>> ...or prepare in advance a fast lookup index - split all existing
>> terms to bi- or trigrams, create a separate lookup index, and then
>> simply for each term ask a phrase query (phrase = all n-grams from
>> an input term), with a slop > 0, to get similar existing terms.
>> This should be fast, and you could provide a "did you mean"
>> function too...
>> 
> 
> Based on this mail I wrote a "ngram speller" for Lucene. It runs in 2
> phases. First you build a "fast lookup index" as mentioned above.
> Then to correct a word you do a query in this index based on the
> ngrams in the misspelled word.
> 

The background for this suggestion was that I was playing some time ago 
with a Luke plugin that builds various sorts of ancillary indexes, but 
then I never finished it... Kudos for actually making it work ;-)

> [1] Source is attached and I'd like to contribute it to the sandbox,
> esp if someone can validate that what it's doing is reasonable and
> useful.

There have been many requests for this or similar functionality in the 
past, I believe it should go into sandbox.

I was wondering about the way you build the n-gram queries. You 
basically don't care about their position in the input term. Originally 
I thought about using PhraseQuery with a slop - however, after checking 
the source of PhraseQuery I realized that this probably wouldn't be that 
fast... You use BooleanQuery and start/end boosts instead, which may 
give similar results in the end but much cheaper.

I also wonder how this algorithm would behave for smaller values of 
start/end lengths (e.g. 2,3,4). In a sense, the smaller the n-gram 
length, the more "fuzziness" you introduce, which may or may not be 
desirable (increased recall at the cost of precision - for small indexes 
this may be useful from the user's perspective because you will always 
get a plausible hit, for huge indexes it's a loss).

> 
> [2] Here's a demo page. I built an ngram index for ngrams of length 3
>  and 4 based on the existing index I have of approx 100k 
> javadoc-generated pages. You type in a misspelled word like
> "recursixe" or whatnot to see what suggestions it returns. Note this
> is not a normal search index query -- rather this is a test page for
> spelling corrections.
> 
> http://www.searchmorph.com/kat/spell.jsp

Very nice demo! I bet it's running way faster than the linear search 
over terms :-), even though you have to build the index in advance. But 
if you work with static or mostly static indexes this doesn't matter.

> Based on a subsequent mail in this thread I set boosts for the words
> in the ngram index. The background is each word (er..term for a given
>  field) in the orig index is a separate Document in the ngram index.
> This Doc contains all ngrams (in my test case, like #2 above, of
> length 3 and 4) of the word. I also set a boost of
> log(word_freq)/log(num_docs) so that more frequently words will tend
> to be suggested more often.

You may want to experiment with 2 <= n <= 5. Some n-gram based 
techniques use all lengths together, some others use just single length, 
results also vary depending on the language...

> 
> I think in "plain" English then the way a word is suggested as a 
> spelling correction is: - frequently occuring words score higher -
> words that share more ngrams with the orig word score higher - words
> that share rare ngrams with the orig word score higher

I think this is a reasonable heuristics. Reading the code I would 
present it this way:

- words that share more ngrams with the orig word score higher, and
   words that share rare ngrams with the orig word score higher
   (as a natural consequence of using BooleanQuery),

- and, frequently occuring words score higher (as a consequence of using
   per-Document boosts),

- from reading the source code I see that you use Levenshtein distance
   to prune the resultset of too long/too short results,

I think also that because you don't use the positional information about 
  the input n-grams you may be getting some really weird hits. You could 
prune them by simply checking if you find a (threshold) of input ngrams 
in the right sequence in the found terms. This shouldn't be too costly 
because you operate on a small result set.

> 
> [6]
> 
> If people want to vote me in as a committer to the sandbox then I can

Well, someone needs to maintain the code after all... ;-)

-- 
Best regards,
Andrzej Bialecki

-------------------------------------------------
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
-------------------------------------------------
FreeBSD developer (http://www.freebsd.org)


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: [VOTE] David Spencer as Lucene Sandbox committer

Posted by Daniel Naber <da...@t-online.de>.
On Wednesday 15 September 2004 01:44, Otis Gospodnetic wrote:

> And +1 from me.  Dave has submitted several nice pieces of code over
> the years.

+1


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: [VOTE] David Spencer as Lucene Sandbox committer

Posted by Erik Hatcher <er...@ehatchersolutions.com>.
If Doug and Otis give a +1 then that is enough for me!  :)

	+1


On Sep 14, 2004, at 7:44 PM, Otis Gospodnetic wrote:

> And +1 from me.  Dave has submitted several nice pieces of code over
> the years.
>
> Otis
>
> --- Doug Cutting <cu...@apache.org> wrote:
>
>> David Spencer wrote:
>>> If people want to vote me in as a committer to the sandbox then I
>> can
>>> check this code in.
>>
>> +1
>>
>> Doug
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: [VOTE] David Spencer as Lucene Sandbox committer

Posted by Jack Park <ja...@thinkalong.com>.
+1 if my vote counts for anything.
Jack

Otis Gospodnetic wrote:

> And +1 from me.  Dave has submitted several nice pieces of code over
> the years.
> 
> Otis
> 
> --- Doug Cutting <cu...@apache.org> wrote:
> 
> 
>>David Spencer wrote:
>>
>>>If people want to vote me in as a committer to the sandbox then I
>>
>>can 
>>
>>>check this code in.
>>
>>+1
>>
>>Doug
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
> 
> 



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: [VOTE] David Spencer as Lucene Sandbox committer

Posted by Christoph Goller <go...@detego-software.de>.
Otis Gospodnetic wrote:
> And +1 from me.  Dave has submitted several nice pieces of code over
> the years.

+1


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


[VOTE] David Spencer as Lucene Sandbox committer

Posted by Otis Gospodnetic <ot...@yahoo.com>.
And +1 from me.  Dave has submitted several nice pieces of code over
the years.

Otis

--- Doug Cutting <cu...@apache.org> wrote:

> David Spencer wrote:
> > If people want to vote me in as a committer to the sandbox then I
> can 
> > check this code in.
> 
> +1
> 
> Doug


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene

Posted by Doug Cutting <cu...@apache.org>.
David Spencer wrote:
> If people want to vote me in as a committer to the sandbox then I can 
> check this code in.

+1

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene

Posted by David Spencer <da...@tropo.com>.
Andrzej Bialecki wrote:

> Aad Nales wrote:
> 
>> David,
>>
>> Perhaps I misunderstand somehting so please correct me if I do. I used
>> http://www.searchmorph.com/kat/spell.jsp to look for conts without
>> changing any of the default values. What I got as results did not
>> include 'const' which has quite a high frequency in your index and
> 
> 
> ??? how do you know that? Remember, this is an index of _Java_docs, and 
> "const" is not a Java keyword.

I added a line of output to the right column under the 'details' box. 
"const" appears 216 times in the index (out of 96k docs), thus it is 
indeed kinda rare.

http://www.searchmorph.com/kat/spell.jsp?s=const

> 
>> should have a pretty low levenshtein distance. Any idea what causes this
>> behavior?
> 
> 
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene

Posted by David Spencer <da...@tropo.com>.
Aad Nales wrote:

> By trying: if you type const you will find that it returns 216 hits. The
> third sports 'const' as a term (space seperated and all). I would expect
> 'conts' to return with const as well. But again I might be mistaken. I
> am now trying to figure what the problem might be: 
> 
> 1. my expectations (most likely ;-)
> 2. something in the code..



I enhanced the code to store simple transpositions also and I 
regenerated my site w/ ngrams from 2 to 5 chars. If you set the 
transposition boost up to 10 then "const" is returned 2nd...

http://www.searchmorph.com/kat/spell.jsp?s=conts&min=2&max=5&maxd=5&maxr=10&bstart=2.0&bend=1.0&btranspose=10.0&popular=1

> 
> -----Original Message-----
> From: Andrzej Bialecki [mailto:ab@getopt.org] 
> Sent: Wednesday, 15 September, 2004 12:23
> To: Lucene Users List
> Subject: Re: NGramSpeller contribution -- Re: combining open office
> spellchecker with Lucene
> 
> 
> Aad Nales wrote:
> 
> 
>>David,
>>
>>Perhaps I misunderstand somehting so please correct me if I do. I used
> 
> 
>>http://www.searchmorph.com/kat/spell.jsp to look for conts without 
>>changing any of the default values. What I got as results did not 
>>include 'const' which has quite a high frequency in your index and
> 
> 
> ??? how do you know that? Remember, this is an index of _Java_docs, and 
> "const" is not a Java keyword.
> 
> 
>>should have a pretty low levenshtein distance. Any idea what causes 
>>this behavior?
> 
> 
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene

Posted by David Spencer <da...@tropo.com>.
Andrzej Bialecki wrote:

> David Spencer wrote:
> 
>> To restate the question for a second.
>>
>> The misspelled word is: "conts".
>> The sugggestion expected is "const", which seems reasonable enough as 
>> it's just a transposition away, thus the string distance is low.
>>
>> But - I guess the problem w/ the algorithm is that for short words 
>> like this, with transpositions, the two words won't share many ngrams.
>>
>> Just looking at 3grams...
>>
>> conts -> con ont nts
>> const -> con ons nst
>>
>> Thus they just share 1 3gram, thus this is why it scores so low. This 
>> is an interesting issue, how to tune the algorithm so that it might 
>> return words this close higher.
> 
> 
> If you added 2-grams, then it would look like this (constructing also 
> special start/end grams):

Oh cute trick to indicate prefixes and suffixes.
Anyway, as per prev post I reformed index w/ ngrams from length 2 to 5, 
and also store transpositions, and w/ appropriate boosts :) then "const" 
is returned 2nd.

http://www.searchmorph.com/kat/spell.jsp?s=conts&min=2&max=5&maxd=5&maxr=10&bstart=2.0&bend=1.0&btranspose=10.0&popular=1

> 
> conts -> _c co on nt ts s_
> const -> _c co on ns st t_
> 
> which gives 50% of overlap.
> 
> In another system that I designed we were using a combination of 2-4 
> grams, albeit for a slightly different purpose, so in this case it would 
> be:
> 
> conts:
>     _c co on nt ts s_, _co con ont nts ts_, _con cont onts nts_
> 
> const:
>     _c co on ns st t_, _co con ons nst st_, _con cons onst nst_
> 
> and the overlap is 40%.
> 
> 
>>
>> I guess one way is to add all simple transpositions to the lookup 
>> table (the "ngram index") so that these could easily be found, with 
>> the heuristic that "a frequent way of misspelling words is to 
>> transpose two adjacent letters".
> 
> 
> Yes, sounds like a good idea. Even though it increases the size of the 
> lookup index, it still beats using the linear search...
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene

Posted by Andrzej Bialecki <ab...@getopt.org>.
David Spencer wrote:
> To restate the question for a second.
> 
> The misspelled word is: "conts".
> The sugggestion expected is "const", which seems reasonable enough as 
> it's just a transposition away, thus the string distance is low.
> 
> But - I guess the problem w/ the algorithm is that for short words like 
> this, with transpositions, the two words won't share many ngrams.
> 
> Just looking at 3grams...
> 
> conts -> con ont nts
> const -> con ons nst
> 
> Thus they just share 1 3gram, thus this is why it scores so low. This is 
> an interesting issue, how to tune the algorithm so that it might return 
> words this close higher.

If you added 2-grams, then it would look like this (constructing also 
special start/end grams):

conts -> _c co on nt ts s_
const -> _c co on ns st t_

which gives 50% of overlap.

In another system that I designed we were using a combination of 2-4 
grams, albeit for a slightly different purpose, so in this case it would be:

conts:
	_c co on nt ts s_, _co con ont nts ts_, _con cont onts nts_

const:
	_c co on ns st t_, _co con ons nst st_, _con cons onst nst_

and the overlap is 40%.


> 
> I guess one way is to add all simple transpositions to the lookup table 
> (the "ngram index") so that these could easily be found, with the 
> heuristic that "a frequent way of misspelling words is to transpose two 
> adjacent letters".

Yes, sounds like a good idea. Even though it increases the size of the 
lookup index, it still beats using the linear search...

-- 
Best regards,
Andrzej Bialecki

-------------------------------------------------
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
-------------------------------------------------
FreeBSD developer (http://www.freebsd.org)


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene

Posted by David Spencer <da...@tropo.com>.
Aad Nales wrote:

> By trying: if you type const you will find that it returns 216 hits. The
> third sports 'const' as a term (space seperated and all). I would expect
> 'conts' to return with const as well. But again I might be mistaken. I
> am now trying to figure what the problem might be: 
> 
> 1. my expectations (most likely ;-)
> 2. something in the code..


Good question.

If I use the form at the bottom of the page and ask for more results, 
the suggestion of "const" does eventually show up - 99th however(!).

http://www.searchmorph.com/kat/spell.jsp?s=conts&min=3&max=4&maxd=5&maxr=1000&bstart=2.0&bend=1.0

Even boosting the prefix match from 2.0 to 10.0 only changes the ranking 
a few slots.
http://www.searchmorph.com/kat/spell.jsp?s=conts&min=3&max=4&maxd=5&maxr=1000&bstart=10.0&bend=1.0

To restate the question for a second.

The misspelled word is: "conts".
The sugggestion expected is "const", which seems reasonable enough as 
it's just a transposition away, thus the string distance is low.

But - I guess the problem w/ the algorithm is that for short words like 
this, with transpositions, the two words won't share many ngrams.

Just looking at 3grams...

conts -> con ont nts
const -> con ons nst

Thus they just share 1 3gram, thus this is why it scores so low. This is 
an interesting issue, how to tune the algorithm so that it might return 
words this close higher.

I guess one way is to add all simple transpositions to the lookup table 
(the "ngram index") so that these could easily be found, with the 
heuristic that "a frequent way of misspelling words is to transpose two 
adjacent letters".

Based on other mails I'll make some additions to the code and will 
report back if anything of interest changes here.




> 
> -----Original Message-----
> From: Andrzej Bialecki [mailto:ab@getopt.org] 
> Sent: Wednesday, 15 September, 2004 12:23
> To: Lucene Users List
> Subject: Re: NGramSpeller contribution -- Re: combining open office
> spellchecker with Lucene
> 
> 
> Aad Nales wrote:
> 
> 
>>David,
>>
>>Perhaps I misunderstand somehting so please correct me if I do. I used
> 
> 
>>http://www.searchmorph.com/kat/spell.jsp to look for conts without 
>>changing any of the default values. What I got as results did not 
>>include 'const' which has quite a high frequency in your index and
> 
> 
> ??? how do you know that? Remember, this is an index of _Java_docs, and 
> "const" is not a Java keyword.
> 
> 
>>should have a pretty low levenshtein distance. Any idea what causes 
>>this behavior?
> 
> 
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


RE: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene

Posted by Aad Nales <aa...@rotterdam-cs.com>.
By trying: if you type const you will find that it returns 216 hits. The
third sports 'const' as a term (space seperated and all). I would expect
'conts' to return with const as well. But again I might be mistaken. I
am now trying to figure what the problem might be: 

1. my expectations (most likely ;-)
2. something in the code..

-----Original Message-----
From: Andrzej Bialecki [mailto:ab@getopt.org] 
Sent: Wednesday, 15 September, 2004 12:23
To: Lucene Users List
Subject: Re: NGramSpeller contribution -- Re: combining open office
spellchecker with Lucene


Aad Nales wrote:

> David,
> 
> Perhaps I misunderstand somehting so please correct me if I do. I used

> http://www.searchmorph.com/kat/spell.jsp to look for conts without 
> changing any of the default values. What I got as results did not 
> include 'const' which has quite a high frequency in your index and

??? how do you know that? Remember, this is an index of _Java_docs, and 
"const" is not a Java keyword.

> should have a pretty low levenshtein distance. Any idea what causes 
> this behavior?



-- 
Best regards,
Andrzej Bialecki

-------------------------------------------------
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
-------------------------------------------------
FreeBSD developer (http://www.freebsd.org)


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org





---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene

Posted by Andrzej Bialecki <ab...@getopt.org>.
Aad Nales wrote:

> David,
> 
> Perhaps I misunderstand somehting so please correct me if I do. I used
> http://www.searchmorph.com/kat/spell.jsp to look for conts without
> changing any of the default values. What I got as results did not
> include 'const' which has quite a high frequency in your index and

??? how do you know that? Remember, this is an index of _Java_docs, and 
"const" is not a Java keyword.

> should have a pretty low levenshtein distance. Any idea what causes this
> behavior?



-- 
Best regards,
Andrzej Bialecki

-------------------------------------------------
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
-------------------------------------------------
FreeBSD developer (http://www.freebsd.org)


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


RE: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene

Posted by Aad Nales <aa...@rotterdam-cs.com>.
David,

Perhaps I misunderstand somehting so please correct me if I do. I used
http://www.searchmorph.com/kat/spell.jsp to look for conts without
changing any of the default values. What I got as results did not
include 'const' which has quite a high frequency in your index and
should have a pretty low levenshtein distance. Any idea what causes this
behavior?

cheers,
Aad

-----Original Message-----
From: David Spencer [mailto:dave-lucene-user@tropo.com] 
Sent: Tuesday, 14 September, 2004 21:23
To: Lucene Users List
Subject: NGramSpeller contribution -- Re: combining open office
spellchecker with Lucene


Andrzej Bialecki wrote:

> David Spencer wrote:
> 
>>
>> I can/should send the code out. The logic is that for any terms in a
>> query that have zero matches, go thru all the terms(!) and calculate 
>> the Levenshtein string distance, and return the best matches. A more 
>> intelligent way of doing this is to instead look for terms that also 
>> match on the 1st "n" (prob 3) chars.
> 
> 
> ...or prepare in advance a fast lookup index - split all existing 
> terms
> to bi- or trigrams, create a separate lookup index, and then simply
for 
> each term ask a phrase query (phrase = all n-grams from an input
term), 
> with a slop > 0, to get similar existing terms. This should be fast,
and 
> you could provide a "did you mean" function too...
> 

Based on this mail I wrote a "ngram speller" for Lucene. It runs in 2 
phases. First you build a "fast lookup index" as mentioned above. Then 
to correct a word you do a query in this index based on the ngrams in 
the misspelled word.

Let's see.

[1] Source is attached and I'd like to contribute it to the sandbox, esp

if someone can validate that what it's doing is reasonable and useful.

[2] Here's a demo page. I built an ngram index for ngrams of length 3 
and 4 based on the existing index I have of approx 100k 
javadoc-generated pages. You type in a misspelled word like "recursixe" 
or whatnot to see what suggestions it returns. Note this is not a normal

search index query -- rather this is a test page for spelling
corrections.

http://www.searchmorph.com/kat/spell.jsp

[3] Here's the javadoc:

http://www.searchmorph.com/pub/ngramspeller/org/apache/lucene/spell/NGra
mSpeller.html

[4] Here's source in HTML:

http://www.searchmorph.com/pub/ngramspeller/src-html/org/apache/lucene/s
pell/NGramSpeller.html#line.152

[5] A few more details:

Based on a subsequent mail in this thread I set boosts for the words in 
the ngram index. The background is each word (er..term for a given 
field) in the orig index is a separate Document in the ngram index. This

Doc contains all ngrams (in my test case, like #2 above, of length 3 and

4) of the word. I also set a boost of log(word_freq)/log(num_docs) so 
that more frequently words will tend to be suggested more often.

I think in "plain" English then the way a word is suggested as a 
spelling correction is:
- frequently occuring words score higher
- words that share more ngrams with the orig word score higher
- words that share rare ngrams with the orig word score higher

[6]

If people want to vote me in as a committer to the sandbox then I can 
check this code in - though again, I'd appreciate feedback.

thx,
  Dave






---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


NGramSpeller contribution -- Re: combining open office spellchecker with Lucene

Posted by David Spencer <da...@tropo.com>.
Andrzej Bialecki wrote:

> David Spencer wrote:
> 
>>
>> I can/should send the code out. The logic is that for any terms in a 
>> query that have zero matches, go thru all the terms(!) and calculate 
>> the Levenshtein string distance, and return the best matches. A more 
>> intelligent way of doing this is to instead look for terms that also 
>> match on the 1st "n" (prob 3) chars.
> 
> 
> ...or prepare in advance a fast lookup index - split all existing terms 
> to bi- or trigrams, create a separate lookup index, and then simply for 
> each term ask a phrase query (phrase = all n-grams from an input term), 
> with a slop > 0, to get similar existing terms. This should be fast, and 
> you could provide a "did you mean" function too...
> 

Based on this mail I wrote a "ngram speller" for Lucene. It runs in 2 
phases. First you build a "fast lookup index" as mentioned above. Then 
to correct a word you do a query in this index based on the ngrams in 
the misspelled word.

Let's see.

[1] Source is attached and I'd like to contribute it to the sandbox, esp 
if someone can validate that what it's doing is reasonable and useful.

[2] Here's a demo page. I built an ngram index for ngrams of length 3 
and 4 based on the existing index I have of approx 100k 
javadoc-generated pages. You type in a misspelled word like "recursixe" 
or whatnot to see what suggestions it returns. Note this is not a normal 
search index query -- rather this is a test page for spelling corrections.

http://www.searchmorph.com/kat/spell.jsp

[3] Here's the javadoc:

http://www.searchmorph.com/pub/ngramspeller/org/apache/lucene/spell/NGramSpeller.html

[4] Here's source in HTML:

http://www.searchmorph.com/pub/ngramspeller/src-html/org/apache/lucene/spell/NGramSpeller.html#line.152

[5] A few more details:

Based on a subsequent mail in this thread I set boosts for the words in 
the ngram index. The background is each word (er..term for a given 
field) in the orig index is a separate Document in the ngram index. This 
Doc contains all ngrams (in my test case, like #2 above, of length 3 and 
4) of the word. I also set a boost of log(word_freq)/log(num_docs) so 
that more frequently words will tend to be suggested more often.

I think in "plain" English then the way a word is suggested as a 
spelling correction is:
- frequently occuring words score higher
- words that share more ngrams with the orig word score higher
- words that share rare ngrams with the orig word score higher

[6]

If people want to vote me in as a committer to the sandbox then I can 
check this code in - though again, I'd appreciate feedback.

thx,
  Dave




Re: combining open office spellchecker with Lucene

Posted by David Spencer <da...@tropo.com>.
Andrzej Bialecki wrote:

> David Spencer wrote:
> 
>>
>> I can/should send the code out. The logic is that for any terms in a 
>> query that have zero matches, go thru all the terms(!) and calculate 
>> the Levenshtein string distance, and return the best matches. A more 
>> intelligent way of doing this is to instead look for terms that also 
>> match on the 1st "n" (prob 3) chars.
> 
> 
> ...or prepare in advance a fast lookup index - split all existing terms 
> to bi- or trigrams, create a separate lookup index, and then simply for 
> each term ask a phrase query (phrase = all n-grams from an input term), 
> with a slop > 0, to get similar existing terms. This should be fast, and 
> you could provide a "did you mean" function too...

Sounds interesting/fun but I'm not sure if I'm following exactly.

Let's talk thru the trigram index case.
Are you saying that for every trigram in every word there will be a 
mapping of trigram -> term?
Thus if "recursive" is in the (orig) index then we'd create entries like:

rec -> recursive
ecu -> ...
cur -> ...
urs -> ...
rsi -> ...
siv -> ...
ive -> ...

And so on for all terms in the orig index.
OK fine.
But now the user types in a query like "recursivz".
What's the algorithm - obviously I guess take all trigrams in the bad 
term and go thru the trigram-index, but there will be lots of 
suggestions. Now what - use string distance to score them? I guess that 
makes sense - plz confirm if I understand.... And so I guess the point 
here is we precalculate the trigram->term mappings to avoid an expensive 
traversal of all terms in an index, but we still use string distance as 
a 2nd pass (and prob should force the matches to always match on the 1st 
   n (3) chars using the heuristic that people can usually start the 
spelling a word  corrrectly).


> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: combining open office spellchecker with Lucene

Posted by Andrzej Bialecki <ab...@getopt.org>.
David Spencer wrote:
> 
> I can/should send the code out. The logic is that for any terms in a 
> query that have zero matches, go thru all the terms(!) and calculate the 
> Levenshtein string distance, and return the best matches. A more 
> intelligent way of doing this is to instead look for terms that also 
> match on the 1st "n" (prob 3) chars.

...or prepare in advance a fast lookup index - split all existing terms 
to bi- or trigrams, create a separate lookup index, and then simply for 
each term ask a phrase query (phrase = all n-grams from an input term), 
with a slop > 0, to get similar existing terms. This should be fast, and 
you could provide a "did you mean" function too...

-- 
Best regards,
Andrzej Bialecki

-------------------------------------------------
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
-------------------------------------------------
FreeBSD developer (http://www.freebsd.org)


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: combining open office spellchecker with Lucene

Posted by David Spencer <da...@tropo.com>.
Aad Nales wrote:

> Hi All,
> 
> Before I start reinventing wheels I would like to do a short check to
> see if anybody else has already tried this. A customer has requested us
> to look into the possibility to perform a spell check on queries. So far
> the most promising way of doing this seems to be to create an Analyzer
> based on the spellchecker of OpenOffice. My question is: "has anybody
> tried this before?" 

I did a WordNet/synonym query expander. Search for "WordNet" on this 
page. Of interest is it stores the Wordnet info in a separate Lucene 
index as at its essence all an index is is a database.

http://jakarta.apache.org/lucene/docs/lucene-sandbox/

Also, another variation, is to instead spell based on what terms are in 
the index, not what an external dictionary says. I've done this on my 
experimental site searchmorph.com in a dumb/inefficient way. Here's an 
example:

http://www.searchmorph.com/kat/search.jsp?s=recursivz

After you click above it takes ~10sec as it produces terms close to 
"recursivz". Opps - looking at the output, it looks like the same word 
is suggest multiple times - ouch - I must be considering all fields, not 
just the contents field. TBD is fixing this. (or no wonder it's so slow :))

I can/should send the code out. The logic is that for any terms in a 
query that have zero matches, go thru all the terms(!) and calculate the 
Levenshtein string distance, and return the best matches. A more 
intelligent way of doing this is to instead look for terms that also 
match on the 1st "n" (prob 3) chars.




> 
> Cheers,
> Aad
> 
> 
> --
> Aad Nales
> aad.nales@rotterdam-cs.com, +31-(0)6 54 207 340 
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org