You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by David Spencer <da...@tropo.com> on 2004/09/11 02:38:46 UTC

frequent terms - Re: combining open office spellchecker with Lucene

Doug Cutting wrote:

> Aad Nales wrote:
> 
>> Before I start reinventing wheels I would like to do a short check to
>> see if anybody else has already tried this. A customer has requested us
>> to look into the possibility to perform a spell check on queries. So far
>> the most promising way of doing this seems to be to create an Analyzer
>> based on the spellchecker of OpenOffice. My question is: "has anybody
>> tried this before?" 
> 
> 
> Note that a spell checker used with a search engine should use 
> collection frequency information.  That's to say, only "corrections" 
> which are more frequent in the collection than what the user entered 
> should be displayed.  Frequency information can also be used when 
> constructing the checker.  For example, one need never consider 
> proposing terms that occur in very few documents.  


> And one should not 
> try correction at all for terms which occur in a large proportion of the 
> collection.

I keep thinking over this one and I don't understand it. If a user 
misspells a word and the "did you mean" spelling correction algorithm 
determines that a frequent term is a good suggestion, why not suggest 
it? The very fact that it's common could mean that it's more likely that 
the user wanted this word (well, the heuristic here is that users 
frequently search for frequent terms, which is probabably wrong, but 
anyway..).

I know in other contexts of IR frequent terms are penalized but in this 
context it seems that frequent terms should be fine...

-- Dave



> 
> Doug
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


RE: frequent terms - Re: combining open office spellchecker with Lucene

Posted by Aad Nales <aa...@rotterdam-cs.com>.
Also,

You can also use an alternative spellchecker for the 'checking part' and
use the Ngram algorithm for the 'suggestion' part. Only if the spell
'check' declares a word illegal the 'suggestion' part would perform its
magic.


cheers,
Aad

Doug Cutting wrote:

> David Spencer wrote:
> 
>> [1] The user enters a query like:
>>     recursize descent parser
>>
>> [2] The search code parses this and sees that the 1st word is not a
>> term in the index, but the next 2 are. So it ignores the last 2 terms

>> ("recursive" and "descent") and suggests alternatives to 
>> "recursize"...thus if any term is in the index, regardless of 
>> frequency,  it is left as-is.
>>
>> I guess you're saying that, if the user enters a term that appears in
>> the index and thus is sort of spelled correctly ( as it exists in
some 
>> doc), then we use the heuristic that any sufficiently large doc 
>> collection will have tons of misspellings, so we assume that rare 
>> terms in the query might be misspelled (i.e. not what the user 
>> intended) and we suggest alternativies to these words too (in
addition 
>> to the words in the query that are not in the index at all).
> 
> 
> Almost.
> 
> If the user enters "a recursize purser", then: "a", which is in, say,
>  >50% of the documents, is probably spelled correctly and "recursize",

> which is in zero documents, is probably mispelled.  But what about 
> "purser"?  If we run the spell check algorithm on "purser" and
generate 
> "parser", should we show it to the user?  If "purser" occurs in 1% of 
> documents and "parser" occurs in 5%, then we probably should, since 
> "parser" is a more common word than "purser".  But if "parser" only 
> occurs in 1% of the documents and purser occurs in 5%, then we
probably 
> shouldn't bother suggesting "parser".
> 
> If you wanted to get really fancy then you could check how frequently
> combinations of query terms occur, i.e., does "purser" or "parser"
occur 
> more frequently near "descent".  But that gets expensive.

I updated the code to have an optional popularity filter - if true then 
it only returns matches more popular (frequent) than the word that is 
passed in for spelling correction.

If true (default) then for common words like "remove", no results are 
returned now, as expected:

http://www.searchmorph.com/kat/spell.jsp?s=remove

But if you set it to false (bottom slot in the form at the bottom of the

page) then the algorithm happily looks for alternatives:

http://www.searchmorph.com/kat/spell.jsp?s=remove&min=2&max=5&maxd=5&max
r=10&bstart=2.0&bend=1.0&btranspose=1.0&popular=0

TBD I need to update the javadoc & repost the code I guess. Also as per 
earlier post I also store simple transpositions for words in the 
ngram-index.

-- Dave

> 
> Doug
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org




---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: frequent terms - Re: combining open office spellchecker with Lucene

Posted by David Spencer <da...@tropo.com>.
Doug Cutting wrote:

> David Spencer wrote:
> 
>> [1] The user enters a query like:
>>     recursize descent parser
>>
>> [2] The search code parses this and sees that the 1st word is not a 
>> term in the index, but the next 2 are. So it ignores the last 2 terms 
>> ("recursive" and "descent") and suggests alternatives to 
>> "recursize"...thus if any term is in the index, regardless of 
>> frequency,  it is left as-is.
>>
>> I guess you're saying that, if the user enters a term that appears in 
>> the index and thus is sort of spelled correctly ( as it exists in some 
>> doc), then we use the heuristic that any sufficiently large doc 
>> collection will have tons of misspellings, so we assume that rare 
>> terms in the query might be misspelled (i.e. not what the user 
>> intended) and we suggest alternativies to these words too (in addition 
>> to the words in the query that are not in the index at all).
> 
> 
> Almost.
> 
> If the user enters "a recursize purser", then: "a", which is in, say, 
>  >50% of the documents, is probably spelled correctly and "recursize", 
> which is in zero documents, is probably mispelled.  But what about 
> "purser"?  If we run the spell check algorithm on "purser" and generate 
> "parser", should we show it to the user?  If "purser" occurs in 1% of 
> documents and "parser" occurs in 5%, then we probably should, since 
> "parser" is a more common word than "purser".  But if "parser" only 
> occurs in 1% of the documents and purser occurs in 5%, then we probably 
> shouldn't bother suggesting "parser".
> 
> If you wanted to get really fancy then you could check how frequently 
> combinations of query terms occur, i.e., does "purser" or "parser" occur 
> more frequently near "descent".  But that gets expensive.

I updated the code to have an optional popularity filter - if true then 
it only returns matches more popular (frequent) than the word that is 
passed in for spelling correction.

If true (default) then for common words like "remove", no results are 
returned now, as expected:

http://www.searchmorph.com/kat/spell.jsp?s=remove

But if you set it to false (bottom slot in the form at the bottom of the 
page) then the algorithm happily looks for alternatives:

http://www.searchmorph.com/kat/spell.jsp?s=remove&min=2&max=5&maxd=5&maxr=10&bstart=2.0&bend=1.0&btranspose=1.0&popular=0

TBD I need to update the javadoc & repost the code I guess. Also as per 
earlier post I also store simple transpositions for words in the 
ngram-index.

-- Dave

> 
> Doug
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: frequent terms - Re: combining open office spellchecker with Lucene

Posted by David Spencer <da...@tropo.com>.
Doug Cutting wrote:

> David Spencer wrote:
> 
>> [1] The user enters a query like:
>>     recursize descent parser
>>
>> [2] The search code parses this and sees that the 1st word is not a 
>> term in the index, but the next 2 are. So it ignores the last 2 terms 
>> ("recursive" and "descent") and suggests alternatives to 
>> "recursize"...thus if any term is in the index, regardless of 
>> frequency,  it is left as-is.
>>
>> I guess you're saying that, if the user enters a term that appears in 
>> the index and thus is sort of spelled correctly ( as it exists in some 
>> doc), then we use the heuristic that any sufficiently large doc 
>> collection will have tons of misspellings, so we assume that rare 
>> terms in the query might be misspelled (i.e. not what the user 
>> intended) and we suggest alternativies to these words too (in addition 
>> to the words in the query that are not in the index at all).
> 
> 
> Almost.
> 
> If the user enters "a recursize purser", then: "a", which is in, say, 
>  >50% of the documents, is probably spelled correctly and "recursize", 
> which is in zero documents, is probably mispelled.  But what about 
> "purser"?  If we run the spell check algorithm on "purser" and generate 
> "parser", should we show it to the user?  If "purser" occurs in 1% of 
> documents and "parser" occurs in 5%, then we probably should, since 
> "parser" is a more common word than "purser".  But if "parser" only 
> occurs in 1% of the documents and purser occurs in 5%, then we probably 
> shouldn't bother suggesting "parser".

OK, sure, got it.
I'll give it a think and try to add this option to my just submitted 
spelling code.


> 
> If you wanted to get really fancy then you could check how frequently 
> combinations of query terms occur, i.e., does "purser" or "parser" occur 
> more frequently near "descent".  But that gets expensive.

Yeah, expensive for a large scale search engine, but probably 
appropriate for a desktop engine.

> 
> Doug
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: frequent terms - Re: combining open office spellchecker with Lucene

Posted by Doug Cutting <cu...@apache.org>.
David Spencer wrote:
> [1] The user enters a query like:
>     recursize descent parser
> 
> [2] The search code parses this and sees that the 1st word is not a term 
> in the index, but the next 2 are. So it ignores the last 2 terms 
> ("recursive" and "descent") and suggests alternatives to 
> "recursize"...thus if any term is in the index, regardless of frequency, 
>  it is left as-is.
> 
> I guess you're saying that, if the user enters a term that appears in 
> the index and thus is sort of spelled correctly ( as it exists in some 
> doc), then we use the heuristic that any sufficiently large doc 
> collection will have tons of misspellings, so we assume that rare terms 
> in the query might be misspelled (i.e. not what the user intended) and 
> we suggest alternativies to these words too (in addition to the words in 
> the query that are not in the index at all).

Almost.

If the user enters "a recursize purser", then: "a", which is in, say, 
 >50% of the documents, is probably spelled correctly and "recursize", 
which is in zero documents, is probably mispelled.  But what about 
"purser"?  If we run the spell check algorithm on "purser" and generate 
"parser", should we show it to the user?  If "purser" occurs in 1% of 
documents and "parser" occurs in 5%, then we probably should, since 
"parser" is a more common word than "purser".  But if "parser" only 
occurs in 1% of the documents and purser occurs in 5%, then we probably 
shouldn't bother suggesting "parser".

If you wanted to get really fancy then you could check how frequently 
combinations of query terms occur, i.e., does "purser" or "parser" occur 
more frequently near "descent".  But that gets expensive.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


RE: frequent terms - Re: combining open office spellchecker with Lucene

Posted by Aad Nales <aa...@rotterdam-cs.com>.
Doug Cutting wrote:

> David Spencer wrote:
> 
>> Doug Cutting wrote:
>>
>>> And one should not try correction at all for terms which occur in a
>>> large proportion of the collection.
>>
>>
>>
>> I keep thinking over this one and I don't understand it. If a user
>> misspells a word and the "did you mean" spelling correction algorithm

>> determines that a frequent term is a good suggestion, why not suggest

>> it? The very fact that it's common could mean that it's more likely 
>> that the user wanted this word (well, the heuristic here is that
users 
>> frequently search for frequent terms, which is probabably wrong, but 
>> anyway..).
> 
> 
> I think you misunderstood me.  What I meant to say was that if the 
> term
> the user enters is very common then spell correction may be skipped. 
> Very common words which are similar to the term the user entered
should 
> of course be shown.  But if the user's term is very common one need
not 
> even attempt to find similarly-spelled words.  Is that any better?

Yes, sure, thx, I understand now - but maybe not - the context I was 
something like this:

[1] The user enters a query like:
     recursize descent parser

[2] The search code parses this and sees that the 1st word is not a term

in the index, but the next 2 are. So it ignores the last 2 terms 
("recursive" and "descent") and suggests alternatives to 
"recursize"...thus if any term is in the index, regardless of frequency,

  it is left as-is.

>>>>
My idea is to first execute the query and only execute the 'spell check'
if the number of results is lower than a certain treshhold. 

Secondly, I would like to use the 'stemming' functionality that MySpell
offers to be used for all stuff that is written to the index together
with the POS appearance.

Thirdly I want to regularly scan the index for often used words to be
added to the list of 'approved' terms. This would serve another purpose
of the customer, which is building an synonym index for Dutch words used
in an eductional context.

But having read all the input I think that using the index itself for a
first spellcheck is probably not a bad start. 
>>>>


I guess you're saying that, if the user enters a term that appears in 
the index and thus is sort of spelled correctly ( as it exists in some 
doc), then we use the heuristic that any sufficiently large doc 
collection will have tons of misspellings, so we assume that rare terms 
in the query might be misspelled (i.e. not what the user intended) and 
we suggest alternativies to these words too (in addition to the words in

the query that are not in the index at all).


> 
> Doug
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org




---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: frequent terms - Re: combining open office spellchecker with Lucene

Posted by David Spencer <da...@tropo.com>.
Doug Cutting wrote:

> David Spencer wrote:
> 
>> Doug Cutting wrote:
>>
>>> And one should not try correction at all for terms which occur in a 
>>> large proportion of the collection.
>>
>>
>>
>> I keep thinking over this one and I don't understand it. If a user 
>> misspells a word and the "did you mean" spelling correction algorithm 
>> determines that a frequent term is a good suggestion, why not suggest 
>> it? The very fact that it's common could mean that it's more likely 
>> that the user wanted this word (well, the heuristic here is that users 
>> frequently search for frequent terms, which is probabably wrong, but 
>> anyway..).
> 
> 
> I think you misunderstood me.  What I meant to say was that if the term 
> the user enters is very common then spell correction may be skipped. 
> Very common words which are similar to the term the user entered should 
> of course be shown.  But if the user's term is very common one need not 
> even attempt to find similarly-spelled words.  Is that any better?

Yes, sure, thx, I understand now - but maybe not - the context I was 
something like this:

[1] The user enters a query like:
     recursize descent parser

[2] The search code parses this and sees that the 1st word is not a term 
in the index, but the next 2 are. So it ignores the last 2 terms 
("recursive" and "descent") and suggests alternatives to 
"recursize"...thus if any term is in the index, regardless of frequency, 
  it is left as-is.

I guess you're saying that, if the user enters a term that appears in 
the index and thus is sort of spelled correctly ( as it exists in some 
doc), then we use the heuristic that any sufficiently large doc 
collection will have tons of misspellings, so we assume that rare terms 
in the query might be misspelled (i.e. not what the user intended) and 
we suggest alternativies to these words too (in addition to the words in 
the query that are not in the index at all).


> 
> Doug
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: frequent terms - Re: combining open office spellchecker with Lucene

Posted by Doug Cutting <cu...@apache.org>.
David Spencer wrote:
> Doug Cutting wrote:
> 
>> And one should not try correction at all for terms which occur in a 
>> large proportion of the collection.
> 
> 
> I keep thinking over this one and I don't understand it. If a user 
> misspells a word and the "did you mean" spelling correction algorithm 
> determines that a frequent term is a good suggestion, why not suggest 
> it? The very fact that it's common could mean that it's more likely that 
> the user wanted this word (well, the heuristic here is that users 
> frequently search for frequent terms, which is probabably wrong, but 
> anyway..).

I think you misunderstood me.  What I meant to say was that if the term 
the user enters is very common then spell correction may be skipped. 
Very common words which are similar to the term the user entered should 
of course be shown.  But if the user's term is very common one need not 
even attempt to find similarly-spelled words.  Is that any better?

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org