You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Stéphane Tellier <st...@cgi.com> on 2009/03/16 16:14:30 UTC

Problems with the spellchecker

Hi,
 
    We think it may have a bug with the spellchecker. It is about accents
and ISO-latin special characters. If I'm doing a request like this (about
the word "considération") :
 
http://localhost/solr/spellCheckCompRH?q=considération&spellcheck=on&spellcheck.dictionary=file
 
and if I have a good amounts of words in my dictionary, it will return
suggestions for "consid" and "ration". It look likes it's considering the
"é" character as a space or a separator.
 
Having looked through the code, I have found the class
SpellingQueryConverter which seems to do the work. I think that the problem
is the regular expression : the predefined character class \w might not work
for special characters. As defined by the Java API, \w = [a-zA-Z_0-9], which
could not necessarily include ISO accent characters. I didn't found a
regular expression that would be able to work all this out, but I think that
it would be important to fix that for the next version.
 
The version we're working with is the Nightly Build of 2009-03-04 (because
we need the better tuned-up facet module, which is quite faster).
 
Thanks.
-- 
View this message in context: http://www.nabble.com/Problems-with-the-spellchecker-tp22540347p22540347.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Problems with the spellchecker

Posted by Grant Ingersoll <gs...@apache.org>.

What's the saying?  "It's not a bug, it's a feature!"

The QueryConverter is by definition a simple implementation that  
handles the basics and is designed to be replaced by those with  
specific needs.   http://wiki.apache.org/solr/SpellCheckComponent#head-8a3a4d45708be416cec61a9387131cd52fcdbbbf

It would probably be good to at least have a few different  
implementations for handling various common scenarios.

-Grant


On Mar 16, 2009, at 11:14 AM, Stéphane Tellier wrote:

>
> Hi,
>
>   We think it may have a bug with the spellchecker. It is about  
> accents
> and ISO-latin special characters. If I'm doing a request like this  
> (about
> the word "considération") :
>
> http://localhost/solr/spellCheckCompRH?q=considération&spellcheck=on&spellcheck.dictionary=file
>
> and if I have a good amounts of words in my dictionary, it will return
> suggestions for "consid" and "ration". It look likes it's  
> considering the
> "é" character as a space or a separator.
>
> Having looked through the code, I have found the class
> SpellingQueryConverter which seems to do the work. I think that the  
> problem
> is the regular expression : the predefined character class \w might  
> not work
> for special characters. As defined by the Java API, \w = [a-zA- 
> Z_0-9], which
> could not necessarily include ISO accent characters. I didn't found a
> regular expression that would be able to work all this out, but I  
> think that
> it would be important to fix that for the next version.
>
> The version we're working with is the Nightly Build of 2009-03-04  
> (because
> we need the better tuned-up facet module, which is quite faster).
>
> Thanks.
> -- 
> View this message in context: http://www.nabble.com/Problems-with-the-spellchecker-tp22540347p22540347.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>