You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Stéphane Tellier <st...@cgi.com> on 2009/03/16 16:14:30 UTC
Problems with the spellchecker
Hi,
We think it may have a bug with the spellchecker. It is about accents
and ISO-latin special characters. If I'm doing a request like this (about
the word "considération") :
http://localhost/solr/spellCheckCompRH?q=considération&spellcheck=on&spellcheck.dictionary=file
and if I have a good amounts of words in my dictionary, it will return
suggestions for "consid" and "ration". It look likes it's considering the
"é" character as a space or a separator.
Having looked through the code, I have found the class
SpellingQueryConverter which seems to do the work. I think that the problem
is the regular expression : the predefined character class \w might not work
for special characters. As defined by the Java API, \w = [a-zA-Z_0-9], which
could not necessarily include ISO accent characters. I didn't found a
regular expression that would be able to work all this out, but I think that
it would be important to fix that for the next version.
The version we're working with is the Nightly Build of 2009-03-04 (because
we need the better tuned-up facet module, which is quite faster).
Thanks.
--
View this message in context: http://www.nabble.com/Problems-with-the-spellchecker-tp22540347p22540347.html
Sent from the Solr - User mailing list archive at Nabble.com.
Re: Problems with the spellchecker
Posted by Grant Ingersoll <gs...@apache.org>.
What's the saying? "It's not a bug, it's a feature!"
The QueryConverter is by definition a simple implementation that
handles the basics and is designed to be replaced by those with
specific needs. http://wiki.apache.org/solr/SpellCheckComponent#head-8a3a4d45708be416cec61a9387131cd52fcdbbbf
It would probably be good to at least have a few different
implementations for handling various common scenarios.
-Grant
On Mar 16, 2009, at 11:14 AM, Stéphane Tellier wrote:
>
> Hi,
>
> We think it may have a bug with the spellchecker. It is about
> accents
> and ISO-latin special characters. If I'm doing a request like this
> (about
> the word "considération") :
>
> http://localhost/solr/spellCheckCompRH?q=considération&spellcheck=on&spellcheck.dictionary=file
>
> and if I have a good amounts of words in my dictionary, it will return
> suggestions for "consid" and "ration". It look likes it's
> considering the
> "é" character as a space or a separator.
>
> Having looked through the code, I have found the class
> SpellingQueryConverter which seems to do the work. I think that the
> problem
> is the regular expression : the predefined character class \w might
> not work
> for special characters. As defined by the Java API, \w = [a-zA-
> Z_0-9], which
> could not necessarily include ISO accent characters. I didn't found a
> regular expression that would be able to work all this out, but I
> think that
> it would be important to fix that for the next version.
>
> The version we're working with is the Nightly Build of 2009-03-04
> (because
> we need the better tuned-up facet module, which is quite faster).
>
> Thanks.
> --
> View this message in context: http://www.nabble.com/Problems-with-the-spellchecker-tp22540347p22540347.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>