You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Luis Cappa Banda <lu...@gmail.com> on 2012/11/07 11:28:58 UTC
Latin characters encoding. Example: letter "ñ".
Hello!
I´ve got some encoding problems with my currently new analyzer
configuration. I´ve deployed a Solr server in Apache Tomcat setting
Tomcat´s encoding to UTF-8 in server.xml. Also Solr´s encoding is setted to
UTF-8 in schema.xml. I have defined a fieldType like the following:
* <fieldType name="textSearch" class="solr.TextField"
positionIncrementGap="100">*
* <analyzer>*
* <charFilter class="solr.MappingCharFilterFactory"
mapping="charsToRemove.txt"/>*
* <tokenizer class="solr.WhitespaceTokenizerFactory"/>*
* <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords_es.txt"/>*
* <filter class="solr.WordDelimiterFilterFactory"*
* splitOnCaseChange="1"*
* splitOnNumerics="1"*
* stemEnglishPossessive="1"*
* generateWordParts="1"*
* generateNumberParts="1"*
* preserveOriginal="1"*
* />*
* <filter class="solr.ASCIIFoldingFilterFactory"/>*
* <filter class="solr.SnowballPorterFilterFactory" language="Spanish" />*
* <filter class="solr.LowerCaseFilterFactory"/>*
* <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>*
* </analyzer> *
* </fieldType>*
I don´t know why, but inmediatly translates an input like "sueños" (dreams,
in English) to something like "sueños". That produces that
WordDelimiterFilterFactory splits the token into "sue à os", with obviously
affects directly to search queries which includes de original "sueños"
term. It looks like that Solr encoding isn´t UTF-8.
Any tips or suggestions?
Thank you very much.
--
- Luis Cappa
Re: Latin characters encoding. Example: letter "ñ".
Posted by Luis Cappa Banda <lu...@gmail.com>.
Latest news!
It was a simple bad spelling Tomcat issue in server.xml. I specified "utf8"
instead of "UTF-8". After that the problem was solved and everything is
O.K. However, I hope that this thread could be useful for someone because
this kind of latin encoding problems are very common.
Goodbye!
2012/11/7 Luis Cappa Banda <lu...@gmail.com>
> Hello!
>
> I´ve got some encoding problems with my currently new analyzer
> configuration. I´ve deployed a Solr server in Apache Tomcat setting
> Tomcat´s encoding to UTF-8 in server.xml. Also Solr´s encoding is setted to
> UTF-8 in schema.xml. I have defined a fieldType like the following:
>
> * <fieldType name="textSearch" class="solr.TextField"
> positionIncrementGap="100">*
> * <analyzer>*
> * <charFilter class="solr.MappingCharFilterFactory"
> mapping="charsToRemove.txt"/>*
> * <tokenizer class="solr.WhitespaceTokenizerFactory"/>*
> * <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords_es.txt"/>*
> * <filter class="solr.WordDelimiterFilterFactory"*
> * splitOnCaseChange="1"*
> * splitOnNumerics="1"*
> * stemEnglishPossessive="1"*
> * generateWordParts="1"*
> * generateNumberParts="1"*
> * preserveOriginal="1"*
> * />*
> * <filter class="solr.ASCIIFoldingFilterFactory"/>*
> * <filter class="solr.SnowballPorterFilterFactory" language="Spanish" />*
> * <filter class="solr.LowerCaseFilterFactory"/>*
> * <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>*
> * </analyzer> *
> * </fieldType>*
>
>
> I don´t know why, but inmediatly translates an input like "sueños"
> (dreams, in English) to something like "sueños". That produces that
> WordDelimiterFilterFactory splits the token into "sue à os", with obviously
> affects directly to search queries which includes de original "sueños"
> term. It looks like that Solr encoding isn´t UTF-8.
>
> Any tips or suggestions?
>
> Thank you very much.
>
> --
>
> - Luis Cappa
>
>
--
- Luis Cappa