You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Marian Steinbach <ma...@sendung.de> on 2011/05/17 22:48:03 UTC

Synonym mapping not working with spaces and non-ASCII characters

Hi!

I have a text field type "countystring" which I need for faceting.
This single-valued field should contain names of German counties like
"Südliche Weinstraße". No tokenizing, stemming etc. is intended. Only
one SynonymFilterFactory is applied.

    <fieldType name="countystring" class="solr.TextField">
    	<analyzer>
    		<filter class="solr.SynonymFilterFactory"
synonyms="county-corrections.txt" ignoreCase="false" expand="false"/>
    		<tokenizer class="solr.KeywordTokenizerFactory"/>
    	</analyzer>
    </fieldType>
   <field name="county" type="countystring" indexed="true"
stored="true" required="false" />


In "county-corrections.txt" (which is UTF-8-encoded) I have mappings
as the following. And some of them work, others don't:

# these are applied as expected:
Vogelbergkreis => Vogelsbergkreis
Weissenburg-Gunzenhausen => Weißenburg-Gunzenhausen

# these aren't applied:
Südliche Weinstrasse => Südliche Weinstraße
"Südliche Weinstrasse" => Südliche Weinstraße
Stadtkreis Amberg => Amberg
"Stadtkreis Amberg" => Amberg
Köthen => Anhalt-Bitterfeld
K\u00F6then => Anhalt-Bitterfeld


It seems as if only those mappings without whitespaces and without
non-ASCII-characters are accepted. As you can see, I have tried our
various thinks like quoting and encoding non-ACII-Characters in
hexadecimal notation. None of them seems to work.

Is there a solution?

Thanks!

Marian

Re: Synonym mapping not working with spaces and non-ASCII characters

Posted by Markus Jelsma <ma...@openindex.io>.
Hi,

Whitespace must be escaped:
Südliche\ Weinstrasse => Südliche\ Weinstraße

Haven't used non-latin characters in synonyms but it should work.

Cheers,

> Hi!
> 
> I have a text field type "countystring" which I need for faceting.
> This single-valued field should contain names of German counties like
> "Südliche Weinstraße". No tokenizing, stemming etc. is intended. Only
> one SynonymFilterFactory is applied.
> 
>     <fieldType name="countystring" class="solr.TextField">
>     	<analyzer>
>     		<filter class="solr.SynonymFilterFactory"
> synonyms="county-corrections.txt" ignoreCase="false" expand="false"/>
>     		<tokenizer class="solr.KeywordTokenizerFactory"/>
>     	</analyzer>
>     </fieldType>
>    <field name="county" type="countystring" indexed="true"
> stored="true" required="false" />
> 
> 
> In "county-corrections.txt" (which is UTF-8-encoded) I have mappings
> as the following. And some of them work, others don't:
> 
> # these are applied as expected:
> Vogelbergkreis => Vogelsbergkreis
> Weissenburg-Gunzenhausen => Weißenburg-Gunzenhausen
> 
> # these aren't applied:
> Südliche Weinstrasse => Südliche Weinstraße
> "Südliche Weinstrasse" => Südliche Weinstraße
> Stadtkreis Amberg => Amberg
> "Stadtkreis Amberg" => Amberg
> Köthen => Anhalt-Bitterfeld
> K\u00F6then => Anhalt-Bitterfeld
> 
> 
> It seems as if only those mappings without whitespaces and without
> non-ASCII-characters are accepted. As you can see, I have tried our
> various thinks like quoting and encoding non-ACII-Characters in
> hexadecimal notation. None of them seems to work.
> 
> Is there a solution?
> 
> Thanks!
> 
> Marian