You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Marian Steinbach <ma...@sendung.de> on 2011/05/17 22:48:03 UTC
Synonym mapping not working with spaces and non-ASCII characters
Hi!
I have a text field type "countystring" which I need for faceting.
This single-valued field should contain names of German counties like
"Südliche Weinstraße". No tokenizing, stemming etc. is intended. Only
one SynonymFilterFactory is applied.
<fieldType name="countystring" class="solr.TextField">
<analyzer>
<filter class="solr.SynonymFilterFactory"
synonyms="county-corrections.txt" ignoreCase="false" expand="false"/>
<tokenizer class="solr.KeywordTokenizerFactory"/>
</analyzer>
</fieldType>
<field name="county" type="countystring" indexed="true"
stored="true" required="false" />
In "county-corrections.txt" (which is UTF-8-encoded) I have mappings
as the following. And some of them work, others don't:
# these are applied as expected:
Vogelbergkreis => Vogelsbergkreis
Weissenburg-Gunzenhausen => Weißenburg-Gunzenhausen
# these aren't applied:
Südliche Weinstrasse => Südliche Weinstraße
"Südliche Weinstrasse" => Südliche Weinstraße
Stadtkreis Amberg => Amberg
"Stadtkreis Amberg" => Amberg
Köthen => Anhalt-Bitterfeld
K\u00F6then => Anhalt-Bitterfeld
It seems as if only those mappings without whitespaces and without
non-ASCII-characters are accepted. As you can see, I have tried our
various thinks like quoting and encoding non-ACII-Characters in
hexadecimal notation. None of them seems to work.
Is there a solution?
Thanks!
Marian
Re: Synonym mapping not working with spaces and non-ASCII characters
Posted by Markus Jelsma <ma...@openindex.io>.
Hi,
Whitespace must be escaped:
Südliche\ Weinstrasse => Südliche\ Weinstraße
Haven't used non-latin characters in synonyms but it should work.
Cheers,
> Hi!
>
> I have a text field type "countystring" which I need for faceting.
> This single-valued field should contain names of German counties like
> "Südliche Weinstraße". No tokenizing, stemming etc. is intended. Only
> one SynonymFilterFactory is applied.
>
> <fieldType name="countystring" class="solr.TextField">
> <analyzer>
> <filter class="solr.SynonymFilterFactory"
> synonyms="county-corrections.txt" ignoreCase="false" expand="false"/>
> <tokenizer class="solr.KeywordTokenizerFactory"/>
> </analyzer>
> </fieldType>
> <field name="county" type="countystring" indexed="true"
> stored="true" required="false" />
>
>
> In "county-corrections.txt" (which is UTF-8-encoded) I have mappings
> as the following. And some of them work, others don't:
>
> # these are applied as expected:
> Vogelbergkreis => Vogelsbergkreis
> Weissenburg-Gunzenhausen => Weißenburg-Gunzenhausen
>
> # these aren't applied:
> Südliche Weinstrasse => Südliche Weinstraße
> "Südliche Weinstrasse" => Südliche Weinstraße
> Stadtkreis Amberg => Amberg
> "Stadtkreis Amberg" => Amberg
> Köthen => Anhalt-Bitterfeld
> K\u00F6then => Anhalt-Bitterfeld
>
>
> It seems as if only those mappings without whitespaces and without
> non-ASCII-characters are accepted. As you can see, I have tried our
> various thinks like quoting and encoding non-ACII-Characters in
> hexadecimal notation. None of them seems to work.
>
> Is there a solution?
>
> Thanks!
>
> Marian