You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Bernd Fehling <be...@uni-bielefeld.de> on 2013/05/13 13:35:28 UTC
CJK question
A question about CJK, how will U+3000 be handled?
U+3000 belongs to "CJK Symbols and Punctuation" and is named "IDEOGRAPHIC SPACE".
Is it wrong if I just map it to U+0020 (SPACE)?
What is CJK Analyzer doing with U+3000?
If "two CJK words" have U+3000 inside, does it mean these "two CJK words"
belong together and changing U+3000 to U+0020 will break the meaning of the
whole CJK word?
Actually I have no idea about CJK.
Any help welcome.
Bernd
RE: CJK question
Posted by Markus Jelsma <ma...@openindex.io>.
Hi,
It uses the StandardAnalyzer which does split on IDEOGRAPHIC SPACE.
Cheers,
Markus
-----Original message-----
> From:Bernd Fehling <be...@uni-bielefeld.de>
> Sent: Mon 13-May-2013 13:36
> To: solr-user@lucene.apache.org
> Subject: CJK question
>
> A question about CJK, how will U+3000 be handled?
>
> U+3000 belongs to "CJK Symbols and Punctuation" and is named "IDEOGRAPHIC SPACE".
>
> Is it wrong if I just map it to U+0020 (SPACE)?
>
> What is CJK Analyzer doing with U+3000?
>
> If "two CJK words" have U+3000 inside, does it mean these "two CJK words"
> belong together and changing U+3000 to U+0020 will break the meaning of the
> whole CJK word?
>
> Actually I have no idea about CJK.
> Any help welcome.
>
> Bernd
>