You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Bernd Fehling <be...@uni-bielefeld.de> on 2013/05/13 13:35:28 UTC

CJK question

A question about CJK, how will U+3000 be handled?

U+3000 belongs to "CJK Symbols and Punctuation" and is named "IDEOGRAPHIC SPACE".

Is it wrong if I just map it to U+0020 (SPACE)?

What is CJK Analyzer doing with U+3000?

If "two CJK words" have U+3000 inside, does it mean these "two CJK words"
belong together and changing U+3000 to U+0020 will break the meaning of the
whole CJK word?

Actually I have no idea about CJK.
Any help welcome.

Bernd

RE: CJK question

Posted by Markus Jelsma <ma...@openindex.io>.
Hi,

It uses the StandardAnalyzer which does split on IDEOGRAPHIC SPACE.

Cheers,
Markus
 
 
-----Original message-----
> From:Bernd Fehling <be...@uni-bielefeld.de>
> Sent: Mon 13-May-2013 13:36
> To: solr-user@lucene.apache.org
> Subject: CJK question
> 
> A question about CJK, how will U+3000 be handled?
> 
> U+3000 belongs to "CJK Symbols and Punctuation" and is named "IDEOGRAPHIC SPACE".
> 
> Is it wrong if I just map it to U+0020 (SPACE)?
> 
> What is CJK Analyzer doing with U+3000?
> 
> If "two CJK words" have U+3000 inside, does it mean these "two CJK words"
> belong together and changing U+3000 to U+0020 will break the meaning of the
> whole CJK word?
> 
> Actually I have no idea about CJK.
> Any help welcome.
> 
> Bernd
>