You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by "Burton-West, Tom" <tb...@umich.edu> on 2010/11/23 00:50:55 UTC

ICUTokenizer and CJK

Hi all,

I see in the javadoc for the ICUTokenizer that it has special handling for Lao,Myanmar, Khmer word breaking but no details in the javadoc about what it does with CJK, which for C and J appears to be breaking into unigrams. Is this correct?


Tom

Re: ICUTokenizer and CJK

Posted by Robert Muir <rc...@gmail.com>.

On Mon, Nov 22, 2010 at 6:50 PM, Burton-West, Tom <tb...@umich.edu> wrote:
> Hi all,
>
> I see in the javadoc for the ICUTokenizer that it has special handling for Lao,Myanmar, Khmer word breaking but no details in the javadoc about what it does with CJK, which for C and J appears to be breaking into unigrams. Is this correct?
>

The han ideographs are segmented into unigram (this is the uax#29
default behavior). I don't know off the top of my head what the rules
are for japanese kana...

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org