You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Jerome Lanneluc <je...@fr.ibm.com> on 2013/01/24 15:25:58 UTC
Chinese analyzer
Hi,
I'm using the 3.6.1 Chinese analyzer and when tokenizing some Chinese
words containing CJK Unified Ideographs Extension B characters, the
resulting tokens do not contain the original words. Instead it seems that
the CJK Unified Ideographs Extension B characters are split in two
characters.
In the attached example,
the output is:
Sentence: �������(25105 26159 20013 22269 20154)
Tokens: [��(25105) ��(26159) �й�(20013 22269) ��(20154) ]
Sentence: ?(55401 57046)
Tokens: [?(55401) ?(57046) ]
Note the 2 tokens in the second sample when I would expect to have only
one token with the (55401 57046) characters.
I could not figure out if I'm doing something wrong, or if this is a bug
in the Chinese analyzer.
Thanks,
Jerome
Sauf indication contraire ci-dessus:/ Unless stated otherwise above:
Compagnie IBM France
Si��ge Social : 17 avenue de l'Europe, 92275 Bois-Colombes Cedex
RCS Nanterre 552 118 465
Forme Sociale : S.A.S.
Capital Social : 653.242.306,20 �
SIREN/SIRET : 552 118 465 03644 - Code NAF 6202A
Re: Chinese analyzer
Posted by Jerome Lanneluc <je...@fr.ibm.com>.
Thanks Robert. Is there another analyzer I should use?
Jerome
From: Robert Muir <rc...@gmail.com>
To: java-user@lucene.apache.org,
Date: 01/24/2013 06:20 PM
Subject: Re: Chinese analyzer
On Thu, Jan 24, 2013 at 10:53 AM, Jerome Lanneluc
<je...@fr.ibm.com> wrote:
> It looks like my attachment was lost. It referred to
> org.apache.lucene.analysis.cn.smart.SmartChineseAnalyzer.
>
I think this analyzer will not properly tokenize text outside of the
BMP: it pretty much only works for simplified text (e.g. chars from
GB2312 range)
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Sauf indication contraire ci-dessus:/ Unless stated otherwise above:
Compagnie IBM France
Siège Social : 17 avenue de l'Europe, 92275 Bois-Colombes Cedex
RCS Nanterre 552 118 465
Forme Sociale : S.A.S.
Capital Social : 653.242.306,20 ?
SIREN/SIRET : 552 118 465 03644 - Code NAF 6202A
Re: Chinese analyzer
Posted by Robert Muir <rc...@gmail.com>.
On Thu, Jan 24, 2013 at 10:53 AM, Jerome Lanneluc
<je...@fr.ibm.com> wrote:
> It looks like my attachment was lost. It referred to
> org.apache.lucene.analysis.cn.smart.SmartChineseAnalyzer.
>
I think this analyzer will not properly tokenize text outside of the
BMP: it pretty much only works for simplified text (e.g. chars from
GB2312 range)
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Chinese analyzer
Posted by Jerome Lanneluc <je...@fr.ibm.com>.
It looks like my attachment was lost. It referred to
org.apache.lucene.analysis.cn.smart.SmartChineseAnalyzer.
I'm inlining it here:
import java.io.IOException;
import java.io.StringReader;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.cn.smart.SmartChineseAnalyzer;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.util.Version;
public class ChineseTokenizerTest {
public static void main(String[] args) throws IOException {
tokenizeChineseWords("�����й���"/*"��"(I) "��"(am) "�й�"
"��"(Chinese = people of China)*/);
tokenizeChineseWords("?");
}
private static void tokenizeChineseWords(String chineseWords)
throws IOException {
SmartChineseAnalyzer analyzer = new
SmartChineseAnalyzer(Version.LUCENE_36);
TokenStream tokenizer = analyzer.tokenStream(null/*field
name*/, new StringReader(chineseWords));
System.out.print("Sentence: ");
print(chineseWords);
System.out.println();
System.out.print("Tokens: [");
while (tokenizer.incrementToken()) {
CharSequence charTermAttribute =
tokenizer.getAttribute(CharTermAttribute.class);
print(charTermAttribute);
System.out.print(" ");
}
System.out.println("]");
System.out.println();
}
private static void print(CharSequence charTermAttribute) {
System.out.print(charTermAttribute);
System.out.print("(");
for (int i = 0, length = charTermAttribute.length(); i <
length; i++) {
System.out.print((int)
charTermAttribute.charAt(i));
if (i < length-1)
System.out.print(" ");
}
System.out.print(")");
}
}
From: Robert Muir <rc...@gmail.com>
To: java-user@lucene.apache.org,
Date: 01/24/2013 04:31 PM
Subject: Re: Chinese analyzer
On Thu, Jan 24, 2013 at 9:25 AM, Jerome Lanneluc
<je...@fr.ibm.com> wrote:
> Note the 2 tokens in the second sample when I would expect to have only
one
> token with the (55401 57046) characters.
>
> I could not figure out if I'm doing something wrong, or if this is a bug
in
> the Chinese analyzer.
>
Which analyzer specifically? there is more than one...
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Sauf indication contraire ci-dessus:/ Unless stated otherwise above:
Compagnie IBM France
Si��ge Social : 17 avenue de l'Europe, 92275 Bois-Colombes Cedex
RCS Nanterre 552 118 465
Forme Sociale : S.A.S.
Capital Social : 653.242.306,20 �
SIREN/SIRET : 552 118 465 03644 - Code NAF 6202A
Re: Chinese analyzer
Posted by Robert Muir <rc...@gmail.com>.
On Thu, Jan 24, 2013 at 9:25 AM, Jerome Lanneluc
<je...@fr.ibm.com> wrote:
> Note the 2 tokens in the second sample when I would expect to have only one
> token with the (55401 57046) characters.
>
> I could not figure out if I'm doing something wrong, or if this is a bug in
> the Chinese analyzer.
>
Which analyzer specifically? there is more than one...
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org