You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Jerome Lanneluc <je...@fr.ibm.com> on 2013/01/24 15:25:58 UTC

Chinese analyzer

Hi,

I'm using the 3.6.1 Chinese analyzer and when tokenizing some Chinese 
words containing CJK Unified Ideographs Extension B characters, the 
resulting tokens do not contain the original words. Instead it seems that 
the CJK Unified Ideographs Extension B characters are split in two 
characters.

In the attached example, 
the output is:

Sentence: �������(25105 26159 20013 22269 20154)
Tokens: [��(25105) ��(26159) �й�(20013 22269) ��(20154) ]

Sentence: ?(55401 57046)
Tokens: [?(55401) ?(57046) ]

Note the 2 tokens in the second sample when I would expect to have only 
one token with the (55401 57046) characters.

I could not figure out if I'm doing something wrong, or if this is a bug 
in the Chinese analyzer.

Thanks,
Jerome



Sauf indication contraire ci-dessus:/ Unless stated otherwise above:
Compagnie IBM France
Si��ge Social : 17 avenue de l'Europe, 92275 Bois-Colombes Cedex
RCS Nanterre 552 118 465
Forme Sociale : S.A.S.
Capital Social : 653.242.306,20 �
SIREN/SIRET : 552 118 465 03644 - Code NAF 6202A 

Re: Chinese analyzer

Posted by Jerome Lanneluc <je...@fr.ibm.com>.
Thanks Robert. Is there another analyzer I should use?

Jerome



From:   Robert Muir <rc...@gmail.com>
To:     java-user@lucene.apache.org, 
Date:   01/24/2013 06:20 PM
Subject:        Re: Chinese analyzer



On Thu, Jan 24, 2013 at 10:53 AM, Jerome Lanneluc
<je...@fr.ibm.com> wrote:
> It looks like my attachment was lost. It referred to
> org.apache.lucene.analysis.cn.smart.SmartChineseAnalyzer.
>

I think this analyzer will not properly tokenize text outside of the
BMP: it pretty much only works for simplified text (e.g. chars from
GB2312 range)

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org




Sauf indication contraire ci-dessus:/ Unless stated otherwise above:
Compagnie IBM France
Siège Social : 17 avenue de l'Europe, 92275 Bois-Colombes Cedex
RCS Nanterre 552 118 465
Forme Sociale : S.A.S.
Capital Social : 653.242.306,20 ?
SIREN/SIRET : 552 118 465 03644 - Code NAF 6202A 

Re: Chinese analyzer

Posted by Robert Muir <rc...@gmail.com>.
On Thu, Jan 24, 2013 at 10:53 AM, Jerome Lanneluc
<je...@fr.ibm.com> wrote:
> It looks like my attachment was lost. It referred to
> org.apache.lucene.analysis.cn.smart.SmartChineseAnalyzer.
>

I think this analyzer will not properly tokenize text outside of the
BMP: it pretty much only works for simplified text (e.g. chars from
GB2312 range)

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Chinese analyzer

Posted by Jerome Lanneluc <je...@fr.ibm.com>.
It looks like my attachment was lost. It referred to 
org.apache.lucene.analysis.cn.smart.SmartChineseAnalyzer.

I'm inlining it here:

import java.io.IOException;
import java.io.StringReader;

import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.cn.smart.SmartChineseAnalyzer;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.util.Version;

public class ChineseTokenizerTest {
        public static void main(String[] args) throws IOException {
                tokenizeChineseWords("�����й���"/*"��"(I) "��"(am) "�й�" 
"��"(Chinese = people of China)*/);
                tokenizeChineseWords("?");
        }

        private static void tokenizeChineseWords(String chineseWords) 
throws IOException {
                SmartChineseAnalyzer analyzer = new 
SmartChineseAnalyzer(Version.LUCENE_36);
                TokenStream tokenizer = analyzer.tokenStream(null/*field 
name*/, new StringReader(chineseWords));
                System.out.print("Sentence: ");
                print(chineseWords);
                System.out.println();
                System.out.print("Tokens: [");
                while (tokenizer.incrementToken()) {
                        CharSequence charTermAttribute = 
tokenizer.getAttribute(CharTermAttribute.class);
                        print(charTermAttribute);
                        System.out.print(" ");
                }
                System.out.println("]");
                System.out.println();
        }

        private static void print(CharSequence charTermAttribute) {
                System.out.print(charTermAttribute);
                System.out.print("(");
                for (int i = 0, length = charTermAttribute.length(); i < 
length; i++) {
                        System.out.print((int) 
charTermAttribute.charAt(i));
                        if (i < length-1)
                                System.out.print(" ");
                }
                System.out.print(")");
        }
}



From:   Robert Muir <rc...@gmail.com>
To:     java-user@lucene.apache.org, 
Date:   01/24/2013 04:31 PM
Subject:        Re: Chinese analyzer



On Thu, Jan 24, 2013 at 9:25 AM, Jerome Lanneluc
<je...@fr.ibm.com> wrote:
> Note the 2 tokens in the second sample when I would expect to have only 
one
> token with the (55401 57046) characters.
>
> I could not figure out if I'm doing something wrong, or if this is a bug 
in
> the Chinese analyzer.
>

Which analyzer specifically? there is more than one...

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org




Sauf indication contraire ci-dessus:/ Unless stated otherwise above:
Compagnie IBM France
Si��ge Social : 17 avenue de l'Europe, 92275 Bois-Colombes Cedex
RCS Nanterre 552 118 465
Forme Sociale : S.A.S.
Capital Social : 653.242.306,20 �
SIREN/SIRET : 552 118 465 03644 - Code NAF 6202A 

Re: Chinese analyzer

Posted by Robert Muir <rc...@gmail.com>.
On Thu, Jan 24, 2013 at 9:25 AM, Jerome Lanneluc
<je...@fr.ibm.com> wrote:
> Note the 2 tokens in the second sample when I would expect to have only one
> token with the (55401 57046) characters.
>
> I could not figure out if I'm doing something wrong, or if this is a bug in
> the Chinese analyzer.
>

Which analyzer specifically? there is more than one...

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org