You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "peina (JIRA)" <ji...@apache.org> on 2016/10/20 03:09:58 UTC

[jira] [Created] (LUCENE-7509) [smartcn] Some chinese text is not tokenized correctly with Chinese punctuation marks appended

peina created LUCENE-7509:
-----------------------------

             Summary: [smartcn] Some chinese text is not tokenized correctly with Chinese punctuation marks appended
                 Key: LUCENE-7509
                 URL: https://issues.apache.org/jira/browse/LUCENE-7509
             Project: Lucene - Core
          Issue Type: Bug
          Components: modules/analysis
    Affects Versions: 6.2.1
         Environment: Mac OS X 10.10
            Reporter: peina


Some chinese text is not tokenized correctly with Chinese punctuation marks appended.

e.g.
碧绿的眼珠 is tokenized as 碧绿|的|眼珠. Which is correct.

But 
碧绿的眼珠,(with a Chinese punctuation appended )is tokenized as 碧绿|的|眼|珠,

The similar case happens when text with numbers appended.

e.g.
生活报8月4号 -->生活|报|8|月|4|号
生活报-->生活报

Test Sample:
public static void main(String[] args) throws IOException{
    Analyzer analyzer = new SmartChineseAnalyzer(); /* will load stopwords */
    System.out.println("Sample1=======");
    String sentence = "生活报8月4号";
    printTokens(analyzer, sentence);
    sentence = "生活报";
    printTokens(analyzer, sentence);
    System.out.println("Sample2=======");
    
    sentence = "碧绿的眼珠,";
    printTokens(analyzer, sentence);
    sentence = "碧绿的眼珠";
    printTokens(analyzer, sentence);
    
    analyzer.close();

  }

  private static void printTokens(Analyzer analyzer, String sentence) throws IOException{
    System.out.println("sentence:" + sentence);
    TokenStream tokens = analyzer.tokenStream("dummyfield", sentence);
    tokens.reset();
    CharTermAttribute termAttr = (CharTermAttribute) tokens.getAttribute(CharTermAttribute.class);
    while (tokens.incrementToken()) {
      System.out.println(termAttr.toString());
    }
    tokens.close();
  }

Output:
Sample1=======
sentence:生活报8月4号
生活
报
8
月
4
号
sentence:生活报
生活报
Sample2=======
sentence:碧绿的眼珠,
碧绿
的
眼
珠
sentence:碧绿的眼珠
碧绿
的
眼珠



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org