You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "peina (JIRA)" <ji...@apache.org> on 2016/10/20 03:09:58 UTC
[jira] [Created] (LUCENE-7509) [smartcn] Some chinese text is not
tokenized correctly with Chinese punctuation marks appended
peina created LUCENE-7509:
-----------------------------
Summary: [smartcn] Some chinese text is not tokenized correctly with Chinese punctuation marks appended
Key: LUCENE-7509
URL: https://issues.apache.org/jira/browse/LUCENE-7509
Project: Lucene - Core
Issue Type: Bug
Components: modules/analysis
Affects Versions: 6.2.1
Environment: Mac OS X 10.10
Reporter: peina
Some chinese text is not tokenized correctly with Chinese punctuation marks appended.
e.g.
碧绿的眼珠 is tokenized as 碧绿|的|眼珠. Which is correct.
But
碧绿的眼珠,(with a Chinese punctuation appended )is tokenized as 碧绿|的|眼|珠,
The similar case happens when text with numbers appended.
e.g.
生活报8月4号 -->生活|报|8|月|4|号
生活报-->生活报
Test Sample:
public static void main(String[] args) throws IOException{
Analyzer analyzer = new SmartChineseAnalyzer(); /* will load stopwords */
System.out.println("Sample1=======");
String sentence = "生活报8月4号";
printTokens(analyzer, sentence);
sentence = "生活报";
printTokens(analyzer, sentence);
System.out.println("Sample2=======");
sentence = "碧绿的眼珠,";
printTokens(analyzer, sentence);
sentence = "碧绿的眼珠";
printTokens(analyzer, sentence);
analyzer.close();
}
private static void printTokens(Analyzer analyzer, String sentence) throws IOException{
System.out.println("sentence:" + sentence);
TokenStream tokens = analyzer.tokenStream("dummyfield", sentence);
tokens.reset();
CharTermAttribute termAttr = (CharTermAttribute) tokens.getAttribute(CharTermAttribute.class);
while (tokens.incrementToken()) {
System.out.println(termAttr.toString());
}
tokens.close();
}
Output:
Sample1=======
sentence:生活报8月4号
生活
报
8
月
4
号
sentence:生活报
生活报
Sample2=======
sentence:碧绿的眼珠,
碧绿
的
眼
珠
sentence:碧绿的眼珠
碧绿
的
眼珠
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org