You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Michael McCandless (JIRA)" <ji...@apache.org> on 2016/10/21 15:09:58 UTC
[jira] [Commented] (LUCENE-7509) [smartcn] Some chinese text is not
tokenized correctly with Chinese punctuation marks appended
[ https://issues.apache.org/jira/browse/LUCENE-7509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15595364#comment-15595364 ]
Michael McCandless commented on LUCENE-7509:
--------------------------------------------
Hi [~peina], could you please turn your test fragments into a test that fails? See e.g. https://wiki.apache.org/lucene-java/HowToContribute
Do you know how to fix this? Is there a Unicode API we should be using to more generally check for punctuation, so that Chinese punctuation is included?
> [smartcn] Some chinese text is not tokenized correctly with Chinese punctuation marks appended
> ----------------------------------------------------------------------------------------------
>
> Key: LUCENE-7509
> URL: https://issues.apache.org/jira/browse/LUCENE-7509
> Project: Lucene - Core
> Issue Type: Bug
> Components: modules/analysis
> Affects Versions: 6.2.1
> Environment: Mac OS X 10.10
> Reporter: peina
> Labels: chinese, tokenization
>
> Some chinese text is not tokenized correctly with Chinese punctuation marks appended.
> e.g.
> 碧绿的眼珠 is tokenized as 碧绿|的|眼珠. Which is correct.
> But
> 碧绿的眼珠,(with a Chinese punctuation appended )is tokenized as 碧绿|的|眼|珠,
> The similar case happens when text with numbers appended.
> e.g.
> 生活报8月4号 -->生活|报|8|月|4|号
> 生活报-->生活报
> Test Sample:
> public static void main(String[] args) throws IOException{
> Analyzer analyzer = new SmartChineseAnalyzer(); /* will load stopwords */
> System.out.println("Sample1=======");
> String sentence = "生活报8月4号";
> printTokens(analyzer, sentence);
> sentence = "生活报";
> printTokens(analyzer, sentence);
> System.out.println("Sample2=======");
>
> sentence = "碧绿的眼珠,";
> printTokens(analyzer, sentence);
> sentence = "碧绿的眼珠";
> printTokens(analyzer, sentence);
>
> analyzer.close();
> }
> private static void printTokens(Analyzer analyzer, String sentence) throws IOException{
> System.out.println("sentence:" + sentence);
> TokenStream tokens = analyzer.tokenStream("dummyfield", sentence);
> tokens.reset();
> CharTermAttribute termAttr = (CharTermAttribute) tokens.getAttribute(CharTermAttribute.class);
> while (tokens.incrementToken()) {
> System.out.println(termAttr.toString());
> }
> tokens.close();
> }
> Output:
> Sample1=======
> sentence:生活报8月4号
> 生活
> 报
> 8
> 月
> 4
> 号
> sentence:生活报
> 生活报
> Sample2=======
> sentence:碧绿的眼珠,
> 碧绿
> 的
> 眼
> 珠
> sentence:碧绿的眼珠
> 碧绿
> 的
> 眼珠
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org