You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Michael McCandless (JIRA)" <ji...@apache.org> on 2016/10/21 15:09:58 UTC
[jira] [Commented] (LUCENE-7509) [smartcn] Some chinese text is not tokenized correctly with Chinese punctuation marks appended

    [ https://issues.apache.org/jira/browse/LUCENE-7509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15595364#comment-15595364 ] 

Michael McCandless commented on LUCENE-7509:
--------------------------------------------

Hi [~peina], could you please turn your test fragments into a test that fails?  See e.g. https://wiki.apache.org/lucene-java/HowToContribute

Do you know how to fix this?  Is there a Unicode API we should be using to more generally check for punctuation, so that Chinese punctuation is included?

> [smartcn] Some chinese text is not tokenized correctly with Chinese punctuation marks appended
> ----------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-7509
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7509
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: modules/analysis
>    Affects Versions: 6.2.1
>         Environment: Mac OS X 10.10
>            Reporter: peina
>              Labels: chinese, tokenization
>
> Some chinese text is not tokenized correctly with Chinese punctuation marks appended.
> e.g.
> 碧绿的眼珠 is tokenized as 碧绿|的|眼珠. Which is correct.
> But 
> 碧绿的眼珠，（with a Chinese punctuation appended )is tokenized as 碧绿|的|眼|珠，
> The similar case happens when text with numbers appended.
> e.g.
> 生活报8月4号 -->生活|报|8|月|4|号
> 生活报-->生活报
> Test Sample:
> public static void main(String[] args) throws IOException{
>     Analyzer analyzer = new SmartChineseAnalyzer(); /* will load stopwords */
>     System.out.println("Sample1=======");
>     String sentence = "生活报8月4号";
>     printTokens(analyzer, sentence);
>     sentence = "生活报";
>     printTokens(analyzer, sentence);
>     System.out.println("Sample2=======");
>     
>     sentence = "碧绿的眼珠，";
>     printTokens(analyzer, sentence);
>     sentence = "碧绿的眼珠";
>     printTokens(analyzer, sentence);
>     
>     analyzer.close();
>   }
>   private static void printTokens(Analyzer analyzer, String sentence) throws IOException{
>     System.out.println("sentence:" + sentence);
>     TokenStream tokens = analyzer.tokenStream("dummyfield", sentence);
>     tokens.reset();
>     CharTermAttribute termAttr = (CharTermAttribute) tokens.getAttribute(CharTermAttribute.class);
>     while (tokens.incrementToken()) {
>       System.out.println(termAttr.toString());
>     }
>     tokens.close();
>   }
> Output:
> Sample1=======
> sentence:生活报8月4号
> 生活
> 报
> 8
> 月
> 4
> 号
> sentence:生活报
> 生活报
> Sample2=======
> sentence:碧绿的眼珠，
> 碧绿
> 的
> 眼
> 珠
> sentence:碧绿的眼珠
> 碧绿
> 的
> 眼珠



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org