You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Chang KaiShin (JIRA)" <ji...@apache.org> on 2016/12/02 07:56:58 UTC

[jira] [Comment Edited] (LUCENE-7509) [smartcn] Some chinese text is not tokenized correctly with Chinese punctuation marks appended

    [ https://issues.apache.org/jira/browse/LUCENE-7509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15714052#comment-15714052 ] 

Chang KaiShin edited comment on LUCENE-7509 at 12/2/16 7:56 AM:
----------------------------------------------------------------

This is not a bug. The underlying Viterbi algorithm segmenting Chinese sentences is based on the probability of the occurrences of the Chinese Characters. Take sentence "生活报8月4号" as an example. The "报" here is meant 2 meanings. If it is placed in the end of the sentence. It means daily newspaper. However, if placed with conjunctions with other Chinese Characters. It is meant to report something. So the algorithm segments "报" as independent word to mean reporting. On the Contrary,  "生活报" is assumed to have higher chance to mean daily newspaper. You need to add some words to the dictionary to let the algorithms to learn, so that you get the correct result you wanted. 

The same induction applies to the case "碧绿的眼珠,". It was segmented into 碧绿|的|眼| 珠,
The punctuation "," is a stopword, so the result is 碧绿|的|眼| 珠. I suggest put the word "眼珠" into the dictionary , the problem should be solved.


was (Author: gushgg):
This is not a bug. The underlying Viterbi algorithm segmenting Chinese sentences is based on the probability of the occurrences of the Chinese Characters. Take sentence "生活报8月4号" as an example. The "报" here is meant 2 meanings. If it is placed in the end of the sentence. It means daily newspaper. However, if placed with conjunctions with other Chinese Characters. It is meant to report something. So the algorithm segments "报" as independent word to mean reporting. On the Contrary,  "生活报" is assumed to have higher chance to mean daily newspaper. You need to add some words to the dictionary to let the algorithms to learn, so that you get the correct result you wanted. 

> [smartcn] Some chinese text is not tokenized correctly with Chinese punctuation marks appended
> ----------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-7509
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7509
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: modules/analysis
>    Affects Versions: 6.2.1
>         Environment: Mac OS X 10.10
>            Reporter: peina
>              Labels: chinese, tokenization
>
> Some chinese text is not tokenized correctly with Chinese punctuation marks appended.
> e.g.
> 碧绿的眼珠 is tokenized as 碧绿|的|眼珠. Which is correct.
> But 
> 碧绿的眼珠,(with a Chinese punctuation appended )is tokenized as 碧绿|的|眼|珠,
> The similar case happens when text with numbers appended.
> e.g.
> 生活报8月4号 -->生活|报|8|月|4|号
> 生活报-->生活报
> Test Sample:
> public static void main(String[] args) throws IOException{
>     Analyzer analyzer = new SmartChineseAnalyzer(); /* will load stopwords */
>     System.out.println("Sample1=======");
>     String sentence = "生活报8月4号";
>     printTokens(analyzer, sentence);
>     sentence = "生活报";
>     printTokens(analyzer, sentence);
>     System.out.println("Sample2=======");
>     
>     sentence = "碧绿的眼珠,";
>     printTokens(analyzer, sentence);
>     sentence = "碧绿的眼珠";
>     printTokens(analyzer, sentence);
>     
>     analyzer.close();
>   }
>   private static void printTokens(Analyzer analyzer, String sentence) throws IOException{
>     System.out.println("sentence:" + sentence);
>     TokenStream tokens = analyzer.tokenStream("dummyfield", sentence);
>     tokens.reset();
>     CharTermAttribute termAttr = (CharTermAttribute) tokens.getAttribute(CharTermAttribute.class);
>     while (tokens.incrementToken()) {
>       System.out.println(termAttr.toString());
>     }
>     tokens.close();
>   }
> Output:
> Sample1=======
> sentence:生活报8月4号
> 生活
> 报
> 8
> 月
> 4
> 号
> sentence:生活报
> 生活报
> Sample2=======
> sentence:碧绿的眼珠,
> 碧绿
> 的
> 眼
> 珠
> sentence:碧绿的眼珠
> 碧绿
> 的
> 眼珠



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org