You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Jan Høydahl (JIRA)" <ji...@apache.org> on 2015/08/14 08:39:46 UTC
[jira] [Closed] (LUCENE-6736) SmartChineseAnalyzer chops English
tokens in a chinese-english mixed sentence.
[ https://issues.apache.org/jira/browse/LUCENE-6736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jan Høydahl closed LUCENE-6736.
-------------------------------
Resolution: Invalid
> SmartChineseAnalyzer chops English tokens in a chinese-english mixed sentence.
> ------------------------------------------------------------------------------
>
> Key: LUCENE-6736
> URL: https://issues.apache.org/jira/browse/LUCENE-6736
> Project: Lucene - Core
> Issue Type: Bug
> Components: modules/analysis
> Affects Versions: 5.1
> Environment: linux Java 1.7
> Reporter: Wayne Xin
> Labels: chinese, tokenization
>
> I am new with Lucene Analyzer. The following code has predefined the sentence in "testStr":
> String testStr = "女单方面,王适娴second seed和头号种子卫冕冠军西班牙选手马林first seed同处1/4区,3号种子李雪芮和韩国选手Korean player成池铉处在2/4区,不过成池铉先要过日本小将(Japanese player)奥原希望这关。下半区,6号种子王仪涵若想晋级决赛secure position. congratulations.";
> The printed tokenized result is:
> 女 单 方面 王 适 娴 second seed 和 头号 种子 卫冕 冠军 西班牙 选手 马 林 first seed 同 处 1 4 区 3 号 种子 李 雪 芮 和 韩国 选手 korean player 成 池 铉 处在 2 4 区 不过 成 池 铉 先 要 过 日本 小将 japanes player 奥 原 希望 这 关 下 半 区 6 号 种子 王 仪 涵 若 想 晋级 决赛 secur posit congratul
> As you can see some long English tokens such as Japanese, position and congratulations are cut short in the tokenization process. I hope I didn't use it wrong.
> Test code:
> private static void testChineseTokenizer() {
> String testStr = "女单方面,王适娴second seed和头号种子卫冕冠军西班牙选手马林first seed同处1/4区,3号种子李雪芮和韩国选手Korean player成池铉处在2/4区,不过成池铉先要过日本小将(Japanese player)奥原希望这关。下半区,6号种子王仪涵若想晋级决赛secure position. congratulations.";
> Analyzer analyzer = new SmartChineseAnalyzer();
> List<String> result = new ArrayList<String>();
> StringReader sr = new StringReader(testStr);
> try {
> TokenStream stream = analyzer.tokenStream(null,sr);
> CharTermAttribute cattr = stream.addAttribute(CharTermAttribute.class);
> stream.reset();
> while (stream.incrementToken()) {
> String token = cattr.toString();
> result.add(token);
> }
> stream.end();
> stream.close();
> sr.close();
> analyzer.close();
> stream = null;
> for (String tok: result) {
> System.out.print(" " + tok);
> }
> System.out.println();
> }
> catch(IOException e) {
> // not thrown b/c we're using a string reader...
> }
> }
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org