You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Wayne Xin <wa...@hotmail.com> on 2015/08/14 17:20:00 UTC

getting full english word from tokenizing with SmartChineseAnalyzer

Hi,



I am new with Lucene Analyzer. I would like to get the full English tokens
from SmartChineseAnalyzer. But I’m only getting stems. The following code
has predefined the sentence in "testStr":
String testStr = "女单方面，王适娴second seed和头号种子卫冕冠军西班牙选手马
林first seed同处1/4区，3号种子李雪芮和韩国选手Korean player成池铉处在2/4区，不
过成池铉先要过日本小将(Japanese player)奥原希望这关。下半区，6号种子王仪涵若想
晋级决赛secure position. congratulations.";

The printed tokenized result is:

女 单 方面 王 适 娴 second seed 和 头号 种子 卫冕 冠军 西班牙 选手 马 林
first seed 同 处 1 4 区 3 号 种子 李 雪 芮 和 韩国 选手 korean player 成 池
铉 处在 2 4 区 不过 成 池 铉 先 要 过 日本 小将 japanes player 奥 原 希望 这
关 下 半 区 6 号 种子 王 仪 涵 若 想 晋级 决赛 secur posit congratul

As you can see some long English tokens such as Japanese, position and
congratulations are cut short in the tokenization process. I hope I didn't
use it wrong.

Test code:

private static void testChineseTokenizer() {
String testStr = "女单方面，王适娴second seed和头号种子卫冕冠军西班牙选手马
林first seed同处1/4区，3号种子李雪芮和韩国选手Korean player成池铉处在2/4区，不
过成池铉先要过日本小将(Japanese player)奥原希望这关。下半区，6号种子王仪涵若想
晋级决赛secure position. congratulations.";
Analyzer analyzer = new SmartChineseAnalyzer();
List<String> result = new ArrayList<String>();
StringReader sr = new StringReader(testStr);

try {
TokenStream stream = analyzer.tokenStream(null,sr);
CharTermAttribute cattr = stream.addAttribute(CharTermAttribute.class);
stream.reset();
while (stream.incrementToken())
{ String token = cattr.toString(); result.add(token); }

stream.end();
stream.close();
sr.close();
analyzer.close();
stream = null;
for (String tok: result)
{ System.out.print(" " + tok); }

System.out.println();
}
catch(IOException e)
{ // not thrown b/c we're using a string reader... }

}

Re: getting full english word from tokenizing with SmartChineseAnalyzer

Posted by Wayne Xin <wa...@hotmail.com>.

Thanks Uwe. This seems to be a handy tool. My problem is I need a better
example (tutorial maybe) to show me what are necessary/default filters a
SmartChineseAnalyzer or JapaneseAnalyzer needs. In this case, I guess I
need a HMMChineseTokenzier and a stop filter but not a porter stem filter.
I could give a try later but a tutorial would be nice. Thanks for the
suggestion though.

-Wayne

On 8/14/15, 4:40 PM, "Uwe Schindler" <uw...@thetaphi.de> wrote:

>Hi,
>
>it's much easier to create own analyzers since Lucene 5.0 (without
>defining your own classes):
>https://lucene.apache.org/core/5_2_1/analyzers-common/org/apache/lucene/an
>alysis/custom/CustomAnalyzer.html
>Using the builder you can create your own analyzer just with a few lines
>of code. The names and params used are the factories known from Apache
>Solr.
>
>Analyzers are final by design.
>
>Uwe
>-----
>Uwe Schindler
>H.-H.-Meier-Allee 63, D-28213 Bremen
>http://www.thetaphi.de
>eMail: uwe@thetaphi.de
>
>
>> -----Original Message-----
>> From: Wayne Xin [mailto:wayne_xin@hotmail.com]
>> Sent: Friday, August 14, 2015 8:44 PM
>> To: java-user@lucene.apache.org
>> Subject: Re: getting full english word from tokenizing with
>> SmartChineseAnalyzer
>> 
>> Thanks Michael. That works well. Not sure why SmartChineseAnalyzer is
>> final, otherwise we could overwrite createComponents().
>> 
>> New output:
>> 
>> 女 单 方面 王 适 娴 second seed 和 头号 种子 卫冕 冠军 西班牙 选手
>> 马 林
>> first seed 同 处 1 4 区 3 号
>> 种子 李 雪 芮 和 韩国 选手 korean player 成 池 铉 处在 2 4 区 不过 成 池
>> 铉
>> 先 要 过 日本 小将
>> japanese player 奥 原 希望 这 关 下 半 区 6 号 种子 王 仪 涵 若 想 晋级
>> 决赛
>> secure position
>> congratulations
>> 
>> -Wayne
>> 
>> 
>> 
>> On 8/14/15, 8:48 AM, "Michael Mastroianni" <mm...@placester.com>
>> wrote:
>> 
>> >The easiest thing to do is to create your own analyzer, cut and paste
>> >the code from org.apache.lucene.analysis.cn.smart.SmartChineseAnalyzer
>> >into it, and get rid of the line in createComponents(String fieldName,
>> >Reader
>> >reader)  that says
>> >
>> >    result = new PorterStemFilter(result);
>> >
>> >
>> >On Fri, Aug 14, 2015 at 11:20 AM, Wayne Xin <wa...@hotmail.com>
>> wrote:
>> >
>> >> Hi,
>> >>
>> >>
>> >>
>> >> I am new with Lucene Analyzer. I would like to get the full English
>> >>tokens  from SmartChineseAnalyzer. But I’m only getting stems. The
>> >>following code  has predefined the sentence in "testStr":
>> >> String testStr = "女单方面，王适娴second seed和头号种子卫冕冠军
>> 西班牙选手马
>> >> 林first seed同处1/4区，3号种子李雪芮和韩国选手Korean player成
>> 池铉处在2/4区，不
>> >> 过成池铉先要过日本小将(Japanese player)奥原希望这关。下半区
>> ，6号种子王仪涵若想
>> >> 晋级决赛secure position. congratulations.";
>> >>
>> >> The printed tokenized result is:
>> >>
>> >> 女 单 方面 王 适 娴 second seed 和 头号 种子 卫冕 冠军 西班牙 选
>> 手 马 林
>> >> first seed 同 处 1 4 区 3 号 种子 李 雪 芮 和 韩国 选手 korean player 成
>> 池
>> >> 铉 处在 2 4 区 不过 成 池 铉 先 要 过 日本 小将 japanes player 奥 原
>> 希望 这
>> >> 关 下 半 区 6 号 种子 王 仪 涵 若 想 晋级 决赛 secur posit congratul
>> >>
>> >> As you can see some long English tokens such as Japanese, position
>> >>and  congratulations are cut short in the tokenization process. I hope
>> >>I didn't  use it wrong.
>> >>
>> >> Test code:
>> >>
>> >> private static void testChineseTokenizer() { String testStr =
>> >> "女单方面，王适娴second seed和头号种子卫冕冠军西班牙选手马
>> >> 林first seed同处1/4区，3号种子李雪芮和韩国选手Korean player成
>> 池铉处在2/4区，不
>> >> 过成池铉先要过日本小将(Japanese player)奥原希望这关。下半区
>> ，6号种子王仪涵若想
>> >> 晋级决赛secure position. congratulations."; Analyzer analyzer = new
>> >> SmartChineseAnalyzer(); List<String> result = new
>> >> ArrayList<String>(); StringReader sr = new StringReader(testStr);
>> >>
>> >> try {
>> >> TokenStream stream = analyzer.tokenStream(null,sr); CharTermAttribute
>> >> cattr = stream.addAttribute(CharTermAttribute.class);
>> >> stream.reset();
>> >> while (stream.incrementToken())
>> >> { String token = cattr.toString(); result.add(token); }
>> >>
>> >> stream.end();
>> >> stream.close();
>> >> sr.close();
>> >> analyzer.close();
>> >> stream = null;
>> >> for (String tok: result)
>> >> { System.out.print(" " + tok); }
>> >>
>> >> System.out.println();
>> >> }
>> >> catch(IOException e)
>> >> { // not thrown b/c we're using a string reader... }
>> >>
>> >> }
>> >>
>> >>
>> >>
>> >>
>> 
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>For additional commands, e-mail: java-user-help@lucene.apache.org
>



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: getting full english word from tokenizing with SmartChineseAnalyzer

Posted by Uwe Schindler <uw...@thetaphi.de>.

Hi,

it's much easier to create own analyzers since Lucene 5.0 (without defining your own classes):
https://lucene.apache.org/core/5_2_1/analyzers-common/org/apache/lucene/analysis/custom/CustomAnalyzer.html
Using the builder you can create your own analyzer just with a few lines of code. The names and params used are the factories known from Apache Solr.

Analyzers are final by design.

Uwe
-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de


> -----Original Message-----
> From: Wayne Xin [mailto:wayne_xin@hotmail.com]
> Sent: Friday, August 14, 2015 8:44 PM
> To: java-user@lucene.apache.org
> Subject: Re: getting full english word from tokenizing with
> SmartChineseAnalyzer
> 
> Thanks Michael. That works well. Not sure why SmartChineseAnalyzer is
> final, otherwise we could overwrite createComponents().
> 
> New output:
> 
> 女 单 方面 王 适 娴 second seed 和 头号 种子 卫冕 冠军 西班牙 选手
> 马 林
> first seed 同 处 1 4 区 3 号
> 种子 李 雪 芮 和 韩国 选手 korean player 成 池 铉 处在 2 4 区 不过 成 池
> 铉
> 先 要 过 日本 小将
> japanese player 奥 原 希望 这 关 下 半 区 6 号 种子 王 仪 涵 若 想 晋级
> 决赛
> secure position
> congratulations
> 
> -Wayne
> 
> 
> 
> On 8/14/15, 8:48 AM, "Michael Mastroianni" <mm...@placester.com>
> wrote:
> 
> >The easiest thing to do is to create your own analyzer, cut and paste
> >the code from org.apache.lucene.analysis.cn.smart.SmartChineseAnalyzer
> >into it, and get rid of the line in createComponents(String fieldName,
> >Reader
> >reader)  that says
> >
> >    result = new PorterStemFilter(result);
> >
> >
> >On Fri, Aug 14, 2015 at 11:20 AM, Wayne Xin <wa...@hotmail.com>
> wrote:
> >
> >> Hi,
> >>
> >>
> >>
> >> I am new with Lucene Analyzer. I would like to get the full English
> >>tokens  from SmartChineseAnalyzer. But I’m only getting stems. The
> >>following code  has predefined the sentence in "testStr":
> >> String testStr = "女单方面，王适娴second seed和头号种子卫冕冠军
> 西班牙选手马
> >> 林first seed同处1/4区，3号种子李雪芮和韩国选手Korean player成
> 池铉处在2/4区，不
> >> 过成池铉先要过日本小将(Japanese player)奥原希望这关。下半区
> ，6号种子王仪涵若想
> >> 晋级决赛secure position. congratulations.";
> >>
> >> The printed tokenized result is:
> >>
> >> 女 单 方面 王 适 娴 second seed 和 头号 种子 卫冕 冠军 西班牙 选
> 手 马 林
> >> first seed 同 处 1 4 区 3 号 种子 李 雪 芮 和 韩国 选手 korean player 成
> 池
> >> 铉 处在 2 4 区 不过 成 池 铉 先 要 过 日本 小将 japanes player 奥 原
> 希望 这
> >> 关 下 半 区 6 号 种子 王 仪 涵 若 想 晋级 决赛 secur posit congratul
> >>
> >> As you can see some long English tokens such as Japanese, position
> >>and  congratulations are cut short in the tokenization process. I hope
> >>I didn't  use it wrong.
> >>
> >> Test code:
> >>
> >> private static void testChineseTokenizer() { String testStr =
> >> "女单方面，王适娴second seed和头号种子卫冕冠军西班牙选手马
> >> 林first seed同处1/4区，3号种子李雪芮和韩国选手Korean player成
> 池铉处在2/4区，不
> >> 过成池铉先要过日本小将(Japanese player)奥原希望这关。下半区
> ，6号种子王仪涵若想
> >> 晋级决赛secure position. congratulations."; Analyzer analyzer = new
> >> SmartChineseAnalyzer(); List<String> result = new
> >> ArrayList<String>(); StringReader sr = new StringReader(testStr);
> >>
> >> try {
> >> TokenStream stream = analyzer.tokenStream(null,sr); CharTermAttribute
> >> cattr = stream.addAttribute(CharTermAttribute.class);
> >> stream.reset();
> >> while (stream.incrementToken())
> >> { String token = cattr.toString(); result.add(token); }
> >>
> >> stream.end();
> >> stream.close();
> >> sr.close();
> >> analyzer.close();
> >> stream = null;
> >> for (String tok: result)
> >> { System.out.print(" " + tok); }
> >>
> >> System.out.println();
> >> }
> >> catch(IOException e)
> >> { // not thrown b/c we're using a string reader... }
> >>
> >> }
> >>
> >>
> >>
> >>
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: getting full english word from tokenizing with SmartChineseAnalyzer

Posted by Wayne Xin <wa...@hotmail.com>.

Thanks Michael. That works well. Not sure why SmartChineseAnalyzer is
final, otherwise we could overwrite createComponents().

New output:

女 单 方面 王 适 娴 second seed 和 头号 种子 卫冕 冠军 西班牙 选手 马 林
first seed 同 处 1 4 区 3 号
种子 李 雪 芮 和 韩国 选手 korean player 成 池 铉 处在 2 4 区 不过 成 池 铉
先 要 过 日本 小将 
japanese player 奥 原 希望 这 关 下 半 区 6 号 种子 王 仪 涵 若 想 晋级 决赛
secure position 
congratulations

-Wayne



On 8/14/15, 8:48 AM, "Michael Mastroianni" <mm...@placester.com>
wrote:

>The easiest thing to do is to create your own analyzer, cut and paste the
>code from org.apache.lucene.analysis.cn.smart.SmartChineseAnalyzer into
>it,
>and get rid of the line in createComponents(String fieldName, Reader
>reader)  that says
>
>    result = new PorterStemFilter(result);
>
>
>On Fri, Aug 14, 2015 at 11:20 AM, Wayne Xin <wa...@hotmail.com> wrote:
>
>> Hi,
>>
>>
>>
>> I am new with Lucene Analyzer. I would like to get the full English
>>tokens
>> from SmartChineseAnalyzer. But I’m only getting stems. The following
>>code
>> has predefined the sentence in "testStr":
>> String testStr = "女单方面，王适娴second seed和头号种子卫冕冠军西班牙选手马
>> 林first seed同处1/4区，3号种子李雪芮和韩国选手Korean player成池铉处在2/4区，不
>> 过成池铉先要过日本小将(Japanese player)奥原希望这关。下半区，6号种子王仪涵若想
>> 晋级决赛secure position. congratulations.";
>>
>> The printed tokenized result is:
>>
>> 女 单 方面 王 适 娴 second seed 和 头号 种子 卫冕 冠军 西班牙 选手 马 林
>> first seed 同 处 1 4 区 3 号 种子 李 雪 芮 和 韩国 选手 korean player 成 池
>> 铉 处在 2 4 区 不过 成 池 铉 先 要 过 日本 小将 japanes player 奥 原 希望 这
>> 关 下 半 区 6 号 种子 王 仪 涵 若 想 晋级 决赛 secur posit congratul
>>
>> As you can see some long English tokens such as Japanese, position and
>> congratulations are cut short in the tokenization process. I hope I
>>didn't
>> use it wrong.
>>
>> Test code:
>>
>> private static void testChineseTokenizer() {
>> String testStr = "女单方面，王适娴second seed和头号种子卫冕冠军西班牙选手马
>> 林first seed同处1/4区，3号种子李雪芮和韩国选手Korean player成池铉处在2/4区，不
>> 过成池铉先要过日本小将(Japanese player)奥原希望这关。下半区，6号种子王仪涵若想
>> 晋级决赛secure position. congratulations.";
>> Analyzer analyzer = new SmartChineseAnalyzer();
>> List<String> result = new ArrayList<String>();
>> StringReader sr = new StringReader(testStr);
>>
>> try {
>> TokenStream stream = analyzer.tokenStream(null,sr);
>> CharTermAttribute cattr = stream.addAttribute(CharTermAttribute.class);
>> stream.reset();
>> while (stream.incrementToken())
>> { String token = cattr.toString(); result.add(token); }
>>
>> stream.end();
>> stream.close();
>> sr.close();
>> analyzer.close();
>> stream = null;
>> for (String tok: result)
>> { System.out.print(" " + tok); }
>>
>> System.out.println();
>> }
>> catch(IOException e)
>> { // not thrown b/c we're using a string reader... }
>>
>> }
>>
>>
>>
>>



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: getting full english word from tokenizing with SmartChineseAnalyzer

Posted by Michael Mastroianni <mm...@placester.com>.

The easiest thing to do is to create your own analyzer, cut and paste the
code from org.apache.lucene.analysis.cn.smart.SmartChineseAnalyzer into it,
and get rid of the line in createComponents(String fieldName, Reader
reader)  that says

    result = new PorterStemFilter(result);


On Fri, Aug 14, 2015 at 11:20 AM, Wayne Xin <wa...@hotmail.com> wrote:

> Hi,
>
>
>
> I am new with Lucene Analyzer. I would like to get the full English tokens
> from SmartChineseAnalyzer. But I’m only getting stems. The following code
> has predefined the sentence in "testStr":
> String testStr = "女单方面，王适娴second seed和头号种子卫冕冠军西班牙选手马
> 林first seed同处1/4区，3号种子李雪芮和韩国选手Korean player成池铉处在2/4区，不
> 过成池铉先要过日本小将(Japanese player)奥原希望这关。下半区，6号种子王仪涵若想
> 晋级决赛secure position. congratulations.";
>
> The printed tokenized result is:
>
> 女 单 方面 王 适 娴 second seed 和 头号 种子 卫冕 冠军 西班牙 选手 马 林
> first seed 同 处 1 4 区 3 号 种子 李 雪 芮 和 韩国 选手 korean player 成 池
> 铉 处在 2 4 区 不过 成 池 铉 先 要 过 日本 小将 japanes player 奥 原 希望 这
> 关 下 半 区 6 号 种子 王 仪 涵 若 想 晋级 决赛 secur posit congratul
>
> As you can see some long English tokens such as Japanese, position and
> congratulations are cut short in the tokenization process. I hope I didn't
> use it wrong.
>
> Test code:
>
> private static void testChineseTokenizer() {
> String testStr = "女单方面，王适娴second seed和头号种子卫冕冠军西班牙选手马
> 林first seed同处1/4区，3号种子李雪芮和韩国选手Korean player成池铉处在2/4区，不
> 过成池铉先要过日本小将(Japanese player)奥原希望这关。下半区，6号种子王仪涵若想
> 晋级决赛secure position. congratulations.";
> Analyzer analyzer = new SmartChineseAnalyzer();
> List<String> result = new ArrayList<String>();
> StringReader sr = new StringReader(testStr);
>
> try {
> TokenStream stream = analyzer.tokenStream(null,sr);
> CharTermAttribute cattr = stream.addAttribute(CharTermAttribute.class);
> stream.reset();
> while (stream.incrementToken())
> { String token = cattr.toString(); result.add(token); }
>
> stream.end();
> stream.close();
> sr.close();
> analyzer.close();
> stream = null;
> for (String tok: result)
> { System.out.print(" " + tok); }
>
> System.out.println();
> }
> catch(IOException e)
> { // not thrown b/c we're using a string reader... }
>
> }
>
>
>
>