You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Thushara Wijeratna <th...@gmail.com> on 2012/03/02 22:41:46 UTC

lucene gosen diff btn jars

I'm testing lucene-gosen for Japanese tokenization and wondering what the
differences are between the two jars provided. (ipadic / chaisen)?
In my preliminary testing, I'm not seeing any difference in tokenization in
these two jars.  (the jar with no dictionary did not work, I assume I need
to make available a custom dictionary - header.sen which I did not try)

I tried to tokenize this phrase:

ゴルフが大好きなあなた。
アメリカにあるベスト・ゴルフコース情報が満載のイエローページ・ジャパンでは、オンラインまたはガイド・ブックからもあらゆる情報が簡単に入手できます。
詳しい情報は


which google translates as


You love golf. Best golf course information in the United States is in the
Yellow Pages Japan is full of, any information can be obtained easily from
online or book guide. For more information


I'm getting identical tokenization from both jars, namely :


ゴルフ / Golf

 大好き / I love

 あなた / You

 アメリカ / America

 ベスト / best

 ゴルフコース / Golf course

 情報 / information

 満載 / save

 イエロ / Hierro

 ページ / page

 ジャパン / Japan

 オンライン / online

 ガイド / guide

 ブック / book

 あらゆる / all

 情報 / information

 簡単 / simple

 入手 / obtaining

 できる / able to

 詳しい  /detailed

 情報 / information


Note: translations based on Google Translate


Any pointers you can provide as to the difference of the two methods of
tokenizing would be highly appreciated.


thx,

thushara

Re: lucene gosen diff btn jars

Posted by Koji Sekiguchi <ko...@r.email.ne.jp>.
Hi Thushara,

Please use lucene-gosen mailing list for lucene-gosen questions:

http://groups.google.com/group/lucene-gosen

Thanks,

koji
-- 
Query Log Visualizer for Apache Solr
http://soleami.com/

(12/03/03 6:41), Thushara Wijeratna wrote:
> I'm testing lucene-gosen for Japanese tokenization and wondering what the
> differences are between the two jars provided. (ipadic / chaisen)?
> In my preliminary testing, I'm not seeing any difference in tokenization in
> these two jars.  (the jar with no dictionary did not work, I assume I need
> to make available a custom dictionary - header.sen which I did not try)
> 
> I tried to tokenize this phrase:
> 
> ゴルフが大好きなあなた。
> アメリカにあるベスト・ゴルフコース情報が満載のイエローページ・ジャパンでは、オンラインまたはガイド・ブックからもあらゆる情報が簡単に入手できます。
> 詳しい情報は
> 
> 
> which google translates as
> 
> 
> You love golf. Best golf course information in the United States is in the
> Yellow Pages Japan is full of, any information can be obtained easily from
> online or book guide. For more information
> 
> 
> I'm getting identical tokenization from both jars, namely :
> 
> 
> ゴルフ / Golf
> 
>   大好き / I love
> 
>   あなた / You
> 
>   アメリカ / America
> 
>   ベスト / best
> 
>   ゴルフコース / Golf course
> 
>   情報 / information
> 
>   満載 / save
> 
>   イエロ / Hierro
> 
>   ページ / page
> 
>   ジャパン / Japan
> 
>   オンライン / online
> 
>   ガイド / guide
> 
>   ブック / book
> 
>   あらゆる / all
> 
>   情報 / information
> 
>   簡単 / simple
> 
>   入手 / obtaining
> 
>   できる / able to
> 
>   詳しい  /detailed
> 
>   情報 / information
> 
> 
> Note: translations based on Google Translate
> 
> 
> Any pointers you can provide as to the difference of the two methods of
> tokenizing would be highly appreciated.
> 
> 
> thx,
> 
> thushara
> 



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org