You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Thushara Wijeratna <th...@gmail.com> on 2012/03/02 22:41:46 UTC
lucene gosen diff btn jars
I'm testing lucene-gosen for Japanese tokenization and wondering what the
differences are between the two jars provided. (ipadic / chaisen)?
In my preliminary testing, I'm not seeing any difference in tokenization in
these two jars. (the jar with no dictionary did not work, I assume I need
to make available a custom dictionary - header.sen which I did not try)
I tried to tokenize this phrase:
ゴルフが大好きなあなた。
アメリカにあるベスト・ゴルフコース情報が満載のイエローページ・ジャパンでは、オンラインまたはガイド・ブックからもあらゆる情報が簡単に入手できます。
詳しい情報は
which google translates as
You love golf. Best golf course information in the United States is in the
Yellow Pages Japan is full of, any information can be obtained easily from
online or book guide. For more information
I'm getting identical tokenization from both jars, namely :
ゴルフ / Golf
大好き / I love
あなた / You
アメリカ / America
ベスト / best
ゴルフコース / Golf course
情報 / information
満載 / save
イエロ / Hierro
ページ / page
ジャパン / Japan
オンライン / online
ガイド / guide
ブック / book
あらゆる / all
情報 / information
簡単 / simple
入手 / obtaining
できる / able to
詳しい /detailed
情報 / information
Note: translations based on Google Translate
Any pointers you can provide as to the difference of the two methods of
tokenizing would be highly appreciated.
thx,
thushara
Re: lucene gosen diff btn jars
Posted by Koji Sekiguchi <ko...@r.email.ne.jp>.
Hi Thushara,
Please use lucene-gosen mailing list for lucene-gosen questions:
http://groups.google.com/group/lucene-gosen
Thanks,
koji
--
Query Log Visualizer for Apache Solr
http://soleami.com/
(12/03/03 6:41), Thushara Wijeratna wrote:
> I'm testing lucene-gosen for Japanese tokenization and wondering what the
> differences are between the two jars provided. (ipadic / chaisen)?
> In my preliminary testing, I'm not seeing any difference in tokenization in
> these two jars. (the jar with no dictionary did not work, I assume I need
> to make available a custom dictionary - header.sen which I did not try)
>
> I tried to tokenize this phrase:
>
> ゴルフが大好きなあなた。
> アメリカにあるベスト・ゴルフコース情報が満載のイエローページ・ジャパンでは、オンラインまたはガイド・ブックからもあらゆる情報が簡単に入手できます。
> 詳しい情報は
>
>
> which google translates as
>
>
> You love golf. Best golf course information in the United States is in the
> Yellow Pages Japan is full of, any information can be obtained easily from
> online or book guide. For more information
>
>
> I'm getting identical tokenization from both jars, namely :
>
>
> ゴルフ / Golf
>
> 大好き / I love
>
> あなた / You
>
> アメリカ / America
>
> ベスト / best
>
> ゴルフコース / Golf course
>
> 情報 / information
>
> 満載 / save
>
> イエロ / Hierro
>
> ページ / page
>
> ジャパン / Japan
>
> オンライン / online
>
> ガイド / guide
>
> ブック / book
>
> あらゆる / all
>
> 情報 / information
>
> 簡単 / simple
>
> 入手 / obtaining
>
> できる / able to
>
> 詳しい /detailed
>
> 情報 / information
>
>
> Note: translations based on Google Translate
>
>
> Any pointers you can provide as to the difference of the two methods of
> tokenizing would be highly appreciated.
>
>
> thx,
>
> thushara
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org