You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by "Shaw, James" <Ja...@intuit.com> on 2007/07/24 22:01:17 UTC

Lucene and Eastern languages (Japanese, Korean and Chinese)

Hi, guys,
I found Analyzers for Japanese, Korean and Chinese, but not stemmers;
the Snowball stemmers only include European languages.  Does stemming
not make sense for ideograph-based languages (i.e., no stemming is
needed for Japanese, Korean and Chinese)?

Also for spell checking, does the default Lucene SpellChecker work for
Japanese, Korean and Chinese?  Does edit distance make sense for these
languages?

What other gotcha's can you guys think of when making Lucene work with
foreign languages, besides analyzer, stemmer and spell checking?  Thanks
in advance for your help.

Re: Lucene and Eastern languages (Japanese, Korean and Chinese)

Posted by Maximilian Hütter <mh...@blue-elephant-systems.com>.
Mathieu Lecarme schrieb:
> Le mardi 24 juillet 2007 à 13:01 -0700, Shaw, James a écrit :
>> Hi, guys,
>> I found Analyzers for Japanese, Korean and Chinese, but not stemmers;
>> the Snowball stemmers only include European languages.  Does stemming
>> not make sense for ideograph-based languages (i.e., no stemming is
>> needed for Japanese, Korean and Chinese)?
> No.

This not quite correct, Chinese doesn't need any stemming but Japanese
is not completely ideograph-based and it could use stemming. I doubt
anyone has done this, besides some commercial software for the japanese
market. I don't know for Korean.

>> Also for spell checking, does the default Lucene SpellChecker work for
>> Japanese, Korean and Chinese?  Does edit distance make sense for these
>> languages?
> Japanese used group of ideogram, but levenstein distance don't make
> sense with few letters but I'm not a CJK expert.
> 
> M.

Edit distance only seems to work with latin character based (writen)
languages. Spell checking Chinese, Japanese (and Korean?) is more or
less pointless, as they are inputed using input methods, which should
produce "correct" words.

Best regards,

Max


-- 
Maximilian Hütter
blue elephant systems GmbH
Wollgrasweg 49
D-70599 Stuttgart

Tel            :  (+49) 0711 - 45 10 17 578
Fax            :  (+49) 0711 - 45 10 17 573
e-mail         :  max.huetter@blue-elephant-systems.com
Sitz           :  Stuttgart, Amtsgericht Stuttgart, HRB 24106
Geschäftsführer:  Joachim Hörnle, Thomas Gentsch, Holger Dietrich

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Lucene and Eastern languages (Japanese, Korean and Chinese)

Posted by Mathieu Lecarme <ma...@garambrogne.net>.
Le mardi 24 juillet 2007 à 13:01 -0700, Shaw, James a écrit :
> Hi, guys,
> I found Analyzers for Japanese, Korean and Chinese, but not stemmers;
> the Snowball stemmers only include European languages.  Does stemming
> not make sense for ideograph-based languages (i.e., no stemming is
> needed for Japanese, Korean and Chinese)?
No.

> Also for spell checking, does the default Lucene SpellChecker work for
> Japanese, Korean and Chinese?  Does edit distance make sense for these
> languages?
Japanese used group of ideogram, but levenstein distance don't make
sense with few letters but I'm not a CJK expert.

M.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org