You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by John Wang <jo...@gmail.com> on 2005/10/04 01:46:45 UTC

korean and lucene

Hi:

We are running into problems with searching on korean documents. We are
using the StandardAnalyzer and everything works with Chinese and Japanese.
Are there known problems with Korean with Lucene?

Thanks

-John

Re: korean and lucene

Posted by Cheolgoo Kang <ap...@gmail.com>.

On 11/8/05, Cheolgoo Kang <ap...@gmail.com> wrote:
> Hello,
>
> I've created a new JIRA issue with Korean analysis that
> StandardAnalyzer splits one word into several tokens each with one
> character. Cause Korean is not a phonogram, one character in Korean

Sorry for confusions. I mean 'ideographic characters' not 'phonogram'. :)

> has almost no meaning at all. So word in Korean should be preserved
> not like Chinese or Japanese.
>
> I've attached a patch to StandardTokenizer to do this and passed the
> test case TestStandardAnalyzer(also patched to test Korean words).
>
> Thanks!
>
> On 10/27/05, Youngho Cho <yo...@nannet.co.kr> wrote:
> > Hello,
> >
> > Ok , I've attached my test code for Korean which is slitely modified Koji's code.
> >
> > Just put into the lia.analysis.i18n package at LuceneInAction
> > and run ant.
> >
> > Hopely someone is helped.
> >
> > -------- build.xml  ---------
> >
> >   <target name="JapaneseDemo" depends="prepare"
> >           description="Examples of Jananese analysis">
> >     <info>
> >
> >       Japanese Test...
> >
> >     </info>
> >
> >     <run-main class="lia.analysis.i18n.JapaneseDemo"/>
> >   </target>
> >
> >   <target name="KoreanDemo" depends="prepare"
> >           description="Examples of Korean analysis">
> >     <info>
> >
> >       Korean Test...
> >
> >     </info>
> >
> >     <run-main class="lia.analysis.i18n.KoreanDemo"/>
> >   </target>
> >
> >
> > Thanks,
> >
> > Youngho
> >
> >
> > ----- Original Message -----
> > From: "Youngho Cho" <yo...@nannet.co.kr>
> > To: <ja...@lucene.apache.org>; "Youngho Cho" <yo...@nannet.co.kr>
> > Sent: Thursday, October 27, 2005 12:47 PM
> > Subject: Re: korean and lucene
> >
> >
> > > Hello all
> > > Plese forgive me pervious my stupid message
> > >
> > >      [echo] Running lia.analysis.i18n.KoreanDemo...
> > >      [java] [경] [기]  analyzer = org.apache.lucene.analysis.standard.StandardAnalyzer
> > >      [java] phrase = 경기
> > >      [java] query = "경 기"
> > >
> > > I got the good result.
> > >
> > > When I compile I just rename old version lucene-1.4.3.jar to lucene-1.4.3.jar_bak
> > > and all new 1.9 lucene. and build the test package.
> > > After I remove lucene-1.4.3.jar_bak in lib directory completely
> > > I got the expected result !!!.
> > >
> > > I don't know the reason... ( looks like my finger make some trouble... )
> > >
> > > Anyway thanks Koji and Cheolgoo
> > > I will further test now...
> > >
> > > Youngho
> > >
> > >
> > >
> > >
> > > ----- Original Message -----
> > > From: "Youngho Cho" <yo...@nannet.co.kr>
> > > To: <ja...@lucene.apache.org>
> > > Sent: Thursday, October 27, 2005 12:28 PM
> > > Subject: Re: korean and lucene
> > >
> > >
> > > > Hello Koji
> > > >
> > > > Here is test result.
> > > > Japanese is OK !.
> > > > maybe ant clean  did some effect.
> > > >
> > > > Anyway please refer to the result using 1.9
> > > >
> > > >      [echo] Running lia.analysis.i18n.JapaneseDemo...
> > > >      [java] [ラ] [メ] [ン] [屋]  analyzer = org.apache.lucene.analysis.standard.StandardAnalyzer
> > > >      [java] phrase = ラ?メン屋
> > > >      [java] query = content:ラ?メン屋
> > > >
> > > >     [echo] Running lia.analysis.i18n.KoreanDemo...
> > > >      [java]  analyzer = org.apache.lucene.analysis.standard.StandardAnalyzer
> > > >      [java] phrase = 경
> > > >      [java] query =
> > > >
> > > >      [echo] Running lia.analysis.i18n.JapaneseDemo...
> > > >      [java] [ラ] [メン] [ン屋]  analyzer = org.apache.lucene.analysis.cjk.CJKAnalyzer
> > > >      [java] phrase = ラ?メン屋
> > > >      [java] query = content:ラ?メン屋
> > > >
> > > >     [echo] Running lia.analysis.i18n.KoreanDemo...
> > > >      [java] [경]  analyzer = org.apache.lucene.analysis.cjk.CJKAnalyzer
> > > >      [java] phrase = 경
> > > >      [java] query = 경
> > > >
> > > >      [echo] Running lia.analysis.i18n.KoreanDemo...
> > > >      [java]  analyzer = org.apache.lucene.analysis.standard.StandardAnalyzer
> > > >      [java] phrase = 경기
> > > >      [java] query =
> > > >
> > > >      [echo] Running lia.analysis.i18n.KoreanDemo...
> > > >      [java] [경기]  analyzer = org.apache.lucene.analysis.cjk.CJKAnalyzer
> > > >      [java] phrase = 경기
> > > >      [java] query = 경기
> > > >
> > > >
> > > > Standard analyzer didn't tokenized the Korean Character at all....
> > > >
> > > > Ug....  look like
> > > >  http://issues.apache.org/jira/browse/LUCENE-444
> > > >  didn't effect at all for Korean.
> > > >
> > > >
> > > > Thanks
> > > >
> > > > Youngho
> > > >
> > > > ----- Original Message -----
> > > > From: "Koji Sekiguchi" <ko...@m4.dion.ne.jp>
> > > > To: <ja...@lucene.apache.org>; "Youngho Cho" <yo...@nannet.co.kr>
> > > > Sent: Thursday, October 27, 2005 11:47 AM
> > > > Subject: RE: korean and lucene
> > > >
> > > >
> > > > > Hello Youngho,
> > > > >
> > > > > I don't understand why you couldn't get hits result in Japanese,
> > > > > though, you had better check why the query was empty with Korean data:
> > > > >
> > > > > > For Korean
> > > > > >      [echo] Running lia.analysis.i18n.KoreanDemo...
> > > > > >      [java] phrase = 경
> > > > > >      [java] query =
> > > > >
> > > > > The last line should be query = 경
> > > > > to get hits result. Can you check why StandardAnalyzer
> > > > > removes "경" during tokenizing?
> > > > >
> > > > > Koji
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: Youngho Cho [mailto:youngho@nannet.co.kr]
> > > > > > Sent: Thursday, October 27, 2005 11:37 AM
> > > > > > To: java-user@lucene.apache.org
> > > > > > Subject: Re: korean and lucene
> > > > > >
> > > > > >
> > > > > > Hello Koji,
> > > > > >
> > > > > > Thanks for your kind reply.
> > > > > >
> > > > > > Yes, I used QueryParser. normaly I used
> > > > > > Query = QueryParser.parse( ) method.
> > > > > >
> > > > > > I put your sample code into lia.analysis.i18n package in LuceneAction
> > > > > > and run JapaneseDemo using 1.4 and 1.9
> > > > > >
> > > > > > results are
> > > > > >
> > > > > >      [echo] Running lia.analysis.i18n.JapaneseDemo...
> > > > > >      [java] query = content:ラ?メン屋
> > > > > >
> > > > > > I can't get hits result.
> > > > > >
> > > > > > For Korean
> > > > > >      [echo] Running lia.analysis.i18n.KoreanDemo...
> > > > > >      [java] phrase = 경
> > > > > >      [java] query =
> > > > > >
> > > > > > I can't get query parse result.
> > > > > >
> > > > > > Thanks,
> > > > > >
> > > > > > Youngho
> > > > > >
> > > > > >
> > > > > >
> > > > > > ----- Original Message -----
> > > > > > From: "Koji Sekiguchi" <ko...@m4.dion.ne.jp>
> > > > > > To: <ja...@lucene.apache.org>; "Youngho Cho" <yo...@nannet.co.kr>
> > > > > > Sent: Thursday, October 27, 2005 9:48 AM
> > > > > > Subject: RE: korean and lucene
> > > > > >
> > > > > >
> > > > > > > Hi Youngho,
> > > > > > >
> > > > > > > With regard to Japanese, using StandardAnalyzer,
> > > > > > > I can search a word/phase.
> > > > > > >
> > > > > > > Did you use QueryParser? StandardAnalyzer tokenizes
> > > > > > > CJK characters into a stream of single character.
> > > > > > > Use QueryParser to get a PhraseQuery and search the query.
> > > > > > >
> > > > > > > Please see the following sample code. Replace Japanese
> > > > > > > "contents" and (search target) "phrase" with Korean in the
> > > > > > program and run.
> > > > > > >
> > > > > > > regards,
> > > > > > >
> > > > > > > Koji
> > > > > > >
> > > > > > > =============================================
> > > > > > > import java.io.IOException;
> > > > > > > import org.apache.lucene.analysis.Analyzer;
> > > > > > > import org.apache.lucene.analysis.standard.StandardAnalyzer;
> > > > > > > import org.apache.lucene.analysis.cjk.CJKAnalyzer;
> > > > > > > import org.apache.lucene.store.Directory;
> > > > > > > import org.apache.lucene.store.RAMDirectory;
> > > > > > > import org.apache.lucene.index.IndexWriter;
> > > > > > > import org.apache.lucene.document.Document;
> > > > > > > import org.apache.lucene.document.Field;
> > > > > > > import org.apache.lucene.search.IndexSearcher;
> > > > > > > import org.apache.lucene.search.Hits;
> > > > > > > import org.apache.lucene.search.Query;
> > > > > > > import org.apache.lucene.queryParser.QueryParser;
> > > > > > > import org.apache.lucene.queryParser.ParseException;
> > > > > > >
> > > > > > > public class JapaneseByStandardAnalyzer {
> > > > > > >
> > > > > > >     private static final String FIELD_CONTENT = "content";
> > > > > > >     private static final String[] contents = {
> > > > > > > "東京にはおいしいラーメン屋がたくさんあります。",
> > > > > > > "北海道にもおいしいラーメン屋があります。"
> > > > > > >     };
> > > > > > >     private static final String phrase = "ラーメン屋";
> > > > > > >     //private static final String phrase = "屋";
> > > > > > >     private static Analyzer analyzer = null;
> > > > > > >
> > > > > > >     public static void main( String[] args ) throws
> > > > > > IOException, ParseException {
> > > > > > > Directory directory = makeIndex();
> > > > > > > search( directory );
> > > > > > > directory.close();
> > > > > > >     }
> > > > > > >
> > > > > > >     private static Analyzer getAnalyzer(){
> > > > > > > if( analyzer == null ){
> > > > > > >     analyzer = new StandardAnalyzer();
> > > > > > >     //analyzer = new CJKAnalyzer();
> > > > > > > }
> > > > > > > return analyzer;
> > > > > > >     }
> > > > > > >
> > > > > > >     private static Directory makeIndex() throws IOException {
> > > > > > > Directory directory = new RAMDirectory();
> > > > > > > IndexWriter writer = new IndexWriter( directory, getAnalyzer(), true );
> > > > > > > for( int i = 0; i < contents.length; i++ ){
> > > > > > >     Document doc = new Document();
> > > > > > >     doc.add( new Field( FIELD_CONTENT, contents[i],
> > > > > > Field.Store.YES, Field.Index.TOKENIZED ) );
> > > > > > >     writer.addDocument( doc );
> > > > > > > }
> > > > > > > writer.close();
> > > > > > > return directory;
> > > > > > >     }
> > > > > > >
> > > > > > >     private static void search( Directory directory ) throws
> > > > > > IOException, ParseException {
> > > > > > > IndexSearcher searcher = new IndexSearcher( directory );
> > > > > > > QueryParser parser = new QueryParser( FIELD_CONTENT, getAnalyzer() );
> > > > > > > Query query = parser.parse( phrase );
> > > > > > > System.out.println( "query = " + query );
> > > > > > > Hits hits = searcher.search( query );
> > > > > > > for( int i = 0; i < hits.length(); i++ )
> > > > > > >     System.out.println( "doc = " + hits.doc( i ).get( FIELD_CONTENT ) );
> > > > > > > searcher.close();
> > > > > > >     }
> > > > > > > }
> > > > > > >
> > > > > > >
> > > > > > > > -----Original Message-----
> > > > > > > > From: Youngho Cho [mailto:youngho@nannet.co.kr]
> > > > > > > > Sent: Thursday, October 27, 2005 8:18 AM
> > > > > > > > To: java-user@lucene.apache.org; Cheolgoo Kang
> > > > > > > > Subject: Re: korean and lucene
> > > > > > > >
> > > > > > > >
> > > > > > > > Hello Cheolgoo,
> > > > > > > >
> > > > > > > > Now I updated my lucene version to 1.9 for using StandardAnalyzer
> > > > > > > > for Korean.
> > > > > > > > And tested your patch which is already adopted in 1.9
> > > > > > > >
> > > > > > > > http://issues.apache.org/jira/browse/LUCENE-444
> > > > > > > >
> > > > > > > > But Still I have no good  results with Korean compare with
> > > > > > CJKAnalyzer.
> > > > > > > >
> > > > > > > > Single character is good match but more two character word
> > > > > > > > doesn't match at all.
> > > > > > > >
> > > > > > > > Am I something missing or still there need some more works ?
> > > > > > > >
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > >
> > > > > > > > Youngho.
> > > > > > > >
> > > > > > > >
> > > > > > > > ----- Original Message -----
> > > > > > > > From: "Cheolgoo Kang" <ap...@gmail.com>
> > > > > > > > To: <ja...@lucene.apache.org>; "John Wang" <jo...@gmail.com>
> > > > > > > > Sent: Tuesday, October 04, 2005 10:11 AM
> > > > > > > > Subject: Re: korean and lucene
> > > > > > > >
> > > > > > > >
> > > > > > > > > StandardAnalyzer's JavaCC based StandardTokenizer.jj cannot read
> > > > > > > > > Korean part of Unicode character blocks.
> > > > > > > > >
> > > > > > > > > You should 1) use CJKAnalyzer or 2) add Korean character
> > > > > > > > > block(0xAC00~0xD7AF) to the CJK token definition on the
> > > > > > > > > StandardTokenizer.jj file.
> > > > > > > > >
> > > > > > > > > Hope it helps.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On 10/4/05, John Wang <jo...@gmail.com> wrote:
> > > > > > > > > > Hi:
> > > > > > > > > >
> > > > > > > > > > We are running into problems with searching on korean
> > > > > > > > documents. We are
> > > > > > > > > > using the StandardAnalyzer and everything works with Chinese
> > > > > > > > and Japanese.
> > > > > > > > > > Are there known problems with Korean with Lucene?
> > > > > > > > > >
> > > > > > > > > > Thanks
> > > > > > > > > >
> > > > > > > > > > -John
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > Cheolgoo
> > > > > > > > >
> > > > > > > > >
> > > > > > ---------------------------------------------------------------------
> > > > > > > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > > > > > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > ---------------------------------------------------------------------
> > > > > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > > > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > ---------------------------------------------------------------------
> > > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
> >
>
>
> --
> Cheolgoo
>


--
Cheolgoo

Re: korean and lucene

Posted by Andrzej Bialecki <ab...@getopt.org>.

Cheolgoo Kang wrote:

>Thanks Bialecki,
>  
>

Bialecki is my last name, my first name is Andrzej. No problem, it's
similarly confusing for Europeans to decide between the first and last
name in Asian names... :-) Is your first name Kang?

>I'm trying to test your program, thanks a lot!
>
>And also, can you give me the paper you've cited [1] and [2]? I've
>googled(entire web and google scholar) about it but got nothing.
>  
>

I got these two papers directly from the author, and he asked me not to
re-distribute them - please write him directly: Leo.G@seznam.cz to
obtain a copy.

I would be most interested in results of your testing.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: korean and lucene

Posted by Cheolgoo Kang <ap...@gmail.com>.

Thanks Bialecki,

I'm trying to test your program, thanks a lot!

And also, can you give me the paper you've cited [1] and [2]? I've
googled(entire web and google scholar) about it but got nothing.

On 11/8/05, Andrzej Bialecki <ab...@getopt.org> wrote:
> KwonNam Son wrote:
>
> >First of all, I really appreciate your work on Lucene for Korean words,
> >
> >But If we cannot support stem analyzer for Korean words, I think one
> >token for one Korean character is better.
> >
> >When we search a word, usually we use "검색" not "검색하다". ("하다" is like
> >"ed" of "searched").
> >If we cannot get any result from "검색", StandardAnalyzer has no meaning
> >to Korean, I may have to go back to use CJKAnalyzer.
> >
> >How about let the StandarAnalyzer be not changed, and add a new
> >Analyzer for Korea words?
> >
> >
>
> Hello,
>
> My knowledge of Korean is near absolute zero... however, your example
> above looks like a typical stemming process for any Western language.
> The stem is not necessarily a valid dictionary word, just something that
> uniquely "labels" a group of related words created from the same root -
> and the transformation from inflected words to a stem can be expressed
> as a series of "patch commands" (insert/remove substring).
>
> I successfully used a Java package, originally created by Leon Galambos
> from Egothor project, to create an algorithmic stemmer for Polish
> (http://www.getopt.org/stempel). The advantage of this particular
> approach is that you don't have to encode specific grammar rules in the
> stemmer, the stemmer learns rules by itself from a training corpus. Such
> training corpus consists of pairs of inflected and base forms, and the
> library automatically learns these "patch commands", i.e. instructions
> for inserting/removing parts of an inflected word to arrive at the base
> form. This training process results in creating a stemmer table,
> reusable even for previously unseen words (based on the similarity of
> character patterns in input words).
>
> I suggest to try the code from the link above and test how it works,
> even if you only have a moderately-sized training corpus (~500 pairs)
> the results should be positive.
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


--
Cheolgoo

Re: korean and lucene

Posted by Andrzej Bialecki <ab...@getopt.org>.

KwonNam Son wrote:

>First of all, I really appreciate your work on Lucene for Korean words,
>
>But If we cannot support stem analyzer for Korean words, I think one
>token for one Korean character is better.
>
>When we search a word, usually we use "검색" not "검색하다". ("하다" is like
>"ed" of "searched").
>If we cannot get any result from "검색", StandardAnalyzer has no meaning
>to Korean, I may have to go back to use CJKAnalyzer.
>
>How about let the StandarAnalyzer be not changed, and add a new
>Analyzer for Korea words?
>  
>

Hello,

My knowledge of Korean is near absolute zero... however, your example 
above looks like a typical stemming process for any Western language. 
The stem is not necessarily a valid dictionary word, just something that 
uniquely "labels" a group of related words created from the same root - 
and the transformation from inflected words to a stem can be expressed 
as a series of "patch commands" (insert/remove substring).

I successfully used a Java package, originally created by Leon Galambos 
from Egothor project, to create an algorithmic stemmer for Polish 
(http://www.getopt.org/stempel). The advantage of this particular 
approach is that you don't have to encode specific grammar rules in the 
stemmer, the stemmer learns rules by itself from a training corpus. Such 
training corpus consists of pairs of inflected and base forms, and the 
library automatically learns these "patch commands", i.e. instructions 
for inserting/removing parts of an inflected word to arrive at the base 
form. This training process results in creating a stemmer table, 
reusable even for previously unseen words (based on the similarity of 
character patterns in input words).

I suggest to try the code from the link above and test how it works, 
even if you only have a moderately-sized training corpus (~500 pairs) 
the results should be positive.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org