You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by John Wang <jo...@gmail.com> on 2005/10/04 01:46:45 UTC

korean and lucene

Hi:

We are running into problems with searching on korean documents. We are
using the StandardAnalyzer and everything works with Chinese and Japanese.
Are there known problems with Korean with Lucene?

Thanks

-John

Re: korean and lucene

Posted by Cheolgoo Kang <ap...@gmail.com>.
On 11/8/05, Cheolgoo Kang <ap...@gmail.com> wrote:
> Hello,
>
> I've created a new JIRA issue with Korean analysis that
> StandardAnalyzer splits one word into several tokens each with one
> character. Cause Korean is not a phonogram, one character in Korean

Sorry for confusions. I mean 'ideographic characters' not 'phonogram'. :)

> has almost no meaning at all. So word in Korean should be preserved
> not like Chinese or Japanese.
>
> I've attached a patch to StandardTokenizer to do this and passed the
> test case TestStandardAnalyzer(also patched to test Korean words).
>
> Thanks!
>
> On 10/27/05, Youngho Cho <yo...@nannet.co.kr> wrote:
> > Hello,
> >
> > Ok , I've attached my test code for Korean which is slitely modified Koji's code.
> >
> > Just put into the lia.analysis.i18n package at LuceneInAction
> > and run ant.
> >
> > Hopely someone is helped.
> >
> > -------- build.xml  ---------
> >
> >   <target name="JapaneseDemo" depends="prepare"
> >           description="Examples of Jananese analysis">
> >     <info>
> >
> >       Japanese Test...
> >
> >     </info>
> >
> >     <run-main class="lia.analysis.i18n.JapaneseDemo"/>
> >   </target>
> >
> >   <target name="KoreanDemo" depends="prepare"
> >           description="Examples of Korean analysis">
> >     <info>
> >
> >       Korean Test...
> >
> >     </info>
> >
> >     <run-main class="lia.analysis.i18n.KoreanDemo"/>
> >   </target>
> >
> >
> > Thanks,
> >
> > Youngho
> >
> >
> > ----- Original Message -----
> > From: "Youngho Cho" <yo...@nannet.co.kr>
> > To: <ja...@lucene.apache.org>; "Youngho Cho" <yo...@nannet.co.kr>
> > Sent: Thursday, October 27, 2005 12:47 PM
> > Subject: Re: korean and lucene
> >
> >
> > > Hello all
> > > Plese forgive me pervious my stupid message
> > >
> > >      [echo] Running lia.analysis.i18n.KoreanDemo...
> > >      [java] [경] [기]  analyzer = org.apache.lucene.analysis.standard.StandardAnalyzer
> > >      [java] phrase = 경기
> > >      [java] query = "경 기"
> > >
> > > I got the good result.
> > >
> > > When I compile I just rename old version lucene-1.4.3.jar to lucene-1.4.3.jar_bak
> > > and all new 1.9 lucene. and build the test package.
> > > After I remove lucene-1.4.3.jar_bak in lib directory completely
> > > I got the expected result !!!.
> > >
> > > I don't know the reason... ( looks like my finger make some trouble... )
> > >
> > > Anyway thanks Koji and Cheolgoo
> > > I will further test now...
> > >
> > > Youngho
> > >
> > >
> > >
> > >
> > > ----- Original Message -----
> > > From: "Youngho Cho" <yo...@nannet.co.kr>
> > > To: <ja...@lucene.apache.org>
> > > Sent: Thursday, October 27, 2005 12:28 PM
> > > Subject: Re: korean and lucene
> > >
> > >
> > > > Hello Koji
> > > >
> > > > Here is test result.
> > > > Japanese is OK !.
> > > > maybe ant clean  did some effect.
> > > >
> > > > Anyway please refer to the result using 1.9
> > > >
> > > >      [echo] Running lia.analysis.i18n.JapaneseDemo...
> > > >      [java] [ラ] [メ] [ン] [屋]  analyzer = org.apache.lucene.analysis.standard.StandardAnalyzer
> > > >      [java] phrase = ラ?メン屋
> > > >      [java] query = content:ラ?メン屋
> > > >
> > > >     [echo] Running lia.analysis.i18n.KoreanDemo...
> > > >      [java]  analyzer = org.apache.lucene.analysis.standard.StandardAnalyzer
> > > >      [java] phrase = 경
> > > >      [java] query =
> > > >
> > > >      [echo] Running lia.analysis.i18n.JapaneseDemo...
> > > >      [java] [ラ] [メン] [ン屋]  analyzer = org.apache.lucene.analysis.cjk.CJKAnalyzer
> > > >      [java] phrase = ラ?メン屋
> > > >      [java] query = content:ラ?メン屋
> > > >
> > > >     [echo] Running lia.analysis.i18n.KoreanDemo...
> > > >      [java] [경]  analyzer = org.apache.lucene.analysis.cjk.CJKAnalyzer
> > > >      [java] phrase = 경
> > > >      [java] query = 경
> > > >
> > > >      [echo] Running lia.analysis.i18n.KoreanDemo...
> > > >      [java]  analyzer = org.apache.lucene.analysis.standard.StandardAnalyzer
> > > >      [java] phrase = 경기
> > > >      [java] query =
> > > >
> > > >      [echo] Running lia.analysis.i18n.KoreanDemo...
> > > >      [java] [경기]  analyzer = org.apache.lucene.analysis.cjk.CJKAnalyzer
> > > >      [java] phrase = 경기
> > > >      [java] query = 경기
> > > >
> > > >
> > > > Standard analyzer didn't tokenized the Korean Character at all....
> > > >
> > > > Ug....  look like
> > > >  http://issues.apache.org/jira/browse/LUCENE-444
> > > >  didn't effect at all for Korean.
> > > >
> > > >
> > > > Thanks
> > > >
> > > > Youngho
> > > >
> > > > ----- Original Message -----
> > > > From: "Koji Sekiguchi" <ko...@m4.dion.ne.jp>
> > > > To: <ja...@lucene.apache.org>; "Youngho Cho" <yo...@nannet.co.kr>
> > > > Sent: Thursday, October 27, 2005 11:47 AM
> > > > Subject: RE: korean and lucene
> > > >
> > > >
> > > > > Hello Youngho,
> > > > >
> > > > > I don't understand why you couldn't get hits result in Japanese,
> > > > > though, you had better check why the query was empty with Korean data:
> > > > >
> > > > > > For Korean
> > > > > >      [echo] Running lia.analysis.i18n.KoreanDemo...
> > > > > >      [java] phrase = 경
> > > > > >      [java] query =
> > > > >
> > > > > The last line should be query = 경
> > > > > to get hits result. Can you check why StandardAnalyzer
> > > > > removes "경" during tokenizing?
> > > > >
> > > > > Koji
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: Youngho Cho [mailto:youngho@nannet.co.kr]
> > > > > > Sent: Thursday, October 27, 2005 11:37 AM
> > > > > > To: java-user@lucene.apache.org
> > > > > > Subject: Re: korean and lucene
> > > > > >
> > > > > >
> > > > > > Hello Koji,
> > > > > >
> > > > > > Thanks for your kind reply.
> > > > > >
> > > > > > Yes, I used QueryParser. normaly I used
> > > > > > Query = QueryParser.parse( ) method.
> > > > > >
> > > > > > I put your sample code into lia.analysis.i18n package in LuceneAction
> > > > > > and run JapaneseDemo using 1.4 and 1.9
> > > > > >
> > > > > > results are
> > > > > >
> > > > > >      [echo] Running lia.analysis.i18n.JapaneseDemo...
> > > > > >      [java] query = content:ラ?メン屋
> > > > > >
> > > > > > I can't get hits result.
> > > > > >
> > > > > > For Korean
> > > > > >      [echo] Running lia.analysis.i18n.KoreanDemo...
> > > > > >      [java] phrase = 경
> > > > > >      [java] query =
> > > > > >
> > > > > > I can't get query parse result.
> > > > > >
> > > > > > Thanks,
> > > > > >
> > > > > > Youngho
> > > > > >
> > > > > >
> > > > > >
> > > > > > ----- Original Message -----
> > > > > > From: "Koji Sekiguchi" <ko...@m4.dion.ne.jp>
> > > > > > To: <ja...@lucene.apache.org>; "Youngho Cho" <yo...@nannet.co.kr>
> > > > > > Sent: Thursday, October 27, 2005 9:48 AM
> > > > > > Subject: RE: korean and lucene
> > > > > >
> > > > > >
> > > > > > > Hi Youngho,
> > > > > > >
> > > > > > > With regard to Japanese, using StandardAnalyzer,
> > > > > > > I can search a word/phase.
> > > > > > >
> > > > > > > Did you use QueryParser? StandardAnalyzer tokenizes
> > > > > > > CJK characters into a stream of single character.
> > > > > > > Use QueryParser to get a PhraseQuery and search the query.
> > > > > > >
> > > > > > > Please see the following sample code. Replace Japanese
> > > > > > > "contents" and (search target) "phrase" with Korean in the
> > > > > > program and run.
> > > > > > >
> > > > > > > regards,
> > > > > > >
> > > > > > > Koji
> > > > > > >
> > > > > > > =============================================
> > > > > > > import java.io.IOException;
> > > > > > > import org.apache.lucene.analysis.Analyzer;
> > > > > > > import org.apache.lucene.analysis.standard.StandardAnalyzer;
> > > > > > > import org.apache.lucene.analysis.cjk.CJKAnalyzer;
> > > > > > > import org.apache.lucene.store.Directory;
> > > > > > > import org.apache.lucene.store.RAMDirectory;
> > > > > > > import org.apache.lucene.index.IndexWriter;
> > > > > > > import org.apache.lucene.document.Document;
> > > > > > > import org.apache.lucene.document.Field;
> > > > > > > import org.apache.lucene.search.IndexSearcher;
> > > > > > > import org.apache.lucene.search.Hits;
> > > > > > > import org.apache.lucene.search.Query;
> > > > > > > import org.apache.lucene.queryParser.QueryParser;
> > > > > > > import org.apache.lucene.queryParser.ParseException;
> > > > > > >
> > > > > > > public class JapaneseByStandardAnalyzer {
> > > > > > >
> > > > > > >     private static final String FIELD_CONTENT = "content";
> > > > > > >     private static final String[] contents = {
> > > > > > > "東京にはおいしいラーメン屋がたくさんあります。",
> > > > > > > "北海道にもおいしいラーメン屋があります。"
> > > > > > >     };
> > > > > > >     private static final String phrase = "ラーメン屋";
> > > > > > >     //private static final String phrase = "屋";
> > > > > > >     private static Analyzer analyzer = null;
> > > > > > >
> > > > > > >     public static void main( String[] args ) throws
> > > > > > IOException, ParseException {
> > > > > > > Directory directory = makeIndex();
> > > > > > > search( directory );
> > > > > > > directory.close();
> > > > > > >     }
> > > > > > >
> > > > > > >     private static Analyzer getAnalyzer(){
> > > > > > > if( analyzer == null ){
> > > > > > >     analyzer = new StandardAnalyzer();
> > > > > > >     //analyzer = new CJKAnalyzer();
> > > > > > > }
> > > > > > > return analyzer;
> > > > > > >     }
> > > > > > >
> > > > > > >     private static Directory makeIndex() throws IOException {
> > > > > > > Directory directory = new RAMDirectory();
> > > > > > > IndexWriter writer = new IndexWriter( directory, getAnalyzer(), true );
> > > > > > > for( int i = 0; i < contents.length; i++ ){
> > > > > > >     Document doc = new Document();
> > > > > > >     doc.add( new Field( FIELD_CONTENT, contents[i],
> > > > > > Field.Store.YES, Field.Index.TOKENIZED ) );
> > > > > > >     writer.addDocument( doc );
> > > > > > > }
> > > > > > > writer.close();
> > > > > > > return directory;
> > > > > > >     }
> > > > > > >
> > > > > > >     private static void search( Directory directory ) throws
> > > > > > IOException, ParseException {
> > > > > > > IndexSearcher searcher = new IndexSearcher( directory );
> > > > > > > QueryParser parser = new QueryParser( FIELD_CONTENT, getAnalyzer() );
> > > > > > > Query query = parser.parse( phrase );
> > > > > > > System.out.println( "query = " + query );
> > > > > > > Hits hits = searcher.search( query );
> > > > > > > for( int i = 0; i < hits.length(); i++ )
> > > > > > >     System.out.println( "doc = " + hits.doc( i ).get( FIELD_CONTENT ) );
> > > > > > > searcher.close();
> > > > > > >     }
> > > > > > > }
> > > > > > >
> > > > > > >
> > > > > > > > -----Original Message-----
> > > > > > > > From: Youngho Cho [mailto:youngho@nannet.co.kr]
> > > > > > > > Sent: Thursday, October 27, 2005 8:18 AM
> > > > > > > > To: java-user@lucene.apache.org; Cheolgoo Kang
> > > > > > > > Subject: Re: korean and lucene
> > > > > > > >
> > > > > > > >
> > > > > > > > Hello Cheolgoo,
> > > > > > > >
> > > > > > > > Now I updated my lucene version to 1.9 for using StandardAnalyzer
> > > > > > > > for Korean.
> > > > > > > > And tested your patch which is already adopted in 1.9
> > > > > > > >
> > > > > > > > http://issues.apache.org/jira/browse/LUCENE-444
> > > > > > > >
> > > > > > > > But Still I have no good  results with Korean compare with
> > > > > > CJKAnalyzer.
> > > > > > > >
> > > > > > > > Single character is good match but more two character word
> > > > > > > > doesn't match at all.
> > > > > > > >
> > > > > > > > Am I something missing or still there need some more works ?
> > > > > > > >
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > >
> > > > > > > > Youngho.
> > > > > > > >
> > > > > > > >
> > > > > > > > ----- Original Message -----
> > > > > > > > From: "Cheolgoo Kang" <ap...@gmail.com>
> > > > > > > > To: <ja...@lucene.apache.org>; "John Wang" <jo...@gmail.com>
> > > > > > > > Sent: Tuesday, October 04, 2005 10:11 AM
> > > > > > > > Subject: Re: korean and lucene
> > > > > > > >
> > > > > > > >
> > > > > > > > > StandardAnalyzer's JavaCC based StandardTokenizer.jj cannot read
> > > > > > > > > Korean part of Unicode character blocks.
> > > > > > > > >
> > > > > > > > > You should 1) use CJKAnalyzer or 2) add Korean character
> > > > > > > > > block(0xAC00~0xD7AF) to the CJK token definition on the
> > > > > > > > > StandardTokenizer.jj file.
> > > > > > > > >
> > > > > > > > > Hope it helps.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On 10/4/05, John Wang <jo...@gmail.com> wrote:
> > > > > > > > > > Hi:
> > > > > > > > > >
> > > > > > > > > > We are running into problems with searching on korean
> > > > > > > > documents. We are
> > > > > > > > > > using the StandardAnalyzer and everything works with Chinese
> > > > > > > > and Japanese.
> > > > > > > > > > Are there known problems with Korean with Lucene?
> > > > > > > > > >
> > > > > > > > > > Thanks
> > > > > > > > > >
> > > > > > > > > > -John
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > Cheolgoo
> > > > > > > > >
> > > > > > > > >
> > > > > > ---------------------------------------------------------------------
> > > > > > > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > > > > > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > ---------------------------------------------------------------------
> > > > > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > > > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > ---------------------------------------------------------------------
> > > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
> >
>
>
> --
> Cheolgoo
>


--
Cheolgoo

Re: korean and lucene

Posted by Andrzej Bialecki <ab...@getopt.org>.
Cheolgoo Kang wrote:

>Thanks Bialecki,
>  
>

Bialecki is my last name, my first name is Andrzej. No problem, it's
similarly confusing for Europeans to decide between the first and last
name in Asian names... :-) Is your first name Kang?

>I'm trying to test your program, thanks a lot!
>
>And also, can you give me the paper you've cited [1] and [2]? I've
>googled(entire web and google scholar) about it but got nothing.
>  
>

I got these two papers directly from the author, and he asked me not to
re-distribute them - please write him directly: Leo.G@seznam.cz to
obtain a copy.


I would be most interested in results of your testing.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: korean and lucene

Posted by Cheolgoo Kang <ap...@gmail.com>.
Thanks Bialecki,

I'm trying to test your program, thanks a lot!

And also, can you give me the paper you've cited [1] and [2]? I've
googled(entire web and google scholar) about it but got nothing.

On 11/8/05, Andrzej Bialecki <ab...@getopt.org> wrote:
> KwonNam Son wrote:
>
> >First of all, I really appreciate your work on Lucene for Korean words,
> >
> >But If we cannot support stem analyzer for Korean words, I think one
> >token for one Korean character is better.
> >
> >When we search a word, usually we use "검색" not "검색하다". ("하다" is like
> >"ed" of "searched").
> >If we cannot get any result from "검색", StandardAnalyzer has no meaning
> >to Korean, I may have to go back to use CJKAnalyzer.
> >
> >How about let the StandarAnalyzer be not changed, and add a new
> >Analyzer for Korea words?
> >
> >
>
> Hello,
>
> My knowledge of Korean is near absolute zero... however, your example
> above looks like a typical stemming process for any Western language.
> The stem is not necessarily a valid dictionary word, just something that
> uniquely "labels" a group of related words created from the same root -
> and the transformation from inflected words to a stem can be expressed
> as a series of "patch commands" (insert/remove substring).
>
> I successfully used a Java package, originally created by Leon Galambos
> from Egothor project, to create an algorithmic stemmer for Polish
> (http://www.getopt.org/stempel). The advantage of this particular
> approach is that you don't have to encode specific grammar rules in the
> stemmer, the stemmer learns rules by itself from a training corpus. Such
> training corpus consists of pairs of inflected and base forms, and the
> library automatically learns these "patch commands", i.e. instructions
> for inserting/removing parts of an inflected word to arrive at the base
> form. This training process results in creating a stemmer table,
> reusable even for previously unseen words (based on the similarity of
> character patterns in input words).
>
> I suggest to try the code from the link above and test how it works,
> even if you only have a moderately-sized training corpus (~500 pairs)
> the results should be positive.
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


--
Cheolgoo

Re: korean and lucene

Posted by Andrzej Bialecki <ab...@getopt.org>.
KwonNam Son wrote:

>First of all, I really appreciate your work on Lucene for Korean words,
>
>But If we cannot support stem analyzer for Korean words, I think one
>token for one Korean character is better.
>
>When we search a word, usually we use "검색" not "검색하다". ("하다" is like
>"ed" of "searched").
>If we cannot get any result from "검색", StandardAnalyzer has no meaning
>to Korean, I may have to go back to use CJKAnalyzer.
>
>How about let the StandarAnalyzer be not changed, and add a new
>Analyzer for Korea words?
>  
>

Hello,

My knowledge of Korean is near absolute zero... however, your example 
above looks like a typical stemming process for any Western language. 
The stem is not necessarily a valid dictionary word, just something that 
uniquely "labels" a group of related words created from the same root - 
and the transformation from inflected words to a stem can be expressed 
as a series of "patch commands" (insert/remove substring).

I successfully used a Java package, originally created by Leon Galambos 
from Egothor project, to create an algorithmic stemmer for Polish 
(http://www.getopt.org/stempel). The advantage of this particular 
approach is that you don't have to encode specific grammar rules in the 
stemmer, the stemmer learns rules by itself from a training corpus. Such 
training corpus consists of pairs of inflected and base forms, and the 
library automatically learns these "patch commands", i.e. instructions 
for inserting/removing parts of an inflected word to arrive at the base 
form. This training process results in creating a stemmer table, 
reusable even for previously unseen words (based on the similarity of 
character patterns in input words).

I suggest to try the code from the link above and test how it works, 
even if you only have a moderately-sized training corpus (~500 pairs) 
the results should be positive.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: korean and lucene

Posted by KwonNam Son <kw...@gmail.com>.
First of all, I really appreciate your work on Lucene for Korean words,

But If we cannot support stem analyzer for Korean words, I think one
token for one Korean character is better.

When we search a word, usually we use "검색" not "검색하다". ("하다" is like
"ed" of "searched").
If we cannot get any result from "검색", StandardAnalyzer has no meaning
to Korean, I may have to go back to use CJKAnalyzer.

How about let the StandarAnalyzer be not changed, and add a new
Analyzer for Korea words?

Thanks.

2005/11/8, Cheolgoo Kang <ap...@gmail.com>:
> On 11/8/05, Youngho Cho <yo...@nannet.co.kr> wrote:
> > Hello,
> >
> > just simple test ...
> > If I compile the javacc correctly..
> > the patched version doesn't match some situation
> > for example
> > in text
> > '엔진박지성(맨체스터 유나이티드)이 주말 프리미어리그를 위해 벤치를 지키며 재충전의 시간을 가졌다.'
> > if query word is '시간'   than nothing match
> > but if query word is '시간을'  than good match.
>
> That's exactly what I've wanted to do with this patch. You need more
> sophisticated morphological analyzer to to what you've wanted to.
> AFAIK, I'm afraid there is no open sourced software to do the Korean
> morphological analysis.
>
> And also, if you have Lucene in Action Korean translation, there is a
> simple introduction to Korean stemming analysis at Appendix D. That's
> just simple enough to hold a list of Korean word endings, and check
> for each Token with matching word endings. :)
>
> >
> > I think there is some tradeoff here.
> >
> > Maybe need some good stop filter for korean etc.......
> >
> >
> > Thanks
> >
> > Youngho
> >
> > ----- Original Message -----
> > From: "Youngho Cho" <yo...@nannet.co.kr>
> > To: <ja...@lucene.apache.org>
> > Sent: Tuesday, November 08, 2005 4:44 PM
> > Subject: Re: korean and lucene
> >
> >
> > > Hello Cheolgoo,
> > >
> > > I will test the patch.
> > >
> > >
> > > Thanks,
> > >
> > > Youngho
> > >
> > > ----- Original Message -----
> > > From: "Cheolgoo Kang" <ap...@gmail.com>
> > > To: <ja...@lucene.apache.org>; "Youngho Cho" <yo...@nannet.co.kr>
> > > Sent: Tuesday, November 08, 2005 4:06 PM
> > > Subject: Re: korean and lucene
> > >
> > >
> > > > Hello,
> > > >
> > > > I've created a new JIRA issue with Korean analysis that
> > > > StandardAnalyzer splits one word into several tokens each with one
> > > > character. Cause Korean is not a phonogram, one character in Korean
> > > > has almost no meaning at all. So word in Korean should be preserved
> > > > not like Chinese or Japanese.
> > > >
> > > > I've attached a patch to StandardTokenizer to do this and passed the
> > > > test case TestStandardAnalyzer(also patched to test Korean words).
> > > >
> > > > Thanks!
> > > >
> > > > On 10/27/05, Youngho Cho <yo...@nannet.co.kr> wrote:
> > > > > Hello,
> > > > >
> > > > > Ok , I've attached my test code for Korean which is slitely modified Koji's code.
> > > > >
> > > > > Just put into the lia.analysis.i18n package at LuceneInAction
> > > > > and run ant.
> > > > >
> > > > > Hopely someone is helped.
> > > > >
> > > > > -------- build.xml  ---------
> > > > >
> > > > >   <target name="JapaneseDemo" depends="prepare"
> > > > >           description="Examples of Jananese analysis">
> > > > >     <info>
> > > > >
> > > > >       Japanese Test...
> > > > >
> > > > >     </info>
> > > > >
> > > > >     <run-main class="lia.analysis.i18n.JapaneseDemo"/>
> > > > >   </target>
> > > > >
> > > > >   <target name="KoreanDemo" depends="prepare"
> > > > >           description="Examples of Korean analysis">
> > > > >     <info>
> > > > >
> > > > >       Korean Test...
> > > > >
> > > > >     </info>
> > > > >
> > > > >     <run-main class="lia.analysis.i18n.KoreanDemo"/>
> > > > >   </target>
> > > > >
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Youngho
> > > > >
> > > > >
> > > > > ----- Original Message -----
> > > > > From: "Youngho Cho" <yo...@nannet.co.kr>
> > > > > To: <ja...@lucene.apache.org>; "Youngho Cho" <yo...@nannet.co.kr>
> > > > > Sent: Thursday, October 27, 2005 12:47 PM
> > > > > Subject: Re: korean and lucene
> > > > >
> > > > >
> > > > > > Hello all
> > > > > > Plese forgive me pervious my stupid message
> > > > > >
> > > > > >      [echo] Running lia.analysis.i18n.KoreanDemo...
> > > > > >      [java] [경] [기]  analyzer = org.apache.lucene.analysis.standard.StandardAnalyzer
> > > > > >      [java] phrase = 경기
> > > > > >      [java] query = "경 기"
> > > > > >
> > > > > > I got the good result.
> > > > > >
> > > > > > When I compile I just rename old version lucene-1.4.3.jar to lucene-1.4.3.jar_bak
> > > > > > and all new 1.9 lucene. and build the test package.
> > > > > > After I remove lucene-1.4.3.jar_bak in lib directory completely
> > > > > > I got the expected result !!!.
> > > > > >
> > > > > > I don't know the reason... ( looks like my finger make some trouble... )
> > > > > >
> > > > > > Anyway thanks Koji and Cheolgoo
> > > > > > I will further test now...
> > > > > >
> > > > > > Youngho
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > ----- Original Message -----
> > > > > > From: "Youngho Cho" <yo...@nannet.co.kr>
> > > > > > To: <ja...@lucene.apache.org>
> > > > > > Sent: Thursday, October 27, 2005 12:28 PM
> > > > > > Subject: Re: korean and lucene
> > > > > >
> > > > > >
> > > > > > > Hello Koji
> > > > > > >
> > > > > > > Here is test result.
> > > > > > > Japanese is OK !.
> > > > > > > maybe ant clean  did some effect.
> > > > > > >
> > > > > > > Anyway please refer to the result using 1.9
> > > > > > >
> > > > > > >      [echo] Running lia.analysis.i18n.JapaneseDemo...
> > > > > > >      [java] [ラ] [メ] [ン] [屋]  analyzer = org.apache.lucene.analysis.standard.StandardAnalyzer
> > > > > > >      [java] phrase = ラ?メン屋
> > > > > > >      [java] query = content:ラ?メン屋
> > > > > > >
> > > > > > >     [echo] Running lia.analysis.i18n.KoreanDemo...
> > > > > > >      [java]  analyzer = org.apache.lucene.analysis.standard.StandardAnalyzer
> > > > > > >      [java] phrase = 경
> > > > > > >      [java] query =
> > > > > > >
> > > > > > >      [echo] Running lia.analysis.i18n.JapaneseDemo...
> > > > > > >      [java] [ラ] [メン] [ン屋]  analyzer = org.apache.lucene.analysis.cjk.CJKAnalyzer
> > > > > > >      [java] phrase = ラ?メン屋
> > > > > > >      [java] query = content:ラ?メン屋
> > > > > > >
> > > > > > >     [echo] Running lia.analysis.i18n.KoreanDemo...
> > > > > > >      [java] [경]  analyzer = org.apache.lucene.analysis.cjk.CJKAnalyzer
> > > > > > >      [java] phrase = 경
> > > > > > >      [java] query = 경
> > > > > > >
> > > > > > >      [echo] Running lia.analysis.i18n.KoreanDemo...
> > > > > > >      [java]  analyzer = org.apache.lucene.analysis.standard.StandardAnalyzer
> > > > > > >      [java] phrase = 경기
> > > > > > >      [java] query =
> > > > > > >
> > > > > > >      [echo] Running lia.analysis.i18n.KoreanDemo...
> > > > > > >      [java] [경기]  analyzer = org.apache.lucene.analysis.cjk.CJKAnalyzer
> > > > > > >      [java] phrase = 경기
> > > > > > >      [java] query = 경기
> > > > > > >
> > > > > > >
> > > > > > > Standard analyzer didn't tokenized the Korean Character at all....
> > > > > > >
> > > > > > > Ug....  look like
> > > > > > >  http://issues.apache.org/jira/browse/LUCENE-444
> > > > > > >  didn't effect at all for Korean.
> > > > > > >
> > > > > > >
> > > > > > > Thanks
> > > > > > >
> > > > > > > Youngho
> > > > > > >
> > > > > > > ----- Original Message -----
> > > > > > > From: "Koji Sekiguchi" <ko...@m4.dion.ne.jp>
> > > > > > > To: <ja...@lucene.apache.org>; "Youngho Cho" <yo...@nannet.co.kr>
> > > > > > > Sent: Thursday, October 27, 2005 11:47 AM
> > > > > > > Subject: RE: korean and lucene
> > > > > > >
> > > > > > >
> > > > > > > > Hello Youngho,
> > > > > > > >
> > > > > > > > I don't understand why you couldn't get hits result in Japanese,
> > > > > > > > though, you had better check why the query was empty with Korean data:
> > > > > > > >
> > > > > > > > > For Korean
> > > > > > > > >      [echo] Running lia.analysis.i18n.KoreanDemo...
> > > > > > > > >      [java] phrase = 경
> > > > > > > > >      [java] query =
> > > > > > > >
> > > > > > > > The last line should be query = 경
> > > > > > > > to get hits result. Can you check why StandardAnalyzer
> > > > > > > > removes "경" during tokenizing?
> > > > > > > >
> > > > > > > > Koji
> > > > > > > >
> > > > > > > > > -----Original Message-----
> > > > > > > > > From: Youngho Cho [mailto:youngho@nannet.co.kr]
> > > > > > > > > Sent: Thursday, October 27, 2005 11:37 AM
> > > > > > > > > To: java-user@lucene.apache.org
> > > > > > > > > Subject: Re: korean and lucene
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Hello Koji,
> > > > > > > > >
> > > > > > > > > Thanks for your kind reply.
> > > > > > > > >
> > > > > > > > > Yes, I used QueryParser. normaly I used
> > > > > > > > > Query = QueryParser.parse( ) method.
> > > > > > > > >
> > > > > > > > > I put your sample code into lia.analysis.i18n package in LuceneAction
> > > > > > > > > and run JapaneseDemo using 1.4 and 1.9
> > > > > > > > >
> > > > > > > > > results are
> > > > > > > > >
> > > > > > > > >      [echo] Running lia.analysis.i18n.JapaneseDemo...
> > > > > > > > >      [java] query = content:ラ?メン屋
> > > > > > > > >
> > > > > > > > > I can't get hits result.
> > > > > > > > >
> > > > > > > > > For Korean
> > > > > > > > >      [echo] Running lia.analysis.i18n.KoreanDemo...
> > > > > > > > >      [java] phrase = 경
> > > > > > > > >      [java] query =
> > > > > > > > >
> > > > > > > > > I can't get query parse result.
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > >
> > > > > > > > > Youngho
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > ----- Original Message -----
> > > > > > > > > From: "Koji Sekiguchi" <ko...@m4.dion.ne.jp>
> > > > > > > > > To: <ja...@lucene.apache.org>; "Youngho Cho" <yo...@nannet.co.kr>
> > > > > > > > > Sent: Thursday, October 27, 2005 9:48 AM
> > > > > > > > > Subject: RE: korean and lucene
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > Hi Youngho,
> > > > > > > > > >
> > > > > > > > > > With regard to Japanese, using StandardAnalyzer,
> > > > > > > > > > I can search a word/phase.
> > > > > > > > > >
> > > > > > > > > > Did you use QueryParser? StandardAnalyzer tokenizes
> > > > > > > > > > CJK characters into a stream of single character.
> > > > > > > > > > Use QueryParser to get a PhraseQuery and search the query.
> > > > > > > > > >
> > > > > > > > > > Please see the following sample code. Replace Japanese
> > > > > > > > > > "contents" and (search target) "phrase" with Korean in the
> > > > > > > > > program and run.
> > > > > > > > > >
> > > > > > > > > > regards,
> > > > > > > > > >
> > > > > > > > > > Koji
> > > > > > > > > >
> > > > > > > > > > =============================================
> > > > > > > > > > import java.io.IOException;
> > > > > > > > > > import org.apache.lucene.analysis.Analyzer;
> > > > > > > > > > import org.apache.lucene.analysis.standard.StandardAnalyzer;
> > > > > > > > > > import org.apache.lucene.analysis.cjk.CJKAnalyzer;
> > > > > > > > > > import org.apache.lucene.store.Directory;
> > > > > > > > > > import org.apache.lucene.store.RAMDirectory;
> > > > > > > > > > import org.apache.lucene.index.IndexWriter;
> > > > > > > > > > import org.apache.lucene.document.Document;
> > > > > > > > > > import org.apache.lucene.document.Field;
> > > > > > > > > > import org.apache.lucene.search.IndexSearcher;
> > > > > > > > > > import org.apache.lucene.search.Hits;
> > > > > > > > > > import org.apache.lucene.search.Query;
> > > > > > > > > > import org.apache.lucene.queryParser.QueryParser;
> > > > > > > > > > import org.apache.lucene.queryParser.ParseException;
> > > > > > > > > >
> > > > > > > > > > public class JapaneseByStandardAnalyzer {
> > > > > > > > > >
> > > > > > > > > >     private static final String FIELD_CONTENT = "content";
> > > > > > > > > >     private static final String[] contents = {
> > > > > > > > > > "東京にはおいしいラーメン屋がたくさんあります。",
> > > > > > > > > > "北海道にもおいしいラーメン屋があります。"
> > > > > > > > > >     };
> > > > > > > > > >     private static final String phrase = "ラーメン屋";
> > > > > > > > > >     //private static final String phrase = "屋";
> > > > > > > > > >     private static Analyzer analyzer = null;
> > > > > > > > > >
> > > > > > > > > >     public static void main( String[] args ) throws
> > > > > > > > > IOException, ParseException {
> > > > > > > > > > Directory directory = makeIndex();
> > > > > > > > > > search( directory );
> > > > > > > > > > directory.close();
> > > > > > > > > >     }
> > > > > > > > > >
> > > > > > > > > >     private static Analyzer getAnalyzer(){
> > > > > > > > > > if( analyzer == null ){
> > > > > > > > > >     analyzer = new StandardAnalyzer();
> > > > > > > > > >     //analyzer = new CJKAnalyzer();
> > > > > > > > > > }
> > > > > > > > > > return analyzer;
> > > > > > > > > >     }
> > > > > > > > > >
> > > > > > > > > >     private static Directory makeIndex() throws IOException {
> > > > > > > > > > Directory directory = new RAMDirectory();
> > > > > > > > > > IndexWriter writer = new IndexWriter( directory, getAnalyzer(), true );
> > > > > > > > > > for( int i = 0; i < contents.length; i++ ){
> > > > > > > > > >     Document doc = new Document();
> > > > > > > > > >     doc.add( new Field( FIELD_CONTENT, contents[i],
> > > > > > > > > Field.Store.YES, Field.Index.TOKENIZED ) );
> > > > > > > > > >     writer.addDocument( doc );
> > > > > > > > > > }
> > > > > > > > > > writer.close();
> > > > > > > > > > return directory;
> > > > > > > > > >     }
> > > > > > > > > >
> > > > > > > > > >     private static void search( Directory directory ) throws
> > > > > > > > > IOException, ParseException {
> > > > > > > > > > IndexSearcher searcher = new IndexSearcher( directory );
> > > > > > > > > > QueryParser parser = new QueryParser( FIELD_CONTENT, getAnalyzer() );
> > > > > > > > > > Query query = parser.parse( phrase );
> > > > > > > > > > System.out.println( "query = " + query );
> > > > > > > > > > Hits hits = searcher.search( query );
> > > > > > > > > > for( int i = 0; i < hits.length(); i++ )
> > > > > > > > > >     System.out.println( "doc = " + hits.doc( i ).get( FIELD_CONTENT ) );
> > > > > > > > > > searcher.close();
> > > > > > > > > >     }
> > > > > > > > > > }
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > -----Original Message-----
> > > > > > > > > > > From: Youngho Cho [mailto:youngho@nannet.co.kr]
> > > > > > > > > > > Sent: Thursday, October 27, 2005 8:18 AM
> > > > > > > > > > > To: java-user@lucene.apache.org; Cheolgoo Kang
> > > > > > > > > > > Subject: Re: korean and lucene
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Hello Cheolgoo,
> > > > > > > > > > >
> > > > > > > > > > > Now I updated my lucene version to 1.9 for using StandardAnalyzer
> > > > > > > > > > > for Korean.
> > > > > > > > > > > And tested your patch which is already adopted in 1.9
> > > > > > > > > > >
> > > > > > > > > > > http://issues.apache.org/jira/browse/LUCENE-444
> > > > > > > > > > >
> > > > > > > > > > > But Still I have no good  results with Korean compare with
> > > > > > > > > CJKAnalyzer.
> > > > > > > > > > >
> > > > > > > > > > > Single character is good match but more two character word
> > > > > > > > > > > doesn't match at all.
> > > > > > > > > > >
> > > > > > > > > > > Am I something missing or still there need some more works ?
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Thanks,
> > > > > > > > > > >
> > > > > > > > > > > Youngho.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > ----- Original Message -----
> > > > > > > > > > > From: "Cheolgoo Kang" <ap...@gmail.com>
> > > > > > > > > > > To: <ja...@lucene.apache.org>; "John Wang" <jo...@gmail.com>
> > > > > > > > > > > Sent: Tuesday, October 04, 2005 10:11 AM
> > > > > > > > > > > Subject: Re: korean and lucene
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > > StandardAnalyzer's JavaCC based StandardTokenizer.jj cannot read
> > > > > > > > > > > > Korean part of Unicode character blocks.
> > > > > > > > > > > >
> > > > > > > > > > > > You should 1) use CJKAnalyzer or 2) add Korean character
> > > > > > > > > > > > block(0xAC00~0xD7AF) to the CJK token definition on the
> > > > > > > > > > > > StandardTokenizer.jj file.
> > > > > > > > > > > >
> > > > > > > > > > > > Hope it helps.
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > On 10/4/05, John Wang <jo...@gmail.com> wrote:
> > > > > > > > > > > > > Hi:
> > > > > > > > > > > > >
> > > > > > > > > > > > > We are running into problems with searching on korean
> > > > > > > > > > > documents. We are
> > > > > > > > > > > > > using the StandardAnalyzer and everything works with Chinese
> > > > > > > > > > > and Japanese.
> > > > > > > > > > > > > Are there known problems with Korean with Lucene?
> > > > > > > > > > > > >
> > > > > > > > > > > > > Thanks
> > > > > > > > > > > > >
> > > > > > > > > > > > > -John
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > --
> > > > > > > > > > > > Cheolgoo
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > ---------------------------------------------------------------------
> > > > > > > > > > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > > > > > > > > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > ---------------------------------------------------------------------
> > > > > > > > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > > > > > > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > ---------------------------------------------------------------------
> > > > > > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > > > > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > > > > ---------------------------------------------------------------------
> > > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > > > >
> > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > Cheolgoo
>
>
> --
> Cheolgoo
>

Re: korean and lucene

Posted by Youngho Cho <yo...@nannet.co.kr>.
----- Original Message ----- 
From: "Cheolgoo Kang" <ap...@gmail.com>
To: "Youngho Cho" <yo...@nannet.co.kr>
Cc: <ja...@lucene.apache.org>
Sent: Tuesday, November 08, 2005 5:53 PM
Subject: Re: korean and lucene


> On 11/8/05, Youngho Cho <yo...@nannet.co.kr> wrote:
> > Hello,
> >
> > just simple test ...
> > If I compile the javacc correctly..
> > the patched version doesn't match some situation
> > for example
> > in text
> > '엔진박지성(맨체스터 유나이티드)이 주말 프리미어리그를 위해 벤치를 지키며 재충전의 시간을 가졌다.'
> > if query word is '시간'   than nothing match
> > but if query word is '시간을'  than good match.
> 
> That's exactly what I've wanted to do with this patch. You need more
> sophisticated morphological analyzer to to what you've wanted to.
> AFAIK, I'm afraid there is no open sourced software to do the Korean
> morphological analysis.
> 
I hope Lucene community support Korean better.

> And also, if you have Lucene in Action Korean translation, there is a
> simple introduction to Korean stemming analysis at Appendix D. That's
> just simple enough to hold a list of Korean word endings, and check
> for each Token with matching word endings. :)
> 

Great !
You have at least one user.


Thanks.

Youngho



> >
> > I think there is some tradeoff here.
> >
> > Maybe need some good stop filter for korean etc.......
> >
> >
> > Thanks
> >
> > Youngho
> >
> > ----- Original Message -----
> > From: "Youngho Cho" <yo...@nannet.co.kr>
> > To: <ja...@lucene.apache.org>
> > Sent: Tuesday, November 08, 2005 4:44 PM
> > Subject: Re: korean and lucene
> >
> >
> > > Hello Cheolgoo,
> > >
> > > I will test the patch.
> > >
> > >
> > > Thanks,
> > >
> > > Youngho
> > >
> > > ----- Original Message -----
> > > From: "Cheolgoo Kang" <ap...@gmail.com>
> > > To: <ja...@lucene.apache.org>; "Youngho Cho" <yo...@nannet.co.kr>
> > > Sent: Tuesday, November 08, 2005 4:06 PM
> > > Subject: Re: korean and lucene
> > >
> > >
> > > > Hello,
> > > >
> > > > I've created a new JIRA issue with Korean analysis that
> > > > StandardAnalyzer splits one word into several tokens each with one
> > > > character. Cause Korean is not a phonogram, one character in Korean
> > > > has almost no meaning at all. So word in Korean should be preserved
> > > > not like Chinese or Japanese.
> > > >
> > > > I've attached a patch to StandardTokenizer to do this and passed the
> > > > test case TestStandardAnalyzer(also patched to test Korean words).
> > > >
> > > > Thanks!
> > > >
> > > > On 10/27/05, Youngho Cho <yo...@nannet.co.kr> wrote:
> > > > > Hello,
> > > > >
> > > > > Ok , I've attached my test code for Korean which is slitely modified Koji's code.
> > > > >
> > > > > Just put into the lia.analysis.i18n package at LuceneInAction
> > > > > and run ant.
> > > > >
> > > > > Hopely someone is helped.
> > > > >
> > > > > -------- build.xml  ---------
> > > > >
> > > > >   <target name="JapaneseDemo" depends="prepare"
> > > > >           description="Examples of Jananese analysis">
> > > > >     <info>
> > > > >
> > > > >       Japanese Test...
> > > > >
> > > > >     </info>
> > > > >
> > > > >     <run-main class="lia.analysis.i18n.JapaneseDemo"/>
> > > > >   </target>
> > > > >
> > > > >   <target name="KoreanDemo" depends="prepare"
> > > > >           description="Examples of Korean analysis">
> > > > >     <info>
> > > > >
> > > > >       Korean Test...
> > > > >
> > > > >     </info>
> > > > >
> > > > >     <run-main class="lia.analysis.i18n.KoreanDemo"/>
> > > > >   </target>
> > > > >
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Youngho
> > > > >
> > > > >
> > > > > ----- Original Message -----
> > > > > From: "Youngho Cho" <yo...@nannet.co.kr>
> > > > > To: <ja...@lucene.apache.org>; "Youngho Cho" <yo...@nannet.co.kr>
> > > > > Sent: Thursday, October 27, 2005 12:47 PM
> > > > > Subject: Re: korean and lucene
> > > > >
> > > > >
> > > > > > Hello all
> > > > > > Plese forgive me pervious my stupid message
> > > > > >
> > > > > >      [echo] Running lia.analysis.i18n.KoreanDemo...
> > > > > >      [java] [경] [기]  analyzer = org.apache.lucene.analysis.standard.StandardAnalyzer
> > > > > >      [java] phrase = 경기
> > > > > >      [java] query = "경 기"
> > > > > >
> > > > > > I got the good result.
> > > > > >
> > > > > > When I compile I just rename old version lucene-1.4.3.jar to lucene-1.4.3.jar_bak
> > > > > > and all new 1.9 lucene. and build the test package.
> > > > > > After I remove lucene-1.4.3.jar_bak in lib directory completely
> > > > > > I got the expected result !!!.
> > > > > >
> > > > > > I don't know the reason... ( looks like my finger make some trouble... )
> > > > > >
> > > > > > Anyway thanks Koji and Cheolgoo
> > > > > > I will further test now...
> > > > > >
> > > > > > Youngho
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > ----- Original Message -----
> > > > > > From: "Youngho Cho" <yo...@nannet.co.kr>
> > > > > > To: <ja...@lucene.apache.org>
> > > > > > Sent: Thursday, October 27, 2005 12:28 PM
> > > > > > Subject: Re: korean and lucene
> > > > > >
> > > > > >
> > > > > > > Hello Koji
> > > > > > >
> > > > > > > Here is test result.
> > > > > > > Japanese is OK !.
> > > > > > > maybe ant clean  did some effect.
> > > > > > >
> > > > > > > Anyway please refer to the result using 1.9
> > > > > > >
> > > > > > >      [echo] Running lia.analysis.i18n.JapaneseDemo...
> > > > > > >      [java] [ラ] [メ] [ン] [屋]  analyzer = org.apache.lucene.analysis.standard.StandardAnalyzer
> > > > > > >      [java] phrase = ラ?メン屋
> > > > > > >      [java] query = content:ラ?メン屋
> > > > > > >
> > > > > > >     [echo] Running lia.analysis.i18n.KoreanDemo...
> > > > > > >      [java]  analyzer = org.apache.lucene.analysis.standard.StandardAnalyzer
> > > > > > >      [java] phrase = 경
> > > > > > >      [java] query =
> > > > > > >
> > > > > > >      [echo] Running lia.analysis.i18n.JapaneseDemo...
> > > > > > >      [java] [ラ] [メン] [ン屋]  analyzer = org.apache.lucene.analysis.cjk.CJKAnalyzer
> > > > > > >      [java] phrase = ラ?メン屋
> > > > > > >      [java] query = content:ラ?メン屋
> > > > > > >
> > > > > > >     [echo] Running lia.analysis.i18n.KoreanDemo...
> > > > > > >      [java] [경]  analyzer = org.apache.lucene.analysis.cjk.CJKAnalyzer
> > > > > > >      [java] phrase = 경
> > > > > > >      [java] query = 경
> > > > > > >
> > > > > > >      [echo] Running lia.analysis.i18n.KoreanDemo...
> > > > > > >      [java]  analyzer = org.apache.lucene.analysis.standard.StandardAnalyzer
> > > > > > >      [java] phrase = 경기
> > > > > > >      [java] query =
> > > > > > >
> > > > > > >      [echo] Running lia.analysis.i18n.KoreanDemo...
> > > > > > >      [java] [경기]  analyzer = org.apache.lucene.analysis.cjk.CJKAnalyzer
> > > > > > >      [java] phrase = 경기
> > > > > > >      [java] query = 경기
> > > > > > >
> > > > > > >
> > > > > > > Standard analyzer didn't tokenized the Korean Character at all....
> > > > > > >
> > > > > > > Ug....  look like
> > > > > > >  http://issues.apache.org/jira/browse/LUCENE-444
> > > > > > >  didn't effect at all for Korean.
> > > > > > >
> > > > > > >
> > > > > > > Thanks
> > > > > > >
> > > > > > > Youngho
> > > > > > >
> > > > > > > ----- Original Message -----
> > > > > > > From: "Koji Sekiguchi" <ko...@m4.dion.ne.jp>
> > > > > > > To: <ja...@lucene.apache.org>; "Youngho Cho" <yo...@nannet.co.kr>
> > > > > > > Sent: Thursday, October 27, 2005 11:47 AM
> > > > > > > Subject: RE: korean and lucene
> > > > > > >
> > > > > > >
> > > > > > > > Hello Youngho,
> > > > > > > >
> > > > > > > > I don't understand why you couldn't get hits result in Japanese,
> > > > > > > > though, you had better check why the query was empty with Korean data:
> > > > > > > >
> > > > > > > > > For Korean
> > > > > > > > >      [echo] Running lia.analysis.i18n.KoreanDemo...
> > > > > > > > >      [java] phrase = 경
> > > > > > > > >      [java] query =
> > > > > > > >
> > > > > > > > The last line should be query = 경
> > > > > > > > to get hits result. Can you check why StandardAnalyzer
> > > > > > > > removes "경" during tokenizing?
> > > > > > > >
> > > > > > > > Koji
> > > > > > > >
> > > > > > > > > -----Original Message-----
> > > > > > > > > From: Youngho Cho [mailto:youngho@nannet.co.kr]
> > > > > > > > > Sent: Thursday, October 27, 2005 11:37 AM
> > > > > > > > > To: java-user@lucene.apache.org
> > > > > > > > > Subject: Re: korean and lucene
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Hello Koji,
> > > > > > > > >
> > > > > > > > > Thanks for your kind reply.
> > > > > > > > >
> > > > > > > > > Yes, I used QueryParser. normaly I used
> > > > > > > > > Query = QueryParser.parse( ) method.
> > > > > > > > >
> > > > > > > > > I put your sample code into lia.analysis.i18n package in LuceneAction
> > > > > > > > > and run JapaneseDemo using 1.4 and 1.9
> > > > > > > > >
> > > > > > > > > results are
> > > > > > > > >
> > > > > > > > >      [echo] Running lia.analysis.i18n.JapaneseDemo...
> > > > > > > > >      [java] query = content:ラ?メン屋
> > > > > > > > >
> > > > > > > > > I can't get hits result.
> > > > > > > > >
> > > > > > > > > For Korean
> > > > > > > > >      [echo] Running lia.analysis.i18n.KoreanDemo...
> > > > > > > > >      [java] phrase = 경
> > > > > > > > >      [java] query =
> > > > > > > > >
> > > > > > > > > I can't get query parse result.
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > >
> > > > > > > > > Youngho
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > ----- Original Message -----
> > > > > > > > > From: "Koji Sekiguchi" <ko...@m4.dion.ne.jp>
> > > > > > > > > To: <ja...@lucene.apache.org>; "Youngho Cho" <yo...@nannet.co.kr>
> > > > > > > > > Sent: Thursday, October 27, 2005 9:48 AM
> > > > > > > > > Subject: RE: korean and lucene
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > Hi Youngho,
> > > > > > > > > >
> > > > > > > > > > With regard to Japanese, using StandardAnalyzer,
> > > > > > > > > > I can search a word/phase.
> > > > > > > > > >
> > > > > > > > > > Did you use QueryParser? StandardAnalyzer tokenizes
> > > > > > > > > > CJK characters into a stream of single character.
> > > > > > > > > > Use QueryParser to get a PhraseQuery and search the query.
> > > > > > > > > >
> > > > > > > > > > Please see the following sample code. Replace Japanese
> > > > > > > > > > "contents" and (search target) "phrase" with Korean in the
> > > > > > > > > program and run.
> > > > > > > > > >
> > > > > > > > > > regards,
> > > > > > > > > >
> > > > > > > > > > Koji
> > > > > > > > > >
> > > > > > > > > > =============================================
> > > > > > > > > > import java.io.IOException;
> > > > > > > > > > import org.apache.lucene.analysis.Analyzer;
> > > > > > > > > > import org.apache.lucene.analysis.standard.StandardAnalyzer;
> > > > > > > > > > import org.apache.lucene.analysis.cjk.CJKAnalyzer;
> > > > > > > > > > import org.apache.lucene.store.Directory;
> > > > > > > > > > import org.apache.lucene.store.RAMDirectory;
> > > > > > > > > > import org.apache.lucene.index.IndexWriter;
> > > > > > > > > > import org.apache.lucene.document.Document;
> > > > > > > > > > import org.apache.lucene.document.Field;
> > > > > > > > > > import org.apache.lucene.search.IndexSearcher;
> > > > > > > > > > import org.apache.lucene.search.Hits;
> > > > > > > > > > import org.apache.lucene.search.Query;
> > > > > > > > > > import org.apache.lucene.queryParser.QueryParser;
> > > > > > > > > > import org.apache.lucene.queryParser.ParseException;
> > > > > > > > > >
> > > > > > > > > > public class JapaneseByStandardAnalyzer {
> > > > > > > > > >
> > > > > > > > > >     private static final String FIELD_CONTENT = "content";
> > > > > > > > > >     private static final String[] contents = {
> > > > > > > > > > "東京にはおいしいラーメン屋がたくさんあります。",
> > > > > > > > > > "北海道にもおいしいラーメン屋があります。"
> > > > > > > > > >     };
> > > > > > > > > >     private static final String phrase = "ラーメン屋";
> > > > > > > > > >     //private static final String phrase = "屋";
> > > > > > > > > >     private static Analyzer analyzer = null;
> > > > > > > > > >
> > > > > > > > > >     public static void main( String[] args ) throws
> > > > > > > > > IOException, ParseException {
> > > > > > > > > > Directory directory = makeIndex();
> > > > > > > > > > search( directory );
> > > > > > > > > > directory.close();
> > > > > > > > > >     }
> > > > > > > > > >
> > > > > > > > > >     private static Analyzer getAnalyzer(){
> > > > > > > > > > if( analyzer == null ){
> > > > > > > > > >     analyzer = new StandardAnalyzer();
> > > > > > > > > >     //analyzer = new CJKAnalyzer();
> > > > > > > > > > }
> > > > > > > > > > return analyzer;
> > > > > > > > > >     }
> > > > > > > > > >
> > > > > > > > > >     private static Directory makeIndex() throws IOException {
> > > > > > > > > > Directory directory = new RAMDirectory();
> > > > > > > > > > IndexWriter writer = new IndexWriter( directory, getAnalyzer(), true );
> > > > > > > > > > for( int i = 0; i < contents.length; i++ ){
> > > > > > > > > >     Document doc = new Document();
> > > > > > > > > >     doc.add( new Field( FIELD_CONTENT, contents[i],
> > > > > > > > > Field.Store.YES, Field.Index.TOKENIZED ) );
> > > > > > > > > >     writer.addDocument( doc );
> > > > > > > > > > }
> > > > > > > > > > writer.close();
> > > > > > > > > > return directory;
> > > > > > > > > >     }
> > > > > > > > > >
> > > > > > > > > >     private static void search( Directory directory ) throws
> > > > > > > > > IOException, ParseException {
> > > > > > > > > > IndexSearcher searcher = new IndexSearcher( directory );
> > > > > > > > > > QueryParser parser = new QueryParser( FIELD_CONTENT, getAnalyzer() );
> > > > > > > > > > Query query = parser.parse( phrase );
> > > > > > > > > > System.out.println( "query = " + query );
> > > > > > > > > > Hits hits = searcher.search( query );
> > > > > > > > > > for( int i = 0; i < hits.length(); i++ )
> > > > > > > > > >     System.out.println( "doc = " + hits.doc( i ).get( FIELD_CONTENT ) );
> > > > > > > > > > searcher.close();
> > > > > > > > > >     }
> > > > > > > > > > }
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > -----Original Message-----
> > > > > > > > > > > From: Youngho Cho [mailto:youngho@nannet.co.kr]
> > > > > > > > > > > Sent: Thursday, October 27, 2005 8:18 AM
> > > > > > > > > > > To: java-user@lucene.apache.org; Cheolgoo Kang
> > > > > > > > > > > Subject: Re: korean and lucene
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Hello Cheolgoo,
> > > > > > > > > > >
> > > > > > > > > > > Now I updated my lucene version to 1.9 for using StandardAnalyzer
> > > > > > > > > > > for Korean.
> > > > > > > > > > > And tested your patch which is already adopted in 1.9
> > > > > > > > > > >
> > > > > > > > > > > http://issues.apache.org/jira/browse/LUCENE-444
> > > > > > > > > > >
> > > > > > > > > > > But Still I have no good  results with Korean compare with
> > > > > > > > > CJKAnalyzer.
> > > > > > > > > > >
> > > > > > > > > > > Single character is good match but more two character word
> > > > > > > > > > > doesn't match at all.
> > > > > > > > > > >
> > > > > > > > > > > Am I something missing or still there need some more works ?
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Thanks,
> > > > > > > > > > >
> > > > > > > > > > > Youngho.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > ----- Original Message -----
> > > > > > > > > > > From: "Cheolgoo Kang" <ap...@gmail.com>
> > > > > > > > > > > To: <ja...@lucene.apache.org>; "John Wang" <jo...@gmail.com>
> > > > > > > > > > > Sent: Tuesday, October 04, 2005 10:11 AM
> > > > > > > > > > > Subject: Re: korean and lucene
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > > StandardAnalyzer's JavaCC based StandardTokenizer.jj cannot read
> > > > > > > > > > > > Korean part of Unicode character blocks.
> > > > > > > > > > > >
> > > > > > > > > > > > You should 1) use CJKAnalyzer or 2) add Korean character
> > > > > > > > > > > > block(0xAC00~0xD7AF) to the CJK token definition on the
> > > > > > > > > > > > StandardTokenizer.jj file.
> > > > > > > > > > > >
> > > > > > > > > > > > Hope it helps.
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > On 10/4/05, John Wang <jo...@gmail.com> wrote:
> > > > > > > > > > > > > Hi:
> > > > > > > > > > > > >
> > > > > > > > > > > > > We are running into problems with searching on korean
> > > > > > > > > > > documents. We are
> > > > > > > > > > > > > using the StandardAnalyzer and everything works with Chinese
> > > > > > > > > > > and Japanese.
> > > > > > > > > > > > > Are there known problems with Korean with Lucene?
> > > > > > > > > > > > >
> > > > > > > > > > > > > Thanks
> > > > > > > > > > > > >
> > > > > > > > > > > > > -John
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > --
> > > > > > > > > > > > Cheolgoo
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > ---------------------------------------------------------------------
> > > > > > > > > > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > > > > > > > > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > ---------------------------------------------------------------------
> > > > > > > > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > > > > > > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > ---------------------------------------------------------------------
> > > > > > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > > > > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > > > > ---------------------------------------------------------------------
> > > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > > > >
> > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > Cheolgoo
> 
> 
> --
> Cheolgoo

Re: korean and lucene

Posted by Cheolgoo Kang <ap...@gmail.com>.
On 11/8/05, Youngho Cho <yo...@nannet.co.kr> wrote:
> Hello,
>
> just simple test ...
> If I compile the javacc correctly..
> the patched version doesn't match some situation
> for example
> in text
> '엔진박지성(맨체스터 유나이티드)이 주말 프리미어리그를 위해 벤치를 지키며 재충전의 시간을 가졌다.'
> if query word is '시간'   than nothing match
> but if query word is '시간을'  than good match.

That's exactly what I've wanted to do with this patch. You need more
sophisticated morphological analyzer to to what you've wanted to.
AFAIK, I'm afraid there is no open sourced software to do the Korean
morphological analysis.

And also, if you have Lucene in Action Korean translation, there is a
simple introduction to Korean stemming analysis at Appendix D. That's
just simple enough to hold a list of Korean word endings, and check
for each Token with matching word endings. :)

>
> I think there is some tradeoff here.
>
> Maybe need some good stop filter for korean etc.......
>
>
> Thanks
>
> Youngho
>
> ----- Original Message -----
> From: "Youngho Cho" <yo...@nannet.co.kr>
> To: <ja...@lucene.apache.org>
> Sent: Tuesday, November 08, 2005 4:44 PM
> Subject: Re: korean and lucene
>
>
> > Hello Cheolgoo,
> >
> > I will test the patch.
> >
> >
> > Thanks,
> >
> > Youngho
> >
> > ----- Original Message -----
> > From: "Cheolgoo Kang" <ap...@gmail.com>
> > To: <ja...@lucene.apache.org>; "Youngho Cho" <yo...@nannet.co.kr>
> > Sent: Tuesday, November 08, 2005 4:06 PM
> > Subject: Re: korean and lucene
> >
> >
> > > Hello,
> > >
> > > I've created a new JIRA issue with Korean analysis that
> > > StandardAnalyzer splits one word into several tokens each with one
> > > character. Cause Korean is not a phonogram, one character in Korean
> > > has almost no meaning at all. So word in Korean should be preserved
> > > not like Chinese or Japanese.
> > >
> > > I've attached a patch to StandardTokenizer to do this and passed the
> > > test case TestStandardAnalyzer(also patched to test Korean words).
> > >
> > > Thanks!
> > >
> > > On 10/27/05, Youngho Cho <yo...@nannet.co.kr> wrote:
> > > > Hello,
> > > >
> > > > Ok , I've attached my test code for Korean which is slitely modified Koji's code.
> > > >
> > > > Just put into the lia.analysis.i18n package at LuceneInAction
> > > > and run ant.
> > > >
> > > > Hopely someone is helped.
> > > >
> > > > -------- build.xml  ---------
> > > >
> > > >   <target name="JapaneseDemo" depends="prepare"
> > > >           description="Examples of Jananese analysis">
> > > >     <info>
> > > >
> > > >       Japanese Test...
> > > >
> > > >     </info>
> > > >
> > > >     <run-main class="lia.analysis.i18n.JapaneseDemo"/>
> > > >   </target>
> > > >
> > > >   <target name="KoreanDemo" depends="prepare"
> > > >           description="Examples of Korean analysis">
> > > >     <info>
> > > >
> > > >       Korean Test...
> > > >
> > > >     </info>
> > > >
> > > >     <run-main class="lia.analysis.i18n.KoreanDemo"/>
> > > >   </target>
> > > >
> > > >
> > > > Thanks,
> > > >
> > > > Youngho
> > > >
> > > >
> > > > ----- Original Message -----
> > > > From: "Youngho Cho" <yo...@nannet.co.kr>
> > > > To: <ja...@lucene.apache.org>; "Youngho Cho" <yo...@nannet.co.kr>
> > > > Sent: Thursday, October 27, 2005 12:47 PM
> > > > Subject: Re: korean and lucene
> > > >
> > > >
> > > > > Hello all
> > > > > Plese forgive me pervious my stupid message
> > > > >
> > > > >      [echo] Running lia.analysis.i18n.KoreanDemo...
> > > > >      [java] [경] [기]  analyzer = org.apache.lucene.analysis.standard.StandardAnalyzer
> > > > >      [java] phrase = 경기
> > > > >      [java] query = "경 기"
> > > > >
> > > > > I got the good result.
> > > > >
> > > > > When I compile I just rename old version lucene-1.4.3.jar to lucene-1.4.3.jar_bak
> > > > > and all new 1.9 lucene. and build the test package.
> > > > > After I remove lucene-1.4.3.jar_bak in lib directory completely
> > > > > I got the expected result !!!.
> > > > >
> > > > > I don't know the reason... ( looks like my finger make some trouble... )
> > > > >
> > > > > Anyway thanks Koji and Cheolgoo
> > > > > I will further test now...
> > > > >
> > > > > Youngho
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > ----- Original Message -----
> > > > > From: "Youngho Cho" <yo...@nannet.co.kr>
> > > > > To: <ja...@lucene.apache.org>
> > > > > Sent: Thursday, October 27, 2005 12:28 PM
> > > > > Subject: Re: korean and lucene
> > > > >
> > > > >
> > > > > > Hello Koji
> > > > > >
> > > > > > Here is test result.
> > > > > > Japanese is OK !.
> > > > > > maybe ant clean  did some effect.
> > > > > >
> > > > > > Anyway please refer to the result using 1.9
> > > > > >
> > > > > >      [echo] Running lia.analysis.i18n.JapaneseDemo...
> > > > > >      [java] [ラ] [メ] [ン] [屋]  analyzer = org.apache.lucene.analysis.standard.StandardAnalyzer
> > > > > >      [java] phrase = ラ?メン屋
> > > > > >      [java] query = content:ラ?メン屋
> > > > > >
> > > > > >     [echo] Running lia.analysis.i18n.KoreanDemo...
> > > > > >      [java]  analyzer = org.apache.lucene.analysis.standard.StandardAnalyzer
> > > > > >      [java] phrase = 경
> > > > > >      [java] query =
> > > > > >
> > > > > >      [echo] Running lia.analysis.i18n.JapaneseDemo...
> > > > > >      [java] [ラ] [メン] [ン屋]  analyzer = org.apache.lucene.analysis.cjk.CJKAnalyzer
> > > > > >      [java] phrase = ラ?メン屋
> > > > > >      [java] query = content:ラ?メン屋
> > > > > >
> > > > > >     [echo] Running lia.analysis.i18n.KoreanDemo...
> > > > > >      [java] [경]  analyzer = org.apache.lucene.analysis.cjk.CJKAnalyzer
> > > > > >      [java] phrase = 경
> > > > > >      [java] query = 경
> > > > > >
> > > > > >      [echo] Running lia.analysis.i18n.KoreanDemo...
> > > > > >      [java]  analyzer = org.apache.lucene.analysis.standard.StandardAnalyzer
> > > > > >      [java] phrase = 경기
> > > > > >      [java] query =
> > > > > >
> > > > > >      [echo] Running lia.analysis.i18n.KoreanDemo...
> > > > > >      [java] [경기]  analyzer = org.apache.lucene.analysis.cjk.CJKAnalyzer
> > > > > >      [java] phrase = 경기
> > > > > >      [java] query = 경기
> > > > > >
> > > > > >
> > > > > > Standard analyzer didn't tokenized the Korean Character at all....
> > > > > >
> > > > > > Ug....  look like
> > > > > >  http://issues.apache.org/jira/browse/LUCENE-444
> > > > > >  didn't effect at all for Korean.
> > > > > >
> > > > > >
> > > > > > Thanks
> > > > > >
> > > > > > Youngho
> > > > > >
> > > > > > ----- Original Message -----
> > > > > > From: "Koji Sekiguchi" <ko...@m4.dion.ne.jp>
> > > > > > To: <ja...@lucene.apache.org>; "Youngho Cho" <yo...@nannet.co.kr>
> > > > > > Sent: Thursday, October 27, 2005 11:47 AM
> > > > > > Subject: RE: korean and lucene
> > > > > >
> > > > > >
> > > > > > > Hello Youngho,
> > > > > > >
> > > > > > > I don't understand why you couldn't get hits result in Japanese,
> > > > > > > though, you had better check why the query was empty with Korean data:
> > > > > > >
> > > > > > > > For Korean
> > > > > > > >      [echo] Running lia.analysis.i18n.KoreanDemo...
> > > > > > > >      [java] phrase = 경
> > > > > > > >      [java] query =
> > > > > > >
> > > > > > > The last line should be query = 경
> > > > > > > to get hits result. Can you check why StandardAnalyzer
> > > > > > > removes "경" during tokenizing?
> > > > > > >
> > > > > > > Koji
> > > > > > >
> > > > > > > > -----Original Message-----
> > > > > > > > From: Youngho Cho [mailto:youngho@nannet.co.kr]
> > > > > > > > Sent: Thursday, October 27, 2005 11:37 AM
> > > > > > > > To: java-user@lucene.apache.org
> > > > > > > > Subject: Re: korean and lucene
> > > > > > > >
> > > > > > > >
> > > > > > > > Hello Koji,
> > > > > > > >
> > > > > > > > Thanks for your kind reply.
> > > > > > > >
> > > > > > > > Yes, I used QueryParser. normaly I used
> > > > > > > > Query = QueryParser.parse( ) method.
> > > > > > > >
> > > > > > > > I put your sample code into lia.analysis.i18n package in LuceneAction
> > > > > > > > and run JapaneseDemo using 1.4 and 1.9
> > > > > > > >
> > > > > > > > results are
> > > > > > > >
> > > > > > > >      [echo] Running lia.analysis.i18n.JapaneseDemo...
> > > > > > > >      [java] query = content:ラ?メン屋
> > > > > > > >
> > > > > > > > I can't get hits result.
> > > > > > > >
> > > > > > > > For Korean
> > > > > > > >      [echo] Running lia.analysis.i18n.KoreanDemo...
> > > > > > > >      [java] phrase = 경
> > > > > > > >      [java] query =
> > > > > > > >
> > > > > > > > I can't get query parse result.
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > >
> > > > > > > > Youngho
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > ----- Original Message -----
> > > > > > > > From: "Koji Sekiguchi" <ko...@m4.dion.ne.jp>
> > > > > > > > To: <ja...@lucene.apache.org>; "Youngho Cho" <yo...@nannet.co.kr>
> > > > > > > > Sent: Thursday, October 27, 2005 9:48 AM
> > > > > > > > Subject: RE: korean and lucene
> > > > > > > >
> > > > > > > >
> > > > > > > > > Hi Youngho,
> > > > > > > > >
> > > > > > > > > With regard to Japanese, using StandardAnalyzer,
> > > > > > > > > I can search a word/phase.
> > > > > > > > >
> > > > > > > > > Did you use QueryParser? StandardAnalyzer tokenizes
> > > > > > > > > CJK characters into a stream of single character.
> > > > > > > > > Use QueryParser to get a PhraseQuery and search the query.
> > > > > > > > >
> > > > > > > > > Please see the following sample code. Replace Japanese
> > > > > > > > > "contents" and (search target) "phrase" with Korean in the
> > > > > > > > program and run.
> > > > > > > > >
> > > > > > > > > regards,
> > > > > > > > >
> > > > > > > > > Koji
> > > > > > > > >
> > > > > > > > > =============================================
> > > > > > > > > import java.io.IOException;
> > > > > > > > > import org.apache.lucene.analysis.Analyzer;
> > > > > > > > > import org.apache.lucene.analysis.standard.StandardAnalyzer;
> > > > > > > > > import org.apache.lucene.analysis.cjk.CJKAnalyzer;
> > > > > > > > > import org.apache.lucene.store.Directory;
> > > > > > > > > import org.apache.lucene.store.RAMDirectory;
> > > > > > > > > import org.apache.lucene.index.IndexWriter;
> > > > > > > > > import org.apache.lucene.document.Document;
> > > > > > > > > import org.apache.lucene.document.Field;
> > > > > > > > > import org.apache.lucene.search.IndexSearcher;
> > > > > > > > > import org.apache.lucene.search.Hits;
> > > > > > > > > import org.apache.lucene.search.Query;
> > > > > > > > > import org.apache.lucene.queryParser.QueryParser;
> > > > > > > > > import org.apache.lucene.queryParser.ParseException;
> > > > > > > > >
> > > > > > > > > public class JapaneseByStandardAnalyzer {
> > > > > > > > >
> > > > > > > > >     private static final String FIELD_CONTENT = "content";
> > > > > > > > >     private static final String[] contents = {
> > > > > > > > > "東京にはおいしいラーメン屋がたくさんあります。",
> > > > > > > > > "北海道にもおいしいラーメン屋があります。"
> > > > > > > > >     };
> > > > > > > > >     private static final String phrase = "ラーメン屋";
> > > > > > > > >     //private static final String phrase = "屋";
> > > > > > > > >     private static Analyzer analyzer = null;
> > > > > > > > >
> > > > > > > > >     public static void main( String[] args ) throws
> > > > > > > > IOException, ParseException {
> > > > > > > > > Directory directory = makeIndex();
> > > > > > > > > search( directory );
> > > > > > > > > directory.close();
> > > > > > > > >     }
> > > > > > > > >
> > > > > > > > >     private static Analyzer getAnalyzer(){
> > > > > > > > > if( analyzer == null ){
> > > > > > > > >     analyzer = new StandardAnalyzer();
> > > > > > > > >     //analyzer = new CJKAnalyzer();
> > > > > > > > > }
> > > > > > > > > return analyzer;
> > > > > > > > >     }
> > > > > > > > >
> > > > > > > > >     private static Directory makeIndex() throws IOException {
> > > > > > > > > Directory directory = new RAMDirectory();
> > > > > > > > > IndexWriter writer = new IndexWriter( directory, getAnalyzer(), true );
> > > > > > > > > for( int i = 0; i < contents.length; i++ ){
> > > > > > > > >     Document doc = new Document();
> > > > > > > > >     doc.add( new Field( FIELD_CONTENT, contents[i],
> > > > > > > > Field.Store.YES, Field.Index.TOKENIZED ) );
> > > > > > > > >     writer.addDocument( doc );
> > > > > > > > > }
> > > > > > > > > writer.close();
> > > > > > > > > return directory;
> > > > > > > > >     }
> > > > > > > > >
> > > > > > > > >     private static void search( Directory directory ) throws
> > > > > > > > IOException, ParseException {
> > > > > > > > > IndexSearcher searcher = new IndexSearcher( directory );
> > > > > > > > > QueryParser parser = new QueryParser( FIELD_CONTENT, getAnalyzer() );
> > > > > > > > > Query query = parser.parse( phrase );
> > > > > > > > > System.out.println( "query = " + query );
> > > > > > > > > Hits hits = searcher.search( query );
> > > > > > > > > for( int i = 0; i < hits.length(); i++ )
> > > > > > > > >     System.out.println( "doc = " + hits.doc( i ).get( FIELD_CONTENT ) );
> > > > > > > > > searcher.close();
> > > > > > > > >     }
> > > > > > > > > }
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > -----Original Message-----
> > > > > > > > > > From: Youngho Cho [mailto:youngho@nannet.co.kr]
> > > > > > > > > > Sent: Thursday, October 27, 2005 8:18 AM
> > > > > > > > > > To: java-user@lucene.apache.org; Cheolgoo Kang
> > > > > > > > > > Subject: Re: korean and lucene
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Hello Cheolgoo,
> > > > > > > > > >
> > > > > > > > > > Now I updated my lucene version to 1.9 for using StandardAnalyzer
> > > > > > > > > > for Korean.
> > > > > > > > > > And tested your patch which is already adopted in 1.9
> > > > > > > > > >
> > > > > > > > > > http://issues.apache.org/jira/browse/LUCENE-444
> > > > > > > > > >
> > > > > > > > > > But Still I have no good  results with Korean compare with
> > > > > > > > CJKAnalyzer.
> > > > > > > > > >
> > > > > > > > > > Single character is good match but more two character word
> > > > > > > > > > doesn't match at all.
> > > > > > > > > >
> > > > > > > > > > Am I something missing or still there need some more works ?
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Thanks,
> > > > > > > > > >
> > > > > > > > > > Youngho.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > ----- Original Message -----
> > > > > > > > > > From: "Cheolgoo Kang" <ap...@gmail.com>
> > > > > > > > > > To: <ja...@lucene.apache.org>; "John Wang" <jo...@gmail.com>
> > > > > > > > > > Sent: Tuesday, October 04, 2005 10:11 AM
> > > > > > > > > > Subject: Re: korean and lucene
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > StandardAnalyzer's JavaCC based StandardTokenizer.jj cannot read
> > > > > > > > > > > Korean part of Unicode character blocks.
> > > > > > > > > > >
> > > > > > > > > > > You should 1) use CJKAnalyzer or 2) add Korean character
> > > > > > > > > > > block(0xAC00~0xD7AF) to the CJK token definition on the
> > > > > > > > > > > StandardTokenizer.jj file.
> > > > > > > > > > >
> > > > > > > > > > > Hope it helps.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > On 10/4/05, John Wang <jo...@gmail.com> wrote:
> > > > > > > > > > > > Hi:
> > > > > > > > > > > >
> > > > > > > > > > > > We are running into problems with searching on korean
> > > > > > > > > > documents. We are
> > > > > > > > > > > > using the StandardAnalyzer and everything works with Chinese
> > > > > > > > > > and Japanese.
> > > > > > > > > > > > Are there known problems with Korean with Lucene?
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks
> > > > > > > > > > > >
> > > > > > > > > > > > -John
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > --
> > > > > > > > > > > Cheolgoo
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > ---------------------------------------------------------------------
> > > > > > > > > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > > > > > > > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > ---------------------------------------------------------------------
> > > > > > > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > > > > > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > ---------------------------------------------------------------------
> > > > > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > > > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > > > ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > > >
> > > >
> > > >
> > >
> > >
> > > --
> > > Cheolgoo


--
Cheolgoo

Re: korean and lucene

Posted by Youngho Cho <yo...@nannet.co.kr>.
Hello,

just simple test ...
If I compile the javacc correctly..
the patched version doesn't match some situation 
for example 
in text
'엔진박지성(맨체스터 유나이티드)이 주말 프리미어리그를 위해 벤치를 지키며 재충전의 시간을 가졌다.'
if query word is '시간'   than nothing match
but if query word is '시간을'  than good match.

I think there is some tradeoff here.

Maybe need some good stop filter for korean etc.......


Thanks

Youngho

----- Original Message ----- 
From: "Youngho Cho" <yo...@nannet.co.kr>
To: <ja...@lucene.apache.org>
Sent: Tuesday, November 08, 2005 4:44 PM
Subject: Re: korean and lucene


> Hello Cheolgoo,
> 
> I will test the patch.
> 
> 
> Thanks,
> 
> Youngho
> 
> ----- Original Message ----- 
> From: "Cheolgoo Kang" <ap...@gmail.com>
> To: <ja...@lucene.apache.org>; "Youngho Cho" <yo...@nannet.co.kr>
> Sent: Tuesday, November 08, 2005 4:06 PM
> Subject: Re: korean and lucene
> 
> 
> > Hello,
> > 
> > I've created a new JIRA issue with Korean analysis that
> > StandardAnalyzer splits one word into several tokens each with one
> > character. Cause Korean is not a phonogram, one character in Korean
> > has almost no meaning at all. So word in Korean should be preserved
> > not like Chinese or Japanese.
> > 
> > I've attached a patch to StandardTokenizer to do this and passed the
> > test case TestStandardAnalyzer(also patched to test Korean words).
> > 
> > Thanks!
> > 
> > On 10/27/05, Youngho Cho <yo...@nannet.co.kr> wrote:
> > > Hello,
> > >
> > > Ok , I've attached my test code for Korean which is slitely modified Koji's code.
> > >
> > > Just put into the lia.analysis.i18n package at LuceneInAction
> > > and run ant.
> > >
> > > Hopely someone is helped.
> > >
> > > -------- build.xml  ---------
> > >
> > >   <target name="JapaneseDemo" depends="prepare"
> > >           description="Examples of Jananese analysis">
> > >     <info>
> > >
> > >       Japanese Test...
> > >
> > >     </info>
> > >
> > >     <run-main class="lia.analysis.i18n.JapaneseDemo"/>
> > >   </target>
> > >
> > >   <target name="KoreanDemo" depends="prepare"
> > >           description="Examples of Korean analysis">
> > >     <info>
> > >
> > >       Korean Test...
> > >
> > >     </info>
> > >
> > >     <run-main class="lia.analysis.i18n.KoreanDemo"/>
> > >   </target>
> > >
> > >
> > > Thanks,
> > >
> > > Youngho
> > >
> > >
> > > ----- Original Message -----
> > > From: "Youngho Cho" <yo...@nannet.co.kr>
> > > To: <ja...@lucene.apache.org>; "Youngho Cho" <yo...@nannet.co.kr>
> > > Sent: Thursday, October 27, 2005 12:47 PM
> > > Subject: Re: korean and lucene
> > >
> > >
> > > > Hello all
> > > > Plese forgive me pervious my stupid message
> > > >
> > > >      [echo] Running lia.analysis.i18n.KoreanDemo...
> > > >      [java] [경] [기]  analyzer = org.apache.lucene.analysis.standard.StandardAnalyzer
> > > >      [java] phrase = 경기
> > > >      [java] query = "경 기"
> > > >
> > > > I got the good result.
> > > >
> > > > When I compile I just rename old version lucene-1.4.3.jar to lucene-1.4.3.jar_bak
> > > > and all new 1.9 lucene. and build the test package.
> > > > After I remove lucene-1.4.3.jar_bak in lib directory completely
> > > > I got the expected result !!!.
> > > >
> > > > I don't know the reason... ( looks like my finger make some trouble... )
> > > >
> > > > Anyway thanks Koji and Cheolgoo
> > > > I will further test now...
> > > >
> > > > Youngho
> > > >
> > > >
> > > >
> > > >
> > > > ----- Original Message -----
> > > > From: "Youngho Cho" <yo...@nannet.co.kr>
> > > > To: <ja...@lucene.apache.org>
> > > > Sent: Thursday, October 27, 2005 12:28 PM
> > > > Subject: Re: korean and lucene
> > > >
> > > >
> > > > > Hello Koji
> > > > >
> > > > > Here is test result.
> > > > > Japanese is OK !.
> > > > > maybe ant clean  did some effect.
> > > > >
> > > > > Anyway please refer to the result using 1.9
> > > > >
> > > > >      [echo] Running lia.analysis.i18n.JapaneseDemo...
> > > > >      [java] [ラ] [メ] [ン] [屋]  analyzer = org.apache.lucene.analysis.standard.StandardAnalyzer
> > > > >      [java] phrase = ラ?メン屋
> > > > >      [java] query = content:ラ?メン屋
> > > > >
> > > > >     [echo] Running lia.analysis.i18n.KoreanDemo...
> > > > >      [java]  analyzer = org.apache.lucene.analysis.standard.StandardAnalyzer
> > > > >      [java] phrase = 경
> > > > >      [java] query =
> > > > >
> > > > >      [echo] Running lia.analysis.i18n.JapaneseDemo...
> > > > >      [java] [ラ] [メン] [ン屋]  analyzer = org.apache.lucene.analysis.cjk.CJKAnalyzer
> > > > >      [java] phrase = ラ?メン屋
> > > > >      [java] query = content:ラ?メン屋
> > > > >
> > > > >     [echo] Running lia.analysis.i18n.KoreanDemo...
> > > > >      [java] [경]  analyzer = org.apache.lucene.analysis.cjk.CJKAnalyzer
> > > > >      [java] phrase = 경
> > > > >      [java] query = 경
> > > > >
> > > > >      [echo] Running lia.analysis.i18n.KoreanDemo...
> > > > >      [java]  analyzer = org.apache.lucene.analysis.standard.StandardAnalyzer
> > > > >      [java] phrase = 경기
> > > > >      [java] query =
> > > > >
> > > > >      [echo] Running lia.analysis.i18n.KoreanDemo...
> > > > >      [java] [경기]  analyzer = org.apache.lucene.analysis.cjk.CJKAnalyzer
> > > > >      [java] phrase = 경기
> > > > >      [java] query = 경기
> > > > >
> > > > >
> > > > > Standard analyzer didn't tokenized the Korean Character at all....
> > > > >
> > > > > Ug....  look like
> > > > >  http://issues.apache.org/jira/browse/LUCENE-444
> > > > >  didn't effect at all for Korean.
> > > > >
> > > > >
> > > > > Thanks
> > > > >
> > > > > Youngho
> > > > >
> > > > > ----- Original Message -----
> > > > > From: "Koji Sekiguchi" <ko...@m4.dion.ne.jp>
> > > > > To: <ja...@lucene.apache.org>; "Youngho Cho" <yo...@nannet.co.kr>
> > > > > Sent: Thursday, October 27, 2005 11:47 AM
> > > > > Subject: RE: korean and lucene
> > > > >
> > > > >
> > > > > > Hello Youngho,
> > > > > >
> > > > > > I don't understand why you couldn't get hits result in Japanese,
> > > > > > though, you had better check why the query was empty with Korean data:
> > > > > >
> > > > > > > For Korean
> > > > > > >      [echo] Running lia.analysis.i18n.KoreanDemo...
> > > > > > >      [java] phrase = 경
> > > > > > >      [java] query =
> > > > > >
> > > > > > The last line should be query = 경
> > > > > > to get hits result. Can you check why StandardAnalyzer
> > > > > > removes "경" during tokenizing?
> > > > > >
> > > > > > Koji
> > > > > >
> > > > > > > -----Original Message-----
> > > > > > > From: Youngho Cho [mailto:youngho@nannet.co.kr]
> > > > > > > Sent: Thursday, October 27, 2005 11:37 AM
> > > > > > > To: java-user@lucene.apache.org
> > > > > > > Subject: Re: korean and lucene
> > > > > > >
> > > > > > >
> > > > > > > Hello Koji,
> > > > > > >
> > > > > > > Thanks for your kind reply.
> > > > > > >
> > > > > > > Yes, I used QueryParser. normaly I used
> > > > > > > Query = QueryParser.parse( ) method.
> > > > > > >
> > > > > > > I put your sample code into lia.analysis.i18n package in LuceneAction
> > > > > > > and run JapaneseDemo using 1.4 and 1.9
> > > > > > >
> > > > > > > results are
> > > > > > >
> > > > > > >      [echo] Running lia.analysis.i18n.JapaneseDemo...
> > > > > > >      [java] query = content:ラ?メン屋
> > > > > > >
> > > > > > > I can't get hits result.
> > > > > > >
> > > > > > > For Korean
> > > > > > >      [echo] Running lia.analysis.i18n.KoreanDemo...
> > > > > > >      [java] phrase = 경
> > > > > > >      [java] query =
> > > > > > >
> > > > > > > I can't get query parse result.
> > > > > > >
> > > > > > > Thanks,
> > > > > > >
> > > > > > > Youngho
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > ----- Original Message -----
> > > > > > > From: "Koji Sekiguchi" <ko...@m4.dion.ne.jp>
> > > > > > > To: <ja...@lucene.apache.org>; "Youngho Cho" <yo...@nannet.co.kr>
> > > > > > > Sent: Thursday, October 27, 2005 9:48 AM
> > > > > > > Subject: RE: korean and lucene
> > > > > > >
> > > > > > >
> > > > > > > > Hi Youngho,
> > > > > > > >
> > > > > > > > With regard to Japanese, using StandardAnalyzer,
> > > > > > > > I can search a word/phase.
> > > > > > > >
> > > > > > > > Did you use QueryParser? StandardAnalyzer tokenizes
> > > > > > > > CJK characters into a stream of single character.
> > > > > > > > Use QueryParser to get a PhraseQuery and search the query.
> > > > > > > >
> > > > > > > > Please see the following sample code. Replace Japanese
> > > > > > > > "contents" and (search target) "phrase" with Korean in the
> > > > > > > program and run.
> > > > > > > >
> > > > > > > > regards,
> > > > > > > >
> > > > > > > > Koji
> > > > > > > >
> > > > > > > > =============================================
> > > > > > > > import java.io.IOException;
> > > > > > > > import org.apache.lucene.analysis.Analyzer;
> > > > > > > > import org.apache.lucene.analysis.standard.StandardAnalyzer;
> > > > > > > > import org.apache.lucene.analysis.cjk.CJKAnalyzer;
> > > > > > > > import org.apache.lucene.store.Directory;
> > > > > > > > import org.apache.lucene.store.RAMDirectory;
> > > > > > > > import org.apache.lucene.index.IndexWriter;
> > > > > > > > import org.apache.lucene.document.Document;
> > > > > > > > import org.apache.lucene.document.Field;
> > > > > > > > import org.apache.lucene.search.IndexSearcher;
> > > > > > > > import org.apache.lucene.search.Hits;
> > > > > > > > import org.apache.lucene.search.Query;
> > > > > > > > import org.apache.lucene.queryParser.QueryParser;
> > > > > > > > import org.apache.lucene.queryParser.ParseException;
> > > > > > > >
> > > > > > > > public class JapaneseByStandardAnalyzer {
> > > > > > > >
> > > > > > > >     private static final String FIELD_CONTENT = "content";
> > > > > > > >     private static final String[] contents = {
> > > > > > > > "東京にはおいしいラーメン屋がたくさんあります。",
> > > > > > > > "北海道にもおいしいラーメン屋があります。"
> > > > > > > >     };
> > > > > > > >     private static final String phrase = "ラーメン屋";
> > > > > > > >     //private static final String phrase = "屋";
> > > > > > > >     private static Analyzer analyzer = null;
> > > > > > > >
> > > > > > > >     public static void main( String[] args ) throws
> > > > > > > IOException, ParseException {
> > > > > > > > Directory directory = makeIndex();
> > > > > > > > search( directory );
> > > > > > > > directory.close();
> > > > > > > >     }
> > > > > > > >
> > > > > > > >     private static Analyzer getAnalyzer(){
> > > > > > > > if( analyzer == null ){
> > > > > > > >     analyzer = new StandardAnalyzer();
> > > > > > > >     //analyzer = new CJKAnalyzer();
> > > > > > > > }
> > > > > > > > return analyzer;
> > > > > > > >     }
> > > > > > > >
> > > > > > > >     private static Directory makeIndex() throws IOException {
> > > > > > > > Directory directory = new RAMDirectory();
> > > > > > > > IndexWriter writer = new IndexWriter( directory, getAnalyzer(), true );
> > > > > > > > for( int i = 0; i < contents.length; i++ ){
> > > > > > > >     Document doc = new Document();
> > > > > > > >     doc.add( new Field( FIELD_CONTENT, contents[i],
> > > > > > > Field.Store.YES, Field.Index.TOKENIZED ) );
> > > > > > > >     writer.addDocument( doc );
> > > > > > > > }
> > > > > > > > writer.close();
> > > > > > > > return directory;
> > > > > > > >     }
> > > > > > > >
> > > > > > > >     private static void search( Directory directory ) throws
> > > > > > > IOException, ParseException {
> > > > > > > > IndexSearcher searcher = new IndexSearcher( directory );
> > > > > > > > QueryParser parser = new QueryParser( FIELD_CONTENT, getAnalyzer() );
> > > > > > > > Query query = parser.parse( phrase );
> > > > > > > > System.out.println( "query = " + query );
> > > > > > > > Hits hits = searcher.search( query );
> > > > > > > > for( int i = 0; i < hits.length(); i++ )
> > > > > > > >     System.out.println( "doc = " + hits.doc( i ).get( FIELD_CONTENT ) );
> > > > > > > > searcher.close();
> > > > > > > >     }
> > > > > > > > }
> > > > > > > >
> > > > > > > >
> > > > > > > > > -----Original Message-----
> > > > > > > > > From: Youngho Cho [mailto:youngho@nannet.co.kr]
> > > > > > > > > Sent: Thursday, October 27, 2005 8:18 AM
> > > > > > > > > To: java-user@lucene.apache.org; Cheolgoo Kang
> > > > > > > > > Subject: Re: korean and lucene
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Hello Cheolgoo,
> > > > > > > > >
> > > > > > > > > Now I updated my lucene version to 1.9 for using StandardAnalyzer
> > > > > > > > > for Korean.
> > > > > > > > > And tested your patch which is already adopted in 1.9
> > > > > > > > >
> > > > > > > > > http://issues.apache.org/jira/browse/LUCENE-444
> > > > > > > > >
> > > > > > > > > But Still I have no good  results with Korean compare with
> > > > > > > CJKAnalyzer.
> > > > > > > > >
> > > > > > > > > Single character is good match but more two character word
> > > > > > > > > doesn't match at all.
> > > > > > > > >
> > > > > > > > > Am I something missing or still there need some more works ?
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > >
> > > > > > > > > Youngho.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > ----- Original Message -----
> > > > > > > > > From: "Cheolgoo Kang" <ap...@gmail.com>
> > > > > > > > > To: <ja...@lucene.apache.org>; "John Wang" <jo...@gmail.com>
> > > > > > > > > Sent: Tuesday, October 04, 2005 10:11 AM
> > > > > > > > > Subject: Re: korean and lucene
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > StandardAnalyzer's JavaCC based StandardTokenizer.jj cannot read
> > > > > > > > > > Korean part of Unicode character blocks.
> > > > > > > > > >
> > > > > > > > > > You should 1) use CJKAnalyzer or 2) add Korean character
> > > > > > > > > > block(0xAC00~0xD7AF) to the CJK token definition on the
> > > > > > > > > > StandardTokenizer.jj file.
> > > > > > > > > >
> > > > > > > > > > Hope it helps.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On 10/4/05, John Wang <jo...@gmail.com> wrote:
> > > > > > > > > > > Hi:
> > > > > > > > > > >
> > > > > > > > > > > We are running into problems with searching on korean
> > > > > > > > > documents. We are
> > > > > > > > > > > using the StandardAnalyzer and everything works with Chinese
> > > > > > > > > and Japanese.
> > > > > > > > > > > Are there known problems with Korean with Lucene?
> > > > > > > > > > >
> > > > > > > > > > > Thanks
> > > > > > > > > > >
> > > > > > > > > > > -John
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > --
> > > > > > > > > > Cheolgoo
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > ---------------------------------------------------------------------
> > > > > > > > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > > > > > > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > ---------------------------------------------------------------------
> > > > > > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > > > > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > ---------------------------------------------------------------------
> > > > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> > >
> > >
> > 
> > 
> > --
> > Cheolgoo

Re: korean and lucene

Posted by Youngho Cho <yo...@nannet.co.kr>.
Hello Cheolgoo,

I will test the patch.


Thanks,

Youngho

----- Original Message ----- 
From: "Cheolgoo Kang" <ap...@gmail.com>
To: <ja...@lucene.apache.org>; "Youngho Cho" <yo...@nannet.co.kr>
Sent: Tuesday, November 08, 2005 4:06 PM
Subject: Re: korean and lucene


> Hello,
> 
> I've created a new JIRA issue with Korean analysis that
> StandardAnalyzer splits one word into several tokens each with one
> character. Cause Korean is not a phonogram, one character in Korean
> has almost no meaning at all. So word in Korean should be preserved
> not like Chinese or Japanese.
> 
> I've attached a patch to StandardTokenizer to do this and passed the
> test case TestStandardAnalyzer(also patched to test Korean words).
> 
> Thanks!
> 
> On 10/27/05, Youngho Cho <yo...@nannet.co.kr> wrote:
> > Hello,
> >
> > Ok , I've attached my test code for Korean which is slitely modified Koji's code.
> >
> > Just put into the lia.analysis.i18n package at LuceneInAction
> > and run ant.
> >
> > Hopely someone is helped.
> >
> > -------- build.xml  ---------
> >
> >   <target name="JapaneseDemo" depends="prepare"
> >           description="Examples of Jananese analysis">
> >     <info>
> >
> >       Japanese Test...
> >
> >     </info>
> >
> >     <run-main class="lia.analysis.i18n.JapaneseDemo"/>
> >   </target>
> >
> >   <target name="KoreanDemo" depends="prepare"
> >           description="Examples of Korean analysis">
> >     <info>
> >
> >       Korean Test...
> >
> >     </info>
> >
> >     <run-main class="lia.analysis.i18n.KoreanDemo"/>
> >   </target>
> >
> >
> > Thanks,
> >
> > Youngho
> >
> >
> > ----- Original Message -----
> > From: "Youngho Cho" <yo...@nannet.co.kr>
> > To: <ja...@lucene.apache.org>; "Youngho Cho" <yo...@nannet.co.kr>
> > Sent: Thursday, October 27, 2005 12:47 PM
> > Subject: Re: korean and lucene
> >
> >
> > > Hello all
> > > Plese forgive me pervious my stupid message
> > >
> > >      [echo] Running lia.analysis.i18n.KoreanDemo...
> > >      [java] [경] [기]  analyzer = org.apache.lucene.analysis.standard.StandardAnalyzer
> > >      [java] phrase = 경기
> > >      [java] query = "경 기"
> > >
> > > I got the good result.
> > >
> > > When I compile I just rename old version lucene-1.4.3.jar to lucene-1.4.3.jar_bak
> > > and all new 1.9 lucene. and build the test package.
> > > After I remove lucene-1.4.3.jar_bak in lib directory completely
> > > I got the expected result !!!.
> > >
> > > I don't know the reason... ( looks like my finger make some trouble... )
> > >
> > > Anyway thanks Koji and Cheolgoo
> > > I will further test now...
> > >
> > > Youngho
> > >
> > >
> > >
> > >
> > > ----- Original Message -----
> > > From: "Youngho Cho" <yo...@nannet.co.kr>
> > > To: <ja...@lucene.apache.org>
> > > Sent: Thursday, October 27, 2005 12:28 PM
> > > Subject: Re: korean and lucene
> > >
> > >
> > > > Hello Koji
> > > >
> > > > Here is test result.
> > > > Japanese is OK !.
> > > > maybe ant clean  did some effect.
> > > >
> > > > Anyway please refer to the result using 1.9
> > > >
> > > >      [echo] Running lia.analysis.i18n.JapaneseDemo...
> > > >      [java] [ラ] [メ] [ン] [屋]  analyzer = org.apache.lucene.analysis.standard.StandardAnalyzer
> > > >      [java] phrase = ラ?メン屋
> > > >      [java] query = content:ラ?メン屋
> > > >
> > > >     [echo] Running lia.analysis.i18n.KoreanDemo...
> > > >      [java]  analyzer = org.apache.lucene.analysis.standard.StandardAnalyzer
> > > >      [java] phrase = 경
> > > >      [java] query =
> > > >
> > > >      [echo] Running lia.analysis.i18n.JapaneseDemo...
> > > >      [java] [ラ] [メン] [ン屋]  analyzer = org.apache.lucene.analysis.cjk.CJKAnalyzer
> > > >      [java] phrase = ラ?メン屋
> > > >      [java] query = content:ラ?メン屋
> > > >
> > > >     [echo] Running lia.analysis.i18n.KoreanDemo...
> > > >      [java] [경]  analyzer = org.apache.lucene.analysis.cjk.CJKAnalyzer
> > > >      [java] phrase = 경
> > > >      [java] query = 경
> > > >
> > > >      [echo] Running lia.analysis.i18n.KoreanDemo...
> > > >      [java]  analyzer = org.apache.lucene.analysis.standard.StandardAnalyzer
> > > >      [java] phrase = 경기
> > > >      [java] query =
> > > >
> > > >      [echo] Running lia.analysis.i18n.KoreanDemo...
> > > >      [java] [경기]  analyzer = org.apache.lucene.analysis.cjk.CJKAnalyzer
> > > >      [java] phrase = 경기
> > > >      [java] query = 경기
> > > >
> > > >
> > > > Standard analyzer didn't tokenized the Korean Character at all....
> > > >
> > > > Ug....  look like
> > > >  http://issues.apache.org/jira/browse/LUCENE-444
> > > >  didn't effect at all for Korean.
> > > >
> > > >
> > > > Thanks
> > > >
> > > > Youngho
> > > >
> > > > ----- Original Message -----
> > > > From: "Koji Sekiguchi" <ko...@m4.dion.ne.jp>
> > > > To: <ja...@lucene.apache.org>; "Youngho Cho" <yo...@nannet.co.kr>
> > > > Sent: Thursday, October 27, 2005 11:47 AM
> > > > Subject: RE: korean and lucene
> > > >
> > > >
> > > > > Hello Youngho,
> > > > >
> > > > > I don't understand why you couldn't get hits result in Japanese,
> > > > > though, you had better check why the query was empty with Korean data:
> > > > >
> > > > > > For Korean
> > > > > >      [echo] Running lia.analysis.i18n.KoreanDemo...
> > > > > >      [java] phrase = 경
> > > > > >      [java] query =
> > > > >
> > > > > The last line should be query = 경
> > > > > to get hits result. Can you check why StandardAnalyzer
> > > > > removes "경" during tokenizing?
> > > > >
> > > > > Koji
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: Youngho Cho [mailto:youngho@nannet.co.kr]
> > > > > > Sent: Thursday, October 27, 2005 11:37 AM
> > > > > > To: java-user@lucene.apache.org
> > > > > > Subject: Re: korean and lucene
> > > > > >
> > > > > >
> > > > > > Hello Koji,
> > > > > >
> > > > > > Thanks for your kind reply.
> > > > > >
> > > > > > Yes, I used QueryParser. normaly I used
> > > > > > Query = QueryParser.parse( ) method.
> > > > > >
> > > > > > I put your sample code into lia.analysis.i18n package in LuceneAction
> > > > > > and run JapaneseDemo using 1.4 and 1.9
> > > > > >
> > > > > > results are
> > > > > >
> > > > > >      [echo] Running lia.analysis.i18n.JapaneseDemo...
> > > > > >      [java] query = content:ラ?メン屋
> > > > > >
> > > > > > I can't get hits result.
> > > > > >
> > > > > > For Korean
> > > > > >      [echo] Running lia.analysis.i18n.KoreanDemo...
> > > > > >      [java] phrase = 경
> > > > > >      [java] query =
> > > > > >
> > > > > > I can't get query parse result.
> > > > > >
> > > > > > Thanks,
> > > > > >
> > > > > > Youngho
> > > > > >
> > > > > >
> > > > > >
> > > > > > ----- Original Message -----
> > > > > > From: "Koji Sekiguchi" <ko...@m4.dion.ne.jp>
> > > > > > To: <ja...@lucene.apache.org>; "Youngho Cho" <yo...@nannet.co.kr>
> > > > > > Sent: Thursday, October 27, 2005 9:48 AM
> > > > > > Subject: RE: korean and lucene
> > > > > >
> > > > > >
> > > > > > > Hi Youngho,
> > > > > > >
> > > > > > > With regard to Japanese, using StandardAnalyzer,
> > > > > > > I can search a word/phase.
> > > > > > >
> > > > > > > Did you use QueryParser? StandardAnalyzer tokenizes
> > > > > > > CJK characters into a stream of single character.
> > > > > > > Use QueryParser to get a PhraseQuery and search the query.
> > > > > > >
> > > > > > > Please see the following sample code. Replace Japanese
> > > > > > > "contents" and (search target) "phrase" with Korean in the
> > > > > > program and run.
> > > > > > >
> > > > > > > regards,
> > > > > > >
> > > > > > > Koji
> > > > > > >
> > > > > > > =============================================
> > > > > > > import java.io.IOException;
> > > > > > > import org.apache.lucene.analysis.Analyzer;
> > > > > > > import org.apache.lucene.analysis.standard.StandardAnalyzer;
> > > > > > > import org.apache.lucene.analysis.cjk.CJKAnalyzer;
> > > > > > > import org.apache.lucene.store.Directory;
> > > > > > > import org.apache.lucene.store.RAMDirectory;
> > > > > > > import org.apache.lucene.index.IndexWriter;
> > > > > > > import org.apache.lucene.document.Document;
> > > > > > > import org.apache.lucene.document.Field;
> > > > > > > import org.apache.lucene.search.IndexSearcher;
> > > > > > > import org.apache.lucene.search.Hits;
> > > > > > > import org.apache.lucene.search.Query;
> > > > > > > import org.apache.lucene.queryParser.QueryParser;
> > > > > > > import org.apache.lucene.queryParser.ParseException;
> > > > > > >
> > > > > > > public class JapaneseByStandardAnalyzer {
> > > > > > >
> > > > > > >     private static final String FIELD_CONTENT = "content";
> > > > > > >     private static final String[] contents = {
> > > > > > > "東京にはおいしいラーメン屋がたくさんあります。",
> > > > > > > "北海道にもおいしいラーメン屋があります。"
> > > > > > >     };
> > > > > > >     private static final String phrase = "ラーメン屋";
> > > > > > >     //private static final String phrase = "屋";
> > > > > > >     private static Analyzer analyzer = null;
> > > > > > >
> > > > > > >     public static void main( String[] args ) throws
> > > > > > IOException, ParseException {
> > > > > > > Directory directory = makeIndex();
> > > > > > > search( directory );
> > > > > > > directory.close();
> > > > > > >     }
> > > > > > >
> > > > > > >     private static Analyzer getAnalyzer(){
> > > > > > > if( analyzer == null ){
> > > > > > >     analyzer = new StandardAnalyzer();
> > > > > > >     //analyzer = new CJKAnalyzer();
> > > > > > > }
> > > > > > > return analyzer;
> > > > > > >     }
> > > > > > >
> > > > > > >     private static Directory makeIndex() throws IOException {
> > > > > > > Directory directory = new RAMDirectory();
> > > > > > > IndexWriter writer = new IndexWriter( directory, getAnalyzer(), true );
> > > > > > > for( int i = 0; i < contents.length; i++ ){
> > > > > > >     Document doc = new Document();
> > > > > > >     doc.add( new Field( FIELD_CONTENT, contents[i],
> > > > > > Field.Store.YES, Field.Index.TOKENIZED ) );
> > > > > > >     writer.addDocument( doc );
> > > > > > > }
> > > > > > > writer.close();
> > > > > > > return directory;
> > > > > > >     }
> > > > > > >
> > > > > > >     private static void search( Directory directory ) throws
> > > > > > IOException, ParseException {
> > > > > > > IndexSearcher searcher = new IndexSearcher( directory );
> > > > > > > QueryParser parser = new QueryParser( FIELD_CONTENT, getAnalyzer() );
> > > > > > > Query query = parser.parse( phrase );
> > > > > > > System.out.println( "query = " + query );
> > > > > > > Hits hits = searcher.search( query );
> > > > > > > for( int i = 0; i < hits.length(); i++ )
> > > > > > >     System.out.println( "doc = " + hits.doc( i ).get( FIELD_CONTENT ) );
> > > > > > > searcher.close();
> > > > > > >     }
> > > > > > > }
> > > > > > >
> > > > > > >
> > > > > > > > -----Original Message-----
> > > > > > > > From: Youngho Cho [mailto:youngho@nannet.co.kr]
> > > > > > > > Sent: Thursday, October 27, 2005 8:18 AM
> > > > > > > > To: java-user@lucene.apache.org; Cheolgoo Kang
> > > > > > > > Subject: Re: korean and lucene
> > > > > > > >
> > > > > > > >
> > > > > > > > Hello Cheolgoo,
> > > > > > > >
> > > > > > > > Now I updated my lucene version to 1.9 for using StandardAnalyzer
> > > > > > > > for Korean.
> > > > > > > > And tested your patch which is already adopted in 1.9
> > > > > > > >
> > > > > > > > http://issues.apache.org/jira/browse/LUCENE-444
> > > > > > > >
> > > > > > > > But Still I have no good  results with Korean compare with
> > > > > > CJKAnalyzer.
> > > > > > > >
> > > > > > > > Single character is good match but more two character word
> > > > > > > > doesn't match at all.
> > > > > > > >
> > > > > > > > Am I something missing or still there need some more works ?
> > > > > > > >
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > >
> > > > > > > > Youngho.
> > > > > > > >
> > > > > > > >
> > > > > > > > ----- Original Message -----
> > > > > > > > From: "Cheolgoo Kang" <ap...@gmail.com>
> > > > > > > > To: <ja...@lucene.apache.org>; "John Wang" <jo...@gmail.com>
> > > > > > > > Sent: Tuesday, October 04, 2005 10:11 AM
> > > > > > > > Subject: Re: korean and lucene
> > > > > > > >
> > > > > > > >
> > > > > > > > > StandardAnalyzer's JavaCC based StandardTokenizer.jj cannot read
> > > > > > > > > Korean part of Unicode character blocks.
> > > > > > > > >
> > > > > > > > > You should 1) use CJKAnalyzer or 2) add Korean character
> > > > > > > > > block(0xAC00~0xD7AF) to the CJK token definition on the
> > > > > > > > > StandardTokenizer.jj file.
> > > > > > > > >
> > > > > > > > > Hope it helps.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On 10/4/05, John Wang <jo...@gmail.com> wrote:
> > > > > > > > > > Hi:
> > > > > > > > > >
> > > > > > > > > > We are running into problems with searching on korean
> > > > > > > > documents. We are
> > > > > > > > > > using the StandardAnalyzer and everything works with Chinese
> > > > > > > > and Japanese.
> > > > > > > > > > Are there known problems with Korean with Lucene?
> > > > > > > > > >
> > > > > > > > > > Thanks
> > > > > > > > > >
> > > > > > > > > > -John
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > Cheolgoo
> > > > > > > > >
> > > > > > > > >
> > > > > > ---------------------------------------------------------------------
> > > > > > > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > > > > > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > ---------------------------------------------------------------------
> > > > > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > > > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > ---------------------------------------------------------------------
> > > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
> >
> 
> 
> --
> Cheolgoo

Re: korean and lucene

Posted by Cheolgoo Kang <ap...@gmail.com>.
Hello,

I've created a new JIRA issue with Korean analysis that
StandardAnalyzer splits one word into several tokens each with one
character. Cause Korean is not a phonogram, one character in Korean
has almost no meaning at all. So word in Korean should be preserved
not like Chinese or Japanese.

I've attached a patch to StandardTokenizer to do this and passed the
test case TestStandardAnalyzer(also patched to test Korean words).

Thanks!

On 10/27/05, Youngho Cho <yo...@nannet.co.kr> wrote:
> Hello,
>
> Ok , I've attached my test code for Korean which is slitely modified Koji's code.
>
> Just put into the lia.analysis.i18n package at LuceneInAction
> and run ant.
>
> Hopely someone is helped.
>
> -------- build.xml  ---------
>
>   <target name="JapaneseDemo" depends="prepare"
>           description="Examples of Jananese analysis">
>     <info>
>
>       Japanese Test...
>
>     </info>
>
>     <run-main class="lia.analysis.i18n.JapaneseDemo"/>
>   </target>
>
>   <target name="KoreanDemo" depends="prepare"
>           description="Examples of Korean analysis">
>     <info>
>
>       Korean Test...
>
>     </info>
>
>     <run-main class="lia.analysis.i18n.KoreanDemo"/>
>   </target>
>
>
> Thanks,
>
> Youngho
>
>
> ----- Original Message -----
> From: "Youngho Cho" <yo...@nannet.co.kr>
> To: <ja...@lucene.apache.org>; "Youngho Cho" <yo...@nannet.co.kr>
> Sent: Thursday, October 27, 2005 12:47 PM
> Subject: Re: korean and lucene
>
>
> > Hello all
> > Plese forgive me pervious my stupid message
> >
> >      [echo] Running lia.analysis.i18n.KoreanDemo...
> >      [java] [경] [기]  analyzer = org.apache.lucene.analysis.standard.StandardAnalyzer
> >      [java] phrase = 경기
> >      [java] query = "경 기"
> >
> > I got the good result.
> >
> > When I compile I just rename old version lucene-1.4.3.jar to lucene-1.4.3.jar_bak
> > and all new 1.9 lucene. and build the test package.
> > After I remove lucene-1.4.3.jar_bak in lib directory completely
> > I got the expected result !!!.
> >
> > I don't know the reason... ( looks like my finger make some trouble... )
> >
> > Anyway thanks Koji and Cheolgoo
> > I will further test now...
> >
> > Youngho
> >
> >
> >
> >
> > ----- Original Message -----
> > From: "Youngho Cho" <yo...@nannet.co.kr>
> > To: <ja...@lucene.apache.org>
> > Sent: Thursday, October 27, 2005 12:28 PM
> > Subject: Re: korean and lucene
> >
> >
> > > Hello Koji
> > >
> > > Here is test result.
> > > Japanese is OK !.
> > > maybe ant clean  did some effect.
> > >
> > > Anyway please refer to the result using 1.9
> > >
> > >      [echo] Running lia.analysis.i18n.JapaneseDemo...
> > >      [java] [ラ] [メ] [ン] [屋]  analyzer = org.apache.lucene.analysis.standard.StandardAnalyzer
> > >      [java] phrase = ラ?メン屋
> > >      [java] query = content:ラ?メン屋
> > >
> > >     [echo] Running lia.analysis.i18n.KoreanDemo...
> > >      [java]  analyzer = org.apache.lucene.analysis.standard.StandardAnalyzer
> > >      [java] phrase = 경
> > >      [java] query =
> > >
> > >      [echo] Running lia.analysis.i18n.JapaneseDemo...
> > >      [java] [ラ] [メン] [ン屋]  analyzer = org.apache.lucene.analysis.cjk.CJKAnalyzer
> > >      [java] phrase = ラ?メン屋
> > >      [java] query = content:ラ?メン屋
> > >
> > >     [echo] Running lia.analysis.i18n.KoreanDemo...
> > >      [java] [경]  analyzer = org.apache.lucene.analysis.cjk.CJKAnalyzer
> > >      [java] phrase = 경
> > >      [java] query = 경
> > >
> > >      [echo] Running lia.analysis.i18n.KoreanDemo...
> > >      [java]  analyzer = org.apache.lucene.analysis.standard.StandardAnalyzer
> > >      [java] phrase = 경기
> > >      [java] query =
> > >
> > >      [echo] Running lia.analysis.i18n.KoreanDemo...
> > >      [java] [경기]  analyzer = org.apache.lucene.analysis.cjk.CJKAnalyzer
> > >      [java] phrase = 경기
> > >      [java] query = 경기
> > >
> > >
> > > Standard analyzer didn't tokenized the Korean Character at all....
> > >
> > > Ug....  look like
> > >  http://issues.apache.org/jira/browse/LUCENE-444
> > >  didn't effect at all for Korean.
> > >
> > >
> > > Thanks
> > >
> > > Youngho
> > >
> > > ----- Original Message -----
> > > From: "Koji Sekiguchi" <ko...@m4.dion.ne.jp>
> > > To: <ja...@lucene.apache.org>; "Youngho Cho" <yo...@nannet.co.kr>
> > > Sent: Thursday, October 27, 2005 11:47 AM
> > > Subject: RE: korean and lucene
> > >
> > >
> > > > Hello Youngho,
> > > >
> > > > I don't understand why you couldn't get hits result in Japanese,
> > > > though, you had better check why the query was empty with Korean data:
> > > >
> > > > > For Korean
> > > > >      [echo] Running lia.analysis.i18n.KoreanDemo...
> > > > >      [java] phrase = 경
> > > > >      [java] query =
> > > >
> > > > The last line should be query = 경
> > > > to get hits result. Can you check why StandardAnalyzer
> > > > removes "경" during tokenizing?
> > > >
> > > > Koji
> > > >
> > > > > -----Original Message-----
> > > > > From: Youngho Cho [mailto:youngho@nannet.co.kr]
> > > > > Sent: Thursday, October 27, 2005 11:37 AM
> > > > > To: java-user@lucene.apache.org
> > > > > Subject: Re: korean and lucene
> > > > >
> > > > >
> > > > > Hello Koji,
> > > > >
> > > > > Thanks for your kind reply.
> > > > >
> > > > > Yes, I used QueryParser. normaly I used
> > > > > Query = QueryParser.parse( ) method.
> > > > >
> > > > > I put your sample code into lia.analysis.i18n package in LuceneAction
> > > > > and run JapaneseDemo using 1.4 and 1.9
> > > > >
> > > > > results are
> > > > >
> > > > >      [echo] Running lia.analysis.i18n.JapaneseDemo...
> > > > >      [java] query = content:ラ?メン屋
> > > > >
> > > > > I can't get hits result.
> > > > >
> > > > > For Korean
> > > > >      [echo] Running lia.analysis.i18n.KoreanDemo...
> > > > >      [java] phrase = 경
> > > > >      [java] query =
> > > > >
> > > > > I can't get query parse result.
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Youngho
> > > > >
> > > > >
> > > > >
> > > > > ----- Original Message -----
> > > > > From: "Koji Sekiguchi" <ko...@m4.dion.ne.jp>
> > > > > To: <ja...@lucene.apache.org>; "Youngho Cho" <yo...@nannet.co.kr>
> > > > > Sent: Thursday, October 27, 2005 9:48 AM
> > > > > Subject: RE: korean and lucene
> > > > >
> > > > >
> > > > > > Hi Youngho,
> > > > > >
> > > > > > With regard to Japanese, using StandardAnalyzer,
> > > > > > I can search a word/phase.
> > > > > >
> > > > > > Did you use QueryParser? StandardAnalyzer tokenizes
> > > > > > CJK characters into a stream of single character.
> > > > > > Use QueryParser to get a PhraseQuery and search the query.
> > > > > >
> > > > > > Please see the following sample code. Replace Japanese
> > > > > > "contents" and (search target) "phrase" with Korean in the
> > > > > program and run.
> > > > > >
> > > > > > regards,
> > > > > >
> > > > > > Koji
> > > > > >
> > > > > > =============================================
> > > > > > import java.io.IOException;
> > > > > > import org.apache.lucene.analysis.Analyzer;
> > > > > > import org.apache.lucene.analysis.standard.StandardAnalyzer;
> > > > > > import org.apache.lucene.analysis.cjk.CJKAnalyzer;
> > > > > > import org.apache.lucene.store.Directory;
> > > > > > import org.apache.lucene.store.RAMDirectory;
> > > > > > import org.apache.lucene.index.IndexWriter;
> > > > > > import org.apache.lucene.document.Document;
> > > > > > import org.apache.lucene.document.Field;
> > > > > > import org.apache.lucene.search.IndexSearcher;
> > > > > > import org.apache.lucene.search.Hits;
> > > > > > import org.apache.lucene.search.Query;
> > > > > > import org.apache.lucene.queryParser.QueryParser;
> > > > > > import org.apache.lucene.queryParser.ParseException;
> > > > > >
> > > > > > public class JapaneseByStandardAnalyzer {
> > > > > >
> > > > > >     private static final String FIELD_CONTENT = "content";
> > > > > >     private static final String[] contents = {
> > > > > > "東京にはおいしいラーメン屋がたくさんあります。",
> > > > > > "北海道にもおいしいラーメン屋があります。"
> > > > > >     };
> > > > > >     private static final String phrase = "ラーメン屋";
> > > > > >     //private static final String phrase = "屋";
> > > > > >     private static Analyzer analyzer = null;
> > > > > >
> > > > > >     public static void main( String[] args ) throws
> > > > > IOException, ParseException {
> > > > > > Directory directory = makeIndex();
> > > > > > search( directory );
> > > > > > directory.close();
> > > > > >     }
> > > > > >
> > > > > >     private static Analyzer getAnalyzer(){
> > > > > > if( analyzer == null ){
> > > > > >     analyzer = new StandardAnalyzer();
> > > > > >     //analyzer = new CJKAnalyzer();
> > > > > > }
> > > > > > return analyzer;
> > > > > >     }
> > > > > >
> > > > > >     private static Directory makeIndex() throws IOException {
> > > > > > Directory directory = new RAMDirectory();
> > > > > > IndexWriter writer = new IndexWriter( directory, getAnalyzer(), true );
> > > > > > for( int i = 0; i < contents.length; i++ ){
> > > > > >     Document doc = new Document();
> > > > > >     doc.add( new Field( FIELD_CONTENT, contents[i],
> > > > > Field.Store.YES, Field.Index.TOKENIZED ) );
> > > > > >     writer.addDocument( doc );
> > > > > > }
> > > > > > writer.close();
> > > > > > return directory;
> > > > > >     }
> > > > > >
> > > > > >     private static void search( Directory directory ) throws
> > > > > IOException, ParseException {
> > > > > > IndexSearcher searcher = new IndexSearcher( directory );
> > > > > > QueryParser parser = new QueryParser( FIELD_CONTENT, getAnalyzer() );
> > > > > > Query query = parser.parse( phrase );
> > > > > > System.out.println( "query = " + query );
> > > > > > Hits hits = searcher.search( query );
> > > > > > for( int i = 0; i < hits.length(); i++ )
> > > > > >     System.out.println( "doc = " + hits.doc( i ).get( FIELD_CONTENT ) );
> > > > > > searcher.close();
> > > > > >     }
> > > > > > }
> > > > > >
> > > > > >
> > > > > > > -----Original Message-----
> > > > > > > From: Youngho Cho [mailto:youngho@nannet.co.kr]
> > > > > > > Sent: Thursday, October 27, 2005 8:18 AM
> > > > > > > To: java-user@lucene.apache.org; Cheolgoo Kang
> > > > > > > Subject: Re: korean and lucene
> > > > > > >
> > > > > > >
> > > > > > > Hello Cheolgoo,
> > > > > > >
> > > > > > > Now I updated my lucene version to 1.9 for using StandardAnalyzer
> > > > > > > for Korean.
> > > > > > > And tested your patch which is already adopted in 1.9
> > > > > > >
> > > > > > > http://issues.apache.org/jira/browse/LUCENE-444
> > > > > > >
> > > > > > > But Still I have no good  results with Korean compare with
> > > > > CJKAnalyzer.
> > > > > > >
> > > > > > > Single character is good match but more two character word
> > > > > > > doesn't match at all.
> > > > > > >
> > > > > > > Am I something missing or still there need some more works ?
> > > > > > >
> > > > > > >
> > > > > > > Thanks,
> > > > > > >
> > > > > > > Youngho.
> > > > > > >
> > > > > > >
> > > > > > > ----- Original Message -----
> > > > > > > From: "Cheolgoo Kang" <ap...@gmail.com>
> > > > > > > To: <ja...@lucene.apache.org>; "John Wang" <jo...@gmail.com>
> > > > > > > Sent: Tuesday, October 04, 2005 10:11 AM
> > > > > > > Subject: Re: korean and lucene
> > > > > > >
> > > > > > >
> > > > > > > > StandardAnalyzer's JavaCC based StandardTokenizer.jj cannot read
> > > > > > > > Korean part of Unicode character blocks.
> > > > > > > >
> > > > > > > > You should 1) use CJKAnalyzer or 2) add Korean character
> > > > > > > > block(0xAC00~0xD7AF) to the CJK token definition on the
> > > > > > > > StandardTokenizer.jj file.
> > > > > > > >
> > > > > > > > Hope it helps.
> > > > > > > >
> > > > > > > >
> > > > > > > > On 10/4/05, John Wang <jo...@gmail.com> wrote:
> > > > > > > > > Hi:
> > > > > > > > >
> > > > > > > > > We are running into problems with searching on korean
> > > > > > > documents. We are
> > > > > > > > > using the StandardAnalyzer and everything works with Chinese
> > > > > > > and Japanese.
> > > > > > > > > Are there known problems with Korean with Lucene?
> > > > > > > > >
> > > > > > > > > Thanks
> > > > > > > > >
> > > > > > > > > -John
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > Cheolgoo
> > > > > > > >
> > > > > > > >
> > > > > ---------------------------------------------------------------------
> > > > > > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > > > > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > > > > >
> > > > > >
> > > > > >
> > > > > > ---------------------------------------------------------------------
> > > > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > > > >
> > > >
> > > >
> > > >
> > > > ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > For additional commands, e-mail: java-user-help@lucene.apache.org
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>


--
Cheolgoo

Re: korean and lucene

Posted by Youngho Cho <yo...@nannet.co.kr>.
Hello,

Ok , I've attached my test code for Korean which is slitely modified Koji's code.

Just put into the lia.analysis.i18n package at LuceneInAction
and run ant.

Hopely someone is helped.

-------- build.xml  ---------

  <target name="JapaneseDemo" depends="prepare"
          description="Examples of Jananese analysis">
    <info>

      Japanese Test...
   
    </info>

    <run-main class="lia.analysis.i18n.JapaneseDemo"/>
  </target>  

  <target name="KoreanDemo" depends="prepare"
          description="Examples of Korean analysis">
    <info>

      Korean Test...
   
    </info>

    <run-main class="lia.analysis.i18n.KoreanDemo"/>
  </target>  


Thanks,

Youngho


----- Original Message ----- 
From: "Youngho Cho" <yo...@nannet.co.kr>
To: <ja...@lucene.apache.org>; "Youngho Cho" <yo...@nannet.co.kr>
Sent: Thursday, October 27, 2005 12:47 PM
Subject: Re: korean and lucene


> Hello all
> Plese forgive me pervious my stupid message
> 
>      [echo] Running lia.analysis.i18n.KoreanDemo...
>      [java] [경] [기]  analyzer = org.apache.lucene.analysis.standard.StandardAnalyzer
>      [java] phrase = 경기
>      [java] query = "경 기"
> 
> I got the good result.
> 
> When I compile I just rename old version lucene-1.4.3.jar to lucene-1.4.3.jar_bak
> and all new 1.9 lucene. and build the test package.
> After I remove lucene-1.4.3.jar_bak in lib directory completely
> I got the expected result !!!.
> 
> I don't know the reason... ( looks like my finger make some trouble... )
> 
> Anyway thanks Koji and Cheolgoo
> I will further test now...
> 
> Youngho
> 
> 
> 
> 
> ----- Original Message ----- 
> From: "Youngho Cho" <yo...@nannet.co.kr>
> To: <ja...@lucene.apache.org>
> Sent: Thursday, October 27, 2005 12:28 PM
> Subject: Re: korean and lucene
> 
> 
> > Hello Koji
> > 
> > Here is test result.
> > Japanese is OK !.
> > maybe ant clean  did some effect.
> > 
> > Anyway please refer to the result using 1.9
> > 
> >      [echo] Running lia.analysis.i18n.JapaneseDemo...
> >      [java] [ラ] [メ] [ン] [屋]  analyzer = org.apache.lucene.analysis.standard.StandardAnalyzer
> >      [java] phrase = ラ?メン屋
> >      [java] query = content:ラ?メン屋
> >   
> >     [echo] Running lia.analysis.i18n.KoreanDemo...
> >      [java]  analyzer = org.apache.lucene.analysis.standard.StandardAnalyzer
> >      [java] phrase = 경
> >      [java] query =  
> >   
> >      [echo] Running lia.analysis.i18n.JapaneseDemo...
> >      [java] [ラ] [メン] [ン屋]  analyzer = org.apache.lucene.analysis.cjk.CJKAnalyzer
> >      [java] phrase = ラ?メン屋
> >      [java] query = content:ラ?メン屋
> >   
> >     [echo] Running lia.analysis.i18n.KoreanDemo...
> >      [java] [경]  analyzer = org.apache.lucene.analysis.cjk.CJKAnalyzer
> >      [java] phrase = 경
> >      [java] query = 경 
> > 
> >      [echo] Running lia.analysis.i18n.KoreanDemo...
> >      [java]  analyzer = org.apache.lucene.analysis.standard.StandardAnalyzer
> >      [java] phrase = 경기
> >      [java] query = 
> >   
> >      [echo] Running lia.analysis.i18n.KoreanDemo...
> >      [java] [경기]  analyzer = org.apache.lucene.analysis.cjk.CJKAnalyzer
> >      [java] phrase = 경기
> >      [java] query = 경기
> >   
> > 
> > Standard analyzer didn't tokenized the Korean Character at all....
> > 
> > Ug....  look like 
> >  http://issues.apache.org/jira/browse/LUCENE-444
> >  didn't effect at all for Korean.
> > 
> > 
> > Thanks 
> > 
> > Youngho
> > 
> > ----- Original Message ----- 
> > From: "Koji Sekiguchi" <ko...@m4.dion.ne.jp>
> > To: <ja...@lucene.apache.org>; "Youngho Cho" <yo...@nannet.co.kr>
> > Sent: Thursday, October 27, 2005 11:47 AM
> > Subject: RE: korean and lucene
> > 
> > 
> > > Hello Youngho,
> > > 
> > > I don't understand why you couldn't get hits result in Japanese,
> > > though, you had better check why the query was empty with Korean data:
> > > 
> > > > For Korean
> > > >      [echo] Running lia.analysis.i18n.KoreanDemo...
> > > >      [java] phrase = 경
> > > >      [java] query = 
> > > 
> > > The last line should be query = 경
> > > to get hits result. Can you check why StandardAnalyzer
> > > removes "경" during tokenizing?
> > > 
> > > Koji
> > > 
> > > > -----Original Message-----
> > > > From: Youngho Cho [mailto:youngho@nannet.co.kr]
> > > > Sent: Thursday, October 27, 2005 11:37 AM
> > > > To: java-user@lucene.apache.org
> > > > Subject: Re: korean and lucene
> > > > 
> > > > 
> > > > Hello Koji,
> > > > 
> > > > Thanks for your kind reply.
> > > > 
> > > > Yes, I used QueryParser. normaly I used
> > > > Query = QueryParser.parse( ) method.
> > > > 
> > > > I put your sample code into lia.analysis.i18n package in LuceneAction
> > > > and run JapaneseDemo using 1.4 and 1.9 
> > > > 
> > > > results are 
> > > > 
> > > >      [echo] Running lia.analysis.i18n.JapaneseDemo...
> > > >      [java] query = content:ラ?メン屋
> > > > 
> > > > I can't get hits result.
> > > > 
> > > > For Korean
> > > >      [echo] Running lia.analysis.i18n.KoreanDemo...
> > > >      [java] phrase = 경
> > > >      [java] query = 
> > > > 
> > > > I can't get query parse result.
> > > > 
> > > > Thanks,
> > > > 
> > > > Youngho
> > > > 
> > > > 
> > > > 
> > > > ----- Original Message ----- 
> > > > From: "Koji Sekiguchi" <ko...@m4.dion.ne.jp>
> > > > To: <ja...@lucene.apache.org>; "Youngho Cho" <yo...@nannet.co.kr>
> > > > Sent: Thursday, October 27, 2005 9:48 AM
> > > > Subject: RE: korean and lucene
> > > > 
> > > > 
> > > > > Hi Youngho,
> > > > > 
> > > > > With regard to Japanese, using StandardAnalyzer,
> > > > > I can search a word/phase.
> > > > > 
> > > > > Did you use QueryParser? StandardAnalyzer tokenizes
> > > > > CJK characters into a stream of single character.
> > > > > Use QueryParser to get a PhraseQuery and search the query.
> > > > > 
> > > > > Please see the following sample code. Replace Japanese
> > > > > "contents" and (search target) "phrase" with Korean in the 
> > > > program and run.
> > > > > 
> > > > > regards,
> > > > > 
> > > > > Koji
> > > > > 
> > > > > =============================================
> > > > > import java.io.IOException;
> > > > > import org.apache.lucene.analysis.Analyzer;
> > > > > import org.apache.lucene.analysis.standard.StandardAnalyzer;
> > > > > import org.apache.lucene.analysis.cjk.CJKAnalyzer;
> > > > > import org.apache.lucene.store.Directory;
> > > > > import org.apache.lucene.store.RAMDirectory;
> > > > > import org.apache.lucene.index.IndexWriter;
> > > > > import org.apache.lucene.document.Document;
> > > > > import org.apache.lucene.document.Field;
> > > > > import org.apache.lucene.search.IndexSearcher;
> > > > > import org.apache.lucene.search.Hits;
> > > > > import org.apache.lucene.search.Query;
> > > > > import org.apache.lucene.queryParser.QueryParser;
> > > > > import org.apache.lucene.queryParser.ParseException;
> > > > > 
> > > > > public class JapaneseByStandardAnalyzer {
> > > > > 
> > > > >     private static final String FIELD_CONTENT = "content";
> > > > >     private static final String[] contents = {
> > > > > "東京にはおいしいラーメン屋がたくさんあります。",
> > > > > "北海道にもおいしいラーメン屋があります。"
> > > > >     };
> > > > >     private static final String phrase = "ラーメン屋";
> > > > >     //private static final String phrase = "屋";
> > > > >     private static Analyzer analyzer = null;
> > > > > 
> > > > >     public static void main( String[] args ) throws 
> > > > IOException, ParseException {
> > > > > Directory directory = makeIndex();
> > > > > search( directory );
> > > > > directory.close();
> > > > >     }
> > > > > 
> > > > >     private static Analyzer getAnalyzer(){
> > > > > if( analyzer == null ){
> > > > >     analyzer = new StandardAnalyzer();
> > > > >     //analyzer = new CJKAnalyzer();
> > > > > }
> > > > > return analyzer;
> > > > >     }
> > > > > 
> > > > >     private static Directory makeIndex() throws IOException {
> > > > > Directory directory = new RAMDirectory();
> > > > > IndexWriter writer = new IndexWriter( directory, getAnalyzer(), true );
> > > > > for( int i = 0; i < contents.length; i++ ){
> > > > >     Document doc = new Document();
> > > > >     doc.add( new Field( FIELD_CONTENT, contents[i], 
> > > > Field.Store.YES, Field.Index.TOKENIZED ) );
> > > > >     writer.addDocument( doc );
> > > > > }
> > > > > writer.close();
> > > > > return directory;
> > > > >     }
> > > > > 
> > > > >     private static void search( Directory directory ) throws 
> > > > IOException, ParseException {
> > > > > IndexSearcher searcher = new IndexSearcher( directory );
> > > > > QueryParser parser = new QueryParser( FIELD_CONTENT, getAnalyzer() );
> > > > > Query query = parser.parse( phrase );
> > > > > System.out.println( "query = " + query );
> > > > > Hits hits = searcher.search( query );
> > > > > for( int i = 0; i < hits.length(); i++ )
> > > > >     System.out.println( "doc = " + hits.doc( i ).get( FIELD_CONTENT ) );
> > > > > searcher.close();
> > > > >     }
> > > > > }
> > > > > 
> > > > > 
> > > > > > -----Original Message-----
> > > > > > From: Youngho Cho [mailto:youngho@nannet.co.kr]
> > > > > > Sent: Thursday, October 27, 2005 8:18 AM
> > > > > > To: java-user@lucene.apache.org; Cheolgoo Kang
> > > > > > Subject: Re: korean and lucene
> > > > > > 
> > > > > > 
> > > > > > Hello Cheolgoo,
> > > > > > 
> > > > > > Now I updated my lucene version to 1.9 for using StandardAnalyzer 
> > > > > > for Korean.
> > > > > > And tested your patch which is already adopted in 1.9
> > > > > > 
> > > > > > http://issues.apache.org/jira/browse/LUCENE-444
> > > > > > 
> > > > > > But Still I have no good  results with Korean compare with 
> > > > CJKAnalyzer.
> > > > > > 
> > > > > > Single character is good match but more two character word 
> > > > > > doesn't match at all.
> > > > > > 
> > > > > > Am I something missing or still there need some more works ?
> > > > > > 
> > > > > > 
> > > > > > Thanks,
> > > > > > 
> > > > > > Youngho.
> > > > > >  
> > > > > > 
> > > > > > ----- Original Message ----- 
> > > > > > From: "Cheolgoo Kang" <ap...@gmail.com>
> > > > > > To: <ja...@lucene.apache.org>; "John Wang" <jo...@gmail.com>
> > > > > > Sent: Tuesday, October 04, 2005 10:11 AM
> > > > > > Subject: Re: korean and lucene
> > > > > > 
> > > > > > 
> > > > > > > StandardAnalyzer's JavaCC based StandardTokenizer.jj cannot read
> > > > > > > Korean part of Unicode character blocks.
> > > > > > > 
> > > > > > > You should 1) use CJKAnalyzer or 2) add Korean character
> > > > > > > block(0xAC00~0xD7AF) to the CJK token definition on the
> > > > > > > StandardTokenizer.jj file.
> > > > > > > 
> > > > > > > Hope it helps.
> > > > > > > 
> > > > > > > 
> > > > > > > On 10/4/05, John Wang <jo...@gmail.com> wrote:
> > > > > > > > Hi:
> > > > > > > >
> > > > > > > > We are running into problems with searching on korean 
> > > > > > documents. We are
> > > > > > > > using the StandardAnalyzer and everything works with Chinese 
> > > > > > and Japanese.
> > > > > > > > Are there known problems with Korean with Lucene?
> > > > > > > >
> > > > > > > > Thanks
> > > > > > > >
> > > > > > > > -John
> > > > > > > >
> > > > > > > >
> > > > > > > 
> > > > > > > 
> > > > > > > --
> > > > > > > Cheolgoo
> > > > > > > 
> > > > > > > 
> > > > ---------------------------------------------------------------------
> > > > > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > > > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > > > > 
> > > > > 
> > > > > 
> > > > > ---------------------------------------------------------------------
> > > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > > > 
> > > 
> > > 
> > > 
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org

Re: korean and lucene

Posted by Youngho Cho <yo...@nannet.co.kr>.
Hello all
Plese forgive me pervious my stupid message

     [echo] Running lia.analysis.i18n.KoreanDemo...
     [java] [경] [기]  analyzer = org.apache.lucene.analysis.standard.StandardAnalyzer
     [java] phrase = 경기
     [java] query = "경 기"

I got the good result.

When I compile I just rename old version lucene-1.4.3.jar to lucene-1.4.3.jar_bak
and all new 1.9 lucene. and build the test package.
After I remove lucene-1.4.3.jar_bak in lib directory completely
I got the expected result !!!.

I don't know the reason... ( looks like my finger make some trouble... )

Anyway thanks Koji and Cheolgoo
I will further test now...

Youngho




----- Original Message ----- 
From: "Youngho Cho" <yo...@nannet.co.kr>
To: <ja...@lucene.apache.org>
Sent: Thursday, October 27, 2005 12:28 PM
Subject: Re: korean and lucene


> Hello Koji
> 
> Here is test result.
> Japanese is OK !.
> maybe ant clean  did some effect.
> 
> Anyway please refer to the result using 1.9
> 
>      [echo] Running lia.analysis.i18n.JapaneseDemo...
>      [java] [ラ] [メ] [ン] [屋]  analyzer = org.apache.lucene.analysis.standard.StandardAnalyzer
>      [java] phrase = ラ?メン屋
>      [java] query = content:ラ?メン屋
>   
>     [echo] Running lia.analysis.i18n.KoreanDemo...
>      [java]  analyzer = org.apache.lucene.analysis.standard.StandardAnalyzer
>      [java] phrase = 경
>      [java] query =  
>   
>      [echo] Running lia.analysis.i18n.JapaneseDemo...
>      [java] [ラ] [メン] [ン屋]  analyzer = org.apache.lucene.analysis.cjk.CJKAnalyzer
>      [java] phrase = ラ?メン屋
>      [java] query = content:ラ?メン屋
>   
>     [echo] Running lia.analysis.i18n.KoreanDemo...
>      [java] [경]  analyzer = org.apache.lucene.analysis.cjk.CJKAnalyzer
>      [java] phrase = 경
>      [java] query = 경 
> 
>      [echo] Running lia.analysis.i18n.KoreanDemo...
>      [java]  analyzer = org.apache.lucene.analysis.standard.StandardAnalyzer
>      [java] phrase = 경기
>      [java] query = 
>   
>      [echo] Running lia.analysis.i18n.KoreanDemo...
>      [java] [경기]  analyzer = org.apache.lucene.analysis.cjk.CJKAnalyzer
>      [java] phrase = 경기
>      [java] query = 경기
>   
> 
> Standard analyzer didn't tokenized the Korean Character at all....
> 
> Ug....  look like 
>  http://issues.apache.org/jira/browse/LUCENE-444
>  didn't effect at all for Korean.
> 
> 
> Thanks 
> 
> Youngho
> 
> ----- Original Message ----- 
> From: "Koji Sekiguchi" <ko...@m4.dion.ne.jp>
> To: <ja...@lucene.apache.org>; "Youngho Cho" <yo...@nannet.co.kr>
> Sent: Thursday, October 27, 2005 11:47 AM
> Subject: RE: korean and lucene
> 
> 
> > Hello Youngho,
> > 
> > I don't understand why you couldn't get hits result in Japanese,
> > though, you had better check why the query was empty with Korean data:
> > 
> > > For Korean
> > >      [echo] Running lia.analysis.i18n.KoreanDemo...
> > >      [java] phrase = 경
> > >      [java] query = 
> > 
> > The last line should be query = 경
> > to get hits result. Can you check why StandardAnalyzer
> > removes "경" during tokenizing?
> > 
> > Koji
> > 
> > > -----Original Message-----
> > > From: Youngho Cho [mailto:youngho@nannet.co.kr]
> > > Sent: Thursday, October 27, 2005 11:37 AM
> > > To: java-user@lucene.apache.org
> > > Subject: Re: korean and lucene
> > > 
> > > 
> > > Hello Koji,
> > > 
> > > Thanks for your kind reply.
> > > 
> > > Yes, I used QueryParser. normaly I used
> > > Query = QueryParser.parse( ) method.
> > > 
> > > I put your sample code into lia.analysis.i18n package in LuceneAction
> > > and run JapaneseDemo using 1.4 and 1.9 
> > > 
> > > results are 
> > > 
> > >      [echo] Running lia.analysis.i18n.JapaneseDemo...
> > >      [java] query = content:ラ?メン屋
> > > 
> > > I can't get hits result.
> > > 
> > > For Korean
> > >      [echo] Running lia.analysis.i18n.KoreanDemo...
> > >      [java] phrase = 경
> > >      [java] query = 
> > > 
> > > I can't get query parse result.
> > > 
> > > Thanks,
> > > 
> > > Youngho
> > > 
> > > 
> > > 
> > > ----- Original Message ----- 
> > > From: "Koji Sekiguchi" <ko...@m4.dion.ne.jp>
> > > To: <ja...@lucene.apache.org>; "Youngho Cho" <yo...@nannet.co.kr>
> > > Sent: Thursday, October 27, 2005 9:48 AM
> > > Subject: RE: korean and lucene
> > > 
> > > 
> > > > Hi Youngho,
> > > > 
> > > > With regard to Japanese, using StandardAnalyzer,
> > > > I can search a word/phase.
> > > > 
> > > > Did you use QueryParser? StandardAnalyzer tokenizes
> > > > CJK characters into a stream of single character.
> > > > Use QueryParser to get a PhraseQuery and search the query.
> > > > 
> > > > Please see the following sample code. Replace Japanese
> > > > "contents" and (search target) "phrase" with Korean in the 
> > > program and run.
> > > > 
> > > > regards,
> > > > 
> > > > Koji
> > > > 
> > > > =============================================
> > > > import java.io.IOException;
> > > > import org.apache.lucene.analysis.Analyzer;
> > > > import org.apache.lucene.analysis.standard.StandardAnalyzer;
> > > > import org.apache.lucene.analysis.cjk.CJKAnalyzer;
> > > > import org.apache.lucene.store.Directory;
> > > > import org.apache.lucene.store.RAMDirectory;
> > > > import org.apache.lucene.index.IndexWriter;
> > > > import org.apache.lucene.document.Document;
> > > > import org.apache.lucene.document.Field;
> > > > import org.apache.lucene.search.IndexSearcher;
> > > > import org.apache.lucene.search.Hits;
> > > > import org.apache.lucene.search.Query;
> > > > import org.apache.lucene.queryParser.QueryParser;
> > > > import org.apache.lucene.queryParser.ParseException;
> > > > 
> > > > public class JapaneseByStandardAnalyzer {
> > > > 
> > > >     private static final String FIELD_CONTENT = "content";
> > > >     private static final String[] contents = {
> > > > "東京にはおいしいラーメン屋がたくさんあります。",
> > > > "北海道にもおいしいラーメン屋があります。"
> > > >     };
> > > >     private static final String phrase = "ラーメン屋";
> > > >     //private static final String phrase = "屋";
> > > >     private static Analyzer analyzer = null;
> > > > 
> > > >     public static void main( String[] args ) throws 
> > > IOException, ParseException {
> > > > Directory directory = makeIndex();
> > > > search( directory );
> > > > directory.close();
> > > >     }
> > > > 
> > > >     private static Analyzer getAnalyzer(){
> > > > if( analyzer == null ){
> > > >     analyzer = new StandardAnalyzer();
> > > >     //analyzer = new CJKAnalyzer();
> > > > }
> > > > return analyzer;
> > > >     }
> > > > 
> > > >     private static Directory makeIndex() throws IOException {
> > > > Directory directory = new RAMDirectory();
> > > > IndexWriter writer = new IndexWriter( directory, getAnalyzer(), true );
> > > > for( int i = 0; i < contents.length; i++ ){
> > > >     Document doc = new Document();
> > > >     doc.add( new Field( FIELD_CONTENT, contents[i], 
> > > Field.Store.YES, Field.Index.TOKENIZED ) );
> > > >     writer.addDocument( doc );
> > > > }
> > > > writer.close();
> > > > return directory;
> > > >     }
> > > > 
> > > >     private static void search( Directory directory ) throws 
> > > IOException, ParseException {
> > > > IndexSearcher searcher = new IndexSearcher( directory );
> > > > QueryParser parser = new QueryParser( FIELD_CONTENT, getAnalyzer() );
> > > > Query query = parser.parse( phrase );
> > > > System.out.println( "query = " + query );
> > > > Hits hits = searcher.search( query );
> > > > for( int i = 0; i < hits.length(); i++ )
> > > >     System.out.println( "doc = " + hits.doc( i ).get( FIELD_CONTENT ) );
> > > > searcher.close();
> > > >     }
> > > > }
> > > > 
> > > > 
> > > > > -----Original Message-----
> > > > > From: Youngho Cho [mailto:youngho@nannet.co.kr]
> > > > > Sent: Thursday, October 27, 2005 8:18 AM
> > > > > To: java-user@lucene.apache.org; Cheolgoo Kang
> > > > > Subject: Re: korean and lucene
> > > > > 
> > > > > 
> > > > > Hello Cheolgoo,
> > > > > 
> > > > > Now I updated my lucene version to 1.9 for using StandardAnalyzer 
> > > > > for Korean.
> > > > > And tested your patch which is already adopted in 1.9
> > > > > 
> > > > > http://issues.apache.org/jira/browse/LUCENE-444
> > > > > 
> > > > > But Still I have no good  results with Korean compare with 
> > > CJKAnalyzer.
> > > > > 
> > > > > Single character is good match but more two character word 
> > > > > doesn't match at all.
> > > > > 
> > > > > Am I something missing or still there need some more works ?
> > > > > 
> > > > > 
> > > > > Thanks,
> > > > > 
> > > > > Youngho.
> > > > >  
> > > > > 
> > > > > ----- Original Message ----- 
> > > > > From: "Cheolgoo Kang" <ap...@gmail.com>
> > > > > To: <ja...@lucene.apache.org>; "John Wang" <jo...@gmail.com>
> > > > > Sent: Tuesday, October 04, 2005 10:11 AM
> > > > > Subject: Re: korean and lucene
> > > > > 
> > > > > 
> > > > > > StandardAnalyzer's JavaCC based StandardTokenizer.jj cannot read
> > > > > > Korean part of Unicode character blocks.
> > > > > > 
> > > > > > You should 1) use CJKAnalyzer or 2) add Korean character
> > > > > > block(0xAC00~0xD7AF) to the CJK token definition on the
> > > > > > StandardTokenizer.jj file.
> > > > > > 
> > > > > > Hope it helps.
> > > > > > 
> > > > > > 
> > > > > > On 10/4/05, John Wang <jo...@gmail.com> wrote:
> > > > > > > Hi:
> > > > > > >
> > > > > > > We are running into problems with searching on korean 
> > > > > documents. We are
> > > > > > > using the StandardAnalyzer and everything works with Chinese 
> > > > > and Japanese.
> > > > > > > Are there known problems with Korean with Lucene?
> > > > > > >
> > > > > > > Thanks
> > > > > > >
> > > > > > > -John
> > > > > > >
> > > > > > >
> > > > > > 
> > > > > > 
> > > > > > --
> > > > > > Cheolgoo
> > > > > > 
> > > > > > 
> > > ---------------------------------------------------------------------
> > > > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > > > 
> > > > 
> > > > 
> > > > ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > > 
> > 
> > 
> > 
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org

Re: korean and lucene

Posted by Youngho Cho <yo...@nannet.co.kr>.
Hello Koji

Here is test result.
Japanese is OK !.
maybe ant clean  did some effect.

Anyway please refer to the result using 1.9

     [echo] Running lia.analysis.i18n.JapaneseDemo...
     [java] [ラ] [メ] [ン] [屋]  analyzer = org.apache.lucene.analysis.standard.StandardAnalyzer
     [java] phrase = ラ?メン屋
     [java] query = content:ラ?メン屋
  
    [echo] Running lia.analysis.i18n.KoreanDemo...
     [java]  analyzer = org.apache.lucene.analysis.standard.StandardAnalyzer
     [java] phrase = 경
     [java] query =  
  
     [echo] Running lia.analysis.i18n.JapaneseDemo...
     [java] [ラ] [メン] [ン屋]  analyzer = org.apache.lucene.analysis.cjk.CJKAnalyzer
     [java] phrase = ラ?メン屋
     [java] query = content:ラ?メン屋
  
    [echo] Running lia.analysis.i18n.KoreanDemo...
     [java] [경]  analyzer = org.apache.lucene.analysis.cjk.CJKAnalyzer
     [java] phrase = 경
     [java] query = 경 

     [echo] Running lia.analysis.i18n.KoreanDemo...
     [java]  analyzer = org.apache.lucene.analysis.standard.StandardAnalyzer
     [java] phrase = 경기
     [java] query = 
  
     [echo] Running lia.analysis.i18n.KoreanDemo...
     [java] [경기]  analyzer = org.apache.lucene.analysis.cjk.CJKAnalyzer
     [java] phrase = 경기
     [java] query = 경기
  

Standard analyzer didn't tokenized the Korean Character at all....

Ug....  look like 
 http://issues.apache.org/jira/browse/LUCENE-444
 didn't effect at all for Korean.


Thanks 

Youngho

----- Original Message ----- 
From: "Koji Sekiguchi" <ko...@m4.dion.ne.jp>
To: <ja...@lucene.apache.org>; "Youngho Cho" <yo...@nannet.co.kr>
Sent: Thursday, October 27, 2005 11:47 AM
Subject: RE: korean and lucene


> Hello Youngho,
> 
> I don't understand why you couldn't get hits result in Japanese,
> though, you had better check why the query was empty with Korean data:
> 
> > For Korean
> >      [echo] Running lia.analysis.i18n.KoreanDemo...
> >      [java] phrase = 경
> >      [java] query = 
> 
> The last line should be query = 경
> to get hits result. Can you check why StandardAnalyzer
> removes "경" during tokenizing?
> 
> Koji
> 
> > -----Original Message-----
> > From: Youngho Cho [mailto:youngho@nannet.co.kr]
> > Sent: Thursday, October 27, 2005 11:37 AM
> > To: java-user@lucene.apache.org
> > Subject: Re: korean and lucene
> > 
> > 
> > Hello Koji,
> > 
> > Thanks for your kind reply.
> > 
> > Yes, I used QueryParser. normaly I used
> > Query = QueryParser.parse( ) method.
> > 
> > I put your sample code into lia.analysis.i18n package in LuceneAction
> > and run JapaneseDemo using 1.4 and 1.9 
> > 
> > results are 
> > 
> >      [echo] Running lia.analysis.i18n.JapaneseDemo...
> >      [java] query = content:ラ?メン屋
> > 
> > I can't get hits result.
> > 
> > For Korean
> >      [echo] Running lia.analysis.i18n.KoreanDemo...
> >      [java] phrase = 경
> >      [java] query = 
> > 
> > I can't get query parse result.
> > 
> > Thanks,
> > 
> > Youngho
> > 
> > 
> > 
> > ----- Original Message ----- 
> > From: "Koji Sekiguchi" <ko...@m4.dion.ne.jp>
> > To: <ja...@lucene.apache.org>; "Youngho Cho" <yo...@nannet.co.kr>
> > Sent: Thursday, October 27, 2005 9:48 AM
> > Subject: RE: korean and lucene
> > 
> > 
> > > Hi Youngho,
> > > 
> > > With regard to Japanese, using StandardAnalyzer,
> > > I can search a word/phase.
> > > 
> > > Did you use QueryParser? StandardAnalyzer tokenizes
> > > CJK characters into a stream of single character.
> > > Use QueryParser to get a PhraseQuery and search the query.
> > > 
> > > Please see the following sample code. Replace Japanese
> > > "contents" and (search target) "phrase" with Korean in the 
> > program and run.
> > > 
> > > regards,
> > > 
> > > Koji
> > > 
> > > =============================================
> > > import java.io.IOException;
> > > import org.apache.lucene.analysis.Analyzer;
> > > import org.apache.lucene.analysis.standard.StandardAnalyzer;
> > > import org.apache.lucene.analysis.cjk.CJKAnalyzer;
> > > import org.apache.lucene.store.Directory;
> > > import org.apache.lucene.store.RAMDirectory;
> > > import org.apache.lucene.index.IndexWriter;
> > > import org.apache.lucene.document.Document;
> > > import org.apache.lucene.document.Field;
> > > import org.apache.lucene.search.IndexSearcher;
> > > import org.apache.lucene.search.Hits;
> > > import org.apache.lucene.search.Query;
> > > import org.apache.lucene.queryParser.QueryParser;
> > > import org.apache.lucene.queryParser.ParseException;
> > > 
> > > public class JapaneseByStandardAnalyzer {
> > > 
> > >     private static final String FIELD_CONTENT = "content";
> > >     private static final String[] contents = {
> > > "東京にはおいしいラーメン屋がたくさんあります。",
> > > "北海道にもおいしいラーメン屋があります。"
> > >     };
> > >     private static final String phrase = "ラーメン屋";
> > >     //private static final String phrase = "屋";
> > >     private static Analyzer analyzer = null;
> > > 
> > >     public static void main( String[] args ) throws 
> > IOException, ParseException {
> > > Directory directory = makeIndex();
> > > search( directory );
> > > directory.close();
> > >     }
> > > 
> > >     private static Analyzer getAnalyzer(){
> > > if( analyzer == null ){
> > >     analyzer = new StandardAnalyzer();
> > >     //analyzer = new CJKAnalyzer();
> > > }
> > > return analyzer;
> > >     }
> > > 
> > >     private static Directory makeIndex() throws IOException {
> > > Directory directory = new RAMDirectory();
> > > IndexWriter writer = new IndexWriter( directory, getAnalyzer(), true );
> > > for( int i = 0; i < contents.length; i++ ){
> > >     Document doc = new Document();
> > >     doc.add( new Field( FIELD_CONTENT, contents[i], 
> > Field.Store.YES, Field.Index.TOKENIZED ) );
> > >     writer.addDocument( doc );
> > > }
> > > writer.close();
> > > return directory;
> > >     }
> > > 
> > >     private static void search( Directory directory ) throws 
> > IOException, ParseException {
> > > IndexSearcher searcher = new IndexSearcher( directory );
> > > QueryParser parser = new QueryParser( FIELD_CONTENT, getAnalyzer() );
> > > Query query = parser.parse( phrase );
> > > System.out.println( "query = " + query );
> > > Hits hits = searcher.search( query );
> > > for( int i = 0; i < hits.length(); i++ )
> > >     System.out.println( "doc = " + hits.doc( i ).get( FIELD_CONTENT ) );
> > > searcher.close();
> > >     }
> > > }
> > > 
> > > 
> > > > -----Original Message-----
> > > > From: Youngho Cho [mailto:youngho@nannet.co.kr]
> > > > Sent: Thursday, October 27, 2005 8:18 AM
> > > > To: java-user@lucene.apache.org; Cheolgoo Kang
> > > > Subject: Re: korean and lucene
> > > > 
> > > > 
> > > > Hello Cheolgoo,
> > > > 
> > > > Now I updated my lucene version to 1.9 for using StandardAnalyzer 
> > > > for Korean.
> > > > And tested your patch which is already adopted in 1.9
> > > > 
> > > > http://issues.apache.org/jira/browse/LUCENE-444
> > > > 
> > > > But Still I have no good  results with Korean compare with 
> > CJKAnalyzer.
> > > > 
> > > > Single character is good match but more two character word 
> > > > doesn't match at all.
> > > > 
> > > > Am I something missing or still there need some more works ?
> > > > 
> > > > 
> > > > Thanks,
> > > > 
> > > > Youngho.
> > > >  
> > > > 
> > > > ----- Original Message ----- 
> > > > From: "Cheolgoo Kang" <ap...@gmail.com>
> > > > To: <ja...@lucene.apache.org>; "John Wang" <jo...@gmail.com>
> > > > Sent: Tuesday, October 04, 2005 10:11 AM
> > > > Subject: Re: korean and lucene
> > > > 
> > > > 
> > > > > StandardAnalyzer's JavaCC based StandardTokenizer.jj cannot read
> > > > > Korean part of Unicode character blocks.
> > > > > 
> > > > > You should 1) use CJKAnalyzer or 2) add Korean character
> > > > > block(0xAC00~0xD7AF) to the CJK token definition on the
> > > > > StandardTokenizer.jj file.
> > > > > 
> > > > > Hope it helps.
> > > > > 
> > > > > 
> > > > > On 10/4/05, John Wang <jo...@gmail.com> wrote:
> > > > > > Hi:
> > > > > >
> > > > > > We are running into problems with searching on korean 
> > > > documents. We are
> > > > > > using the StandardAnalyzer and everything works with Chinese 
> > > > and Japanese.
> > > > > > Are there known problems with Korean with Lucene?
> > > > > >
> > > > > > Thanks
> > > > > >
> > > > > > -John
> > > > > >
> > > > > >
> > > > > 
> > > > > 
> > > > > --
> > > > > Cheolgoo
> > > > > 
> > > > > 
> > ---------------------------------------------------------------------
> > > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > > 
> > > 
> > > 
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > 
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org

RE: korean and lucene

Posted by Koji Sekiguchi <ko...@m4.dion.ne.jp>.
Hello Youngho,

I don't understand why you couldn't get hits result in Japanese,
though, you had better check why the query was empty with Korean data:

> For Korean
>      [echo] Running lia.analysis.i18n.KoreanDemo...
>      [java] phrase = 경
>      [java] query = 

The last line should be query = 경
to get hits result. Can you check why StandardAnalyzer
removes "경" during tokenizing?

Koji

> -----Original Message-----
> From: Youngho Cho [mailto:youngho@nannet.co.kr]
> Sent: Thursday, October 27, 2005 11:37 AM
> To: java-user@lucene.apache.org
> Subject: Re: korean and lucene
> 
> 
> Hello Koji,
> 
> Thanks for your kind reply.
> 
> Yes, I used QueryParser. normaly I used
> Query = QueryParser.parse( ) method.
> 
> I put your sample code into lia.analysis.i18n package in LuceneAction
> and run JapaneseDemo using 1.4 and 1.9 
> 
> results are 
> 
>      [echo] Running lia.analysis.i18n.JapaneseDemo...
>      [java] query = content:ラ?メン屋
> 
> I can't get hits result.
> 
> For Korean
>      [echo] Running lia.analysis.i18n.KoreanDemo...
>      [java] phrase = 경
>      [java] query = 
> 
> I can't get query parse result.
> 
> Thanks,
> 
> Youngho
> 
> 
> 
> ----- Original Message ----- 
> From: "Koji Sekiguchi" <ko...@m4.dion.ne.jp>
> To: <ja...@lucene.apache.org>; "Youngho Cho" <yo...@nannet.co.kr>
> Sent: Thursday, October 27, 2005 9:48 AM
> Subject: RE: korean and lucene
> 
> 
> > Hi Youngho,
> > 
> > With regard to Japanese, using StandardAnalyzer,
> > I can search a word/phase.
> > 
> > Did you use QueryParser? StandardAnalyzer tokenizes
> > CJK characters into a stream of single character.
> > Use QueryParser to get a PhraseQuery and search the query.
> > 
> > Please see the following sample code. Replace Japanese
> > "contents" and (search target) "phrase" with Korean in the 
> program and run.
> > 
> > regards,
> > 
> > Koji
> > 
> > =============================================
> > import java.io.IOException;
> > import org.apache.lucene.analysis.Analyzer;
> > import org.apache.lucene.analysis.standard.StandardAnalyzer;
> > import org.apache.lucene.analysis.cjk.CJKAnalyzer;
> > import org.apache.lucene.store.Directory;
> > import org.apache.lucene.store.RAMDirectory;
> > import org.apache.lucene.index.IndexWriter;
> > import org.apache.lucene.document.Document;
> > import org.apache.lucene.document.Field;
> > import org.apache.lucene.search.IndexSearcher;
> > import org.apache.lucene.search.Hits;
> > import org.apache.lucene.search.Query;
> > import org.apache.lucene.queryParser.QueryParser;
> > import org.apache.lucene.queryParser.ParseException;
> > 
> > public class JapaneseByStandardAnalyzer {
> > 
> >     private static final String FIELD_CONTENT = "content";
> >     private static final String[] contents = {
> > "東京にはおいしいラーメン屋がたくさんあります。",
> > "北海道にもおいしいラーメン屋があります。"
> >     };
> >     private static final String phrase = "ラーメン屋";
> >     //private static final String phrase = "屋";
> >     private static Analyzer analyzer = null;
> > 
> >     public static void main( String[] args ) throws 
> IOException, ParseException {
> > Directory directory = makeIndex();
> > search( directory );
> > directory.close();
> >     }
> > 
> >     private static Analyzer getAnalyzer(){
> > if( analyzer == null ){
> >     analyzer = new StandardAnalyzer();
> >     //analyzer = new CJKAnalyzer();
> > }
> > return analyzer;
> >     }
> > 
> >     private static Directory makeIndex() throws IOException {
> > Directory directory = new RAMDirectory();
> > IndexWriter writer = new IndexWriter( directory, getAnalyzer(), true );
> > for( int i = 0; i < contents.length; i++ ){
> >     Document doc = new Document();
> >     doc.add( new Field( FIELD_CONTENT, contents[i], 
> Field.Store.YES, Field.Index.TOKENIZED ) );
> >     writer.addDocument( doc );
> > }
> > writer.close();
> > return directory;
> >     }
> > 
> >     private static void search( Directory directory ) throws 
> IOException, ParseException {
> > IndexSearcher searcher = new IndexSearcher( directory );
> > QueryParser parser = new QueryParser( FIELD_CONTENT, getAnalyzer() );
> > Query query = parser.parse( phrase );
> > System.out.println( "query = " + query );
> > Hits hits = searcher.search( query );
> > for( int i = 0; i < hits.length(); i++ )
> >     System.out.println( "doc = " + hits.doc( i ).get( FIELD_CONTENT ) );
> > searcher.close();
> >     }
> > }
> > 
> > 
> > > -----Original Message-----
> > > From: Youngho Cho [mailto:youngho@nannet.co.kr]
> > > Sent: Thursday, October 27, 2005 8:18 AM
> > > To: java-user@lucene.apache.org; Cheolgoo Kang
> > > Subject: Re: korean and lucene
> > > 
> > > 
> > > Hello Cheolgoo,
> > > 
> > > Now I updated my lucene version to 1.9 for using StandardAnalyzer 
> > > for Korean.
> > > And tested your patch which is already adopted in 1.9
> > > 
> > > http://issues.apache.org/jira/browse/LUCENE-444
> > > 
> > > But Still I have no good  results with Korean compare with 
> CJKAnalyzer.
> > > 
> > > Single character is good match but more two character word 
> > > doesn't match at all.
> > > 
> > > Am I something missing or still there need some more works ?
> > > 
> > > 
> > > Thanks,
> > > 
> > > Youngho.
> > >  
> > > 
> > > ----- Original Message ----- 
> > > From: "Cheolgoo Kang" <ap...@gmail.com>
> > > To: <ja...@lucene.apache.org>; "John Wang" <jo...@gmail.com>
> > > Sent: Tuesday, October 04, 2005 10:11 AM
> > > Subject: Re: korean and lucene
> > > 
> > > 
> > > > StandardAnalyzer's JavaCC based StandardTokenizer.jj cannot read
> > > > Korean part of Unicode character blocks.
> > > > 
> > > > You should 1) use CJKAnalyzer or 2) add Korean character
> > > > block(0xAC00~0xD7AF) to the CJK token definition on the
> > > > StandardTokenizer.jj file.
> > > > 
> > > > Hope it helps.
> > > > 
> > > > 
> > > > On 10/4/05, John Wang <jo...@gmail.com> wrote:
> > > > > Hi:
> > > > >
> > > > > We are running into problems with searching on korean 
> > > documents. We are
> > > > > using the StandardAnalyzer and everything works with Chinese 
> > > and Japanese.
> > > > > Are there known problems with Korean with Lucene?
> > > > >
> > > > > Thanks
> > > > >
> > > > > -John
> > > > >
> > > > >
> > > > 
> > > > 
> > > > --
> > > > Cheolgoo
> > > > 
> > > > 
> ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > 
> > 
> > 
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> 



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: korean and lucene

Posted by Youngho Cho <yo...@nannet.co.kr>.
Hello Koji,

Thanks for your kind reply.

Yes, I used QueryParser. normaly I used
Query = QueryParser.parse( ) method.

I put your sample code into lia.analysis.i18n package in LuceneAction
and run JapaneseDemo using 1.4 and 1.9 

results are 

     [echo] Running lia.analysis.i18n.JapaneseDemo...
     [java] query = content:ラ?メン屋

I can't get hits result.

For Korean
     [echo] Running lia.analysis.i18n.KoreanDemo...
     [java] phrase = 경
     [java] query = 

I can't get query parse result.

Thanks,

Youngho



----- Original Message ----- 
From: "Koji Sekiguchi" <ko...@m4.dion.ne.jp>
To: <ja...@lucene.apache.org>; "Youngho Cho" <yo...@nannet.co.kr>
Sent: Thursday, October 27, 2005 9:48 AM
Subject: RE: korean and lucene


> Hi Youngho,
> 
> With regard to Japanese, using StandardAnalyzer,
> I can search a word/phase.
> 
> Did you use QueryParser? StandardAnalyzer tokenizes
> CJK characters into a stream of single character.
> Use QueryParser to get a PhraseQuery and search the query.
> 
> Please see the following sample code. Replace Japanese
> "contents" and (search target) "phrase" with Korean in the program and run.
> 
> regards,
> 
> Koji
> 
> =============================================
> import java.io.IOException;
> import org.apache.lucene.analysis.Analyzer;
> import org.apache.lucene.analysis.standard.StandardAnalyzer;
> import org.apache.lucene.analysis.cjk.CJKAnalyzer;
> import org.apache.lucene.store.Directory;
> import org.apache.lucene.store.RAMDirectory;
> import org.apache.lucene.index.IndexWriter;
> import org.apache.lucene.document.Document;
> import org.apache.lucene.document.Field;
> import org.apache.lucene.search.IndexSearcher;
> import org.apache.lucene.search.Hits;
> import org.apache.lucene.search.Query;
> import org.apache.lucene.queryParser.QueryParser;
> import org.apache.lucene.queryParser.ParseException;
> 
> public class JapaneseByStandardAnalyzer {
> 
>     private static final String FIELD_CONTENT = "content";
>     private static final String[] contents = {
> "東京にはおいしいラーメン屋がたくさんあります。",
> "北海道にもおいしいラーメン屋があります。"
>     };
>     private static final String phrase = "ラーメン屋";
>     //private static final String phrase = "屋";
>     private static Analyzer analyzer = null;
> 
>     public static void main( String[] args ) throws IOException, ParseException {
> Directory directory = makeIndex();
> search( directory );
> directory.close();
>     }
> 
>     private static Analyzer getAnalyzer(){
> if( analyzer == null ){
>     analyzer = new StandardAnalyzer();
>     //analyzer = new CJKAnalyzer();
> }
> return analyzer;
>     }
> 
>     private static Directory makeIndex() throws IOException {
> Directory directory = new RAMDirectory();
> IndexWriter writer = new IndexWriter( directory, getAnalyzer(), true );
> for( int i = 0; i < contents.length; i++ ){
>     Document doc = new Document();
>     doc.add( new Field( FIELD_CONTENT, contents[i], Field.Store.YES, Field.Index.TOKENIZED ) );
>     writer.addDocument( doc );
> }
> writer.close();
> return directory;
>     }
> 
>     private static void search( Directory directory ) throws IOException, ParseException {
> IndexSearcher searcher = new IndexSearcher( directory );
> QueryParser parser = new QueryParser( FIELD_CONTENT, getAnalyzer() );
> Query query = parser.parse( phrase );
> System.out.println( "query = " + query );
> Hits hits = searcher.search( query );
> for( int i = 0; i < hits.length(); i++ )
>     System.out.println( "doc = " + hits.doc( i ).get( FIELD_CONTENT ) );
> searcher.close();
>     }
> }
> 
> 
> > -----Original Message-----
> > From: Youngho Cho [mailto:youngho@nannet.co.kr]
> > Sent: Thursday, October 27, 2005 8:18 AM
> > To: java-user@lucene.apache.org; Cheolgoo Kang
> > Subject: Re: korean and lucene
> > 
> > 
> > Hello Cheolgoo,
> > 
> > Now I updated my lucene version to 1.9 for using StandardAnalyzer 
> > for Korean.
> > And tested your patch which is already adopted in 1.9
> > 
> > http://issues.apache.org/jira/browse/LUCENE-444
> > 
> > But Still I have no good  results with Korean compare with CJKAnalyzer.
> > 
> > Single character is good match but more two character word 
> > doesn't match at all.
> > 
> > Am I something missing or still there need some more works ?
> > 
> > 
> > Thanks,
> > 
> > Youngho.
> >  
> > 
> > ----- Original Message ----- 
> > From: "Cheolgoo Kang" <ap...@gmail.com>
> > To: <ja...@lucene.apache.org>; "John Wang" <jo...@gmail.com>
> > Sent: Tuesday, October 04, 2005 10:11 AM
> > Subject: Re: korean and lucene
> > 
> > 
> > > StandardAnalyzer's JavaCC based StandardTokenizer.jj cannot read
> > > Korean part of Unicode character blocks.
> > > 
> > > You should 1) use CJKAnalyzer or 2) add Korean character
> > > block(0xAC00~0xD7AF) to the CJK token definition on the
> > > StandardTokenizer.jj file.
> > > 
> > > Hope it helps.
> > > 
> > > 
> > > On 10/4/05, John Wang <jo...@gmail.com> wrote:
> > > > Hi:
> > > >
> > > > We are running into problems with searching on korean 
> > documents. We are
> > > > using the StandardAnalyzer and everything works with Chinese 
> > and Japanese.
> > > > Are there known problems with Korean with Lucene?
> > > >
> > > > Thanks
> > > >
> > > > -John
> > > >
> > > >
> > > 
> > > 
> > > --
> > > Cheolgoo
> > > 
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org

RE: korean and lucene

Posted by Koji Sekiguchi <ko...@m4.dion.ne.jp>.
Hi Youngho,

With regard to Japanese, using StandardAnalyzer,
I can search a word/phase.

Did you use QueryParser? StandardAnalyzer tokenizes
CJK characters into a stream of single character.
Use QueryParser to get a PhraseQuery and search the query.

Please see the following sample code. Replace Japanese
"contents" and (search target) "phrase" with Korean in the program and run.

regards,

Koji

=============================================
import java.io.IOException;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.analysis.cjk.CJKAnalyzer;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.RAMDirectory;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Hits;
import org.apache.lucene.search.Query;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.queryParser.ParseException;

public class JapaneseByStandardAnalyzer {

    private static final String FIELD_CONTENT = "content";
    private static final String[] contents = {
	"東京にはおいしいラーメン屋がたくさんあります。",
	"北海道にもおいしいラーメン屋があります。"
    };
    private static final String phrase = "ラーメン屋";
    //private static final String phrase = "屋";
    private static Analyzer analyzer = null;

    public static void main( String[] args ) throws IOException, ParseException {
	Directory directory = makeIndex();
	search( directory );
	directory.close();
    }

    private static Analyzer getAnalyzer(){
	if( analyzer == null ){
	    analyzer = new StandardAnalyzer();
	    //analyzer = new CJKAnalyzer();
	}
	return analyzer;
    }

    private static Directory makeIndex() throws IOException {
	Directory directory = new RAMDirectory();
	IndexWriter writer = new IndexWriter( directory, getAnalyzer(), true );
	for( int i = 0; i < contents.length; i++ ){
	    Document doc = new Document();
	    doc.add( new Field( FIELD_CONTENT, contents[i], Field.Store.YES, Field.Index.TOKENIZED ) );
	    writer.addDocument( doc );
	}
	writer.close();
	return directory;
    }

    private static void search( Directory directory ) throws IOException, ParseException {
	IndexSearcher searcher = new IndexSearcher( directory );
	QueryParser parser = new QueryParser( FIELD_CONTENT, getAnalyzer() );
	Query query = parser.parse( phrase );
	System.out.println( "query = " + query );
	Hits hits = searcher.search( query );
	for( int i = 0; i < hits.length(); i++ )
	    System.out.println( "doc = " + hits.doc( i ).get( FIELD_CONTENT ) );
	searcher.close();
    }
}


> -----Original Message-----
> From: Youngho Cho [mailto:youngho@nannet.co.kr]
> Sent: Thursday, October 27, 2005 8:18 AM
> To: java-user@lucene.apache.org; Cheolgoo Kang
> Subject: Re: korean and lucene
> 
> 
> Hello Cheolgoo,
> 
> Now I updated my lucene version to 1.9 for using StandardAnalyzer 
> for Korean.
> And tested your patch which is already adopted in 1.9
> 
> http://issues.apache.org/jira/browse/LUCENE-444
> 
> But Still I have no good  results with Korean compare with CJKAnalyzer.
> 
> Single character is good match but more two character word 
> doesn't match at all.
> 
> Am I something missing or still there need some more works ?
> 
> 
> Thanks,
> 
> Youngho.
>  
> 
> ----- Original Message ----- 
> From: "Cheolgoo Kang" <ap...@gmail.com>
> To: <ja...@lucene.apache.org>; "John Wang" <jo...@gmail.com>
> Sent: Tuesday, October 04, 2005 10:11 AM
> Subject: Re: korean and lucene
> 
> 
> > StandardAnalyzer's JavaCC based StandardTokenizer.jj cannot read
> > Korean part of Unicode character blocks.
> > 
> > You should 1) use CJKAnalyzer or 2) add Korean character
> > block(0xAC00~0xD7AF) to the CJK token definition on the
> > StandardTokenizer.jj file.
> > 
> > Hope it helps.
> > 
> > 
> > On 10/4/05, John Wang <jo...@gmail.com> wrote:
> > > Hi:
> > >
> > > We are running into problems with searching on korean 
> documents. We are
> > > using the StandardAnalyzer and everything works with Chinese 
> and Japanese.
> > > Are there known problems with Korean with Lucene?
> > >
> > > Thanks
> > >
> > > -John
> > >
> > >
> > 
> > 
> > --
> > Cheolgoo
> > 
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: korean and lucene

Posted by Youngho Cho <yo...@nannet.co.kr>.
Hello Cheolgoo,

Now I updated my lucene version to 1.9 for using StandardAnalyzer for Korean.
And tested your patch which is already adopted in 1.9

http://issues.apache.org/jira/browse/LUCENE-444

But Still I have no good  results with Korean compare with CJKAnalyzer.

Single character is good match but more two character word doesn't match at all.

Am I something missing or still there need some more works ?


Thanks,

Youngho.
 

----- Original Message ----- 
From: "Cheolgoo Kang" <ap...@gmail.com>
To: <ja...@lucene.apache.org>; "John Wang" <jo...@gmail.com>
Sent: Tuesday, October 04, 2005 10:11 AM
Subject: Re: korean and lucene


> StandardAnalyzer's JavaCC based StandardTokenizer.jj cannot read
> Korean part of Unicode character blocks.
> 
> You should 1) use CJKAnalyzer or 2) add Korean character
> block(0xAC00~0xD7AF) to the CJK token definition on the
> StandardTokenizer.jj file.
> 
> Hope it helps.
> 
> 
> On 10/4/05, John Wang <jo...@gmail.com> wrote:
> > Hi:
> >
> > We are running into problems with searching on korean documents. We are
> > using the StandardAnalyzer and everything works with Chinese and Japanese.
> > Are there known problems with Korean with Lucene?
> >
> > Thanks
> >
> > -John
> >
> >
> 
> 
> --
> Cheolgoo
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org

Re: korean and lucene

Posted by Cheolgoo Kang <ap...@gmail.com>.
StandardAnalyzer's JavaCC based StandardTokenizer.jj cannot read
Korean part of Unicode character blocks.

You should 1) use CJKAnalyzer or 2) add Korean character
block(0xAC00~0xD7AF) to the CJK token definition on the
StandardTokenizer.jj file.

Hope it helps.


On 10/4/05, John Wang <jo...@gmail.com> wrote:
> Hi:
>
> We are running into problems with searching on korean documents. We are
> using the StandardAnalyzer and everything works with Chinese and Japanese.
> Are there known problems with Korean with Lucene?
>
> Thanks
>
> -John
>
>


--
Cheolgoo

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: korean and lucene

Posted by Youngho Cho <yo...@nannet.co.kr>.
Would you share what the problem is ?

I used CJKAnalyzer for Korean over 2 years without any problem.
( I remembered that there was some query result problem with StandardAnalyzer at that time )
But I tring to switch to the StandardAnalyzer again.

Thanks,

Youngho


----- Original Message ----- 
From: "John Wang" <jo...@gmail.com>
To: <ja...@lucene.apache.org>
Sent: Tuesday, October 04, 2005 8:46 AM
Subject: korean and lucene


Hi:

We are running into problems with searching on korean documents. We are
using the StandardAnalyzer and everything works with Chinese and Japanese.
Are there known problems with Korean with Lucene?

Thanks

-John