You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Weiwei Wang <ww...@gmail.com> on 2009/12/15 14:26:05 UTC

How to do alias(Pinyin) search in Lucene

Hi, guys,
     I'm implementing a search engine based on Lucene for Chinese. So I want
to support pinyin search as Google China do.

e.g.
    “中国”  means Chinese in English
    this word's pinyin input is "zhongguo"
The feature i want to implement is when user type zhongguo the results will
include documents containing "中国" or even Chinese

Anybody here know how to achieve this?

-- 
Weiwei Wang
Alex Wang
王巍巍
Room 403, Mengmin Wei Building
Computer Science Department
Gulou Campus of Nanjing University
Nanjing, P.R.China, 210093

Homepage: http://cs.nju.edu.cn/rl/weiweiwang

Re: How to do alias(Pinyin) search in Lucene

Posted by fulin tang <ta...@gmail.com>.

another way to do this: pinyin4j

you can trans all Chinese words to pinyin form first, and index the
pinyin form as a field, then you can search on them

see: http://www.slideshare.net/tangfl/ss-2364878

in which we implement a pinyin search for our music search

2009/12/16 Weiwei Wang <ww...@gmail.com>:
> Thanks Erick, I''ll take a carefull study of that
>
> 2009/12/16 Erick Erickson <er...@gmail.com>
>
>> If your queries are still slow, make sure you're not measuring
>> the *first* query on a newly opened searcher. There are
>> other tips here that might be useful. These are general searching
>> tips complimentary to Robert's suggestions..
>>
>> http://wiki.apache.org/lucene-java/ImproveSearchingSpeed
>>
>> <http://wiki.apache.org/lucene-java/ImproveSearchingSpeed>HTH
>> Erick
>>
>> 2009/12/15 Weiwei Wang <ww...@gmail.com>
>>
>> > Thanks Robert, a lot is learned from you:-)
>> >
>> > On Wed, Dec 16, 2009 at 11:53 AM, Robert Muir <rc...@gmail.com> wrote:
>> >
>> > > Hi, just one more thought for you.
>> > >
>> > > I think even more important than anything I said before, you should
>> > ensure
>> > > you implement reusableTokenStream in your analyzer.
>> > > this becomes a necessity if you are using expensive objects like this.
>> > >
>> > > 2009/12/15 Weiwei Wang <ww...@gmail.com>
>> > >
>> > > > Finally, i make it run, however, it works so slow
>> > > >
>> > > > 2009/12/15 Weiwei Wang <ww...@gmail.com>
>> > > >
>> > > > > got it, thanks, Robert
>> > > > >
>> > > > >
>> > > > > On Tue, Dec 15, 2009 at 10:19 PM, Robert Muir <rc...@gmail.com>
>> > > wrote:
>> > > > >
>> > > > >>  if you have lucene 2.9 or 3.0 source code, just run patch -p0 <
>> > > > >> /path/to/LUCENE-XXYY.patch from the lucene source code root
>> > > directory...
>> > > > >> it
>> > > > >> should create the necessary directory and files.
>> > > > >> then run 'ant' , in this case it should create a lucene-icu jar
>> file
>> > > in
>> > > > >> the
>> > > > >> build directory.
>> > > > >>
>> > > > >> the patch doesnt include the icu dependency itself so you need to
>> > get
>> > > > that
>> > > > >> jar file from www.icu-project.org and have it in your classpath
>> > also
>> > > > >>
>> > > > >> sorry for the trouble, hope to integrate some of this soon for a
>> > > future
>> > > > >> release.
>> > > > >>
>> > > > >> On Tue, Dec 15, 2009 at 9:13 AM, Weiwei Wang <
>> ww.wang.cs@gmail.com>
>> > > > >> wrote:
>> > > > >>
>> > > > >> > Yes, i found the patch file LUCENE-1488.patch and there's no icu
>> > > > >> directory
>> > > > >> > in my dowloaded contrib directory.
>> > > > >> >
>> > > > >> > I'm a rookie guy using patch, i'm currently in the contrib dir,
>> > > could
>> > > > >> > anybody tell me how to execute this patch command to generate
>> the
>> > > > >> relevant
>> > > > >> > dir and souce files?
>> > > > >> >
>> > > > >> > On Tue, Dec 15, 2009 at 9:51 PM, Robert Muir <rc...@gmail.com>
>> > > > wrote:
>> > > > >> >
>> > > > >> > > look at the latest patch file attached to the issue, it should
>> > > work
>> > > > >> with
>> > > > >> > > lucene 2.9 or greater (I think)
>> > > > >> > >
>> > > > >> > > 2009/12/15 Weiwei Wang <ww...@gmail.com>
>> > > > >> > >
>> > > > >> > > > where can i find the source code?
>> > > > >> > > >
>> > > > >> > > > On Tue, Dec 15, 2009 at 9:40 PM, Robert Muir <
>> > rcmuir@gmail.com>
>> > > > >> wrote:
>> > > > >> > > >
>> > > > >> > > > > there is an icu transform tokenfilter in the patch here:
>> > > > >> > > > > http://issues.apache.org/jira/browse/LUCENE-1488
>> > > > >> > > > >
>> > > > >> > > > >    Transliterator pinyin =
>> > > > >> Transliterator.getInstance("Han-Latin");
>> > > > >> > > > >    Tokenizer tokenizer = new KeywordTokenizer(new
>> > > > >> > StringReader("中国"));
>> > > > >> > > > >    ICUTransformFilter filter = new
>> > > ICUTransformFilter(tokenizer,
>> > > > >> > > pinyin);
>> > > > >> > > > >    assertTokenStreamContents(filter, new String[] { "zhōng
>> > > guó"
>> > > > }
>> > > > >> );
>> > > > >> > > > >
>> > > > >> > > > > note it will add tone marks and insert space between
>> > syllables
>> > > > by
>> > > > >> > > default
>> > > > >> > > > > if you do not want this, you need to do some cleanup.
>> > > > >> > > > >
>> > > > >> > > > >    Transliterator pinyin =
>> > > > Transliterator.getInstance("Han-Latin;
>> > > > >> > NFD;
>> > > > >> > > > > [[:NonspacingMark:][:Space:]] Remove");
>> > > > >> > > > >    Tokenizer tokenizer = new KeywordTokenizer(new
>> > > > >> > StringReader("中国"));
>> > > > >> > > > >    ICUTransformFilter filter = new
>> > > ICUTransformFilter(tokenizer,
>> > > > >> > > pinyin);
>> > > > >> > > > >    assertTokenStreamContents(filter, new String[] {
>> > "zhongguo"
>> > > }
>> > > > >> );
>> > > > >> > > > >
>> > > > >> > > > >
>> > > > >> > > > > 2009/12/15 Weiwei Wang <ww...@gmail.com>
>> > > > >> > > > >
>> > > > >> > > > > > Hi, guys,
>> > > > >> > > > > >     I'm implementing a search engine based on Lucene for
>> > > > >> Chinese.
>> > > > >> > So
>> > > > >> > > I
>> > > > >> > > > > want
>> > > > >> > > > > > to support pinyin search as Google China do.
>> > > > >> > > > > >
>> > > > >> > > > > > e.g.
>> > > > >> > > > > >    “中国”  means Chinese in English
>> > > > >> > > > > >    this word's pinyin input is "zhongguo"
>> > > > >> > > > > > The feature i want to implement is when user type
>> zhongguo
>> > > the
>> > > > >> > > results
>> > > > >> > > > > will
>> > > > >> > > > > > include documents containing "中国" or even Chinese
>> > > > >> > > > > >
>> > > > >> > > > > > Anybody here know how to achieve this?
>> > > > >> > > > > >
>> > > > >> > > > > > --
>> > > > >> > > > > > Weiwei Wang
>> > > > >> > > > > > Alex Wang
>> > > > >> > > > > > 王巍巍
>> > > > >> > > > > > Room 403, Mengmin Wei Building
>> > > > >> > > > > > Computer Science Department
>> > > > >> > > > > > Gulou Campus of Nanjing University
>> > > > >> > > > > > Nanjing, P.R.China, 210093
>> > > > >> > > > > >
>> > > > >> > > > > > Homepage: http://cs.nju.edu.cn/rl/weiweiwang
>> > > > >> > > > > >
>> > > > >> > > > >
>> > > > >> > > > >
>> > > > >> > > > >
>> > > > >> > > > > --
>> > > > >> > > > > Robert Muir
>> > > > >> > > > > rcmuir@gmail.com
>> > > > >> > > > >
>> > > > >> > > >
>> > > > >> > > >
>> > > > >> > > >
>> > > > >> > > > --
>> > > > >> > > > Weiwei Wang
>> > > > >> > > > Alex Wang
>> > > > >> > > > 王巍巍
>> > > > >> > > > Room 403, Mengmin Wei Building
>> > > > >> > > > Computer Science Department
>> > > > >> > > > Gulou Campus of Nanjing University
>> > > > >> > > > Nanjing, P.R.China, 210093
>> > > > >> > > >
>> > > > >> > > > Homepage: http://cs.nju.edu.cn/rl/weiweiwang
>> > > > >> > > >
>> > > > >> > >
>> > > > >> > >
>> > > > >> > >
>> > > > >> > > --
>> > > > >> > > Robert Muir
>> > > > >> > > rcmuir@gmail.com
>> > > > >> > >
>> > > > >> >
>> > > > >> >
>> > > > >> >
>> > > > >> > --
>> > > > >> > Weiwei Wang
>> > > > >> > Alex Wang
>> > > > >> > 王巍巍
>> > > > >> > Room 403, Mengmin Wei Building
>> > > > >> > Computer Science Department
>> > > > >> > Gulou Campus of Nanjing University
>> > > > >> > Nanjing, P.R.China, 210093
>> > > > >> >
>> > > > >> > Homepage: http://cs.nju.edu.cn/rl/weiweiwang
>> > > > >> >
>> > > > >>
>> > > > >>
>> > > > >>
>> > > > >> --
>> > > > >> Robert Muir
>> > > > >> rcmuir@gmail.com
>> > > > >>
>> > > > >
>> > > > >
>> > > > >
>> > > > > --
>> > > > > Weiwei Wang
>> > > > > Alex Wang
>> > > > > 王巍巍
>> > > > > Room 403, Mengmin Wei Building
>> > > > > Computer Science Department
>> > > > > Gulou Campus of Nanjing University
>> > > > > Nanjing, P.R.China, 210093
>> > > > >
>> > > > > Homepage: http://cs.nju.edu.cn/rl/weiweiwang
>> > > > >
>> > > >
>> > > >
>> > > >
>> > > > --
>> > > > Weiwei Wang
>> > > > Alex Wang
>> > > > 王巍巍
>> > > > Room 403, Mengmin Wei Building
>> > > > Computer Science Department
>> > > > Gulou Campus of Nanjing University
>> > > > Nanjing, P.R.China, 210093
>> > > >
>> > > > Homepage: http://cs.nju.edu.cn/rl/weiweiwang
>> > > >
>> > >
>> > >
>> > >
>> > > --
>> > > Robert Muir
>> > > rcmuir@gmail.com
>> > >
>> >
>> >
>> >
>> > --
>> > Weiwei Wang
>> > Alex Wang
>> > 王巍巍
>> > Room 403, Mengmin Wei Building
>> > Computer Science Department
>> > Gulou Campus of Nanjing University
>> > Nanjing, P.R.China, 210093
>> >
>> > Homepage: http://cs.nju.edu.cn/rl/weiweiwang
>> >
>>
>
>
>
> --
> Weiwei Wang
> Alex Wang
> 王巍巍
> Room 403, Mengmin Wei Building
> Computer Science Department
> Gulou Campus of Nanjing University
> Nanjing, P.R.China, 210093
>
> Homepage: http://cs.nju.edu.cn/rl/weiweiwang
>



-- 
梦的开始挣扎于城市的边缘
心的远方执着在脚步的瞬间
我的宿命埋藏了寂寞的永远

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: How to do alias(Pinyin) search in Lucene

Posted by Weiwei Wang <ww...@gmail.com>.

Thanks Erick, I''ll take a carefull study of that

2009/12/16 Erick Erickson <er...@gmail.com>

> If your queries are still slow, make sure you're not measuring
> the *first* query on a newly opened searcher. There are
> other tips here that might be useful. These are general searching
> tips complimentary to Robert's suggestions..
>
> http://wiki.apache.org/lucene-java/ImproveSearchingSpeed
>
> <http://wiki.apache.org/lucene-java/ImproveSearchingSpeed>HTH
> Erick
>
> 2009/12/15 Weiwei Wang <ww...@gmail.com>
>
> > Thanks Robert, a lot is learned from you:-)
> >
> > On Wed, Dec 16, 2009 at 11:53 AM, Robert Muir <rc...@gmail.com> wrote:
> >
> > > Hi, just one more thought for you.
> > >
> > > I think even more important than anything I said before, you should
> > ensure
> > > you implement reusableTokenStream in your analyzer.
> > > this becomes a necessity if you are using expensive objects like this.
> > >
> > > 2009/12/15 Weiwei Wang <ww...@gmail.com>
> > >
> > > > Finally, i make it run, however, it works so slow
> > > >
> > > > 2009/12/15 Weiwei Wang <ww...@gmail.com>
> > > >
> > > > > got it, thanks, Robert
> > > > >
> > > > >
> > > > > On Tue, Dec 15, 2009 at 10:19 PM, Robert Muir <rc...@gmail.com>
> > > wrote:
> > > > >
> > > > >>  if you have lucene 2.9 or 3.0 source code, just run patch -p0 <
> > > > >> /path/to/LUCENE-XXYY.patch from the lucene source code root
> > > directory...
> > > > >> it
> > > > >> should create the necessary directory and files.
> > > > >> then run 'ant' , in this case it should create a lucene-icu jar
> file
> > > in
> > > > >> the
> > > > >> build directory.
> > > > >>
> > > > >> the patch doesnt include the icu dependency itself so you need to
> > get
> > > > that
> > > > >> jar file from www.icu-project.org and have it in your classpath
> > also
> > > > >>
> > > > >> sorry for the trouble, hope to integrate some of this soon for a
> > > future
> > > > >> release.
> > > > >>
> > > > >> On Tue, Dec 15, 2009 at 9:13 AM, Weiwei Wang <
> ww.wang.cs@gmail.com>
> > > > >> wrote:
> > > > >>
> > > > >> > Yes, i found the patch file LUCENE-1488.patch and there's no icu
> > > > >> directory
> > > > >> > in my dowloaded contrib directory.
> > > > >> >
> > > > >> > I'm a rookie guy using patch, i'm currently in the contrib dir,
> > > could
> > > > >> > anybody tell me how to execute this patch command to generate
> the
> > > > >> relevant
> > > > >> > dir and souce files?
> > > > >> >
> > > > >> > On Tue, Dec 15, 2009 at 9:51 PM, Robert Muir <rc...@gmail.com>
> > > > wrote:
> > > > >> >
> > > > >> > > look at the latest patch file attached to the issue, it should
> > > work
> > > > >> with
> > > > >> > > lucene 2.9 or greater (I think)
> > > > >> > >
> > > > >> > > 2009/12/15 Weiwei Wang <ww...@gmail.com>
> > > > >> > >
> > > > >> > > > where can i find the source code?
> > > > >> > > >
> > > > >> > > > On Tue, Dec 15, 2009 at 9:40 PM, Robert Muir <
> > rcmuir@gmail.com>
> > > > >> wrote:
> > > > >> > > >
> > > > >> > > > > there is an icu transform tokenfilter in the patch here:
> > > > >> > > > > http://issues.apache.org/jira/browse/LUCENE-1488
> > > > >> > > > >
> > > > >> > > > >    Transliterator pinyin =
> > > > >> Transliterator.getInstance("Han-Latin");
> > > > >> > > > >    Tokenizer tokenizer = new KeywordTokenizer(new
> > > > >> > StringReader("中国"));
> > > > >> > > > >    ICUTransformFilter filter = new
> > > ICUTransformFilter(tokenizer,
> > > > >> > > pinyin);
> > > > >> > > > >    assertTokenStreamContents(filter, new String[] { "zhōng
> > > guó"
> > > > }
> > > > >> );
> > > > >> > > > >
> > > > >> > > > > note it will add tone marks and insert space between
> > syllables
> > > > by
> > > > >> > > default
> > > > >> > > > > if you do not want this, you need to do some cleanup.
> > > > >> > > > >
> > > > >> > > > >    Transliterator pinyin =
> > > > Transliterator.getInstance("Han-Latin;
> > > > >> > NFD;
> > > > >> > > > > [[:NonspacingMark:][:Space:]] Remove");
> > > > >> > > > >    Tokenizer tokenizer = new KeywordTokenizer(new
> > > > >> > StringReader("中国"));
> > > > >> > > > >    ICUTransformFilter filter = new
> > > ICUTransformFilter(tokenizer,
> > > > >> > > pinyin);
> > > > >> > > > >    assertTokenStreamContents(filter, new String[] {
> > "zhongguo"
> > > }
> > > > >> );
> > > > >> > > > >
> > > > >> > > > >
> > > > >> > > > > 2009/12/15 Weiwei Wang <ww...@gmail.com>
> > > > >> > > > >
> > > > >> > > > > > Hi, guys,
> > > > >> > > > > >     I'm implementing a search engine based on Lucene for
> > > > >> Chinese.
> > > > >> > So
> > > > >> > > I
> > > > >> > > > > want
> > > > >> > > > > > to support pinyin search as Google China do.
> > > > >> > > > > >
> > > > >> > > > > > e.g.
> > > > >> > > > > >    “中国”  means Chinese in English
> > > > >> > > > > >    this word's pinyin input is "zhongguo"
> > > > >> > > > > > The feature i want to implement is when user type
> zhongguo
> > > the
> > > > >> > > results
> > > > >> > > > > will
> > > > >> > > > > > include documents containing "中国" or even Chinese
> > > > >> > > > > >
> > > > >> > > > > > Anybody here know how to achieve this?
> > > > >> > > > > >
> > > > >> > > > > > --
> > > > >> > > > > > Weiwei Wang
> > > > >> > > > > > Alex Wang
> > > > >> > > > > > 王巍巍
> > > > >> > > > > > Room 403, Mengmin Wei Building
> > > > >> > > > > > Computer Science Department
> > > > >> > > > > > Gulou Campus of Nanjing University
> > > > >> > > > > > Nanjing, P.R.China, 210093
> > > > >> > > > > >
> > > > >> > > > > > Homepage: http://cs.nju.edu.cn/rl/weiweiwang
> > > > >> > > > > >
> > > > >> > > > >
> > > > >> > > > >
> > > > >> > > > >
> > > > >> > > > > --
> > > > >> > > > > Robert Muir
> > > > >> > > > > rcmuir@gmail.com
> > > > >> > > > >
> > > > >> > > >
> > > > >> > > >
> > > > >> > > >
> > > > >> > > > --
> > > > >> > > > Weiwei Wang
> > > > >> > > > Alex Wang
> > > > >> > > > 王巍巍
> > > > >> > > > Room 403, Mengmin Wei Building
> > > > >> > > > Computer Science Department
> > > > >> > > > Gulou Campus of Nanjing University
> > > > >> > > > Nanjing, P.R.China, 210093
> > > > >> > > >
> > > > >> > > > Homepage: http://cs.nju.edu.cn/rl/weiweiwang
> > > > >> > > >
> > > > >> > >
> > > > >> > >
> > > > >> > >
> > > > >> > > --
> > > > >> > > Robert Muir
> > > > >> > > rcmuir@gmail.com
> > > > >> > >
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> > --
> > > > >> > Weiwei Wang
> > > > >> > Alex Wang
> > > > >> > 王巍巍
> > > > >> > Room 403, Mengmin Wei Building
> > > > >> > Computer Science Department
> > > > >> > Gulou Campus of Nanjing University
> > > > >> > Nanjing, P.R.China, 210093
> > > > >> >
> > > > >> > Homepage: http://cs.nju.edu.cn/rl/weiweiwang
> > > > >> >
> > > > >>
> > > > >>
> > > > >>
> > > > >> --
> > > > >> Robert Muir
> > > > >> rcmuir@gmail.com
> > > > >>
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Weiwei Wang
> > > > > Alex Wang
> > > > > 王巍巍
> > > > > Room 403, Mengmin Wei Building
> > > > > Computer Science Department
> > > > > Gulou Campus of Nanjing University
> > > > > Nanjing, P.R.China, 210093
> > > > >
> > > > > Homepage: http://cs.nju.edu.cn/rl/weiweiwang
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Weiwei Wang
> > > > Alex Wang
> > > > 王巍巍
> > > > Room 403, Mengmin Wei Building
> > > > Computer Science Department
> > > > Gulou Campus of Nanjing University
> > > > Nanjing, P.R.China, 210093
> > > >
> > > > Homepage: http://cs.nju.edu.cn/rl/weiweiwang
> > > >
> > >
> > >
> > >
> > > --
> > > Robert Muir
> > > rcmuir@gmail.com
> > >
> >
> >
> >
> > --
> > Weiwei Wang
> > Alex Wang
> > 王巍巍
> > Room 403, Mengmin Wei Building
> > Computer Science Department
> > Gulou Campus of Nanjing University
> > Nanjing, P.R.China, 210093
> >
> > Homepage: http://cs.nju.edu.cn/rl/weiweiwang
> >
>



-- 
Weiwei Wang
Alex Wang
王巍巍
Room 403, Mengmin Wei Building
Computer Science Department
Gulou Campus of Nanjing University
Nanjing, P.R.China, 210093

Homepage: http://cs.nju.edu.cn/rl/weiweiwang

Re: How to do alias(Pinyin) search in Lucene

Posted by Erick Erickson <er...@gmail.com>.

If your queries are still slow, make sure you're not measuring
the *first* query on a newly opened searcher. There are
other tips here that might be useful. These are general searching
tips complimentary to Robert's suggestions..

http://wiki.apache.org/lucene-java/ImproveSearchingSpeed

<http://wiki.apache.org/lucene-java/ImproveSearchingSpeed>HTH
Erick

2009/12/15 Weiwei Wang <ww...@gmail.com>

> Thanks Robert, a lot is learned from you:-)
>
> On Wed, Dec 16, 2009 at 11:53 AM, Robert Muir <rc...@gmail.com> wrote:
>
> > Hi, just one more thought for you.
> >
> > I think even more important than anything I said before, you should
> ensure
> > you implement reusableTokenStream in your analyzer.
> > this becomes a necessity if you are using expensive objects like this.
> >
> > 2009/12/15 Weiwei Wang <ww...@gmail.com>
> >
> > > Finally, i make it run, however, it works so slow
> > >
> > > 2009/12/15 Weiwei Wang <ww...@gmail.com>
> > >
> > > > got it, thanks, Robert
> > > >
> > > >
> > > > On Tue, Dec 15, 2009 at 10:19 PM, Robert Muir <rc...@gmail.com>
> > wrote:
> > > >
> > > >>  if you have lucene 2.9 or 3.0 source code, just run patch -p0 <
> > > >> /path/to/LUCENE-XXYY.patch from the lucene source code root
> > directory...
> > > >> it
> > > >> should create the necessary directory and files.
> > > >> then run 'ant' , in this case it should create a lucene-icu jar file
> > in
> > > >> the
> > > >> build directory.
> > > >>
> > > >> the patch doesnt include the icu dependency itself so you need to
> get
> > > that
> > > >> jar file from www.icu-project.org and have it in your classpath
> also
> > > >>
> > > >> sorry for the trouble, hope to integrate some of this soon for a
> > future
> > > >> release.
> > > >>
> > > >> On Tue, Dec 15, 2009 at 9:13 AM, Weiwei Wang <ww...@gmail.com>
> > > >> wrote:
> > > >>
> > > >> > Yes, i found the patch file LUCENE-1488.patch and there's no icu
> > > >> directory
> > > >> > in my dowloaded contrib directory.
> > > >> >
> > > >> > I'm a rookie guy using patch, i'm currently in the contrib dir,
> > could
> > > >> > anybody tell me how to execute this patch command to generate the
> > > >> relevant
> > > >> > dir and souce files?
> > > >> >
> > > >> > On Tue, Dec 15, 2009 at 9:51 PM, Robert Muir <rc...@gmail.com>
> > > wrote:
> > > >> >
> > > >> > > look at the latest patch file attached to the issue, it should
> > work
> > > >> with
> > > >> > > lucene 2.9 or greater (I think)
> > > >> > >
> > > >> > > 2009/12/15 Weiwei Wang <ww...@gmail.com>
> > > >> > >
> > > >> > > > where can i find the source code?
> > > >> > > >
> > > >> > > > On Tue, Dec 15, 2009 at 9:40 PM, Robert Muir <
> rcmuir@gmail.com>
> > > >> wrote:
> > > >> > > >
> > > >> > > > > there is an icu transform tokenfilter in the patch here:
> > > >> > > > > http://issues.apache.org/jira/browse/LUCENE-1488
> > > >> > > > >
> > > >> > > > >    Transliterator pinyin =
> > > >> Transliterator.getInstance("Han-Latin");
> > > >> > > > >    Tokenizer tokenizer = new KeywordTokenizer(new
> > > >> > StringReader("中国"));
> > > >> > > > >    ICUTransformFilter filter = new
> > ICUTransformFilter(tokenizer,
> > > >> > > pinyin);
> > > >> > > > >    assertTokenStreamContents(filter, new String[] { "zhōng
> > guó"
> > > }
> > > >> );
> > > >> > > > >
> > > >> > > > > note it will add tone marks and insert space between
> syllables
> > > by
> > > >> > > default
> > > >> > > > > if you do not want this, you need to do some cleanup.
> > > >> > > > >
> > > >> > > > >    Transliterator pinyin =
> > > Transliterator.getInstance("Han-Latin;
> > > >> > NFD;
> > > >> > > > > [[:NonspacingMark:][:Space:]] Remove");
> > > >> > > > >    Tokenizer tokenizer = new KeywordTokenizer(new
> > > >> > StringReader("中国"));
> > > >> > > > >    ICUTransformFilter filter = new
> > ICUTransformFilter(tokenizer,
> > > >> > > pinyin);
> > > >> > > > >    assertTokenStreamContents(filter, new String[] {
> "zhongguo"
> > }
> > > >> );
> > > >> > > > >
> > > >> > > > >
> > > >> > > > > 2009/12/15 Weiwei Wang <ww...@gmail.com>
> > > >> > > > >
> > > >> > > > > > Hi, guys,
> > > >> > > > > >     I'm implementing a search engine based on Lucene for
> > > >> Chinese.
> > > >> > So
> > > >> > > I
> > > >> > > > > want
> > > >> > > > > > to support pinyin search as Google China do.
> > > >> > > > > >
> > > >> > > > > > e.g.
> > > >> > > > > >    “中国”  means Chinese in English
> > > >> > > > > >    this word's pinyin input is "zhongguo"
> > > >> > > > > > The feature i want to implement is when user type zhongguo
> > the
> > > >> > > results
> > > >> > > > > will
> > > >> > > > > > include documents containing "中国" or even Chinese
> > > >> > > > > >
> > > >> > > > > > Anybody here know how to achieve this?
> > > >> > > > > >
> > > >> > > > > > --
> > > >> > > > > > Weiwei Wang
> > > >> > > > > > Alex Wang
> > > >> > > > > > 王巍巍
> > > >> > > > > > Room 403, Mengmin Wei Building
> > > >> > > > > > Computer Science Department
> > > >> > > > > > Gulou Campus of Nanjing University
> > > >> > > > > > Nanjing, P.R.China, 210093
> > > >> > > > > >
> > > >> > > > > > Homepage: http://cs.nju.edu.cn/rl/weiweiwang
> > > >> > > > > >
> > > >> > > > >
> > > >> > > > >
> > > >> > > > >
> > > >> > > > > --
> > > >> > > > > Robert Muir
> > > >> > > > > rcmuir@gmail.com
> > > >> > > > >
> > > >> > > >
> > > >> > > >
> > > >> > > >
> > > >> > > > --
> > > >> > > > Weiwei Wang
> > > >> > > > Alex Wang
> > > >> > > > 王巍巍
> > > >> > > > Room 403, Mengmin Wei Building
> > > >> > > > Computer Science Department
> > > >> > > > Gulou Campus of Nanjing University
> > > >> > > > Nanjing, P.R.China, 210093
> > > >> > > >
> > > >> > > > Homepage: http://cs.nju.edu.cn/rl/weiweiwang
> > > >> > > >
> > > >> > >
> > > >> > >
> > > >> > >
> > > >> > > --
> > > >> > > Robert Muir
> > > >> > > rcmuir@gmail.com
> > > >> > >
> > > >> >
> > > >> >
> > > >> >
> > > >> > --
> > > >> > Weiwei Wang
> > > >> > Alex Wang
> > > >> > 王巍巍
> > > >> > Room 403, Mengmin Wei Building
> > > >> > Computer Science Department
> > > >> > Gulou Campus of Nanjing University
> > > >> > Nanjing, P.R.China, 210093
> > > >> >
> > > >> > Homepage: http://cs.nju.edu.cn/rl/weiweiwang
> > > >> >
> > > >>
> > > >>
> > > >>
> > > >> --
> > > >> Robert Muir
> > > >> rcmuir@gmail.com
> > > >>
> > > >
> > > >
> > > >
> > > > --
> > > > Weiwei Wang
> > > > Alex Wang
> > > > 王巍巍
> > > > Room 403, Mengmin Wei Building
> > > > Computer Science Department
> > > > Gulou Campus of Nanjing University
> > > > Nanjing, P.R.China, 210093
> > > >
> > > > Homepage: http://cs.nju.edu.cn/rl/weiweiwang
> > > >
> > >
> > >
> > >
> > > --
> > > Weiwei Wang
> > > Alex Wang
> > > 王巍巍
> > > Room 403, Mengmin Wei Building
> > > Computer Science Department
> > > Gulou Campus of Nanjing University
> > > Nanjing, P.R.China, 210093
> > >
> > > Homepage: http://cs.nju.edu.cn/rl/weiweiwang
> > >
> >
> >
> >
> > --
> > Robert Muir
> > rcmuir@gmail.com
> >
>
>
>
> --
> Weiwei Wang
> Alex Wang
> 王巍巍
> Room 403, Mengmin Wei Building
> Computer Science Department
> Gulou Campus of Nanjing University
> Nanjing, P.R.China, 210093
>
> Homepage: http://cs.nju.edu.cn/rl/weiweiwang
>

Re: How to do alias(Pinyin) search in Lucene

Posted by Weiwei Wang <ww...@gmail.com>.

Thanks Robert, a lot is learned from you:-)

On Wed, Dec 16, 2009 at 11:53 AM, Robert Muir <rc...@gmail.com> wrote:

> Hi, just one more thought for you.
>
> I think even more important than anything I said before, you should ensure
> you implement reusableTokenStream in your analyzer.
> this becomes a necessity if you are using expensive objects like this.
>
> 2009/12/15 Weiwei Wang <ww...@gmail.com>
>
> > Finally, i make it run, however, it works so slow
> >
> > 2009/12/15 Weiwei Wang <ww...@gmail.com>
> >
> > > got it, thanks, Robert
> > >
> > >
> > > On Tue, Dec 15, 2009 at 10:19 PM, Robert Muir <rc...@gmail.com>
> wrote:
> > >
> > >>  if you have lucene 2.9 or 3.0 source code, just run patch -p0 <
> > >> /path/to/LUCENE-XXYY.patch from the lucene source code root
> directory...
> > >> it
> > >> should create the necessary directory and files.
> > >> then run 'ant' , in this case it should create a lucene-icu jar file
> in
> > >> the
> > >> build directory.
> > >>
> > >> the patch doesnt include the icu dependency itself so you need to get
> > that
> > >> jar file from www.icu-project.org and have it in your classpath also
> > >>
> > >> sorry for the trouble, hope to integrate some of this soon for a
> future
> > >> release.
> > >>
> > >> On Tue, Dec 15, 2009 at 9:13 AM, Weiwei Wang <ww...@gmail.com>
> > >> wrote:
> > >>
> > >> > Yes, i found the patch file LUCENE-1488.patch and there's no icu
> > >> directory
> > >> > in my dowloaded contrib directory.
> > >> >
> > >> > I'm a rookie guy using patch, i'm currently in the contrib dir,
> could
> > >> > anybody tell me how to execute this patch command to generate the
> > >> relevant
> > >> > dir and souce files?
> > >> >
> > >> > On Tue, Dec 15, 2009 at 9:51 PM, Robert Muir <rc...@gmail.com>
> > wrote:
> > >> >
> > >> > > look at the latest patch file attached to the issue, it should
> work
> > >> with
> > >> > > lucene 2.9 or greater (I think)
> > >> > >
> > >> > > 2009/12/15 Weiwei Wang <ww...@gmail.com>
> > >> > >
> > >> > > > where can i find the source code?
> > >> > > >
> > >> > > > On Tue, Dec 15, 2009 at 9:40 PM, Robert Muir <rc...@gmail.com>
> > >> wrote:
> > >> > > >
> > >> > > > > there is an icu transform tokenfilter in the patch here:
> > >> > > > > http://issues.apache.org/jira/browse/LUCENE-1488
> > >> > > > >
> > >> > > > >    Transliterator pinyin =
> > >> Transliterator.getInstance("Han-Latin");
> > >> > > > >    Tokenizer tokenizer = new KeywordTokenizer(new
> > >> > StringReader("中国"));
> > >> > > > >    ICUTransformFilter filter = new
> ICUTransformFilter(tokenizer,
> > >> > > pinyin);
> > >> > > > >    assertTokenStreamContents(filter, new String[] { "zhōng
> guó"
> > }
> > >> );
> > >> > > > >
> > >> > > > > note it will add tone marks and insert space between syllables
> > by
> > >> > > default
> > >> > > > > if you do not want this, you need to do some cleanup.
> > >> > > > >
> > >> > > > >    Transliterator pinyin =
> > Transliterator.getInstance("Han-Latin;
> > >> > NFD;
> > >> > > > > [[:NonspacingMark:][:Space:]] Remove");
> > >> > > > >    Tokenizer tokenizer = new KeywordTokenizer(new
> > >> > StringReader("中国"));
> > >> > > > >    ICUTransformFilter filter = new
> ICUTransformFilter(tokenizer,
> > >> > > pinyin);
> > >> > > > >    assertTokenStreamContents(filter, new String[] { "zhongguo"
> }
> > >> );
> > >> > > > >
> > >> > > > >
> > >> > > > > 2009/12/15 Weiwei Wang <ww...@gmail.com>
> > >> > > > >
> > >> > > > > > Hi, guys,
> > >> > > > > >     I'm implementing a search engine based on Lucene for
> > >> Chinese.
> > >> > So
> > >> > > I
> > >> > > > > want
> > >> > > > > > to support pinyin search as Google China do.
> > >> > > > > >
> > >> > > > > > e.g.
> > >> > > > > >    “中国”  means Chinese in English
> > >> > > > > >    this word's pinyin input is "zhongguo"
> > >> > > > > > The feature i want to implement is when user type zhongguo
> the
> > >> > > results
> > >> > > > > will
> > >> > > > > > include documents containing "中国" or even Chinese
> > >> > > > > >
> > >> > > > > > Anybody here know how to achieve this?
> > >> > > > > >
> > >> > > > > > --
> > >> > > > > > Weiwei Wang
> > >> > > > > > Alex Wang
> > >> > > > > > 王巍巍
> > >> > > > > > Room 403, Mengmin Wei Building
> > >> > > > > > Computer Science Department
> > >> > > > > > Gulou Campus of Nanjing University
> > >> > > > > > Nanjing, P.R.China, 210093
> > >> > > > > >
> > >> > > > > > Homepage: http://cs.nju.edu.cn/rl/weiweiwang
> > >> > > > > >
> > >> > > > >
> > >> > > > >
> > >> > > > >
> > >> > > > > --
> > >> > > > > Robert Muir
> > >> > > > > rcmuir@gmail.com
> > >> > > > >
> > >> > > >
> > >> > > >
> > >> > > >
> > >> > > > --
> > >> > > > Weiwei Wang
> > >> > > > Alex Wang
> > >> > > > 王巍巍
> > >> > > > Room 403, Mengmin Wei Building
> > >> > > > Computer Science Department
> > >> > > > Gulou Campus of Nanjing University
> > >> > > > Nanjing, P.R.China, 210093
> > >> > > >
> > >> > > > Homepage: http://cs.nju.edu.cn/rl/weiweiwang
> > >> > > >
> > >> > >
> > >> > >
> > >> > >
> > >> > > --
> > >> > > Robert Muir
> > >> > > rcmuir@gmail.com
> > >> > >
> > >> >
> > >> >
> > >> >
> > >> > --
> > >> > Weiwei Wang
> > >> > Alex Wang
> > >> > 王巍巍
> > >> > Room 403, Mengmin Wei Building
> > >> > Computer Science Department
> > >> > Gulou Campus of Nanjing University
> > >> > Nanjing, P.R.China, 210093
> > >> >
> > >> > Homepage: http://cs.nju.edu.cn/rl/weiweiwang
> > >> >
> > >>
> > >>
> > >>
> > >> --
> > >> Robert Muir
> > >> rcmuir@gmail.com
> > >>
> > >
> > >
> > >
> > > --
> > > Weiwei Wang
> > > Alex Wang
> > > 王巍巍
> > > Room 403, Mengmin Wei Building
> > > Computer Science Department
> > > Gulou Campus of Nanjing University
> > > Nanjing, P.R.China, 210093
> > >
> > > Homepage: http://cs.nju.edu.cn/rl/weiweiwang
> > >
> >
> >
> >
> > --
> > Weiwei Wang
> > Alex Wang
> > 王巍巍
> > Room 403, Mengmin Wei Building
> > Computer Science Department
> > Gulou Campus of Nanjing University
> > Nanjing, P.R.China, 210093
> >
> > Homepage: http://cs.nju.edu.cn/rl/weiweiwang
> >
>
>
>
> --
> Robert Muir
> rcmuir@gmail.com
>



-- 
Weiwei Wang
Alex Wang
王巍巍
Room 403, Mengmin Wei Building
Computer Science Department
Gulou Campus of Nanjing University
Nanjing, P.R.China, 210093

Homepage: http://cs.nju.edu.cn/rl/weiweiwang

Re: How to do alias(Pinyin) search in Lucene

Posted by Robert Muir <rc...@gmail.com>.

Hi, just one more thought for you.

I think even more important than anything I said before, you should ensure
you implement reusableTokenStream in your analyzer.
this becomes a necessity if you are using expensive objects like this.

2009/12/15 Weiwei Wang <ww...@gmail.com>

> Finally, i make it run, however, it works so slow
>
> 2009/12/15 Weiwei Wang <ww...@gmail.com>
>
> > got it, thanks, Robert
> >
> >
> > On Tue, Dec 15, 2009 at 10:19 PM, Robert Muir <rc...@gmail.com> wrote:
> >
> >>  if you have lucene 2.9 or 3.0 source code, just run patch -p0 <
> >> /path/to/LUCENE-XXYY.patch from the lucene source code root directory...
> >> it
> >> should create the necessary directory and files.
> >> then run 'ant' , in this case it should create a lucene-icu jar file in
> >> the
> >> build directory.
> >>
> >> the patch doesnt include the icu dependency itself so you need to get
> that
> >> jar file from www.icu-project.org and have it in your classpath also
> >>
> >> sorry for the trouble, hope to integrate some of this soon for a future
> >> release.
> >>
> >> On Tue, Dec 15, 2009 at 9:13 AM, Weiwei Wang <ww...@gmail.com>
> >> wrote:
> >>
> >> > Yes, i found the patch file LUCENE-1488.patch and there's no icu
> >> directory
> >> > in my dowloaded contrib directory.
> >> >
> >> > I'm a rookie guy using patch, i'm currently in the contrib dir, could
> >> > anybody tell me how to execute this patch command to generate the
> >> relevant
> >> > dir and souce files?
> >> >
> >> > On Tue, Dec 15, 2009 at 9:51 PM, Robert Muir <rc...@gmail.com>
> wrote:
> >> >
> >> > > look at the latest patch file attached to the issue, it should work
> >> with
> >> > > lucene 2.9 or greater (I think)
> >> > >
> >> > > 2009/12/15 Weiwei Wang <ww...@gmail.com>
> >> > >
> >> > > > where can i find the source code?
> >> > > >
> >> > > > On Tue, Dec 15, 2009 at 9:40 PM, Robert Muir <rc...@gmail.com>
> >> wrote:
> >> > > >
> >> > > > > there is an icu transform tokenfilter in the patch here:
> >> > > > > http://issues.apache.org/jira/browse/LUCENE-1488
> >> > > > >
> >> > > > >    Transliterator pinyin =
> >> Transliterator.getInstance("Han-Latin");
> >> > > > >    Tokenizer tokenizer = new KeywordTokenizer(new
> >> > StringReader("中国"));
> >> > > > >    ICUTransformFilter filter = new ICUTransformFilter(tokenizer,
> >> > > pinyin);
> >> > > > >    assertTokenStreamContents(filter, new String[] { "zhōng guó"
> }
> >> );
> >> > > > >
> >> > > > > note it will add tone marks and insert space between syllables
> by
> >> > > default
> >> > > > > if you do not want this, you need to do some cleanup.
> >> > > > >
> >> > > > >    Transliterator pinyin =
> Transliterator.getInstance("Han-Latin;
> >> > NFD;
> >> > > > > [[:NonspacingMark:][:Space:]] Remove");
> >> > > > >    Tokenizer tokenizer = new KeywordTokenizer(new
> >> > StringReader("中国"));
> >> > > > >    ICUTransformFilter filter = new ICUTransformFilter(tokenizer,
> >> > > pinyin);
> >> > > > >    assertTokenStreamContents(filter, new String[] { "zhongguo" }
> >> );
> >> > > > >
> >> > > > >
> >> > > > > 2009/12/15 Weiwei Wang <ww...@gmail.com>
> >> > > > >
> >> > > > > > Hi, guys,
> >> > > > > >     I'm implementing a search engine based on Lucene for
> >> Chinese.
> >> > So
> >> > > I
> >> > > > > want
> >> > > > > > to support pinyin search as Google China do.
> >> > > > > >
> >> > > > > > e.g.
> >> > > > > >    “中国”  means Chinese in English
> >> > > > > >    this word's pinyin input is "zhongguo"
> >> > > > > > The feature i want to implement is when user type zhongguo the
> >> > > results
> >> > > > > will
> >> > > > > > include documents containing "中国" or even Chinese
> >> > > > > >
> >> > > > > > Anybody here know how to achieve this?
> >> > > > > >
> >> > > > > > --
> >> > > > > > Weiwei Wang
> >> > > > > > Alex Wang
> >> > > > > > 王巍巍
> >> > > > > > Room 403, Mengmin Wei Building
> >> > > > > > Computer Science Department
> >> > > > > > Gulou Campus of Nanjing University
> >> > > > > > Nanjing, P.R.China, 210093
> >> > > > > >
> >> > > > > > Homepage: http://cs.nju.edu.cn/rl/weiweiwang
> >> > > > > >
> >> > > > >
> >> > > > >
> >> > > > >
> >> > > > > --
> >> > > > > Robert Muir
> >> > > > > rcmuir@gmail.com
> >> > > > >
> >> > > >
> >> > > >
> >> > > >
> >> > > > --
> >> > > > Weiwei Wang
> >> > > > Alex Wang
> >> > > > 王巍巍
> >> > > > Room 403, Mengmin Wei Building
> >> > > > Computer Science Department
> >> > > > Gulou Campus of Nanjing University
> >> > > > Nanjing, P.R.China, 210093
> >> > > >
> >> > > > Homepage: http://cs.nju.edu.cn/rl/weiweiwang
> >> > > >
> >> > >
> >> > >
> >> > >
> >> > > --
> >> > > Robert Muir
> >> > > rcmuir@gmail.com
> >> > >
> >> >
> >> >
> >> >
> >> > --
> >> > Weiwei Wang
> >> > Alex Wang
> >> > 王巍巍
> >> > Room 403, Mengmin Wei Building
> >> > Computer Science Department
> >> > Gulou Campus of Nanjing University
> >> > Nanjing, P.R.China, 210093
> >> >
> >> > Homepage: http://cs.nju.edu.cn/rl/weiweiwang
> >> >
> >>
> >>
> >>
> >> --
> >> Robert Muir
> >> rcmuir@gmail.com
> >>
> >
> >
> >
> > --
> > Weiwei Wang
> > Alex Wang
> > 王巍巍
> > Room 403, Mengmin Wei Building
> > Computer Science Department
> > Gulou Campus of Nanjing University
> > Nanjing, P.R.China, 210093
> >
> > Homepage: http://cs.nju.edu.cn/rl/weiweiwang
> >
>
>
>
> --
> Weiwei Wang
> Alex Wang
> 王巍巍
> Room 403, Mengmin Wei Building
> Computer Science Department
> Gulou Campus of Nanjing University
> Nanjing, P.R.China, 210093
>
> Homepage: http://cs.nju.edu.cn/rl/weiweiwang
>



-- 
Robert Muir
rcmuir@gmail.com

Re: How to do alias(Pinyin) search in Lucene

Posted by Robert Muir <rc...@gmail.com>.

hello, this was mainly to show you a quick-and-dirty way to solve the
problem.

if you have a lot of text, here are some ways to optimize:
1. the 'cleanup' step I showed you, is extremely inefficient way to remove
the space and diacritics.
For your case perhaps you can use more efficient ways to avoid the
normalization (NFD), such as asciifoldingfilter to remove the diacritics.
2. if most of your text is not chinese, you can use a filter in your spec to
ensure that it only operates on chinese text... i think something like this
should work: instead of Transliterator.getInstance("Han-Latin") it would be
Transliterator.getInstance("[:Ideographic:] Han-Latin"). Such a thing cannot
be automatically determined (see the comments in the code) because this is a
composite transform, it calls the Han-SpacedHan to add the space.
3. if you really care about speed, you can take the rules yourself from CLDR
and customize them to your needs (for example, remove diacritics from the
rules and/or do not call Han-SpacedHan which adds the spaces. This would
also probably automatically fix #2 above. The rules are based on this XML
file: http://unicode.org/repos/cldr/trunk/common/transforms/Han-Latin.xml,
but you need to preprocess it to be plain text to work as a custom
transform: see here for more details
http://userguide.icu-project.org/transforms/general



2009/12/15 Weiwei Wang <ww...@gmail.com>

> Finally, i make it run, however, it works so slow
>
> 2009/12/15 Weiwei Wang <ww...@gmail.com>
>
> > got it, thanks, Robert
> >
> >
> > On Tue, Dec 15, 2009 at 10:19 PM, Robert Muir <rc...@gmail.com> wrote:
> >
> >>  if you have lucene 2.9 or 3.0 source code, just run patch -p0 <
> >> /path/to/LUCENE-XXYY.patch from the lucene source code root directory...
> >> it
> >> should create the necessary directory and files.
> >> then run 'ant' , in this case it should create a lucene-icu jar file in
> >> the
> >> build directory.
> >>
> >> the patch doesnt include the icu dependency itself so you need to get
> that
> >> jar file from www.icu-project.org and have it in your classpath also
> >>
> >> sorry for the trouble, hope to integrate some of this soon for a future
> >> release.
> >>
> >> On Tue, Dec 15, 2009 at 9:13 AM, Weiwei Wang <ww...@gmail.com>
> >> wrote:
> >>
> >> > Yes, i found the patch file LUCENE-1488.patch and there's no icu
> >> directory
> >> > in my dowloaded contrib directory.
> >> >
> >> > I'm a rookie guy using patch, i'm currently in the contrib dir, could
> >> > anybody tell me how to execute this patch command to generate the
> >> relevant
> >> > dir and souce files?
> >> >
> >> > On Tue, Dec 15, 2009 at 9:51 PM, Robert Muir <rc...@gmail.com>
> wrote:
> >> >
> >> > > look at the latest patch file attached to the issue, it should work
> >> with
> >> > > lucene 2.9 or greater (I think)
> >> > >
> >> > > 2009/12/15 Weiwei Wang <ww...@gmail.com>
> >> > >
> >> > > > where can i find the source code?
> >> > > >
> >> > > > On Tue, Dec 15, 2009 at 9:40 PM, Robert Muir <rc...@gmail.com>
> >> wrote:
> >> > > >
> >> > > > > there is an icu transform tokenfilter in the patch here:
> >> > > > > http://issues.apache.org/jira/browse/LUCENE-1488
> >> > > > >
> >> > > > >    Transliterator pinyin =
> >> Transliterator.getInstance("Han-Latin");
> >> > > > >    Tokenizer tokenizer = new KeywordTokenizer(new
> >> > StringReader("中国"));
> >> > > > >    ICUTransformFilter filter = new ICUTransformFilter(tokenizer,
> >> > > pinyin);
> >> > > > >    assertTokenStreamContents(filter, new String[] { "zhōng guó"
> }
> >> );
> >> > > > >
> >> > > > > note it will add tone marks and insert space between syllables
> by
> >> > > default
> >> > > > > if you do not want this, you need to do some cleanup.
> >> > > > >
> >> > > > >    Transliterator pinyin =
> Transliterator.getInstance("Han-Latin;
> >> > NFD;
> >> > > > > [[:NonspacingMark:][:Space:]] Remove");
> >> > > > >    Tokenizer tokenizer = new KeywordTokenizer(new
> >> > StringReader("中国"));
> >> > > > >    ICUTransformFilter filter = new ICUTransformFilter(tokenizer,
> >> > > pinyin);
> >> > > > >    assertTokenStreamContents(filter, new String[] { "zhongguo" }
> >> );
> >> > > > >
> >> > > > >
> >> > > > > 2009/12/15 Weiwei Wang <ww...@gmail.com>
> >> > > > >
> >> > > > > > Hi, guys,
> >> > > > > >     I'm implementing a search engine based on Lucene for
> >> Chinese.
> >> > So
> >> > > I
> >> > > > > want
> >> > > > > > to support pinyin search as Google China do.
> >> > > > > >
> >> > > > > > e.g.
> >> > > > > >    “中国”  means Chinese in English
> >> > > > > >    this word's pinyin input is "zhongguo"
> >> > > > > > The feature i want to implement is when user type zhongguo the
> >> > > results
> >> > > > > will
> >> > > > > > include documents containing "中国" or even Chinese
> >> > > > > >
> >> > > > > > Anybody here know how to achieve this?
> >> > > > > >
> >> > > > > > --
> >> > > > > > Weiwei Wang
> >> > > > > > Alex Wang
> >> > > > > > 王巍巍
> >> > > > > > Room 403, Mengmin Wei Building
> >> > > > > > Computer Science Department
> >> > > > > > Gulou Campus of Nanjing University
> >> > > > > > Nanjing, P.R.China, 210093
> >> > > > > >
> >> > > > > > Homepage: http://cs.nju.edu.cn/rl/weiweiwang
> >> > > > > >
> >> > > > >
> >> > > > >
> >> > > > >
> >> > > > > --
> >> > > > > Robert Muir
> >> > > > > rcmuir@gmail.com
> >> > > > >
> >> > > >
> >> > > >
> >> > > >
> >> > > > --
> >> > > > Weiwei Wang
> >> > > > Alex Wang
> >> > > > 王巍巍
> >> > > > Room 403, Mengmin Wei Building
> >> > > > Computer Science Department
> >> > > > Gulou Campus of Nanjing University
> >> > > > Nanjing, P.R.China, 210093
> >> > > >
> >> > > > Homepage: http://cs.nju.edu.cn/rl/weiweiwang
> >> > > >
> >> > >
> >> > >
> >> > >
> >> > > --
> >> > > Robert Muir
> >> > > rcmuir@gmail.com
> >> > >
> >> >
> >> >
> >> >
> >> > --
> >> > Weiwei Wang
> >> > Alex Wang
> >> > 王巍巍
> >> > Room 403, Mengmin Wei Building
> >> > Computer Science Department
> >> > Gulou Campus of Nanjing University
> >> > Nanjing, P.R.China, 210093
> >> >
> >> > Homepage: http://cs.nju.edu.cn/rl/weiweiwang
> >> >
> >>
> >>
> >>
> >> --
> >> Robert Muir
> >> rcmuir@gmail.com
> >>
> >
> >
> >
> > --
> > Weiwei Wang
> > Alex Wang
> > 王巍巍
> > Room 403, Mengmin Wei Building
> > Computer Science Department
> > Gulou Campus of Nanjing University
> > Nanjing, P.R.China, 210093
> >
> > Homepage: http://cs.nju.edu.cn/rl/weiweiwang
> >
>
>
>
> --
> Weiwei Wang
> Alex Wang
> 王巍巍
> Room 403, Mengmin Wei Building
> Computer Science Department
> Gulou Campus of Nanjing University
> Nanjing, P.R.China, 210093
>
> Homepage: http://cs.nju.edu.cn/rl/weiweiwang
>



-- 
Robert Muir
rcmuir@gmail.com

Re: How to do alias(Pinyin) search in Lucene

Posted by Weiwei Wang <ww...@gmail.com>.

Finally, i make it run, however, it works so slow

2009/12/15 Weiwei Wang <ww...@gmail.com>

> got it, thanks, Robert
>
>
> On Tue, Dec 15, 2009 at 10:19 PM, Robert Muir <rc...@gmail.com> wrote:
>
>>  if you have lucene 2.9 or 3.0 source code, just run patch -p0 <
>> /path/to/LUCENE-XXYY.patch from the lucene source code root directory...
>> it
>> should create the necessary directory and files.
>> then run 'ant' , in this case it should create a lucene-icu jar file in
>> the
>> build directory.
>>
>> the patch doesnt include the icu dependency itself so you need to get that
>> jar file from www.icu-project.org and have it in your classpath also
>>
>> sorry for the trouble, hope to integrate some of this soon for a future
>> release.
>>
>> On Tue, Dec 15, 2009 at 9:13 AM, Weiwei Wang <ww...@gmail.com>
>> wrote:
>>
>> > Yes, i found the patch file LUCENE-1488.patch and there's no icu
>> directory
>> > in my dowloaded contrib directory.
>> >
>> > I'm a rookie guy using patch, i'm currently in the contrib dir, could
>> > anybody tell me how to execute this patch command to generate the
>> relevant
>> > dir and souce files?
>> >
>> > On Tue, Dec 15, 2009 at 9:51 PM, Robert Muir <rc...@gmail.com> wrote:
>> >
>> > > look at the latest patch file attached to the issue, it should work
>> with
>> > > lucene 2.9 or greater (I think)
>> > >
>> > > 2009/12/15 Weiwei Wang <ww...@gmail.com>
>> > >
>> > > > where can i find the source code?
>> > > >
>> > > > On Tue, Dec 15, 2009 at 9:40 PM, Robert Muir <rc...@gmail.com>
>> wrote:
>> > > >
>> > > > > there is an icu transform tokenfilter in the patch here:
>> > > > > http://issues.apache.org/jira/browse/LUCENE-1488
>> > > > >
>> > > > >    Transliterator pinyin =
>> Transliterator.getInstance("Han-Latin");
>> > > > >    Tokenizer tokenizer = new KeywordTokenizer(new
>> > StringReader("中国"));
>> > > > >    ICUTransformFilter filter = new ICUTransformFilter(tokenizer,
>> > > pinyin);
>> > > > >    assertTokenStreamContents(filter, new String[] { "zhōng guó" }
>> );
>> > > > >
>> > > > > note it will add tone marks and insert space between syllables by
>> > > default
>> > > > > if you do not want this, you need to do some cleanup.
>> > > > >
>> > > > >    Transliterator pinyin = Transliterator.getInstance("Han-Latin;
>> > NFD;
>> > > > > [[:NonspacingMark:][:Space:]] Remove");
>> > > > >    Tokenizer tokenizer = new KeywordTokenizer(new
>> > StringReader("中国"));
>> > > > >    ICUTransformFilter filter = new ICUTransformFilter(tokenizer,
>> > > pinyin);
>> > > > >    assertTokenStreamContents(filter, new String[] { "zhongguo" }
>> );
>> > > > >
>> > > > >
>> > > > > 2009/12/15 Weiwei Wang <ww...@gmail.com>
>> > > > >
>> > > > > > Hi, guys,
>> > > > > >     I'm implementing a search engine based on Lucene for
>> Chinese.
>> > So
>> > > I
>> > > > > want
>> > > > > > to support pinyin search as Google China do.
>> > > > > >
>> > > > > > e.g.
>> > > > > >    “中国”  means Chinese in English
>> > > > > >    this word's pinyin input is "zhongguo"
>> > > > > > The feature i want to implement is when user type zhongguo the
>> > > results
>> > > > > will
>> > > > > > include documents containing "中国" or even Chinese
>> > > > > >
>> > > > > > Anybody here know how to achieve this?
>> > > > > >
>> > > > > > --
>> > > > > > Weiwei Wang
>> > > > > > Alex Wang
>> > > > > > 王巍巍
>> > > > > > Room 403, Mengmin Wei Building
>> > > > > > Computer Science Department
>> > > > > > Gulou Campus of Nanjing University
>> > > > > > Nanjing, P.R.China, 210093
>> > > > > >
>> > > > > > Homepage: http://cs.nju.edu.cn/rl/weiweiwang
>> > > > > >
>> > > > >
>> > > > >
>> > > > >
>> > > > > --
>> > > > > Robert Muir
>> > > > > rcmuir@gmail.com
>> > > > >
>> > > >
>> > > >
>> > > >
>> > > > --
>> > > > Weiwei Wang
>> > > > Alex Wang
>> > > > 王巍巍
>> > > > Room 403, Mengmin Wei Building
>> > > > Computer Science Department
>> > > > Gulou Campus of Nanjing University
>> > > > Nanjing, P.R.China, 210093
>> > > >
>> > > > Homepage: http://cs.nju.edu.cn/rl/weiweiwang
>> > > >
>> > >
>> > >
>> > >
>> > > --
>> > > Robert Muir
>> > > rcmuir@gmail.com
>> > >
>> >
>> >
>> >
>> > --
>> > Weiwei Wang
>> > Alex Wang
>> > 王巍巍
>> > Room 403, Mengmin Wei Building
>> > Computer Science Department
>> > Gulou Campus of Nanjing University
>> > Nanjing, P.R.China, 210093
>> >
>> > Homepage: http://cs.nju.edu.cn/rl/weiweiwang
>> >
>>
>>
>>
>> --
>> Robert Muir
>> rcmuir@gmail.com
>>
>
>
>
> --
> Weiwei Wang
> Alex Wang
> 王巍巍
> Room 403, Mengmin Wei Building
> Computer Science Department
> Gulou Campus of Nanjing University
> Nanjing, P.R.China, 210093
>
> Homepage: http://cs.nju.edu.cn/rl/weiweiwang
>



-- 
Weiwei Wang
Alex Wang
王巍巍
Room 403, Mengmin Wei Building
Computer Science Department
Gulou Campus of Nanjing University
Nanjing, P.R.China, 210093

Homepage: http://cs.nju.edu.cn/rl/weiweiwang

Re: How to do alias(Pinyin) search in Lucene

Posted by Weiwei Wang <ww...@gmail.com>.

got it, thanks, Robert

On Tue, Dec 15, 2009 at 10:19 PM, Robert Muir <rc...@gmail.com> wrote:

>  if you have lucene 2.9 or 3.0 source code, just run patch -p0 <
> /path/to/LUCENE-XXYY.patch from the lucene source code root directory... it
> should create the necessary directory and files.
> then run 'ant' , in this case it should create a lucene-icu jar file in the
> build directory.
>
> the patch doesnt include the icu dependency itself so you need to get that
> jar file from www.icu-project.org and have it in your classpath also
>
> sorry for the trouble, hope to integrate some of this soon for a future
> release.
>
> On Tue, Dec 15, 2009 at 9:13 AM, Weiwei Wang <ww...@gmail.com> wrote:
>
> > Yes, i found the patch file LUCENE-1488.patch and there's no icu
> directory
> > in my dowloaded contrib directory.
> >
> > I'm a rookie guy using patch, i'm currently in the contrib dir, could
> > anybody tell me how to execute this patch command to generate the
> relevant
> > dir and souce files?
> >
> > On Tue, Dec 15, 2009 at 9:51 PM, Robert Muir <rc...@gmail.com> wrote:
> >
> > > look at the latest patch file attached to the issue, it should work
> with
> > > lucene 2.9 or greater (I think)
> > >
> > > 2009/12/15 Weiwei Wang <ww...@gmail.com>
> > >
> > > > where can i find the source code?
> > > >
> > > > On Tue, Dec 15, 2009 at 9:40 PM, Robert Muir <rc...@gmail.com>
> wrote:
> > > >
> > > > > there is an icu transform tokenfilter in the patch here:
> > > > > http://issues.apache.org/jira/browse/LUCENE-1488
> > > > >
> > > > >    Transliterator pinyin = Transliterator.getInstance("Han-Latin");
> > > > >    Tokenizer tokenizer = new KeywordTokenizer(new
> > StringReader("中国"));
> > > > >    ICUTransformFilter filter = new ICUTransformFilter(tokenizer,
> > > pinyin);
> > > > >    assertTokenStreamContents(filter, new String[] { "zhōng guó" }
> );
> > > > >
> > > > > note it will add tone marks and insert space between syllables by
> > > default
> > > > > if you do not want this, you need to do some cleanup.
> > > > >
> > > > >    Transliterator pinyin = Transliterator.getInstance("Han-Latin;
> > NFD;
> > > > > [[:NonspacingMark:][:Space:]] Remove");
> > > > >    Tokenizer tokenizer = new KeywordTokenizer(new
> > StringReader("中国"));
> > > > >    ICUTransformFilter filter = new ICUTransformFilter(tokenizer,
> > > pinyin);
> > > > >    assertTokenStreamContents(filter, new String[] { "zhongguo" } );
> > > > >
> > > > >
> > > > > 2009/12/15 Weiwei Wang <ww...@gmail.com>
> > > > >
> > > > > > Hi, guys,
> > > > > >     I'm implementing a search engine based on Lucene for Chinese.
> > So
> > > I
> > > > > want
> > > > > > to support pinyin search as Google China do.
> > > > > >
> > > > > > e.g.
> > > > > >    “中国”  means Chinese in English
> > > > > >    this word's pinyin input is "zhongguo"
> > > > > > The feature i want to implement is when user type zhongguo the
> > > results
> > > > > will
> > > > > > include documents containing "中国" or even Chinese
> > > > > >
> > > > > > Anybody here know how to achieve this?
> > > > > >
> > > > > > --
> > > > > > Weiwei Wang
> > > > > > Alex Wang
> > > > > > 王巍巍
> > > > > > Room 403, Mengmin Wei Building
> > > > > > Computer Science Department
> > > > > > Gulou Campus of Nanjing University
> > > > > > Nanjing, P.R.China, 210093
> > > > > >
> > > > > > Homepage: http://cs.nju.edu.cn/rl/weiweiwang
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Robert Muir
> > > > > rcmuir@gmail.com
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Weiwei Wang
> > > > Alex Wang
> > > > 王巍巍
> > > > Room 403, Mengmin Wei Building
> > > > Computer Science Department
> > > > Gulou Campus of Nanjing University
> > > > Nanjing, P.R.China, 210093
> > > >
> > > > Homepage: http://cs.nju.edu.cn/rl/weiweiwang
> > > >
> > >
> > >
> > >
> > > --
> > > Robert Muir
> > > rcmuir@gmail.com
> > >
> >
> >
> >
> > --
> > Weiwei Wang
> > Alex Wang
> > 王巍巍
> > Room 403, Mengmin Wei Building
> > Computer Science Department
> > Gulou Campus of Nanjing University
> > Nanjing, P.R.China, 210093
> >
> > Homepage: http://cs.nju.edu.cn/rl/weiweiwang
> >
>
>
>
> --
> Robert Muir
> rcmuir@gmail.com
>



-- 
Weiwei Wang
Alex Wang
王巍巍
Room 403, Mengmin Wei Building
Computer Science Department
Gulou Campus of Nanjing University
Nanjing, P.R.China, 210093

Homepage: http://cs.nju.edu.cn/rl/weiweiwang

Re: How to do alias(Pinyin) search in Lucene

Posted by Robert Muir <rc...@gmail.com>.

 if you have lucene 2.9 or 3.0 source code, just run patch -p0 <
/path/to/LUCENE-XXYY.patch from the lucene source code root directory... it
should create the necessary directory and files.
then run 'ant' , in this case it should create a lucene-icu jar file in the
build directory.

the patch doesnt include the icu dependency itself so you need to get that
jar file from www.icu-project.org and have it in your classpath also

sorry for the trouble, hope to integrate some of this soon for a future
release.

On Tue, Dec 15, 2009 at 9:13 AM, Weiwei Wang <ww...@gmail.com> wrote:

> Yes, i found the patch file LUCENE-1488.patch and there's no icu directory
> in my dowloaded contrib directory.
>
> I'm a rookie guy using patch, i'm currently in the contrib dir, could
> anybody tell me how to execute this patch command to generate the relevant
> dir and souce files?
>
> On Tue, Dec 15, 2009 at 9:51 PM, Robert Muir <rc...@gmail.com> wrote:
>
> > look at the latest patch file attached to the issue, it should work with
> > lucene 2.9 or greater (I think)
> >
> > 2009/12/15 Weiwei Wang <ww...@gmail.com>
> >
> > > where can i find the source code?
> > >
> > > On Tue, Dec 15, 2009 at 9:40 PM, Robert Muir <rc...@gmail.com> wrote:
> > >
> > > > there is an icu transform tokenfilter in the patch here:
> > > > http://issues.apache.org/jira/browse/LUCENE-1488
> > > >
> > > >    Transliterator pinyin = Transliterator.getInstance("Han-Latin");
> > > >    Tokenizer tokenizer = new KeywordTokenizer(new
> StringReader("中国"));
> > > >    ICUTransformFilter filter = new ICUTransformFilter(tokenizer,
> > pinyin);
> > > >    assertTokenStreamContents(filter, new String[] { "zhōng guó" } );
> > > >
> > > > note it will add tone marks and insert space between syllables by
> > default
> > > > if you do not want this, you need to do some cleanup.
> > > >
> > > >    Transliterator pinyin = Transliterator.getInstance("Han-Latin;
> NFD;
> > > > [[:NonspacingMark:][:Space:]] Remove");
> > > >    Tokenizer tokenizer = new KeywordTokenizer(new
> StringReader("中国"));
> > > >    ICUTransformFilter filter = new ICUTransformFilter(tokenizer,
> > pinyin);
> > > >    assertTokenStreamContents(filter, new String[] { "zhongguo" } );
> > > >
> > > >
> > > > 2009/12/15 Weiwei Wang <ww...@gmail.com>
> > > >
> > > > > Hi, guys,
> > > > >     I'm implementing a search engine based on Lucene for Chinese.
> So
> > I
> > > > want
> > > > > to support pinyin search as Google China do.
> > > > >
> > > > > e.g.
> > > > >    “中国”  means Chinese in English
> > > > >    this word's pinyin input is "zhongguo"
> > > > > The feature i want to implement is when user type zhongguo the
> > results
> > > > will
> > > > > include documents containing "中国" or even Chinese
> > > > >
> > > > > Anybody here know how to achieve this?
> > > > >
> > > > > --
> > > > > Weiwei Wang
> > > > > Alex Wang
> > > > > 王巍巍
> > > > > Room 403, Mengmin Wei Building
> > > > > Computer Science Department
> > > > > Gulou Campus of Nanjing University
> > > > > Nanjing, P.R.China, 210093
> > > > >
> > > > > Homepage: http://cs.nju.edu.cn/rl/weiweiwang
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Robert Muir
> > > > rcmuir@gmail.com
> > > >
> > >
> > >
> > >
> > > --
> > > Weiwei Wang
> > > Alex Wang
> > > 王巍巍
> > > Room 403, Mengmin Wei Building
> > > Computer Science Department
> > > Gulou Campus of Nanjing University
> > > Nanjing, P.R.China, 210093
> > >
> > > Homepage: http://cs.nju.edu.cn/rl/weiweiwang
> > >
> >
> >
> >
> > --
> > Robert Muir
> > rcmuir@gmail.com
> >
>
>
>
> --
> Weiwei Wang
> Alex Wang
> 王巍巍
> Room 403, Mengmin Wei Building
> Computer Science Department
> Gulou Campus of Nanjing University
> Nanjing, P.R.China, 210093
>
> Homepage: http://cs.nju.edu.cn/rl/weiweiwang
>



-- 
Robert Muir
rcmuir@gmail.com

Re: How to do alias(Pinyin) search in Lucene

Posted by Erick Erickson <er...@gmail.com>.

If you're using an IDE, there should be an "apply patch" somewhere. In
Eclipse, you right-click on the project>>team>>apply patch.

In IntelliJ, it's something like Version Control>>(subversion???)>>apply
patch....

Or do as Robert suggests from the command line...

HTH
Erick

On Tue, Dec 15, 2009 at 9:13 AM, Weiwei Wang <ww...@gmail.com> wrote:

> Yes, i found the patch file LUCENE-1488.patch and there's no icu directory
> in my dowloaded contrib directory.
>
> I'm a rookie guy using patch, i'm currently in the contrib dir, could
> anybody tell me how to execute this patch command to generate the relevant
> dir and souce files?
>
> On Tue, Dec 15, 2009 at 9:51 PM, Robert Muir <rc...@gmail.com> wrote:
>
> > look at the latest patch file attached to the issue, it should work with
> > lucene 2.9 or greater (I think)
> >
> > 2009/12/15 Weiwei Wang <ww...@gmail.com>
> >
> > > where can i find the source code?
> > >
> > > On Tue, Dec 15, 2009 at 9:40 PM, Robert Muir <rc...@gmail.com> wrote:
> > >
> > > > there is an icu transform tokenfilter in the patch here:
> > > > http://issues.apache.org/jira/browse/LUCENE-1488
> > > >
> > > >    Transliterator pinyin = Transliterator.getInstance("Han-Latin");
> > > >    Tokenizer tokenizer = new KeywordTokenizer(new
> StringReader("中国"));
> > > >    ICUTransformFilter filter = new ICUTransformFilter(tokenizer,
> > pinyin);
> > > >    assertTokenStreamContents(filter, new String[] { "zhōng guó" } );
> > > >
> > > > note it will add tone marks and insert space between syllables by
> > default
> > > > if you do not want this, you need to do some cleanup.
> > > >
> > > >    Transliterator pinyin = Transliterator.getInstance("Han-Latin;
> NFD;
> > > > [[:NonspacingMark:][:Space:]] Remove");
> > > >    Tokenizer tokenizer = new KeywordTokenizer(new
> StringReader("中国"));
> > > >    ICUTransformFilter filter = new ICUTransformFilter(tokenizer,
> > pinyin);
> > > >    assertTokenStreamContents(filter, new String[] { "zhongguo" } );
> > > >
> > > >
> > > > 2009/12/15 Weiwei Wang <ww...@gmail.com>
> > > >
> > > > > Hi, guys,
> > > > >     I'm implementing a search engine based on Lucene for Chinese.
> So
> > I
> > > > want
> > > > > to support pinyin search as Google China do.
> > > > >
> > > > > e.g.
> > > > >    “中国”  means Chinese in English
> > > > >    this word's pinyin input is "zhongguo"
> > > > > The feature i want to implement is when user type zhongguo the
> > results
> > > > will
> > > > > include documents containing "中国" or even Chinese
> > > > >
> > > > > Anybody here know how to achieve this?
> > > > >
> > > > > --
> > > > > Weiwei Wang
> > > > > Alex Wang
> > > > > 王巍巍
> > > > > Room 403, Mengmin Wei Building
> > > > > Computer Science Department
> > > > > Gulou Campus of Nanjing University
> > > > > Nanjing, P.R.China, 210093
> > > > >
> > > > > Homepage: http://cs.nju.edu.cn/rl/weiweiwang
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Robert Muir
> > > > rcmuir@gmail.com
> > > >
> > >
> > >
> > >
> > > --
> > > Weiwei Wang
> > > Alex Wang
> > > 王巍巍
> > > Room 403, Mengmin Wei Building
> > > Computer Science Department
> > > Gulou Campus of Nanjing University
> > > Nanjing, P.R.China, 210093
> > >
> > > Homepage: http://cs.nju.edu.cn/rl/weiweiwang
> > >
> >
> >
> >
> > --
> > Robert Muir
> > rcmuir@gmail.com
> >
>
>
>
> --
> Weiwei Wang
> Alex Wang
> 王巍巍
> Room 403, Mengmin Wei Building
> Computer Science Department
> Gulou Campus of Nanjing University
> Nanjing, P.R.China, 210093
>
> Homepage: http://cs.nju.edu.cn/rl/weiweiwang
>

Re: How to do alias(Pinyin) search in Lucene

Posted by Weiwei Wang <ww...@gmail.com>.

Yes, i found the patch file LUCENE-1488.patch and there's no icu directory
in my dowloaded contrib directory.

I'm a rookie guy using patch, i'm currently in the contrib dir, could
anybody tell me how to execute this patch command to generate the relevant
dir and souce files?

On Tue, Dec 15, 2009 at 9:51 PM, Robert Muir <rc...@gmail.com> wrote:

> look at the latest patch file attached to the issue, it should work with
> lucene 2.9 or greater (I think)
>
> 2009/12/15 Weiwei Wang <ww...@gmail.com>
>
> > where can i find the source code?
> >
> > On Tue, Dec 15, 2009 at 9:40 PM, Robert Muir <rc...@gmail.com> wrote:
> >
> > > there is an icu transform tokenfilter in the patch here:
> > > http://issues.apache.org/jira/browse/LUCENE-1488
> > >
> > >    Transliterator pinyin = Transliterator.getInstance("Han-Latin");
> > >    Tokenizer tokenizer = new KeywordTokenizer(new StringReader("中国"));
> > >    ICUTransformFilter filter = new ICUTransformFilter(tokenizer,
> pinyin);
> > >    assertTokenStreamContents(filter, new String[] { "zhōng guó" } );
> > >
> > > note it will add tone marks and insert space between syllables by
> default
> > > if you do not want this, you need to do some cleanup.
> > >
> > >    Transliterator pinyin = Transliterator.getInstance("Han-Latin; NFD;
> > > [[:NonspacingMark:][:Space:]] Remove");
> > >    Tokenizer tokenizer = new KeywordTokenizer(new StringReader("中国"));
> > >    ICUTransformFilter filter = new ICUTransformFilter(tokenizer,
> pinyin);
> > >    assertTokenStreamContents(filter, new String[] { "zhongguo" } );
> > >
> > >
> > > 2009/12/15 Weiwei Wang <ww...@gmail.com>
> > >
> > > > Hi, guys,
> > > >     I'm implementing a search engine based on Lucene for Chinese. So
> I
> > > want
> > > > to support pinyin search as Google China do.
> > > >
> > > > e.g.
> > > >    “中国”  means Chinese in English
> > > >    this word's pinyin input is "zhongguo"
> > > > The feature i want to implement is when user type zhongguo the
> results
> > > will
> > > > include documents containing "中国" or even Chinese
> > > >
> > > > Anybody here know how to achieve this?
> > > >
> > > > --
> > > > Weiwei Wang
> > > > Alex Wang
> > > > 王巍巍
> > > > Room 403, Mengmin Wei Building
> > > > Computer Science Department
> > > > Gulou Campus of Nanjing University
> > > > Nanjing, P.R.China, 210093
> > > >
> > > > Homepage: http://cs.nju.edu.cn/rl/weiweiwang
> > > >
> > >
> > >
> > >
> > > --
> > > Robert Muir
> > > rcmuir@gmail.com
> > >
> >
> >
> >
> > --
> > Weiwei Wang
> > Alex Wang
> > 王巍巍
> > Room 403, Mengmin Wei Building
> > Computer Science Department
> > Gulou Campus of Nanjing University
> > Nanjing, P.R.China, 210093
> >
> > Homepage: http://cs.nju.edu.cn/rl/weiweiwang
> >
>
>
>
> --
> Robert Muir
> rcmuir@gmail.com
>



-- 
Weiwei Wang
Alex Wang
王巍巍
Room 403, Mengmin Wei Building
Computer Science Department
Gulou Campus of Nanjing University
Nanjing, P.R.China, 210093

Homepage: http://cs.nju.edu.cn/rl/weiweiwang

Re: How to do alias(Pinyin) search in Lucene

Posted by Robert Muir <rc...@gmail.com>.

look at the latest patch file attached to the issue, it should work with
lucene 2.9 or greater (I think)

2009/12/15 Weiwei Wang <ww...@gmail.com>

> where can i find the source code?
>
> On Tue, Dec 15, 2009 at 9:40 PM, Robert Muir <rc...@gmail.com> wrote:
>
> > there is an icu transform tokenfilter in the patch here:
> > http://issues.apache.org/jira/browse/LUCENE-1488
> >
> >    Transliterator pinyin = Transliterator.getInstance("Han-Latin");
> >    Tokenizer tokenizer = new KeywordTokenizer(new StringReader("中国"));
> >    ICUTransformFilter filter = new ICUTransformFilter(tokenizer, pinyin);
> >    assertTokenStreamContents(filter, new String[] { "zhōng guó" } );
> >
> > note it will add tone marks and insert space between syllables by default
> > if you do not want this, you need to do some cleanup.
> >
> >    Transliterator pinyin = Transliterator.getInstance("Han-Latin; NFD;
> > [[:NonspacingMark:][:Space:]] Remove");
> >    Tokenizer tokenizer = new KeywordTokenizer(new StringReader("中国"));
> >    ICUTransformFilter filter = new ICUTransformFilter(tokenizer, pinyin);
> >    assertTokenStreamContents(filter, new String[] { "zhongguo" } );
> >
> >
> > 2009/12/15 Weiwei Wang <ww...@gmail.com>
> >
> > > Hi, guys,
> > >     I'm implementing a search engine based on Lucene for Chinese. So I
> > want
> > > to support pinyin search as Google China do.
> > >
> > > e.g.
> > >    “中国”  means Chinese in English
> > >    this word's pinyin input is "zhongguo"
> > > The feature i want to implement is when user type zhongguo the results
> > will
> > > include documents containing "中国" or even Chinese
> > >
> > > Anybody here know how to achieve this?
> > >
> > > --
> > > Weiwei Wang
> > > Alex Wang
> > > 王巍巍
> > > Room 403, Mengmin Wei Building
> > > Computer Science Department
> > > Gulou Campus of Nanjing University
> > > Nanjing, P.R.China, 210093
> > >
> > > Homepage: http://cs.nju.edu.cn/rl/weiweiwang
> > >
> >
> >
> >
> > --
> > Robert Muir
> > rcmuir@gmail.com
> >
>
>
>
> --
> Weiwei Wang
> Alex Wang
> 王巍巍
> Room 403, Mengmin Wei Building
> Computer Science Department
> Gulou Campus of Nanjing University
> Nanjing, P.R.China, 210093
>
> Homepage: http://cs.nju.edu.cn/rl/weiweiwang
>



-- 
Robert Muir
rcmuir@gmail.com

Re: How to do alias(Pinyin) search in Lucene

Posted by Weiwei Wang <ww...@gmail.com>.

where can i find the source code?

On Tue, Dec 15, 2009 at 9:40 PM, Robert Muir <rc...@gmail.com> wrote:

> there is an icu transform tokenfilter in the patch here:
> http://issues.apache.org/jira/browse/LUCENE-1488
>
>    Transliterator pinyin = Transliterator.getInstance("Han-Latin");
>    Tokenizer tokenizer = new KeywordTokenizer(new StringReader("中国"));
>    ICUTransformFilter filter = new ICUTransformFilter(tokenizer, pinyin);
>    assertTokenStreamContents(filter, new String[] { "zhōng guó" } );
>
> note it will add tone marks and insert space between syllables by default
> if you do not want this, you need to do some cleanup.
>
>    Transliterator pinyin = Transliterator.getInstance("Han-Latin; NFD;
> [[:NonspacingMark:][:Space:]] Remove");
>    Tokenizer tokenizer = new KeywordTokenizer(new StringReader("中国"));
>    ICUTransformFilter filter = new ICUTransformFilter(tokenizer, pinyin);
>    assertTokenStreamContents(filter, new String[] { "zhongguo" } );
>
>
> 2009/12/15 Weiwei Wang <ww...@gmail.com>
>
> > Hi, guys,
> >     I'm implementing a search engine based on Lucene for Chinese. So I
> want
> > to support pinyin search as Google China do.
> >
> > e.g.
> >    “中国”  means Chinese in English
> >    this word's pinyin input is "zhongguo"
> > The feature i want to implement is when user type zhongguo the results
> will
> > include documents containing "中国" or even Chinese
> >
> > Anybody here know how to achieve this?
> >
> > --
> > Weiwei Wang
> > Alex Wang
> > 王巍巍
> > Room 403, Mengmin Wei Building
> > Computer Science Department
> > Gulou Campus of Nanjing University
> > Nanjing, P.R.China, 210093
> >
> > Homepage: http://cs.nju.edu.cn/rl/weiweiwang
> >
>
>
>
> --
> Robert Muir
> rcmuir@gmail.com
>



-- 
Weiwei Wang
Alex Wang
王巍巍
Room 403, Mengmin Wei Building
Computer Science Department
Gulou Campus of Nanjing University
Nanjing, P.R.China, 210093

Homepage: http://cs.nju.edu.cn/rl/weiweiwang

Re: How to do alias(Pinyin) search in Lucene

Posted by Robert Muir <rc...@gmail.com>.

there is an icu transform tokenfilter in the patch here:
http://issues.apache.org/jira/browse/LUCENE-1488

    Transliterator pinyin = Transliterator.getInstance("Han-Latin");
    Tokenizer tokenizer = new KeywordTokenizer(new StringReader("中国"));
    ICUTransformFilter filter = new ICUTransformFilter(tokenizer, pinyin);
    assertTokenStreamContents(filter, new String[] { "zhōng guó" } );

note it will add tone marks and insert space between syllables by default
if you do not want this, you need to do some cleanup.

    Transliterator pinyin = Transliterator.getInstance("Han-Latin; NFD;
[[:NonspacingMark:][:Space:]] Remove");
    Tokenizer tokenizer = new KeywordTokenizer(new StringReader("中国"));
    ICUTransformFilter filter = new ICUTransformFilter(tokenizer, pinyin);
    assertTokenStreamContents(filter, new String[] { "zhongguo" } );


2009/12/15 Weiwei Wang <ww...@gmail.com>

> Hi, guys,
>     I'm implementing a search engine based on Lucene for Chinese. So I want
> to support pinyin search as Google China do.
>
> e.g.
>    “中国”  means Chinese in English
>    this word's pinyin input is "zhongguo"
> The feature i want to implement is when user type zhongguo the results will
> include documents containing "中国" or even Chinese
>
> Anybody here know how to achieve this?
>
> --
> Weiwei Wang
> Alex Wang
> 王巍巍
> Room 403, Mengmin Wei Building
> Computer Science Department
> Gulou Campus of Nanjing University
> Nanjing, P.R.China, 210093
>
> Homepage: http://cs.nju.edu.cn/rl/weiweiwang
>



-- 
Robert Muir
rcmuir@gmail.com