You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Paco Avila <mo...@gmail.com> on 2012/06/27 12:19:52 UTC

Question about chinese and WildcardQuery

Hi there,

I have to index chinese content and I don't get the expected results when
searching. It seems that the WildcardQuery does not work properly with the
chinese characters. See attached sample code.

I store the string "ר����Ϣ����.doc" using the StandardAnalyzer and after that
search for "ר����*" and no result is given. AFAIK, it should match the
"ר����Ϣ����.doc" string but it doesn't :(

NOTE: Use Lucene 3.1.0

Regards.
-- 
http://www.openkm.com
http://www.guia-ubuntu.org

Re: Question about chinese and WildcardQuery

Posted by Paco Avila <mo...@gmail.com>.

How do I enable this? I didn't know this need to be enabled in any place.

2012/6/27 齐保元 <qi...@126.com>

> enable prefixquery feature


-- 
OpenKM
http://www.openkm.com
http://www.guia-ubuntu.org

Re:Question about chinese and WildcardQuery

Posted by 齐保元 <qi...@126.com>.


maybe you did not enable prefixquery feature.

At 2012-06-27 18:19:52,"Paco Avila" <mo...@gmail.com> wrote:
Hi there,

I have to index chinese content and I don't get the expected results when searching. It seems that the WildcardQuery does not work properly with the chinese characters. See attached sample code.

I store the string "专项信息管理.doc" using the StandardAnalyzer and after that search for "专项信*" and no result is given. AFAIK, it should match the "专项信息管理.doc" string but it doesn't :(

NOTE: Use Lucene 3.1.0

Regards.
--
http://www.openkm.com
http://www.guia-ubuntu.org

Re: Question about chinese and WildcardQuery

Posted by Paco Avila <pa...@openkm.com>.

Thanks for the info.

2012/6/28 Li Li <fa...@gmail.com>

> in Chinese, there isn't word boundary between words. it writes like:
> Iamok. you should tokenize it to I am ok
> if you want to search *amo*, you should view I am ok as one token. In
> Chinese, fuzzy search is not very useful. even use Standard Analyzer,
> it's ok to use boolean query. because "Iamok" is tokenized as I a m o
> k. if search boolean query +a +m +o, it's fine. Chinese has many
> letters(commonly used more than 3000). and words are very short(most
> words has only 2 letters).
>
>
> On Thu, Jun 28, 2012 at 2:31 PM, Paco Avila <mo...@gmail.com> wrote:
> > Thank, using Whitespace Analyzer works, but I don't understand why
> > StandardAnalyzer does not work if according with the ChineseAnalyzer
> > deprecation I should use StandardAnalyzer:
> >
> > @deprecated Use {@link StandardAnalyzer} instead, which has the same
> > functionality.
> >
> > Is very annoying.
> >
> > 2012/6/27 Li Li <fa...@gmail.com>
> >
> >> standard analyzer will segment each character into a token, you should
> use
> >> whitespace analyzer or your own analyzer that can tokenize it as one
> token
> >> for wildcard search
> >> 在 2012-6-27 傍晚6:20，"Paco Avila" <mo...@gmail.com>写道：
> >>
> >> > Hi there,
> >> >
> >> > I have to index chinese content and I don't get the expected results
> when
> >> > searching. It seems that the WildcardQuery does not work properly with
> >> the
> >> > chinese characters. See attached sample code.
> >> >
> >> > I store the string "专项信息管理.doc" using the StandardAnalyzer and after
> that
> >> > search for "专项信*" and no result is given. AFAIK, it should match the
> >> > "专项信息管理.doc" string but it doesn't :(
> >> >
> >> > NOTE: Use Lucene 3.1.0
> >> >
> >> > Regards.
> >> > --
> >> > http://www.openkm.com
> >> > http://www.guia-ubuntu.org
> >> >
> >> >
> >> > ---------------------------------------------------------------------
> >> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >> >
> >>
> >
> >
> >
> > --
> > OpenKM
> > http://www.openkm.com
> > http://www.guia-ubuntu.org
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


-- 
OpenKM
http://www.openkm.com
http://facebook.com/OpenKM.DMS

Re: Question about chinese and WildcardQuery

Posted by Li Li <fa...@gmail.com>.

in Chinese, there isn't word boundary between words. it writes like:
Iamok. you should tokenize it to I am ok
if you want to search *amo*, you should view I am ok as one token. In
Chinese, fuzzy search is not very useful. even use Standard Analyzer,
it's ok to use boolean query. because "Iamok" is tokenized as I a m o
k. if search boolean query +a +m +o, it's fine. Chinese has many
letters(commonly used more than 3000). and words are very short(most
words has only 2 letters).


On Thu, Jun 28, 2012 at 2:31 PM, Paco Avila <mo...@gmail.com> wrote:
> Thank, using Whitespace Analyzer works, but I don't understand why
> StandardAnalyzer does not work if according with the ChineseAnalyzer
> deprecation I should use StandardAnalyzer:
>
> @deprecated Use {@link StandardAnalyzer} instead, which has the same
> functionality.
>
> Is very annoying.
>
> 2012/6/27 Li Li <fa...@gmail.com>
>
>> standard analyzer will segment each character into a token, you should use
>> whitespace analyzer or your own analyzer that can tokenize it as one token
>> for wildcard search
>> 在 2012-6-27 傍晚6:20，"Paco Avila" <mo...@gmail.com>写道：
>>
>> > Hi there,
>> >
>> > I have to index chinese content and I don't get the expected results when
>> > searching. It seems that the WildcardQuery does not work properly with
>> the
>> > chinese characters. See attached sample code.
>> >
>> > I store the string "专项信息管理.doc" using the StandardAnalyzer and after that
>> > search for "专项信*" and no result is given. AFAIK, it should match the
>> > "专项信息管理.doc" string but it doesn't :(
>> >
>> > NOTE: Use Lucene 3.1.0
>> >
>> > Regards.
>> > --
>> > http://www.openkm.com
>> > http://www.guia-ubuntu.org
>> >
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> > For additional commands, e-mail: java-user-help@lucene.apache.org
>> >
>>
>
>
>
> --
> OpenKM
> http://www.openkm.com
> http://www.guia-ubuntu.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Question about chinese and WildcardQuery

Posted by wangjing <pp...@gmail.com>.

最好搜索的Analyzer 和生成index的Analyzer 保持一致

On Thu, Jun 28, 2012 at 2:31 PM, Paco Avila <mo...@gmail.com> wrote:
> Thank, using Whitespace Analyzer works, but I don't understand why
> StandardAnalyzer does not work if according with the ChineseAnalyzer
> deprecation I should use StandardAnalyzer:
>
> @deprecated Use {@link StandardAnalyzer} instead, which has the same
> functionality.
>
> Is very annoying.
>
> 2012/6/27 Li Li <fa...@gmail.com>
>
>> standard analyzer will segment each character into a token, you should use
>> whitespace analyzer or your own analyzer that can tokenize it as one token
>> for wildcard search
>> 在 2012-6-27 傍晚6:20，"Paco Avila" <mo...@gmail.com>写道：
>>
>> > Hi there,
>> >
>> > I have to index chinese content and I don't get the expected results when
>> > searching. It seems that the WildcardQuery does not work properly with
>> the
>> > chinese characters. See attached sample code.
>> >
>> > I store the string "专项信息管理.doc" using the StandardAnalyzer and after that
>> > search for "专项信*" and no result is given. AFAIK, it should match the
>> > "专项信息管理.doc" string but it doesn't :(
>> >
>> > NOTE: Use Lucene 3.1.0
>> >
>> > Regards.
>> > --
>> > http://www.openkm.com
>> > http://www.guia-ubuntu.org
>> >
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> > For additional commands, e-mail: java-user-help@lucene.apache.org
>> >
>>
>
>
>
> --
> OpenKM
> http://www.openkm.com
> http://www.guia-ubuntu.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Question about chinese and WildcardQuery

Posted by Paco Avila <mo...@gmail.com>.

Thank, using Whitespace Analyzer works, but I don't understand why
StandardAnalyzer does not work if according with the ChineseAnalyzer
deprecation I should use StandardAnalyzer:

@deprecated Use {@link StandardAnalyzer} instead, which has the same
functionality.

Is very annoying.

2012/6/27 Li Li <fa...@gmail.com>

> standard analyzer will segment each character into a token, you should use
> whitespace analyzer or your own analyzer that can tokenize it as one token
> for wildcard search
> 在 2012-6-27 傍晚6:20，"Paco Avila" <mo...@gmail.com>写道：
>
> > Hi there,
> >
> > I have to index chinese content and I don't get the expected results when
> > searching. It seems that the WildcardQuery does not work properly with
> the
> > chinese characters. See attached sample code.
> >
> > I store the string "专项信息管理.doc" using the StandardAnalyzer and after that
> > search for "专项信*" and no result is given. AFAIK, it should match the
> > "专项信息管理.doc" string but it doesn't :(
> >
> > NOTE: Use Lucene 3.1.0
> >
> > Regards.
> > --
> > http://www.openkm.com
> > http://www.guia-ubuntu.org
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
>



-- 
OpenKM
http://www.openkm.com
http://www.guia-ubuntu.org

Re: Question about chinese and WildcardQuery

Posted by Li Li <fa...@gmail.com>.

standard analyzer will segment each character into a token, you should use
whitespace analyzer or your own analyzer that can tokenize it as one token
for wildcard search
在 2012-6-27 傍晚6:20，"Paco Avila" <mo...@gmail.com>写道：

> Hi there,
>
> I have to index chinese content and I don't get the expected results when
> searching. It seems that the WildcardQuery does not work properly with the
> chinese characters. See attached sample code.
>
> I store the string "专项信息管理.doc" using the StandardAnalyzer and after that
> search for "专项信*" and no result is given. AFAIK, it should match the
> "专项信息管理.doc" string but it doesn't :(
>
> NOTE: Use Lucene 3.1.0
>
> Regards.
> --
> http://www.openkm.com
> http://www.guia-ubuntu.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>