You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by weidong sun <lm...@gmail.com> on 2009/05/14 16:11:46 UTC

Question wrt Lucene analyzer for different language

Hello,

I am a newbie in Lucene world. I might ask some obvious question which
unfortunately I don't know the answer. Please help me 'grow'.

We have a project intend to use Lucene search engine for search some user's
info stored our system. The user info might not be in English even it will
be stored in UTF-8 encoding.

My question is, if I use one particular Lucene analyzer for a language other
than English (e.g. ChineseAnalyzer or ArabicAnalyzer), can it still able to
handle it correctly if user info is mixed with English character/word?

Really appreciated with any answers.

:-)

Re: Question wrt Lucene analyzer for different language

Posted by weidong sun <lm...@gmail.com>.

Because the exactly same reason, our assumption is that users' profile
information is mixed with only that particular language and English. And
we'll have to use the analyzer for that particular language to do the
indexing and searching. And that's the reason why I asked this analyzer
question. :-)

On Thu, May 14, 2009 at 11:27 AM, Uwe Schindler <uw...@thetaphi.de> wrote:

> There are two problems:
>
> a) Currently there is no such analyzer (I have the problem, too, I would
> also like to autodetect the language from a text like M$ Word does and
> switch the analyzers).
> b) If such an autodetect analyzer exists, you will have a problem on the
> searching side, because you should almost always use the same anylyzer on
> the indexing and search side. The problem is that search queries are
> normally very short and autodetection is hardly possible. If somebody now
> enters something parsed by the query parser using this
> auto-language-analyzer, the detection may fail and the wrongly-stemmed
> analyzer tokens will hit no results.
>
> Uwe
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe@thetaphi.de
>
> > -----Original Message-----
> > From: weidong sun [mailto:lmcwesu@gmail.com]
> > Sent: Thursday, May 14, 2009 5:19 PM
> > To: java-user@lucene.apache.org
> > Subject: Re: Question wrt Lucene analyzer for different language
> >
> > Thanks for the suprising quick response. :-)
> >
> > What I mean "correctly" here is that the specific analyzer can tokenize a
> > text mixed with English and that sepcfic langauge, for example, "12345
> > ????"
> > or "????Text???" (where '?' is a character of that specific language and
> > "12345" and "Text" is english character) to have "12345" and "Text"
> > treated
> > as token and indexed as well.
> > BTW, I don't see a needs for stemming it by far since the information our
> > project encountered is just user's profile info.
> >
> > For the perticular ChineseAnalyzer,  Can it do that?
> >
> >
> > On Thu, May 14, 2009 at 10:37 AM, Erick Erickson
> > <er...@gmail.com>wrote:
> >
> > > No. What is "correctly"? Are you stemming? in which case using thesame
> > > analyzer on different languages will not work.
> > >
> > > This topic have been discussed on the user list frequently, so if you
> > > searched
> > > that archive (see: http://wiki.apache.org/lucene-
> > java/MailingListArchives)
> > > you'd find a wealth of information quickly...
> > >
> > > Best
> > > Erick
> > >
> > > On Thu, May 14, 2009 at 10:11 AM, weidong sun <lm...@gmail.com>
> wrote:
> > >
> > > > Hello,
> > > >
> > > > I am a newbie in Lucene world. I might ask some obvious question
> which
> > > > unfortunately I don't know the answer. Please help me 'grow'.
> > > >
> > > > We have a project intend to use Lucene search engine for search some
> > > user's
> > > > info stored our system. The user info might not be in English even it
> > > will
> > > > be stored in UTF-8 encoding.
> > > >
> > > > My question is, if I use one particular Lucene analyzer for a
> language
> > > > other
> > > > than English (e.g. ChineseAnalyzer or ArabicAnalyzer), can it still
> > able
> > > to
> > > > handle it correctly if user info is mixed with English
> character/word?
> > > >
> > > > Really appreciated with any answers.
> > > >
> > > > :-)
> > > >
> > >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

RE: Question wrt Lucene analyzer for different language

Posted by Uwe Schindler <uw...@thetaphi.de>.

There are two problems:

a) Currently there is no such analyzer (I have the problem, too, I would
also like to autodetect the language from a text like M$ Word does and
switch the analyzers).
b) If such an autodetect analyzer exists, you will have a problem on the
searching side, because you should almost always use the same anylyzer on
the indexing and search side. The problem is that search queries are
normally very short and autodetection is hardly possible. If somebody now
enters something parsed by the query parser using this
auto-language-analyzer, the detection may fail and the wrongly-stemmed
analyzer tokens will hit no results.

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de

> -----Original Message-----
> From: weidong sun [mailto:lmcwesu@gmail.com]
> Sent: Thursday, May 14, 2009 5:19 PM
> To: java-user@lucene.apache.org
> Subject: Re: Question wrt Lucene analyzer for different language
> 
> Thanks for the suprising quick response. :-)
> 
> What I mean "correctly" here is that the specific analyzer can tokenize a
> text mixed with English and that sepcfic langauge, for example, "12345
> ????"
> or "????Text???" (where '?' is a character of that specific language and
> "12345" and "Text" is english character) to have "12345" and "Text"
> treated
> as token and indexed as well.
> BTW, I don't see a needs for stemming it by far since the information our
> project encountered is just user's profile info.
> 
> For the perticular ChineseAnalyzer,  Can it do that?
> 
> 
> On Thu, May 14, 2009 at 10:37 AM, Erick Erickson
> <er...@gmail.com>wrote:
> 
> > No. What is "correctly"? Are you stemming? in which case using thesame
> > analyzer on different languages will not work.
> >
> > This topic have been discussed on the user list frequently, so if you
> > searched
> > that archive (see: http://wiki.apache.org/lucene-
> java/MailingListArchives)
> > you'd find a wealth of information quickly...
> >
> > Best
> > Erick
> >
> > On Thu, May 14, 2009 at 10:11 AM, weidong sun <lm...@gmail.com> wrote:
> >
> > > Hello,
> > >
> > > I am a newbie in Lucene world. I might ask some obvious question which
> > > unfortunately I don't know the answer. Please help me 'grow'.
> > >
> > > We have a project intend to use Lucene search engine for search some
> > user's
> > > info stored our system. The user info might not be in English even it
> > will
> > > be stored in UTF-8 encoding.
> > >
> > > My question is, if I use one particular Lucene analyzer for a language
> > > other
> > > than English (e.g. ChineseAnalyzer or ArabicAnalyzer), can it still
> able
> > to
> > > handle it correctly if user info is mixed with English character/word?
> > >
> > > Really appreciated with any answers.
> > >
> > > :-)
> > >
> >


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Question wrt Lucene analyzer for different language

Posted by weidong sun <lm...@gmail.com>.

Thanks for the suprising quick response. :-)

What I mean "correctly" here is that the specific analyzer can tokenize a
text mixed with English and that sepcfic langauge, for example, "12345 ????"
or "????Text???" (where '?' is a character of that specific language and
"12345" and "Text" is english character) to have "12345" and "Text" treated
as token and indexed as well.
BTW, I don't see a needs for stemming it by far since the information our
project encountered is just user's profile info.

For the perticular ChineseAnalyzer,  Can it do that?

On Thu, May 14, 2009 at 10:37 AM, Erick Erickson <er...@gmail.com>wrote:

> No. What is "correctly"? Are you stemming? in which case using thesame
> analyzer on different languages will not work.
>
> This topic have been discussed on the user list frequently, so if you
> searched
> that archive (see: http://wiki.apache.org/lucene-java/MailingListArchives)
> you'd find a wealth of information quickly...
>
> Best
> Erick
>
> On Thu, May 14, 2009 at 10:11 AM, weidong sun <lm...@gmail.com> wrote:
>
> > Hello,
> >
> > I am a newbie in Lucene world. I might ask some obvious question which
> > unfortunately I don't know the answer. Please help me 'grow'.
> >
> > We have a project intend to use Lucene search engine for search some
> user's
> > info stored our system. The user info might not be in English even it
> will
> > be stored in UTF-8 encoding.
> >
> > My question is, if I use one particular Lucene analyzer for a language
> > other
> > than English (e.g. ChineseAnalyzer or ArabicAnalyzer), can it still able
> to
> > handle it correctly if user info is mixed with English character/word?
> >
> > Really appreciated with any answers.
> >
> > :-)
> >
>

Re: Question wrt Lucene analyzer for different language

Posted by Erick Erickson <er...@gmail.com>.

No. What is "correctly"? Are you stemming? in which case using thesame
analyzer on different languages will not work.

This topic have been discussed on the user list frequently, so if you
searched
that archive (see: http://wiki.apache.org/lucene-java/MailingListArchives)
you'd find a wealth of information quickly...

Best
Erick

On Thu, May 14, 2009 at 10:11 AM, weidong sun <lm...@gmail.com> wrote:

> Hello,
>
> I am a newbie in Lucene world. I might ask some obvious question which
> unfortunately I don't know the answer. Please help me 'grow'.
>
> We have a project intend to use Lucene search engine for search some user's
> info stored our system. The user info might not be in English even it will
> be stored in UTF-8 encoding.
>
> My question is, if I use one particular Lucene analyzer for a language
> other
> than English (e.g. ChineseAnalyzer or ArabicAnalyzer), can it still able to
> handle it correctly if user info is mixed with English character/word?
>
> Really appreciated with any answers.
>
> :-)
>

RE: Question wrt Lucene analyzer for different language

Posted by Uwe Schindler <uw...@thetaphi.de>.

> Thanks for the quick answer. :-)
> 
> So  can I say, for ArabicAnalyzer, generally it can tokenize the mixed
> content with Arabic and English? :-)
> 
> I am not really familiar with Arabic language. What do you mean for
> "change
> Arabic tokens"? Does Arabic has something like upper/lower case as English
> does?

For the arabic anayzer this works, because you can detect the "language"
easy from the used characters. But then it stems only the Arabic part. The
English one is simply untouched.

But a analyzer that should automatically detect English, French, German on
the index and search side (see my email before) is almost impossible to
create.

Uwe


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Question wrt Lucene analyzer for different language

Posted by Robert Muir <rc...@gmail.com>.

I would say in general, yes.

when i say 'change arabic text', I mean the arabic analyzer will standardize
and stem arabic words. but it won't modify any of your english words.

and no, there is no case in arabic. this is why if you are handling mixed
arabic/english text I recommend creating a custom analyzer that does some
basics with the english part as well, such as lowercasefilter.

On Thu, May 14, 2009 at 11:26 AM, weidong sun <lm...@gmail.com> wrote:

> Thanks for the quick answer. :-)
>
> So  can I say, for ArabicAnalyzer, generally it can tokenize the mixed
> content with Arabic and English? :-)
>
> I am not really familiar with Arabic language. What do you mean for "change
> Arabic tokens"? Does Arabic has something like upper/lower case as English
> does?
>
>
> On Thu, May 14, 2009 at 10:47 AM, Robert Muir <rc...@gmail.com> wrote:
>
> > in the case of ArabicAnalyzer it will only change Arabic tokens, and will
> > leave english words as-is (it will not convert them to lowercase or
> > anything
> > like that)
> >
> > so if you want to have good Arabic and English behavior you would want to
> > create a custom analyzer that looks like Arabic analyzer but also invokes
> > lowercasefilter, perhaps also some english stemmer, etc etc.
> >
> > On Thu, May 14, 2009 at 10:11 AM, weidong sun <lm...@gmail.com> wrote:
> >
> > > Hello,
> > >
> > > I am a newbie in Lucene world. I might ask some obvious question which
> > > unfortunately I don't know the answer. Please help me 'grow'.
> > >
> > > We have a project intend to use Lucene search engine for search some
> > user's
> > > info stored our system. The user info might not be in English even it
> > will
> > > be stored in UTF-8 encoding.
> > >
> > > My question is, if I use one particular Lucene analyzer for a language
> > > other
> > > than English (e.g. ChineseAnalyzer or ArabicAnalyzer), can it still
> able
> > to
> > > handle it correctly if user info is mixed with English character/word?
> > >
> > > Really appreciated with any answers.
> > >
> > > :-)
> > >
> >
> >
> >
> > --
> > Robert Muir
> > rcmuir@gmail.com
> >
>



-- 
Robert Muir
rcmuir@gmail.com

Re: Question wrt Lucene analyzer for different language

Posted by weidong sun <lm...@gmail.com>.

Thanks for the quick answer. :-)

So  can I say, for ArabicAnalyzer, generally it can tokenize the mixed
content with Arabic and English? :-)

I am not really familiar with Arabic language. What do you mean for "change
Arabic tokens"? Does Arabic has something like upper/lower case as English
does?


On Thu, May 14, 2009 at 10:47 AM, Robert Muir <rc...@gmail.com> wrote:

> in the case of ArabicAnalyzer it will only change Arabic tokens, and will
> leave english words as-is (it will not convert them to lowercase or
> anything
> like that)
>
> so if you want to have good Arabic and English behavior you would want to
> create a custom analyzer that looks like Arabic analyzer but also invokes
> lowercasefilter, perhaps also some english stemmer, etc etc.
>
> On Thu, May 14, 2009 at 10:11 AM, weidong sun <lm...@gmail.com> wrote:
>
> > Hello,
> >
> > I am a newbie in Lucene world. I might ask some obvious question which
> > unfortunately I don't know the answer. Please help me 'grow'.
> >
> > We have a project intend to use Lucene search engine for search some
> user's
> > info stored our system. The user info might not be in English even it
> will
> > be stored in UTF-8 encoding.
> >
> > My question is, if I use one particular Lucene analyzer for a language
> > other
> > than English (e.g. ChineseAnalyzer or ArabicAnalyzer), can it still able
> to
> > handle it correctly if user info is mixed with English character/word?
> >
> > Really appreciated with any answers.
> >
> > :-)
> >
>
>
>
> --
> Robert Muir
> rcmuir@gmail.com
>

Re: Question wrt Lucene analyzer for different language

Posted by Robert Muir <rc...@gmail.com>.

in the case of ArabicAnalyzer it will only change Arabic tokens, and will
leave english words as-is (it will not convert them to lowercase or anything
like that)

so if you want to have good Arabic and English behavior you would want to
create a custom analyzer that looks like Arabic analyzer but also invokes
lowercasefilter, perhaps also some english stemmer, etc etc.

On Thu, May 14, 2009 at 10:11 AM, weidong sun <lm...@gmail.com> wrote:

> Hello,
>
> I am a newbie in Lucene world. I might ask some obvious question which
> unfortunately I don't know the answer. Please help me 'grow'.
>
> We have a project intend to use Lucene search engine for search some user's
> info stored our system. The user info might not be in English even it will
> be stored in UTF-8 encoding.
>
> My question is, if I use one particular Lucene analyzer for a language
> other
> than English (e.g. ChineseAnalyzer or ArabicAnalyzer), can it still able to
> handle it correctly if user info is mixed with English character/word?
>
> Really appreciated with any answers.
>
> :-)
>



-- 
Robert Muir
rcmuir@gmail.com