You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by KK <di...@gmail.com> on 2009/05/23 08:23:13 UTC

Which analyzer to use for non-english unicoded text?

Hi All,
I've been trying to index some non-english [Indian languages] in unicode
utf-8. For all these languages we don't have any stemmer or tokenizers etc.
To keep the searching simple I'ld like to be able to do exact word
searches/matches as a first step. I'ld like to know which will be the
simplest yet working analyzer to use for both indexing as well as
searhing[lucene wiki says both should be same, else you might not get search
results, right?]

Many a people must have done indexing for non-english text for which there
is no standard analyzers. I request them to give me ideas on this. Along
with this I would also like to do hit highlighting irrespective of language.
Ideas on this will be equally helpful.

Is simpleAnalyzer() good enough for indexing and searching?

Thanks,
KK

Re: Which analyzer to use for non-english unicoded text?

Posted by Erick Erickson <er...@gmail.com>.

I don't think there's anything you can use out of the box, but if you
search for the mail thread (see serchable archives) for a thread
titled "Hebrew and Hindi analyzers" you might find something
useful.

Not much help I know, but perhaps a place to start.

And yes, you should use the same analyzer for indexing and
searching if at all possible. The reason is that the job of an
analyzer is to break the incoming stream up into meaningful
units (usually words). You wouldn't want your analyzer used
in indexing to, say, remove stopwords then use a different analyzer
to search that did NOT remove stopwords (or lowercase, or stem, of...).

And certainly many people have indexed and searched non-English
documents, and many have been contributed the resultant
Analyzers back to the Lucene community. If you find that you have to
write your own, please consider contributing.

HTH
Erick

On Sat, May 23, 2009 at 2:23 AM, KK <di...@gmail.com> wrote:

> Hi All,
> I've been trying to index some non-english [Indian languages] in unicode
> utf-8. For all these languages we don't have any stemmer or tokenizers etc.
> To keep the searching simple I'ld like to be able to do exact word
> searches/matches as a first step. I'ld like to know which will be the
> simplest yet working analyzer to use for both indexing as well as
> searhing[lucene wiki says both should be same, else you might not get
> search
> results, right?]
>
> Many a people must have done indexing for non-english text for which there
> is no standard analyzers. I request them to give me ideas on this. Along
> with this I would also like to do hit highlighting irrespective of
> language.
> Ideas on this will be equally helpful.
>
> Is simpleAnalyzer() good enough for indexing and searching?
>
> Thanks,
> KK
>