You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Hacking Bear <ha...@gmail.com> on 2005/09/06 05:15:15 UTC
Multi-lang analyzer? Re: Multiple Language Indexing and Searching
Hi,
I have the similar problem to deal with. In fact, a lot of times, the
documents do not have any lanugage information or it may contain text in
multiple languages. Further, the user would not like to always supply this
information. Also the user may very well be interested in documents in
multiple language.
I think Google and other search engines allow indexing multi-lanugage
documents. For example, if you google "Java", there will many matched
documents in lanugages other than English.
The only assumption we can make is that the document text are converted to
Unicode before feeding to Lucene.
So I think the solution should be (1) create one index for all lanugage (2)
add an advisory attribute like "lang" to specify the language of the
document; if the language is unknown, just leave it empty or set to "ANY";
(3) based on the code pages of the upcoming Unicode character, we
automatically switch among different analyzers to index the fragments of the
text; (4) during search, unless the user explicitly requesting documents in
certain language, we return all matches regardless of lanugage.
I have browsed through the Lucene and contributed source codes, but I
cannot tell which analyzer is suitable for use (in (3).) While the logic for
such an analyzer is probably not too complicate, it seems to demand quite
some Unicode knowledge to create one.
Is my approach the right one? Is there an analyzer suitable to use?
Thanks.
- HB
On 9/5/05, Olivier Jaquemet <ol...@jalios.com> wrote:
>
> Hi,
>
> I'd like to go in details regarding issues that occurs when you want to
> index and search contents in multiple languages.
>
> I have read Lucene in Action book, and many thread on this mailing list,
> the most interesting so far being this one:
>
> http://mail-archives.apache.org/mod_mbox/lucene-java-user/200505.mbox/%3c19ADCC0B9D4CAD4582BB9900BBCE35740194503C@tayexc13.americas.cpqcorp.net%3e
>
> The solution choosen/recommended by Doug Cutting in this message:
>
> http://mail-archives.apache.org/mod_mbox/lucene-java-user/200506.mbox/%3c42A0841C.2090202@apache.org%3e
> is the number '2/':
> Having one index for all languages one Document per content's language
> with a field specify its language, and using a query filter when
> searching.
>
> While I think it is a good solution:
> - If you have N languages, if you search for something in 1 language,
> you are going to search an index N times too large.
> Wouldn't it be better to have N indices for N languages? That way, each
> index could benefit of its specialized analyser, and if you need to
> search in multiple languages, you just need to merge result of those
> differents analyzer.
> - If you have contents in multiple language like we do, and by that I
> don't mean multiple contents each one having its own language, but
> multiple content, each one being in many languages. You are going to
> have a N to 1, Document/content relation in the index.
> As far as update, delete, and search in multiple language are concerned,
> wouldn't it be simpler to alway keep a 1 to 1 Document/content relation
> in an index?
>
> As you may have guess, my original thought, even before I read those
> thread, was that the solution number 3. might be more flexible/modular
> than the others, of course it also has its drawbacks:
> - performance issue when doing multiple language search, specially when
> merging results of different index.
> - more complex to code
> - other?
>
> Can you clarify on this?
> What solutions all of you have choosen til now regarding indexing and
> searching of multiple content in multiple language ?
>
> Thanks!
>
> Olivier
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
Re: Multiple Language Indexing and Searching
Posted by Chris Hostetter <ho...@fucit.org>.
: I don't know if the developpers of lucene would agree, but from what
: I've been browsing on the ML archives, those multiple language issues
: seems to arrise quite often in the mailing list, and maybe some articles
: like "best practices", "do's and don'ts" or "Lucene Architecture in
: multiple language environement", might be really nice to see :) If some
: of you have the time and the experience to write them I'll be really
: thankful! :)
There is already a document in the HowTo section of the wiki called
"IndexingOtherLanguages" if the topic of Multiple Languages interests
you, and if you've allready familiarized yourself with some of hte
suggestions from this thread (and other threads you've researched from the
archives) then perhaps you could compile a list in the wiki yourself?
Don't worry about making a master document that describes a single perfect
soluation -- or even describing a solution that you know will work, if
nothing else just compiling a collection of ideas people have suggested
into a single document will be helpful. Over time, other people can/will
revise it based on which ideas work well and which ideas don't.
You don't have to have time and experience to write documentation -- time
is good enough to start with, experience can come along later :)
http://wiki.apache.org/jakarta-lucene/HowTo
http://wiki.apache.org/jakarta-lucene/IndexingOtherLanguages
-Hoss
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Multiple Language Indexing and Searching
Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On Sep 6, 2005, at 7:15 AM, Hacking Bear wrote:
> On 9/6/05, Olivier Jaquemet <ol...@jalios.com> wrote:
>
>>
>> As far as your usage is concerned, it seems to be the right approach,
>> and I think the StandardAnalyzer does the job pretty right when it
>> has
>> to deal with whatever language you want.
>>
>
> I should look into exactly what it does. Does this
> StandardAnalyzer handle
> non-European languages like Chinese?
StandardAnalyzer recognizes the CJK range of characters and emits
them each as individual tokens. This is not an ideal way to deal
with Chinese (不好 :) - but it does at least maintain the characters
rather than throwing them away or blurring consecutive characters
together.
Erik
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Multiple Language Indexing and Searching
Posted by Hacking Bear <ha...@gmail.com>.
On 9/6/05, Olivier Jaquemet <ol...@jalios.com> wrote:
>
> As far as your usage is concerned, it seems to be the right approach,
> and I think the StandardAnalyzer does the job pretty right when it has
> to deal with whatever language you want.
I should look into exactly what it does. Does this StandardAnalyzer handle
non-European languages like Chinese?
Though, note that it won't deal with all languages' stop words but the
> English ones, unless specified at index time But then if you change the
> stop words at index time, what should you use at query time, some query
> it won't work well.
I think we can easily create our own super stop-word lists by copying from
whatever other language's stop word lists we can find.
But as far as I am concerned, each content (content in the sense of a
> CMS) is known to have multiple language, and each of these language
> *can* be indexed separately with no problem at all, and therefore a
> dedicated analyser could be use. So I was wondering whether my approach
> could be the right one of if it was over complex, and could introduce
> some problem I could not see... (My approach being: one index per
> language)
My suggestion would be to create one index for all languages with each
document having a 'lang' attribute. Lucene is quite scalable right? So this
should not be an issue.
During search, you can either default to turn on the 'lang' attribute
condition or default to off, depending on what your users want most often.
But it will be very easy to search multiple language documents.
> I don't know if the developpers of lucene would agree, but from what
> I've been browsing on the ML archives, those multiple language issues
> seems to arrise quite often in the mailing list, and maybe some articles
> like "best practices", "do's and don'ts" or "Lucene Architecture in
> multiple language environement", might be really nice to see :) If some
> of you have the time and the experience to write them I'll be really
> thankful! :)
What keywords do you use to search? Somehow, I cannot find any discussion
about multiple language on the ML archive. I even did Google! :-) Or maybe I
was giving the keywords in the wrong language? :-)
Re: Multiple Language Indexing and Searching
Posted by Olivier Jaquemet <ol...@jalios.com>.
As far as your usage is concerned, it seems to be the right approach,
and I think the StandardAnalyzer does the job pretty right when it has
to deal with whatever language you want.
Though, note that it won't deal with all languages' stop words but the
English ones, unless specified at index time But then if you change the
stop words at index time, what should you use at query time, some query
it won't work well.
But as far as I am concerned, each content (content in the sense of a
CMS) is known to have multiple language, and each of these language
*can* be indexed separately with no problem at all, and therefore a
dedicated analyser could be use. So I was wondering whether my approach
could be the right one of if it was over complex, and could introduce
some problem I could not see... (My approach being: one index per language)
Advantages are:
- you always have the same analyzer for one index, so if want to benefit
from some indexing capabilities in one language (stemmer, filter..
whatever), you can!
- Should you need to search in all the language, you just need to do the
query on every single index and you still benefit from each analyzer.
Inconvenients are
- You have to deal with as much indices as you have languages, but then
again, if you do a search in only one language, it becomes an
performance advantage I think.
- You have to merge results from different index, this is a probleme
when dealing with score, any suggestions?
- Unless i'm wrong, you cannot use a MultipleSeacher, because only one
analyzer can be specified, and not one analyzer per searcher (if someone
could correct me if I'm wrong..)
- others ??
I don't know if the developpers of lucene would agree, but from what
I've been browsing on the ML archives, those multiple language issues
seems to arrise quite often in the mailing list, and maybe some articles
like "best practices", "do's and don'ts" or "Lucene Architecture in
multiple language environement", might be really nice to see :) If some
of you have the time and the experience to write them I'll be really
thankful! :)
Olivier
Hacking Bear wrote:
>Hi,
> I have the similar problem to deal with. In fact, a lot of times, the
>documents do not have any lanugage information or it may contain text in
>multiple languages. Further, the user would not like to always supply this
>information. Also the user may very well be interested in documents in
>multiple language.
> I think Google and other search engines allow indexing multi-lanugage
>documents. For example, if you google "Java", there will many matched
>documents in lanugages other than English.
> The only assumption we can make is that the document text are converted to
>Unicode before feeding to Lucene.
>
>So I think the solution should be (1) create one index for all lanugage (2)
>add an advisory attribute like "lang" to specify the language of the
>document; if the language is unknown, just leave it empty or set to "ANY";
>(3) based on the code pages of the upcoming Unicode character, we
>automatically switch among different analyzers to index the fragments of the
>text; (4) during search, unless the user explicitly requesting documents in
>certain language, we return all matches regardless of lanugage.
> I have browsed through the Lucene and contributed source codes, but I
>cannot tell which analyzer is suitable for use (in (3).) While the logic for
>such an analyzer is probably not too complicate, it seems to demand quite
>some Unicode knowledge to create one.
> Is my approach the right one? Is there an analyzer suitable to use?
> Thanks.
>- HB
>
> On 9/5/05, Olivier Jaquemet <ol...@jalios.com> wrote:
>
>
>>Hi,
>>
>>I'd like to go in details regarding issues that occurs when you want to
>>index and search contents in multiple languages.
>>
>>I have read Lucene in Action book, and many thread on this mailing list,
>>the most interesting so far being this one:
>>
>>http://mail-archives.apache.org/mod_mbox/lucene-java-user/200505.mbox/%3c19ADCC0B9D4CAD4582BB9900BBCE35740194503C@tayexc13.americas.cpqcorp.net%3e
>>
>>The solution choosen/recommended by Doug Cutting in this message:
>>
>>http://mail-archives.apache.org/mod_mbox/lucene-java-user/200506.mbox/%3c42A0841C.2090202@apache.org%3e
>>is the number '2/':
>>Having one index for all languages one Document per content's language
>>with a field specify its language, and using a query filter when
>>searching.
>>
>>While I think it is a good solution:
>>- If you have N languages, if you search for something in 1 language,
>>you are going to search an index N times too large.
>>Wouldn't it be better to have N indices for N languages? That way, each
>>index could benefit of its specialized analyser, and if you need to
>>search in multiple languages, you just need to merge result of those
>>differents analyzer.
>>- If you have contents in multiple language like we do, and by that I
>>don't mean multiple contents each one having its own language, but
>>multiple content, each one being in many languages. You are going to
>>have a N to 1, Document/content relation in the index.
>>As far as update, delete, and search in multiple language are concerned,
>>wouldn't it be simpler to alway keep a 1 to 1 Document/content relation
>>in an index?
>>
>>As you may have guess, my original thought, even before I read those
>>thread, was that the solution number 3. might be more flexible/modular
>>than the others, of course it also has its drawbacks:
>>- performance issue when doing multiple language search, specially when
>>merging results of different index.
>>- more complex to code
>>- other?
>>
>>Can you clarify on this?
>>What solutions all of you have choosen til now regarding indexing and
>>searching of multiple content in multiple language ?
>>
>>Thanks!
>>
>>Olivier
>>
>>
>>
>>---------------------------------------------------------------------
>>To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>
>>
>
>
>
--
Olivier Jaquemet <ol...@jalios.com>
Ingénieur R&D Jalios S.A.
Tel: 01.39.23.92.83
http://www.jalios.com/
http://support.jalios.com/
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org