You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Hacking Bear <ha...@gmail.com> on 2005/09/06 05:15:15 UTC

Multi-lang analyzer? Re: Multiple Language Indexing and Searching

Hi,
 I have the similar problem to deal with. In fact, a lot of times, the 
documents do not have any lanugage information or it may contain text in 
multiple languages. Further, the user would not like to always supply this 
information. Also the user may very well be interested in documents in 
multiple language.
 I think Google and other search engines allow indexing multi-lanugage 
documents. For example, if you google "Java", there will many matched 
documents in lanugages other than English.
 The only assumption we can make is that the document text are converted to 
Unicode before feeding to Lucene.

So I think the solution should be (1) create one index for all lanugage (2) 
add an advisory attribute like "lang" to specify the language of the 
document; if the language is unknown, just leave it empty or set to "ANY"; 
(3) based on the code pages of the upcoming Unicode character, we 
automatically switch among different analyzers to index the fragments of the 
text; (4) during search, unless the user explicitly requesting documents in 
certain language, we return all matches regardless of lanugage.
 I have browsed through the Lucene and contributed source codes, but I 
cannot tell which analyzer is suitable for use (in (3).) While the logic for 
such an analyzer is probably not too complicate, it seems to demand quite 
some Unicode knowledge to create one.
 Is my approach the right one? Is there an analyzer suitable to use?
 Thanks.
- HB

 On 9/5/05, Olivier Jaquemet <ol...@jalios.com> wrote: 
> 
> Hi,
> 
> I'd like to go in details regarding issues that occurs when you want to
> index and search contents in multiple languages.
> 
> I have read Lucene in Action book, and many thread on this mailing list,
> the most interesting so far being this one:
> 
> http://mail-archives.apache.org/mod_mbox/lucene-java-user/200505.mbox/%3c19ADCC0B9D4CAD4582BB9900BBCE35740194503C@tayexc13.americas.cpqcorp.net%3e
> 
> The solution choosen/recommended by Doug Cutting in this message:
> 
> http://mail-archives.apache.org/mod_mbox/lucene-java-user/200506.mbox/%3c42A0841C.2090202@apache.org%3e
> is the number '2/':
> Having one index for all languages one Document per content's language
> with a field specify its language, and using a query filter when 
> searching.
> 
> While I think it is a good solution:
> - If you have N languages, if you search for something in 1 language,
> you are going to search an index N times too large.
> Wouldn't it be better to have N indices for N languages? That way, each
> index could benefit of its specialized analyser, and if you need to
> search in multiple languages, you just need to merge result of those
> differents analyzer.
> - If you have contents in multiple language like we do, and by that I
> don't mean multiple contents each one having its own language, but
> multiple content, each one being in many languages. You are going to
> have a N to 1, Document/content relation in the index.
> As far as update, delete, and search in multiple language are concerned,
> wouldn't it be simpler to alway keep a 1 to 1 Document/content relation
> in an index?
> 
> As you may have guess, my original thought, even before I read those
> thread, was that the solution number 3. might be more flexible/modular
> than the others, of course it also has its drawbacks:
> - performance issue when doing multiple language search, specially when
> merging results of different index.
> - more complex to code
> - other?
> 
> Can you clarify on this?
> What solutions all of you have choosen til now regarding indexing and
> searching of multiple content in multiple language ?
> 
> Thanks!
> 
> Olivier
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
>

Re: Multiple Language Indexing and Searching

Posted by Chris Hostetter <ho...@fucit.org>.

: I don't know if the developpers of lucene would agree, but from what
: I've been browsing on the ML archives, those multiple language issues
: seems to arrise quite often in the mailing list, and maybe some articles
: like "best practices", "do's and don'ts" or "Lucene Architecture in
: multiple language environement",  might be really nice to see :) If some
: of you have the time and the experience to write them I'll be really
: thankful! :)


There is already a document in the HowTo section of the wiki called
"IndexingOtherLanguages"  if the topic of Multiple Languages interests
you, and if you've allready familiarized yourself with some of hte
suggestions from this thread (and other threads you've researched from the
archives) then perhaps you could compile a list in the wiki yourself?

Don't worry about making a master document that describes a single perfect
soluation -- or even describing a solution that you know will work, if
nothing else just compiling a collection of ideas people have suggested
into a single document will be helpful.  Over time, other people can/will
revise it based on which ideas work well and which ideas don't.

You don't have to have time and experience to write documentation -- time
is good enough to start with, experience can come along later :)

http://wiki.apache.org/jakarta-lucene/HowTo
http://wiki.apache.org/jakarta-lucene/IndexingOtherLanguages


-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Multiple Language Indexing and Searching

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

On Sep 6, 2005, at 7:15 AM, Hacking Bear wrote:

> On 9/6/05, Olivier Jaquemet <ol...@jalios.com> wrote:
>
>>
>> As far as your usage is concerned, it seems to be the right approach,
>> and I think the StandardAnalyzer does the job pretty right when it  
>> has
>> to deal with whatever language you want.
>>
>
>  I should look into exactly what it does. Does this  
> StandardAnalyzer handle
> non-European languages like Chinese?

StandardAnalyzer recognizes the CJK range of characters and emits  
them each as individual tokens.  This is not an ideal way to deal  
with Chinese (不好 :) - but it does at least maintain the characters  
rather than throwing them away or blurring consecutive characters  
together.

     Erik

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Multiple Language Indexing and Searching

Posted by Hacking Bear <ha...@gmail.com>.

On 9/6/05, Olivier Jaquemet <ol...@jalios.com> wrote: 
> 
> As far as your usage is concerned, it seems to be the right approach,
> and I think the StandardAnalyzer does the job pretty right when it has
> to deal with whatever language you want.

 I should look into exactly what it does. Does this StandardAnalyzer handle 
non-European languages like Chinese?

Though, note that it won't deal with all languages' stop words but the
> English ones, unless specified at index time But then if you change the
> stop words at index time, what should you use at query time, some query
> it won't work well.

 I think we can easily create our own super stop-word lists by copying from 
whatever other language's stop word lists we can find.

But as far as I am concerned, each content (content in the sense of a
> CMS) is known to have multiple language, and each of these language
> *can* be indexed separately with no problem at all, and therefore a
> dedicated analyser could be use. So I was wondering whether my approach
> could be the right one of if it was over complex, and could introduce
> some problem I could not see... (My approach being: one index per 
> language)

 My suggestion would be to create one index for all languages with each 
document having a 'lang' attribute. Lucene is quite scalable right? So this 
should not be an issue.
 During search, you can either default to turn on the 'lang' attribute 
condition or default to off, depending on what your users want most often. 
But it will be very easy to search multiple language documents.

> I don't know if the developpers of lucene would agree, but from what
> I've been browsing on the ML archives, those multiple language issues
> seems to arrise quite often in the mailing list, and maybe some articles
> like "best practices", "do's and don'ts" or "Lucene Architecture in
> multiple language environement", might be really nice to see :) If some
> of you have the time and the experience to write them I'll be really
> thankful! :)

 What keywords do you use to search? Somehow, I cannot find any discussion 
about multiple language on the ML archive. I even did Google! :-) Or maybe I 
was giving the keywords in the wrong language? :-)

Re: Multiple Language Indexing and Searching

Posted by Olivier Jaquemet <ol...@jalios.com>.

As far as your usage is concerned, it seems to be the right approach, 
and I think the StandardAnalyzer does the job pretty right when it has 
to deal with whatever language you want.
Though, note that it won't deal with all languages' stop words but the 
English ones, unless specified at index time  But then if you change the 
stop words at index time, what should you use at query time, some query 
it won't work well.

But as far as I am concerned, each content (content in the sense of a 
CMS) is known to have multiple language, and each of these language 
*can* be indexed separately with no problem at all, and therefore a 
dedicated analyser could be use. So I was wondering whether my approach 
could be the right one of if it was over complex, and could introduce 
some problem I could not see... (My approach being: one index per language)
Advantages are:
- you always have the same analyzer for one index, so if want to benefit 
from some indexing capabilities in one language (stemmer, filter.. 
whatever), you can!
- Should you need to search in all the language, you just need to do the 
query on every single index and you still benefit from each analyzer.
Inconvenients are
- You have to deal with as much indices as you have languages, but then 
again, if you do a search in only one language, it becomes an 
performance advantage I think.
- You have to merge results from different index, this is a probleme 
when dealing with score, any suggestions?
- Unless i'm wrong, you cannot use a MultipleSeacher, because only one 
analyzer can be specified, and not one analyzer per searcher (if someone 
could correct me if I'm wrong..)
 - others ??

I don't know if the developpers of lucene would agree, but from what 
I've been browsing on the ML archives, those multiple language issues 
seems to arrise quite often in the mailing list, and maybe some articles 
like "best practices", "do's and don'ts" or "Lucene Architecture in 
multiple language environement",  might be really nice to see :) If some 
of you have the time and the experience to write them I'll be really 
thankful! :)

Olivier

Hacking Bear wrote:

>Hi,
> I have the similar problem to deal with. In fact, a lot of times, the 
>documents do not have any lanugage information or it may contain text in 
>multiple languages. Further, the user would not like to always supply this 
>information. Also the user may very well be interested in documents in 
>multiple language.
> I think Google and other search engines allow indexing multi-lanugage 
>documents. For example, if you google "Java", there will many matched 
>documents in lanugages other than English.
> The only assumption we can make is that the document text are converted to 
>Unicode before feeding to Lucene.
>
>So I think the solution should be (1) create one index for all lanugage (2) 
>add an advisory attribute like "lang" to specify the language of the 
>document; if the language is unknown, just leave it empty or set to "ANY"; 
>(3) based on the code pages of the upcoming Unicode character, we 
>automatically switch among different analyzers to index the fragments of the 
>text; (4) during search, unless the user explicitly requesting documents in 
>certain language, we return all matches regardless of lanugage.
> I have browsed through the Lucene and contributed source codes, but I 
>cannot tell which analyzer is suitable for use (in (3).) While the logic for 
>such an analyzer is probably not too complicate, it seems to demand quite 
>some Unicode knowledge to create one.
> Is my approach the right one? Is there an analyzer suitable to use?
> Thanks.
>- HB
>
> On 9/5/05, Olivier Jaquemet <ol...@jalios.com> wrote: 
>  
>
>>Hi,
>>
>>I'd like to go in details regarding issues that occurs when you want to
>>index and search contents in multiple languages.
>>
>>I have read Lucene in Action book, and many thread on this mailing list,
>>the most interesting so far being this one:
>>
>>http://mail-archives.apache.org/mod_mbox/lucene-java-user/200505.mbox/%3c19ADCC0B9D4CAD4582BB9900BBCE35740194503C@tayexc13.americas.cpqcorp.net%3e
>>
>>The solution choosen/recommended by Doug Cutting in this message:
>>
>>http://mail-archives.apache.org/mod_mbox/lucene-java-user/200506.mbox/%3c42A0841C.2090202@apache.org%3e
>>is the number '2/':
>>Having one index for all languages one Document per content's language
>>with a field specify its language, and using a query filter when 
>>searching.
>>
>>While I think it is a good solution:
>>- If you have N languages, if you search for something in 1 language,
>>you are going to search an index N times too large.
>>Wouldn't it be better to have N indices for N languages? That way, each
>>index could benefit of its specialized analyser, and if you need to
>>search in multiple languages, you just need to merge result of those
>>differents analyzer.
>>- If you have contents in multiple language like we do, and by that I
>>don't mean multiple contents each one having its own language, but
>>multiple content, each one being in many languages. You are going to
>>have a N to 1, Document/content relation in the index.
>>As far as update, delete, and search in multiple language are concerned,
>>wouldn't it be simpler to alway keep a 1 to 1 Document/content relation
>>in an index?
>>
>>As you may have guess, my original thought, even before I read those
>>thread, was that the solution number 3. might be more flexible/modular
>>than the others, of course it also has its drawbacks:
>>- performance issue when doing multiple language search, specially when
>>merging results of different index.
>>- more complex to code
>>- other?
>>
>>Can you clarify on this?
>>What solutions all of you have choosen til now regarding indexing and
>>searching of multiple content in multiple language ?
>>
>>Thanks!
>>
>>Olivier
>>
>>
>>
>>---------------------------------------------------------------------
>>To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>    
>>
>
>  
>


-- 
Olivier Jaquemet <ol...@jalios.com>
Ingénieur R&D Jalios S.A.
Tel: 01.39.23.92.83
http://www.jalios.com/
http://support.jalios.com/




---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org