You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Paul Cowan <co...@aconex.com> on 2006/03/16 04:49:48 UTC

Multiple languages - possible approach

Hi everyone,

We are currently using Lucene to index correspondence between various 
people, who may or may not use the same language in their discussions to 
each other. Think an email system where participants might use the 
language that seems most appropriate to the thought at the time, just as 
they would in conversation.

An example (CN = some chinese text. Use your imagination!):

	From: Someone in the UK
	To: Someone in China
	Subject: Re: CNCNCNCNCNCNCNCNCNCNCN

	> CNCNCNCNCNCNCNCN

	Yes, I think that's fine. I'm OK with that as long as Bob is.

	> CNCNCNCNCNCN

	CNCN?

	> Tuesday OK?

	I need it by Monday, sorry. CNCN!

We need to index that, and be able to search on it -- for both the 
Chinese and English text. Note that stemming is not a particular need of 
ours -- we're happy to search for literal tokens, but of course that may 
not apply to other languages where stemming is expected behaviour, not 
just a 'nicety'.

Anyway: so far, fine -- StandardAnalyzer is perfectly suitable to our 
needs. The problem is, the next language out of the starting blocks is 
Arabic, which StandardAnalyzer doesn't seem to be up to.

I've looked into previous discussions about this on the various lists, 
and it seems to me there are a few options:

1) Maintain multiple indexes (StandardAnalyzer-analyzed, 
ArabicAnalyzer-analyzed, LanguageXXXAnalyzer-analyzed) and search across 
all of them, merging results

2) Maintain multiple indexes, ask the user which one to use at search-time:
	Search for the [Arabic \/] text: [______________________]

3) Use StandardAnalyzer and hope for the best.

4) Write a new... "Super Analyzer" that tries to deal with this. This is 
POSSIBLY the best idea -- and, of course, almost certainly the hardest!

Basically, what we're considering is writing some sort of new 
CompositeAnalyzer class which applies the following algorithm (in very 
simple terms):

a) Start reading the stream

b) Look at the next character

c) Use some sort of Character.UnicodeBlock (or Character.Subset 
generally) -> Analyzer mapping to work out which Analyzer we want to 
use. e.g. find a member of Character.UnicodeBlock.GREEK, load a 
GreekAnalyzer.

d) Keep reading until we hit something that makes us think we need to 
change analyzers (either end-of-stream or something incongruous -- e.g. 
something from Character.UnicodeBlock.CYRILLIC). Then bundle up what 
we've got, hand it to the GreekAnalyzer, and then start the process 
again with a RussianAnalyzer (or whatever).

Obviously the best way to do this would be to have these mappings 
dynamic, not set in stone -- some people might like all 
CJK_COMPATABILITY to be handed to the CJKAnalyzer, some to the 
ChineseAnalyzer, some might like to use their own, etc. Of course 
there's no reason default mappings can't be supplied.

I guess the basic question is -- what does everyone think? Is this 
useful/workable/are there any fatal flaws with it? Obviously the biggie 
is that sometimes Unicode ranges are not sufficient to determine which 
analyzer to use -- for example, we may want to specifically use the 
GermanAnalyzer for German text, but that is basically impossible to tell 
from English purely based on the Unicode block of the next character. At 
least this way, though, we'd have the OPTION of farming off to more 
specific Analyzers based on Character set; being able to have an 
Analyzer which can tell Urdu from Arabic is something of separate issue; 
at least the "CompositeAnalyzer" would bring us a bit closer to the 
goal. It may be rudimentary but I think the 'pluggable' architecture 
could be useful -- certainly more useful in our case than just running 
the StandardAnalyzer over everything.

If this project goes ahead, it's possible (even likely) that it would be 
contributed back to the Lucene sandbox. As such, I'm very interested to 
hear about any suggestions, criticisms, or other feedback you might have.

Cheers,

Paul Cowan

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Multiple languages - possible approach

Posted by Paul Cowan <co...@aconex.com>.

Hi Grant and Otis,

Thanks for the feedback, I appreciate it. You've given some good ideas.

> Sounds like a really interesting system!  I am curious, are your users 
> fluent in multiple languages or are you using some type of translation 
> component?

The former. We're talking about construction projects, where English is 
(generally) something of a Lingua Franca, as it were (a really big 
construction project these days might use Australian architects, British 
managers and UAE-based engineers on a project in Shanghai). So we might 
have an architect forwarding a message on to an engineer in English, she 
forwards it to the ground team in Shanghai in English, but they then 
discuss it amongst themselves in Chinese... all in the space of one 
forwarded email.

> How are you querying?  Are users entering mixed language queries too?  
<snip>

Good question(s). Automatically detecting the indexing language doesn't 
NECESSARILY help us with the searching, as we'll have a lot less text to 
work with. On the plus side, we can always ASK what language the text 
they're searching for is with a drop-down or something; we can't really 
ask what language their correspondence is in, as it may be mixed.

Multiple indexes is an option but we're very concerned about performance 
and size -- we're talking many many millions of things to index, having 
English/Chinese/Arabic/who knows what else indexes could be nightmare.

> Also, is the text so finely delineated as your example?  We sometimes 
> run across the case where foreign languages will use other languages 
> (mostly English) mid-sentence and it makes things quite ugly.   Approach 
> 4 should handle this, though

Yeah, that's one of our worries. People often can't find the right word 
for what they want to say, etc., so they slip back into another language.

Anyway, thanks for that and the rest of the ideas. We think that 
StandardAnalyzer will do us for now (Chinese only); when we hit more 
complicated languages I'll come up with a plan/design for the "Super 
Analyzer" and post it to this list for discussion and/or flamewar.

Cheers,

Paul

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Multiple languages - possible approach

Posted by Grant Ingersoll <gs...@syr.edu>.

Hi Paul,

Sounds like a really interesting system!  I am curious, are your users 
fluent in multiple languages or are you using some type of translation 
component?

Some comments below and a few thoughts here.

How are you querying?  Are users entering mixed language queries too?  
Do you have a cross language component too?  Or is it the case that if 
they enter an English query they only want English searches?  If this is 
true, having multiple indexes would make things easier (or multiple 
fields) as you could simply detect (or know, based on user profile 
information) the query language and then select the appropriate 
index/field and you wouldn't need the extra complexity of #4.

Also, is the text so finely delineated as your example?  We sometimes 
run across the case where foreign languages will use other languages 
(mostly English) mid-sentence and it makes things quite ugly.   Approach 
4 should handle this, though

It seems no matter which approach you take, except for #3, you have to 
have a way of delineating the languages.

Also, you could use the PerFieldAnalyzerWrapper, and have one field per 
language per document, this way you wouldn't have to manage multiple 
indexes.  You would have to demarcate your text before indexing, I 
suppose, so you would have to process it twice, but that may not be a 
big deal for you.

-Grant

Paul Cowan wrote:
> Hi everyone,
>
> We are currently using Lucene to index correspondence between various 
> people, who may or may not use the same language in their discussions 
> to each other. Think an email system where participants might use the 
> language that seems most appropriate to the thought at the time, just 
> as they would in conversation.
>
> An example (CN = some chinese text. Use your imagination!):
>
>     From: Someone in the UK
>     To: Someone in China
>     Subject: Re: CNCNCNCNCNCNCNCNCNCNCN
>
>     > CNCNCNCNCNCNCNCN
>
>     Yes, I think that's fine. I'm OK with that as long as Bob is.
>
>     > CNCNCNCNCNCN
>
>     CNCN?
>
>     > Tuesday OK?
>
>     I need it by Monday, sorry. CNCN!
>
> We need to index that, and be able to search on it -- for both the 
> Chinese and English text. Note that stemming is not a particular need 
> of ours -- we're happy to search for literal tokens, but of course 
> that may not apply to other languages where stemming is expected 
> behaviour, not just a 'nicety'.
>
> Anyway: so far, fine -- StandardAnalyzer is perfectly suitable to our 
> needs. The problem is, the next language out of the starting blocks is 
> Arabic, which StandardAnalyzer doesn't seem to be up to.
>
> I've looked into previous discussions about this on the various lists, 
> and it seems to me there are a few options:
>
> 1) Maintain multiple indexes (StandardAnalyzer-analyzed, 
> ArabicAnalyzer-analyzed, LanguageXXXAnalyzer-analyzed) and search 
> across all of them, merging results
>
It is not always straightforward to merge results, as scores do not 
translate well across indexes.

> 2) Maintain multiple indexes, ask the user which one to use at 
> search-time:
>     Search for the [Arabic \/] text: [______________________]

>
> 3) Use StandardAnalyzer and hope for the best.
I don't think this is viable.  You are probably safe to use 
StandardAnalyzer as a default case for #4

>
> 4) Write a new... "Super Analyzer" that tries to deal with this. This 
> is POSSIBLY the best idea -- and, of course, almost certainly the 
> hardest!
>
> Basically, what we're considering is writing some sort of new 
> CompositeAnalyzer class which applies the following algorithm (in very 
> simple terms):
>
> a) Start reading the stream
>
> b) Look at the next character
>
> c) Use some sort of Character.UnicodeBlock (or Character.Subset 
> generally) -> Analyzer mapping to work out which Analyzer we want to 
> use. e.g. find a member of Character.UnicodeBlock.GREEK, load a 
> GreekAnalyzer.
>
> d) Keep reading until we hit something that makes us think we need to 
> change analyzers (either end-of-stream or something incongruous -- 
> e.g. something from Character.UnicodeBlock.CYRILLIC). Then bundle up 
> what we've got, hand it to the GreekAnalyzer, and then start the 
> process again with a RussianAnalyzer (or whatever).
>
> Obviously the best way to do this would be to have these mappings 
> dynamic, not set in stone -- some people might like all 
> CJK_COMPATABILITY to be handed to the CJKAnalyzer, some to the 
> ChineseAnalyzer, some might like to use their own, etc. Of course 
> there's no reason default mappings can't be supplied.
>
> I guess the basic question is -- what does everyone think? Is this 
> useful/workable/are there any fatal flaws with it? Obviously the 
> biggie is that sometimes Unicode ranges are not sufficient to 
> determine which analyzer to use -- for example, we may want to 
> specifically use the GermanAnalyzer for German text, but that is 
> basically impossible to tell from English purely based on the Unicode 
> block of the next character. At least this way, though, we'd have the 
> OPTION of farming off to more specific Analyzers based on Character 
> set; being able to have an Analyzer which can tell Urdu from Arabic is 
> something of separate issue; at least the "CompositeAnalyzer" would 
> bring us a bit closer to the goal. It may be rudimentary but I think 
> the 'pluggable' architecture could be useful -- certainly more useful 
> in our case than just running the StandardAnalyzer over everything.
>

This sounds like a reasonable approach.  I wonder a bit about how the 
TokenStream mechanism will work, considering tokenization can be quite 
different for Chinese and some of the other Asian languages as compared 
to Latin based languages.  Essentially, as things come into the 
Tokenizer, you will need to indicate to the analyzer which Filter to 
apply.  I guess this could be done by setting the Type property on the 
Token and having a Filter that wraps all of your other Filters and, 
based on Type, hands it off to the appropriate filter for that language.


> If this project goes ahead, it's possible (even likely) that it would 
> be contributed back to the Lucene sandbox. As such, I'm very 
> interested to hear about any suggestions, criticisms, or other 
> feedback you might have.
>
> Cheers,
>
> Paul Cowan
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

-- 

Grant Ingersoll 
Sr. Software Engineer 
Center for Natural Language Processing 
Syracuse University 
School of Information Studies 
335 Hinds Hall 
Syracuse, NY 13244 

http://www.cnlp.org 
Voice:  315-443-5484 
Fax: 315-443-6886 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Multiple languages - possible approach

Posted by Otis Gospodnetic <ot...@yahoo.com>.

Hi Paul,

I don't have any first-hand experience with this, but your suggestion about pluggable analyzers sounds both reasonable and interesting to me.  One thing you did not mention as a mechanism for figuring out which analyzer to use is language identification (like the one you can find among Nutch plugins).  If you can't tell which analyzer to use by looking at characters and unicode ranges, perhaps you can (also) read in/ahead a few tokens and pass them to language identifier before selecting the best analyzer.

This would be a great contribution, of course! :)

Otis

----- Original Message ----
From: Paul Cowan <co...@aconex.com>
To: java-user@lucene.apache.org
Sent: Wednesday, March 15, 2006 10:49:48 PM
Subject: Multiple languages - possible approach

Hi everyone,

We are currently using Lucene to index correspondence between various 
people, who may or may not use the same language in their discussions to 
each other. Think an email system where participants might use the 
language that seems most appropriate to the thought at the time, just as 
they would in conversation.

An example (CN = some chinese text. Use your imagination!):

    From: Someone in the UK
    To: Someone in China
    Subject: Re: CNCNCNCNCNCNCNCNCNCNCN

    > CNCNCNCNCNCNCNCN

    Yes, I think that's fine. I'm OK with that as long as Bob is.

    > CNCNCNCNCNCN

    CNCN?

    > Tuesday OK?

    I need it by Monday, sorry. CNCN!

We need to index that, and be able to search on it -- for both the 
Chinese and English text. Note that stemming is not a particular need of 
ours -- we're happy to search for literal tokens, but of course that may 
not apply to other languages where stemming is expected behaviour, not 
just a 'nicety'.

Anyway: so far, fine -- StandardAnalyzer is perfectly suitable to our 
needs. The problem is, the next language out of the starting blocks is 
Arabic, which StandardAnalyzer doesn't seem to be up to.

I've looked into previous discussions about this on the various lists, 
and it seems to me there are a few options:

1) Maintain multiple indexes (StandardAnalyzer-analyzed, 
ArabicAnalyzer-analyzed, LanguageXXXAnalyzer-analyzed) and search across 
all of them, merging results

2) Maintain multiple indexes, ask the user which one to use at search-time:
    Search for the [Arabic \/] text: [______________________]

3) Use StandardAnalyzer and hope for the best.

4) Write a new... "Super Analyzer" that tries to deal with this. This is 
POSSIBLY the best idea -- and, of course, almost certainly the hardest!

Basically, what we're considering is writing some sort of new 
CompositeAnalyzer class which applies the following algorithm (in very 
simple terms):

a) Start reading the stream

b) Look at the next character

c) Use some sort of Character.UnicodeBlock (or Character.Subset 
generally) -> Analyzer mapping to work out which Analyzer we want to 
use. e.g. find a member of Character.UnicodeBlock.GREEK, load a 
GreekAnalyzer.

d) Keep reading until we hit something that makes us think we need to 
change analyzers (either end-of-stream or something incongruous -- e.g. 
something from Character.UnicodeBlock.CYRILLIC). Then bundle up what 
we've got, hand it to the GreekAnalyzer, and then start the process 
again with a RussianAnalyzer (or whatever).

Obviously the best way to do this would be to have these mappings 
dynamic, not set in stone -- some people might like all 
CJK_COMPATABILITY to be handed to the CJKAnalyzer, some to the 
ChineseAnalyzer, some might like to use their own, etc. Of course 
there's no reason default mappings can't be supplied.

I guess the basic question is -- what does everyone think? Is this 
useful/workable/are there any fatal flaws with it? Obviously the biggie 
is that sometimes Unicode ranges are not sufficient to determine which 
analyzer to use -- for example, we may want to specifically use the 
GermanAnalyzer for German text, but that is basically impossible to tell 
from English purely based on the Unicode block of the next character. At 
least this way, though, we'd have the OPTION of farming off to more 
specific Analyzers based on Character set; being able to have an 
Analyzer which can tell Urdu from Arabic is something of separate issue; 
at least the "CompositeAnalyzer" would bring us a bit closer to the 
goal. It may be rudimentary but I think the 'pluggable' architecture 
could be useful -- certainly more useful in our case than just running 
the StandardAnalyzer over everything.

If this project goes ahead, it's possible (even likely) that it would be 
contributed back to the Lucene sandbox. As such, I'm very interested to 
hear about any suggestions, criticisms, or other feedback you might have.

Cheers,

Paul Cowan

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org





---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org