You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Paul Cowan <co...@aconex.com> on 2006/03/16 04:49:48 UTC
Multiple languages - possible approach
Hi everyone,
We are currently using Lucene to index correspondence between various
people, who may or may not use the same language in their discussions to
each other. Think an email system where participants might use the
language that seems most appropriate to the thought at the time, just as
they would in conversation.
An example (CN = some chinese text. Use your imagination!):
From: Someone in the UK
To: Someone in China
Subject: Re: CNCNCNCNCNCNCNCNCNCNCN
> CNCNCNCNCNCNCNCN
Yes, I think that's fine. I'm OK with that as long as Bob is.
> CNCNCNCNCNCN
CNCN?
> Tuesday OK?
I need it by Monday, sorry. CNCN!
We need to index that, and be able to search on it -- for both the
Chinese and English text. Note that stemming is not a particular need of
ours -- we're happy to search for literal tokens, but of course that may
not apply to other languages where stemming is expected behaviour, not
just a 'nicety'.
Anyway: so far, fine -- StandardAnalyzer is perfectly suitable to our
needs. The problem is, the next language out of the starting blocks is
Arabic, which StandardAnalyzer doesn't seem to be up to.
I've looked into previous discussions about this on the various lists,
and it seems to me there are a few options:
1) Maintain multiple indexes (StandardAnalyzer-analyzed,
ArabicAnalyzer-analyzed, LanguageXXXAnalyzer-analyzed) and search across
all of them, merging results
2) Maintain multiple indexes, ask the user which one to use at search-time:
Search for the [Arabic \/] text: [______________________]
3) Use StandardAnalyzer and hope for the best.
4) Write a new... "Super Analyzer" that tries to deal with this. This is
POSSIBLY the best idea -- and, of course, almost certainly the hardest!
Basically, what we're considering is writing some sort of new
CompositeAnalyzer class which applies the following algorithm (in very
simple terms):
a) Start reading the stream
b) Look at the next character
c) Use some sort of Character.UnicodeBlock (or Character.Subset
generally) -> Analyzer mapping to work out which Analyzer we want to
use. e.g. find a member of Character.UnicodeBlock.GREEK, load a
GreekAnalyzer.
d) Keep reading until we hit something that makes us think we need to
change analyzers (either end-of-stream or something incongruous -- e.g.
something from Character.UnicodeBlock.CYRILLIC). Then bundle up what
we've got, hand it to the GreekAnalyzer, and then start the process
again with a RussianAnalyzer (or whatever).
Obviously the best way to do this would be to have these mappings
dynamic, not set in stone -- some people might like all
CJK_COMPATABILITY to be handed to the CJKAnalyzer, some to the
ChineseAnalyzer, some might like to use their own, etc. Of course
there's no reason default mappings can't be supplied.
I guess the basic question is -- what does everyone think? Is this
useful/workable/are there any fatal flaws with it? Obviously the biggie
is that sometimes Unicode ranges are not sufficient to determine which
analyzer to use -- for example, we may want to specifically use the
GermanAnalyzer for German text, but that is basically impossible to tell
from English purely based on the Unicode block of the next character. At
least this way, though, we'd have the OPTION of farming off to more
specific Analyzers based on Character set; being able to have an
Analyzer which can tell Urdu from Arabic is something of separate issue;
at least the "CompositeAnalyzer" would bring us a bit closer to the
goal. It may be rudimentary but I think the 'pluggable' architecture
could be useful -- certainly more useful in our case than just running
the StandardAnalyzer over everything.
If this project goes ahead, it's possible (even likely) that it would be
contributed back to the Lucene sandbox. As such, I'm very interested to
hear about any suggestions, criticisms, or other feedback you might have.
Cheers,
Paul Cowan
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Multiple languages - possible approach
Posted by Paul Cowan <co...@aconex.com>.
Hi Grant and Otis,
Thanks for the feedback, I appreciate it. You've given some good ideas.
> Sounds like a really interesting system! I am curious, are your users
> fluent in multiple languages or are you using some type of translation
> component?
The former. We're talking about construction projects, where English is
(generally) something of a Lingua Franca, as it were (a really big
construction project these days might use Australian architects, British
managers and UAE-based engineers on a project in Shanghai). So we might
have an architect forwarding a message on to an engineer in English, she
forwards it to the ground team in Shanghai in English, but they then
discuss it amongst themselves in Chinese... all in the space of one
forwarded email.
> How are you querying? Are users entering mixed language queries too?
<snip>
Good question(s). Automatically detecting the indexing language doesn't
NECESSARILY help us with the searching, as we'll have a lot less text to
work with. On the plus side, we can always ASK what language the text
they're searching for is with a drop-down or something; we can't really
ask what language their correspondence is in, as it may be mixed.
Multiple indexes is an option but we're very concerned about performance
and size -- we're talking many many millions of things to index, having
English/Chinese/Arabic/who knows what else indexes could be nightmare.
> Also, is the text so finely delineated as your example? We sometimes
> run across the case where foreign languages will use other languages
> (mostly English) mid-sentence and it makes things quite ugly. Approach
> 4 should handle this, though
Yeah, that's one of our worries. People often can't find the right word
for what they want to say, etc., so they slip back into another language.
Anyway, thanks for that and the rest of the ideas. We think that
StandardAnalyzer will do us for now (Chinese only); when we hit more
complicated languages I'll come up with a plan/design for the "Super
Analyzer" and post it to this list for discussion and/or flamewar.
Cheers,
Paul
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Multiple languages - possible approach
Posted by Grant Ingersoll <gs...@syr.edu>.
Hi Paul,
Sounds like a really interesting system! I am curious, are your users
fluent in multiple languages or are you using some type of translation
component?
Some comments below and a few thoughts here.
How are you querying? Are users entering mixed language queries too?
Do you have a cross language component too? Or is it the case that if
they enter an English query they only want English searches? If this is
true, having multiple indexes would make things easier (or multiple
fields) as you could simply detect (or know, based on user profile
information) the query language and then select the appropriate
index/field and you wouldn't need the extra complexity of #4.
Also, is the text so finely delineated as your example? We sometimes
run across the case where foreign languages will use other languages
(mostly English) mid-sentence and it makes things quite ugly. Approach
4 should handle this, though
It seems no matter which approach you take, except for #3, you have to
have a way of delineating the languages.
Also, you could use the PerFieldAnalyzerWrapper, and have one field per
language per document, this way you wouldn't have to manage multiple
indexes. You would have to demarcate your text before indexing, I
suppose, so you would have to process it twice, but that may not be a
big deal for you.
-Grant
Paul Cowan wrote:
> Hi everyone,
>
> We are currently using Lucene to index correspondence between various
> people, who may or may not use the same language in their discussions
> to each other. Think an email system where participants might use the
> language that seems most appropriate to the thought at the time, just
> as they would in conversation.
>
> An example (CN = some chinese text. Use your imagination!):
>
> From: Someone in the UK
> To: Someone in China
> Subject: Re: CNCNCNCNCNCNCNCNCNCNCN
>
> > CNCNCNCNCNCNCNCN
>
> Yes, I think that's fine. I'm OK with that as long as Bob is.
>
> > CNCNCNCNCNCN
>
> CNCN?
>
> > Tuesday OK?
>
> I need it by Monday, sorry. CNCN!
>
> We need to index that, and be able to search on it -- for both the
> Chinese and English text. Note that stemming is not a particular need
> of ours -- we're happy to search for literal tokens, but of course
> that may not apply to other languages where stemming is expected
> behaviour, not just a 'nicety'.
>
> Anyway: so far, fine -- StandardAnalyzer is perfectly suitable to our
> needs. The problem is, the next language out of the starting blocks is
> Arabic, which StandardAnalyzer doesn't seem to be up to.
>
> I've looked into previous discussions about this on the various lists,
> and it seems to me there are a few options:
>
> 1) Maintain multiple indexes (StandardAnalyzer-analyzed,
> ArabicAnalyzer-analyzed, LanguageXXXAnalyzer-analyzed) and search
> across all of them, merging results
>
It is not always straightforward to merge results, as scores do not
translate well across indexes.
> 2) Maintain multiple indexes, ask the user which one to use at
> search-time:
> Search for the [Arabic \/] text: [______________________]
>
> 3) Use StandardAnalyzer and hope for the best.
I don't think this is viable. You are probably safe to use
StandardAnalyzer as a default case for #4
>
> 4) Write a new... "Super Analyzer" that tries to deal with this. This
> is POSSIBLY the best idea -- and, of course, almost certainly the
> hardest!
>
> Basically, what we're considering is writing some sort of new
> CompositeAnalyzer class which applies the following algorithm (in very
> simple terms):
>
> a) Start reading the stream
>
> b) Look at the next character
>
> c) Use some sort of Character.UnicodeBlock (or Character.Subset
> generally) -> Analyzer mapping to work out which Analyzer we want to
> use. e.g. find a member of Character.UnicodeBlock.GREEK, load a
> GreekAnalyzer.
>
> d) Keep reading until we hit something that makes us think we need to
> change analyzers (either end-of-stream or something incongruous --
> e.g. something from Character.UnicodeBlock.CYRILLIC). Then bundle up
> what we've got, hand it to the GreekAnalyzer, and then start the
> process again with a RussianAnalyzer (or whatever).
>
> Obviously the best way to do this would be to have these mappings
> dynamic, not set in stone -- some people might like all
> CJK_COMPATABILITY to be handed to the CJKAnalyzer, some to the
> ChineseAnalyzer, some might like to use their own, etc. Of course
> there's no reason default mappings can't be supplied.
>
> I guess the basic question is -- what does everyone think? Is this
> useful/workable/are there any fatal flaws with it? Obviously the
> biggie is that sometimes Unicode ranges are not sufficient to
> determine which analyzer to use -- for example, we may want to
> specifically use the GermanAnalyzer for German text, but that is
> basically impossible to tell from English purely based on the Unicode
> block of the next character. At least this way, though, we'd have the
> OPTION of farming off to more specific Analyzers based on Character
> set; being able to have an Analyzer which can tell Urdu from Arabic is
> something of separate issue; at least the "CompositeAnalyzer" would
> bring us a bit closer to the goal. It may be rudimentary but I think
> the 'pluggable' architecture could be useful -- certainly more useful
> in our case than just running the StandardAnalyzer over everything.
>
This sounds like a reasonable approach. I wonder a bit about how the
TokenStream mechanism will work, considering tokenization can be quite
different for Chinese and some of the other Asian languages as compared
to Latin based languages. Essentially, as things come into the
Tokenizer, you will need to indicate to the analyzer which Filter to
apply. I guess this could be done by setting the Type property on the
Token and having a Filter that wraps all of your other Filters and,
based on Type, hands it off to the appropriate filter for that language.
> If this project goes ahead, it's possible (even likely) that it would
> be contributed back to the Lucene sandbox. As such, I'm very
> interested to hear about any suggestions, criticisms, or other
> feedback you might have.
>
> Cheers,
>
> Paul Cowan
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
--
Grant Ingersoll
Sr. Software Engineer
Center for Natural Language Processing
Syracuse University
School of Information Studies
335 Hinds Hall
Syracuse, NY 13244
http://www.cnlp.org
Voice: 315-443-5484
Fax: 315-443-6886
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Multiple languages - possible approach
Posted by Otis Gospodnetic <ot...@yahoo.com>.
Hi Paul,
I don't have any first-hand experience with this, but your suggestion about pluggable analyzers sounds both reasonable and interesting to me. One thing you did not mention as a mechanism for figuring out which analyzer to use is language identification (like the one you can find among Nutch plugins). If you can't tell which analyzer to use by looking at characters and unicode ranges, perhaps you can (also) read in/ahead a few tokens and pass them to language identifier before selecting the best analyzer.
This would be a great contribution, of course! :)
Otis
----- Original Message ----
From: Paul Cowan <co...@aconex.com>
To: java-user@lucene.apache.org
Sent: Wednesday, March 15, 2006 10:49:48 PM
Subject: Multiple languages - possible approach
Hi everyone,
We are currently using Lucene to index correspondence between various
people, who may or may not use the same language in their discussions to
each other. Think an email system where participants might use the
language that seems most appropriate to the thought at the time, just as
they would in conversation.
An example (CN = some chinese text. Use your imagination!):
From: Someone in the UK
To: Someone in China
Subject: Re: CNCNCNCNCNCNCNCNCNCNCN
> CNCNCNCNCNCNCNCN
Yes, I think that's fine. I'm OK with that as long as Bob is.
> CNCNCNCNCNCN
CNCN?
> Tuesday OK?
I need it by Monday, sorry. CNCN!
We need to index that, and be able to search on it -- for both the
Chinese and English text. Note that stemming is not a particular need of
ours -- we're happy to search for literal tokens, but of course that may
not apply to other languages where stemming is expected behaviour, not
just a 'nicety'.
Anyway: so far, fine -- StandardAnalyzer is perfectly suitable to our
needs. The problem is, the next language out of the starting blocks is
Arabic, which StandardAnalyzer doesn't seem to be up to.
I've looked into previous discussions about this on the various lists,
and it seems to me there are a few options:
1) Maintain multiple indexes (StandardAnalyzer-analyzed,
ArabicAnalyzer-analyzed, LanguageXXXAnalyzer-analyzed) and search across
all of them, merging results
2) Maintain multiple indexes, ask the user which one to use at search-time:
Search for the [Arabic \/] text: [______________________]
3) Use StandardAnalyzer and hope for the best.
4) Write a new... "Super Analyzer" that tries to deal with this. This is
POSSIBLY the best idea -- and, of course, almost certainly the hardest!
Basically, what we're considering is writing some sort of new
CompositeAnalyzer class which applies the following algorithm (in very
simple terms):
a) Start reading the stream
b) Look at the next character
c) Use some sort of Character.UnicodeBlock (or Character.Subset
generally) -> Analyzer mapping to work out which Analyzer we want to
use. e.g. find a member of Character.UnicodeBlock.GREEK, load a
GreekAnalyzer.
d) Keep reading until we hit something that makes us think we need to
change analyzers (either end-of-stream or something incongruous -- e.g.
something from Character.UnicodeBlock.CYRILLIC). Then bundle up what
we've got, hand it to the GreekAnalyzer, and then start the process
again with a RussianAnalyzer (or whatever).
Obviously the best way to do this would be to have these mappings
dynamic, not set in stone -- some people might like all
CJK_COMPATABILITY to be handed to the CJKAnalyzer, some to the
ChineseAnalyzer, some might like to use their own, etc. Of course
there's no reason default mappings can't be supplied.
I guess the basic question is -- what does everyone think? Is this
useful/workable/are there any fatal flaws with it? Obviously the biggie
is that sometimes Unicode ranges are not sufficient to determine which
analyzer to use -- for example, we may want to specifically use the
GermanAnalyzer for German text, but that is basically impossible to tell
from English purely based on the Unicode block of the next character. At
least this way, though, we'd have the OPTION of farming off to more
specific Analyzers based on Character set; being able to have an
Analyzer which can tell Urdu from Arabic is something of separate issue;
at least the "CompositeAnalyzer" would bring us a bit closer to the
goal. It may be rudimentary but I think the 'pluggable' architecture
could be useful -- certainly more useful in our case than just running
the StandardAnalyzer over everything.
If this project goes ahead, it's possible (even likely) that it would be
contributed back to the Lucene sandbox. As such, I'm very interested to
hear about any suggestions, criticisms, or other feedback you might have.
Cheers,
Paul Cowan
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org