You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Eric Chow <er...@gmail.com> on 2005/04/11 03:21:23 UTC

Multi-analyzer ?

Hello,

If I don't know the language of the input terms, how can I use
different analyzer to search it ?

For example, the input box accepts UTF-8 search text, they can be
anything, such as Chinese, Japanese, English, Russian, Deuch, etc. How
can search any of them or all of them with Lucene?

Any example, please?


Best Regards,
Eric

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Multi-analyzer ?

Posted by Andy Roberts <ma...@andy-roberts.net>.

On Tuesday 12 Apr 2005 00:53, Eric Chow wrote:
> But how about one document contains more than two different languages ??
>
>
> Eric

If you're indexing many documents which contain multiple languages then it's 
probably just better to use a SimpleAnalyser, rather than one that does any 
language specific stemming or removal of stoplist words.

If there are documents where one language is clearly more dominant than the 
other, then it would probably be ok to use an Analyzer for that language and 
hope it doesn't effect the indexing of the other language too much. However, 
it's clear that you can't really accomodate multi-language documents. It 
would be much easier to ensure all docs were in a single language before 
indexing.

Andy

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Multi-analyzer ?

Posted by Ernesto De Santis <er...@colaborativa.net>.

Maybe you can use PerFieldAnalyzerWrapper.
(I never used this)

Ernesto.

Eric Chow escribió:

>But how about one document contains more than two different languages ??
>
>
>Eric
>
>On Apr 12, 2005 12:13 AM, Andy Roberts <ma...@andy-roberts.net> wrote:
>  
>
>>On Monday 11 Apr 2005 14:55, Mike Baranczak wrote:
>>    
>>
>>>Your example with Arabic wouldn't work reliably either - there are
>>>several other languages that use the Arabic script (Persian for
>>>example).
>>>      
>>>
>>Good point. Although you could try a simple approach to test for the
>>additional characters that exist in Persian but not in Arabic. Although, this
>>again is not fool-proof. A letter-model approach would be better but is
>>rather time consuming.
>>
>>    
>>
>>>This is the sort of problem that the end user can solve much better
>>>than the software can.
>>>
>>>      
>>>
>>I completely agree, which is why I originally suggested prompting the user for
>>this info. It may be the case that for the majority of queries, English is
>>the usual language. And it is probably more feasible to do a test to
>>determine whether the query English or not (still very tricky, mind). If not,
>>then prompt the user to specify their input language because otherwise,
>>results will be poor.
>>
>>Andy Roberts
>>
>>    
>>
>>>-MB
>>>
>>>On Apr 11, 2005, at 6:02 AM, Andy Roberts wrote:
>>>      
>>>
>>>>Can you not provide the user with a option list to specify their input
>>>>language?
>>>>
>>>>Language identification can be a pretty tricky field. There are some
>>>>tricks
>>>>you can do with unicode to identify language, e.g., \u0600 - \u06FF
>>>>contains
>>>>the Arabic characters, so if you're input contains lots of chars
>>>>within this
>>>>range, you can guess that the input is Arabic, for example.
>>>>
>>>>The problem comes with differentiating between the languages that use
>>>>a Latin
>>>>alphabet. Again, there are multiple approaches, although the only one
>>>>I know
>>>>of that worked pretty well for identifying European languages was to
>>>>build a
>>>>model based on character bigrams (that is, sequences of two letters)
>>>>[1]
>>>>
>>>>At the end of the day, Lucene cannot help you in choosing the correct
>>>>language
>>>>as it doesn't know, and so it'll be up to you to add the necessary
>>>>logic to
>>>>tell Lucene which Analyzers to utilise. :(
>>>>
>>>>Andy
>>>>
>>>>[1] Churcher, G E; Hayes, J; Hughes, J S; Johnson, S; Souter, C.
>>>>Bigram and
>>>>trigram models for language identification and classification in:
>>>>Evett, L &
>>>>Rose,T (editors) Computational Linguistics for Speech and Handwriting
>>>>Recognition AISB'94 Workshop University of Leeds/AISB. 1994.
>>>>
>>>>On Monday 11 Apr 2005 01:21, Eric Chow wrote:
>>>>        
>>>>
>>>>>Hello,
>>>>>
>>>>>If I don't know the language of the input terms, how can I use
>>>>>different analyzer to search it ?
>>>>>
>>>>>For example, the input box accepts UTF-8 search text, they can be
>>>>>anything, such as Chinese, Japanese, English, Russian, Deuch, etc. How
>>>>>can search any of them or all of them with Lucene?
>>>>>
>>>>>Any example, please?
>>>>>
>>>>>
>>>>>Best Regards,
>>>>>Eric
>>>>>
>>>>>---------------------------------------------------------------------
>>>>>To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>>For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>          
>>>>>
>>>>---------------------------------------------------------------------
>>>>To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>        
>>>>
>>>---------------------------------------------------------------------
>>>To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>For additional commands, e-mail: java-user-help@lucene.apache.org
>>>      
>>>
>>---------------------------------------------------------------------
>>To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>    
>>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>  
>

-- 
Ernesto De Santis - Colaborativa.net
Córdoba 1147 Piso 6 Oficinas 3 y 4
(S2000AWO) Rosario, SF, Argentina.



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Multi-analyzer ?

Posted by Eric Chow <er...@gmail.com>.

But how about one document contains more than two different languages ??


Eric

On Apr 12, 2005 12:13 AM, Andy Roberts <ma...@andy-roberts.net> wrote:
> On Monday 11 Apr 2005 14:55, Mike Baranczak wrote:
> > Your example with Arabic wouldn't work reliably either - there are
> > several other languages that use the Arabic script (Persian for
> > example).
> 
> Good point. Although you could try a simple approach to test for the
> additional characters that exist in Persian but not in Arabic. Although, this
> again is not fool-proof. A letter-model approach would be better but is
> rather time consuming.
> 
> >
> > This is the sort of problem that the end user can solve much better
> > than the software can.
> >
> 
> I completely agree, which is why I originally suggested prompting the user for
> this info. It may be the case that for the majority of queries, English is
> the usual language. And it is probably more feasible to do a test to
> determine whether the query English or not (still very tricky, mind). If not,
> then prompt the user to specify their input language because otherwise,
> results will be poor.
> 
> Andy Roberts
> 
> > -MB
> >
> > On Apr 11, 2005, at 6:02 AM, Andy Roberts wrote:
> > > Can you not provide the user with a option list to specify their input
> > > language?
> > >
> > > Language identification can be a pretty tricky field. There are some
> > > tricks
> > > you can do with unicode to identify language, e.g., \u0600 - \u06FF
> > > contains
> > > the Arabic characters, so if you're input contains lots of chars
> > > within this
> > > range, you can guess that the input is Arabic, for example.
> > >
> > > The problem comes with differentiating between the languages that use
> > > a Latin
> > > alphabet. Again, there are multiple approaches, although the only one
> > > I know
> > > of that worked pretty well for identifying European languages was to
> > > build a
> > > model based on character bigrams (that is, sequences of two letters)
> > > [1]
> > >
> > > At the end of the day, Lucene cannot help you in choosing the correct
> > > language
> > > as it doesn't know, and so it'll be up to you to add the necessary
> > > logic to
> > > tell Lucene which Analyzers to utilise. :(
> > >
> > > Andy
> > >
> > > [1] Churcher, G E; Hayes, J; Hughes, J S; Johnson, S; Souter, C.
> > > Bigram and
> > > trigram models for language identification and classification in:
> > > Evett, L &
> > > Rose,T (editors) Computational Linguistics for Speech and Handwriting
> > > Recognition AISB'94 Workshop University of Leeds/AISB. 1994.
> > >
> > > On Monday 11 Apr 2005 01:21, Eric Chow wrote:
> > >> Hello,
> > >>
> > >> If I don't know the language of the input terms, how can I use
> > >> different analyzer to search it ?
> > >>
> > >> For example, the input box accepts UTF-8 search text, they can be
> > >> anything, such as Chinese, Japanese, English, Russian, Deuch, etc. How
> > >> can search any of them or all of them with Lucene?
> > >>
> > >> Any example, please?
> > >>
> > >>
> > >> Best Regards,
> > >> Eric
> > >>
> > >> ---------------------------------------------------------------------
> > >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > >> For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Multi-analyzer ?

Posted by Andy Roberts <ma...@andy-roberts.net>.

On Monday 11 Apr 2005 14:55, Mike Baranczak wrote:
> Your example with Arabic wouldn't work reliably either - there are
> several other languages that use the Arabic script (Persian for
> example).

Good point. Although you could try a simple approach to test for the 
additional characters that exist in Persian but not in Arabic. Although, this 
again is not fool-proof. A letter-model approach would be better but is 
rather time consuming.

>
> This is the sort of problem that the end user can solve much better
> than the software can.
>

I completely agree, which is why I originally suggested prompting the user for 
this info. It may be the case that for the majority of queries, English is 
the usual language. And it is probably more feasible to do a test to 
determine whether the query English or not (still very tricky, mind). If not, 
then prompt the user to specify their input language because otherwise, 
results will be poor.

Andy Roberts

> -MB
>
> On Apr 11, 2005, at 6:02 AM, Andy Roberts wrote:
> > Can you not provide the user with a option list to specify their input
> > language?
> >
> > Language identification can be a pretty tricky field. There are some
> > tricks
> > you can do with unicode to identify language, e.g., \u0600 - \u06FF
> > contains
> > the Arabic characters, so if you're input contains lots of chars
> > within this
> > range, you can guess that the input is Arabic, for example.
> >
> > The problem comes with differentiating between the languages that use
> > a Latin
> > alphabet. Again, there are multiple approaches, although the only one
> > I know
> > of that worked pretty well for identifying European languages was to
> > build a
> > model based on character bigrams (that is, sequences of two letters)
> > [1]
> >
> > At the end of the day, Lucene cannot help you in choosing the correct
> > language
> > as it doesn't know, and so it'll be up to you to add the necessary
> > logic to
> > tell Lucene which Analyzers to utilise. :(
> >
> > Andy
> >
> > [1] Churcher, G E; Hayes, J; Hughes, J S; Johnson, S; Souter, C.
> > Bigram and
> > trigram models for language identification and classification in:
> > Evett, L &
> > Rose,T (editors) Computational Linguistics for Speech and Handwriting
> > Recognition AISB'94 Workshop University of Leeds/AISB. 1994.
> >
> > On Monday 11 Apr 2005 01:21, Eric Chow wrote:
> >> Hello,
> >>
> >> If I don't know the language of the input terms, how can I use
> >> different analyzer to search it ?
> >>
> >> For example, the input box accepts UTF-8 search text, they can be
> >> anything, such as Chinese, Japanese, English, Russian, Deuch, etc. How
> >> can search any of them or all of them with Lucene?
> >>
> >> Any example, please?
> >>
> >>
> >> Best Regards,
> >> Eric
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Multi-analyzer ?

Posted by Mike Baranczak <mb...@twcny.rr.com>.

Your example with Arabic wouldn't work reliably either - there are 
several other languages that use the Arabic script (Persian for 
example).

You could also try to pick out characters that are unique to a 
particular language - for example, Ę or Ż only occur in Polish (as far 
as I know...). Of course, you have no guarantee that a Polish-language 
query will actually contain any of those characters - so this method 
would only work as a supplement to another method.

And don't forget that some words are written the same in several 
different languages.

This is the sort of problem that the end user can solve much better 
than the software can.

-MB


On Apr 11, 2005, at 6:02 AM, Andy Roberts wrote:

> Can you not provide the user with a option list to specify their input
> language?
>
> Language identification can be a pretty tricky field. There are some 
> tricks
> you can do with unicode to identify language, e.g., \u0600 - \u06FF 
> contains
> the Arabic characters, so if you're input contains lots of chars 
> within this
> range, you can guess that the input is Arabic, for example.
>
> The problem comes with differentiating between the languages that use 
> a Latin
> alphabet. Again, there are multiple approaches, although the only one 
> I know
> of that worked pretty well for identifying European languages was to 
> build a
> model based on character bigrams (that is, sequences of two letters) 
> [1]
>
> At the end of the day, Lucene cannot help you in choosing the correct 
> language
> as it doesn't know, and so it'll be up to you to add the necessary 
> logic to
> tell Lucene which Analyzers to utilise. :(
>
> Andy
>
> [1] Churcher, G E; Hayes, J; Hughes, J S; Johnson, S; Souter, C. 
> Bigram and
> trigram models for language identification and classification in: 
> Evett, L &
> Rose,T (editors) Computational Linguistics for Speech and Handwriting
> Recognition AISB'94 Workshop University of Leeds/AISB. 1994.
>
> On Monday 11 Apr 2005 01:21, Eric Chow wrote:
>> Hello,
>>
>> If I don't know the language of the input terms, how can I use
>> different analyzer to search it ?
>>
>> For example, the input box accepts UTF-8 search text, they can be
>> anything, such as Chinese, Japanese, English, Russian, Deuch, etc. How
>> can search any of them or all of them with Lucene?
>>
>> Any example, please?
>>
>>
>> Best Regards,
>> Eric
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Multi-analyzer ?

Posted by Andy Roberts <ma...@andy-roberts.net>.

Can you not provide the user with a option list to specify their input 
language?

Language identification can be a pretty tricky field. There are some tricks 
you can do with unicode to identify language, e.g., \u0600 - \u06FF contains 
the Arabic characters, so if you're input contains lots of chars within this 
range, you can guess that the input is Arabic, for example.

The problem comes with differentiating between the languages that use a Latin 
alphabet. Again, there are multiple approaches, although the only one I know 
of that worked pretty well for identifying European languages was to build a 
model based on character bigrams (that is, sequences of two letters) [1]

At the end of the day, Lucene cannot help you in choosing the correct language 
as it doesn't know, and so it'll be up to you to add the necessary logic to 
tell Lucene which Analyzers to utilise. :(

Andy

[1] Churcher, G E; Hayes, J; Hughes, J S; Johnson, S; Souter, C. Bigram and 
trigram models for language identification and classification in: Evett, L & 
Rose,T (editors) Computational Linguistics for Speech and Handwriting 
Recognition AISB'94 Workshop University of Leeds/AISB. 1994.

On Monday 11 Apr 2005 01:21, Eric Chow wrote:
> Hello,
>
> If I don't know the language of the input terms, how can I use
> different analyzer to search it ?
>
> For example, the input box accepts UTF-8 search text, they can be
> anything, such as Chinese, Japanese, English, Russian, Deuch, etc. How
> can search any of them or all of them with Lucene?
>
> Any example, please?
>
>
> Best Regards,
> Eric
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Multi-analyzer ?

Posted by Karl Øie <ka...@gan.no>.

I don't think you can figure out the language from the input box value 
alone, i can't see any way to select the correct language analyzer at 
this point. What you can do is to put Chinese, Japanese, English and 
Dutch content in separate indexes and use multisearcher to search in 
all of them, and then you would know what languages that returns hits.

Mvh Karl Øie

On 11. apr. 2005, at 03.21, Eric Chow wrote:

> Hello,
>
> If I don't know the language of the input terms, how can I use
> different analyzer to search it ?
>
> For example, the input box accepts UTF-8 search text, they can be
> anything, such as Chinese, Japanese, English, Russian, Deuch, etc. How
> can search any of them or all of them with Lucene?
>
> Any example, please?
>
>
> Best Regards,
> Eric
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
- 1f j00 (4n 1234d 7|-|15, j00 n33d 70 937 |41d


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org