You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Juan Antonio Farré Basurte <ju...@reviewpro.com> on 2011/06/02 09:57:49 UTC

Multilingual text analysis

Hello,
Some of the possible analyzers that can be applied to a text field, depend on the language of the text to analyze and can be configured for a concrete language.
In my case, the text fields can be in many different languages, but each document also includes a field containing the language of text fields.
Is it possible to configure analyzers to use the suitable language for each document, in function of the language field?
Thanks,

Juan

Re: Multilingual text analysis

Posted by Paul Libbrecht <pa...@hoplahup.net>.


Le 2 juin 2011 à 16:27, Juan Antonio Farré Basurte a écrit :

> Paul, what do you mean when you say it would make sense to start a page at the solr website?

I meant the solr wiki.

> I just had wondered whether it was possible to parametrize the analyzers in function of one field value. I think this would be a very elegant solution for many needs. May it could be a possible improvement for future versions of solr :)

Honestly, I think it is of utmost importance for a CMS manager to kind of know "how much stemming" one wishes... so configuring which analyzer is used for which language is, I think really useful and the schema is easy to write that.

In one of my search projects, I have a series of unit-tests that all fail because the analyzer, say, for Arabic or Hungarian, was not "good enough"... this always happens and it's better to be aware of that.

paul

Re: Multilingual text analysis

Posted by Juan Antonio Farré Basurte <ju...@reviewpro.com>.

Thank you both Paul and Lee for your answer.
Luckily in my case there's no problem about knowing language at index time nor we have really to bother about the language of the query, as users can specify the language they are interested in.
So I guess our solution would be to use different optional fields, one for each language and that should be good enough.
I just had wondered whether it was possible to parametrize the analyzers in function of one field value. I think this would be a very elegant solution for many needs. May it could be a possible improvement for future versions of solr :)

Paul, what do you mean when you say it would make sense to start a page at the solr website?

Thanks again,

Juan

El 02/06/2011, a las 16:06, Paul Libbrecht escribió:

> Juan,
> 
> An easy way in solr, I think, is indeed to use different fields at index time and expand on multiple fields at query time.
> I believe using field-names' wildcards allows you to specify a different analyzer per language doing this.
> 
> There's been long discussions on the java-user@lucene.apache.org mailing-list about the best design for multilingual indexing and searching. One of the key arguments was wether you were able to detect with faithfulness the language of a query, this is generally very hard.
> 
> It would make sense to start a page at the solr website...
> 
> paul
> 
> 
> Le 2 juin 2011 à 12:52, lee carroll a écrit :
> 
>> Juan
>> 
>> I don't think so.
>> 
>> you can try indexing fields like myfield_en. myfield_fr, my field_xx
>> if you now what language you are dealing with at index and query time.
>> 
>> you can also have seperate cores for your documents for each language
>> if you don't want to complicate your schema
>> again you will need to know language at index and query time
>> 
>> 
>> 
>> On 2 June 2011 08:57, Juan Antonio Farré Basurte
>> <ju...@reviewpro.com> wrote:
>>> Hello,
>>> Some of the possible analyzers that can be applied to a text field, depend on the language of the text to analyze and can be configured for a concrete language.
>>> In my case, the text fields can be in many different languages, but each document also includes a field containing the language of text fields.
>>> Is it possible to configure analyzers to use the suitable language for each document, in function of the language field?
>>> Thanks,
>>> 
>>> Juan
>

Re: Multilingual text analysis

Posted by Paul Libbrecht <pa...@hoplahup.net>.

Juan,

An easy way in solr, I think, is indeed to use different fields at index time and expand on multiple fields at query time.
I believe using field-names' wildcards allows you to specify a different analyzer per language doing this.

There's been long discussions on the java-user@lucene.apache.org mailing-list about the best design for multilingual indexing and searching. One of the key arguments was wether you were able to detect with faithfulness the language of a query, this is generally very hard.

It would make sense to start a page at the solr website...

paul

Le 2 juin 2011 à 12:52, lee carroll a écrit :

> Juan
> 
> I don't think so.
> 
> you can try indexing fields like myfield_en. myfield_fr, my field_xx
> if you now what language you are dealing with at index and query time.
> 
> you can also have seperate cores for your documents for each language
> if you don't want to complicate your schema
> again you will need to know language at index and query time
> 
> 
> 
> On 2 June 2011 08:57, Juan Antonio Farré Basurte
> <ju...@reviewpro.com> wrote:
>> Hello,
>> Some of the possible analyzers that can be applied to a text field, depend on the language of the text to analyze and can be configured for a concrete language.
>> In my case, the text fields can be in many different languages, but each document also includes a field containing the language of text fields.
>> Is it possible to configure analyzers to use the suitable language for each document, in function of the language field?
>> Thanks,
>> 
>> Juan

Re: Multilingual text analysis

Posted by lee carroll <le...@googlemail.com>.

Juan

I don't think so.

you can try indexing fields like myfield_en. myfield_fr, my field_xx
if you now what language you are dealing with at index and query time.

you can also have seperate cores for your documents for each language
if you don't want to complicate your schema
again you will need to know language at index and query time



On 2 June 2011 08:57, Juan Antonio Farré Basurte
<ju...@reviewpro.com> wrote:
> Hello,
> Some of the possible analyzers that can be applied to a text field, depend on the language of the text to analyze and can be configured for a concrete language.
> In my case, the text fields can be in many different languages, but each document also includes a field containing the language of text fields.
> Is it possible to configure analyzers to use the suitable language for each document, in function of the language field?
> Thanks,
>
> Juan