You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Andrew McCombe <eu...@gmail.com> on 2009/07/22 12:12:49 UTC

Best approach to multiple languages

Hi

We have a dataset that contains productname, category and descriptions.  The
descriptions can be in one or more different languages.  What would be the
recommended way of indexing these?

My initial thoughts are to index each description as a separate field and
append the language identifier to the field name, for example, three fields
with description_en, description_de, descrtiption_fr.  Is this the best
approach or is there a better way?

Regards
Andrew McCombe

Re: Best approach to multiple languages

Posted by Julian Davchev <jm...@drun.net>.

Hi,

We have such case...we don't want to search in all of those languages at
once but just one of them.
So we took the approach of different indexes for each language. From
what I know it helps not breaking relevance of the stats as well.
You know, how much an index is used etc etc.

If you dig in mailling list. This has been discussed quite many times.

Andrew McCombe wrote:
> Hi
>
> We have a dataset that contains productname, category and descriptions.  The
> descriptions can be in one or more different languages.  What would be the
> recommended way of indexing these?
>
> My initial thoughts are to index each description as a separate field and
> append the language identifier to the field name, for example, three fields
> with description_en, description_de, descrtiption_fr.  Is this the best
> approach or is there a better way?
>
> Regards
> Andrew McCombe
>
>

Re: Best approach to multiple languages

Posted by Andrew McCombe <eu...@gmail.com>.

Hi

Thanks for posting this. Helps a lot with my application.

Andrew

2009/7/22 Ed Summers <eh...@pobox.com>

> On Wed, Jul 22, 2009 at 11:35 AM, Grant Ingersoll<gs...@apache.org>
> wrote:
> >> My initial thoughts are to index each description as a separate field
> and
> >> append the language identifier to the field name, for example, three
> >> fields
> >> with description_en, description_de, descrtiption_fr.  Is this the best
> >> approach or is there a better way?
>
> FWIW, this approach is essentially what we did at the Library of
> Congress to support multi-lingual fulltext search in the World Digital
> Library [1] webapp. It seems to have paid off pretty well, since we
> were able to configure analysis on a per-language basis.
>
> In case you are curious I've attached a copy of our schema.xml to give
> you an idea of what we did.
>
> //Ed
>
> [1] http://www.wdl.org/
>

Re: Best approach to multiple languages

Posted by aniljayanti <an...@yahoo.co.in>.

Hi 

thanks for you post. I am searching for this type of multiple language
indexing and searching in solr. Below is my post in lecene. Can you please
help me out of this.

http://lucene.472066.n3.nabble.com/Indexing-Multiple-Languages-with-solr-Arabic-amp-English-td4104580.html

thanks in advance,

aniljayanti



--
View this message in context: http://lucene.472066.n3.nabble.com/Best-approach-to-multiple-languages-tp498198p4104593.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Best approach to multiple languages

Posted by Olivier Dobberkau <ol...@dkd.de>.

Am 22.07.2009 um 18:31 schrieb Ed Summers:

> In case you are curious I've attached a copy of our schema.xml to give
> you an idea of what we did.


Thanks for sharing!

--
Olivier Dobberkau

Re: Best approach to multiple languages

Posted by Ed Summers <eh...@pobox.com>.

On Wed, Jul 22, 2009 at 11:35 AM, Grant Ingersoll<gs...@apache.org> wrote:
>> My initial thoughts are to index each description as a separate field and
>> append the language identifier to the field name, for example, three
>> fields
>> with description_en, description_de, descrtiption_fr.  Is this the best
>> approach or is there a better way?

FWIW, this approach is essentially what we did at the Library of
Congress to support multi-lingual fulltext search in the World Digital
Library [1] webapp. It seems to have paid off pretty well, since we
were able to configure analysis on a per-language basis.

In case you are curious I've attached a copy of our schema.xml to give
you an idea of what we did.

//Ed

[1] http://www.wdl.org/

Re: Best approach to multiple languages

Posted by Grant Ingersoll <gs...@apache.org>.

Typically there are three options that people do:

1. Put 'em all in one big field
2. Split Fields (as you and others have described)  - not sure why no  
one ever splits on documents, which is viable too, but comes with  
repeated data
3. Split indexes

For your case, #1 isn't going to work since you want to search  
language specific.  I'd likely go with #2, but #3 has it's merits  
too.  #3 allows for managing the languages separately (you can update  
the Spanish document w/o affecting the English version, and also can  
take the whole collection offline if you want w/o affecting the other  
indexes), which can sometimes be helpful, but the cost is more  
operational complexity, etc.

-Grant

On Jul 22, 2009, at 12:39 PM, Andrew McCombe wrote:

> Hi
>
> We will  know the user's language choice before searching.
>
> Regards
> Andrew
>
> 2009/7/22 Grant Ingersoll <gs...@apache.org>
>
>> How do you want to search those descriptions?  Do you know the query
>> language going in?
>>
>>
>> On Jul 22, 2009, at 6:12 AM, Andrew McCombe wrote:
>>
>> Hi
>>>
>>> We have a dataset that contains productname, category and  
>>> descriptions.
>>> The
>>> descriptions can be in one or more different languages.  What  
>>> would be the
>>> recommended way of indexing these?
>>>
>>> My initial thoughts are to index each description as a separate  
>>> field and
>>> append the language identifier to the field name, for example, three
>>> fields
>>> with description_en, description_de, descrtiption_fr.  Is this the  
>>> best
>>> approach or is there a better way?
>>>
>>> Regards
>>> Andrew McCombe
>>>
>>
>> --------------------------
>> Grant Ingersoll
>> http://www.lucidimagination.com/
>>
>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
>> using
>> Solr/Lucene:
>> http://www.lucidimagination.com/search
>>
>>

Re: Best approach to multiple languages

Posted by Andrew McCombe <eu...@gmail.com>.

Hi

We will  know the user's language choice before searching.

Regards
Andrew

2009/7/22 Grant Ingersoll <gs...@apache.org>

> How do you want to search those descriptions?  Do you know the query
> language going in?
>
>
> On Jul 22, 2009, at 6:12 AM, Andrew McCombe wrote:
>
>  Hi
>>
>> We have a dataset that contains productname, category and descriptions.
>>  The
>> descriptions can be in one or more different languages.  What would be the
>> recommended way of indexing these?
>>
>> My initial thoughts are to index each description as a separate field and
>> append the language identifier to the field name, for example, three
>> fields
>> with description_en, description_de, descrtiption_fr.  Is this the best
>> approach or is there a better way?
>>
>> Regards
>> Andrew McCombe
>>
>
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com/
>
> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
> Solr/Lucene:
> http://www.lucidimagination.com/search
>
>

Re: Best approach to multiple languages

Posted by Grant Ingersoll <gs...@apache.org>.

How do you want to search those descriptions?  Do you know the query  
language going in?

On Jul 22, 2009, at 6:12 AM, Andrew McCombe wrote:

> Hi
>
> We have a dataset that contains productname, category and  
> descriptions.  The
> descriptions can be in one or more different languages.  What would  
> be the
> recommended way of indexing these?
>
> My initial thoughts are to index each description as a separate  
> field and
> append the language identifier to the field name, for example, three  
> fields
> with description_en, description_de, descrtiption_fr.  Is this the  
> best
> approach or is there a better way?
>
> Regards
> Andrew McCombe

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search