You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@lucenenet.apache.org by "Douglas Smith (DataSmithy)" <da...@googlemail.com> on 2007/08/29 05:46:21 UTC

Best Practices for Multiple Languages

Hi All,

That you all for you comments earlier on using Lucene as a web service.
I am looking at Solr, and it does have some potential for my application.

In the meantime I have this question: What are the recommended best
practices for using Lucene to index multiple languages? Particularly,
would I be better off using a separate index for each language?

Here is our scenario: We have a database that has several text fields
that will be translated to multiple languages. The text will be only be
incrementally translated, and it could take years to get it completely
translated. Also, new untranslated data is always being added. Also, we
may add new languages to be translated at any time. When a user selects
to view our web application in a foreign language, we want the user to
be able to search in either their language, or in English (in order to
guarantee that they can find all data). I probably won't know which
language they actually entered for the text search. I want search Lucene
in both the English and the selected language, and return any results
that are found.

FYI, I will be using Lucene to return a list of IDs that are unique to
our data, and then joining back to our data, using SQL. I will use our
database to show a mix of translated and untranslated data. That is,
translated data/fields are show if we have it, otherwise the default
English is shown. So I don't need to get the text itself from Lucene,
just a list of ID's that I can use in my SQL query. I can pull out our
data easily in either language, or a mix, in order to create Lucene
indexes.

If I can mix languages in a single index, I would like to add a Language
column to query on, and query on both the english and the foreign
language text. If not, I can see it working to query to run two seperate
Lucene queries on two seperate indexes, and combining the resulting ID
list into a single list (and making it unique, if needed).

If you have any comments, or feedback from experience doing anything
like this, it would be much appreciated!

Douglas Smith

Unsubscribe

Posted by "Shepherd, Shane" <Sh...@tylertech.com>.
How do I unsubscribe from the mailing list?

RE: Best Practices for Multiple Languages

Posted by George Aroush <ge...@aroush.net>.
Hi Douglas,

Defiantly use one Lucene index per language.  This will give you the
simplicity of maintaining separate indexes per langue so you can manage them
as such and better performance per langue since per langue index will be
much smaller then one Lucene index holding all of your data.

If in the feature you need to search across multiple languages, just use a
MultiSearcher.

Regards,

-- George

> -----Original Message-----
> From: Douglas Smith (DataSmithy) [mailto:datasmithy@googlemail.com] 
> Sent: Tuesday, August 28, 2007 11:46 PM
> To: lucene-net-user@incubator.apache.org
> Subject: Best Practices for Multiple Languages
> 
> Hi All,
> 
> That you all for you comments earlier on using Lucene as a 
> web service.
> I am looking at Solr, and it does have some potential for my 
> application.
> 
> In the meantime I have this question: What are the 
> recommended best practices for using Lucene to index multiple 
> languages? Particularly, would I be better off using a 
> separate index for each language?
> 
> Here is our scenario: We have a database that has several 
> text fields that will be translated to multiple languages. 
> The text will be only be incrementally translated, and it 
> could take years to get it completely translated. Also, new 
> untranslated data is always being added. Also, we may add new 
> languages to be translated at any time. When a user selects 
> to view our web application in a foreign language, we want 
> the user to be able to search in either their language, or in 
> English (in order to guarantee that they can find all data). 
> I probably won't know which language they actually entered 
> for the text search. I want search Lucene in both the English 
> and the selected language, and return any results that are found.
> 
> FYI, I will be using Lucene to return a list of IDs that are 
> unique to our data, and then joining back to our data, using 
> SQL. I will use our database to show a mix of translated and 
> untranslated data. That is, translated data/fields are show 
> if we have it, otherwise the default English is shown. So I 
> don't need to get the text itself from Lucene, just a list of 
> ID's that I can use in my SQL query. I can pull out our data 
> easily in either language, or a mix, in order to create 
> Lucene indexes.
> 
> If I can mix languages in a single index, I would like to add 
> a Language column to query on, and query on both the english 
> and the foreign language text. If not, I can see it working 
> to query to run two seperate Lucene queries on two seperate 
> indexes, and combining the resulting ID list into a single 
> list (and making it unique, if needed).
> 
> If you have any comments, or feedback from experience doing 
> anything like this, it would be much appreciated!
> 
> Douglas Smith
>