You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Alejandro Valdez <al...@gmail.com> on 2009/01/27 21:05:40 UTC

Indexing documents in multiple languages

Hi, I plan to use solr to index a large number of documents extracted
from emails bodies, such documents could be in different languages,
and a single  document could be in more than one language. In the same
way, the query string could be words in different languages.

I read that a common approach to index multilingual documents is to
use some algorithm (n-gram) to determine the document language, then use a
stemmer and finally index the document in a different index for each
language.

As the document language and the query string can't be detected in a
reliable way, I think that it make not sense to use a stemmer on them
because a stemmer is tied to a specific language.

My plan is to index all the documents in the same index, without any
stemming process (the users will have to search for the exact words that
they are looking for).

But I'm not sure if this approach will make the index too big, too
slow, or if there is a better way to index this kind of documents.

Any suggestion will be very appreciated.

Re: Indexing documents in multiple languages

Posted by Erick Erickson <er...@gmail.com>.
First, I'd search the mail archive for the topic of languages, it's
been discussed often and there's a wealth of information that
might be of benefit, far more information than I can remember.

As to whether your approach will be "too big, too slow...", you
really haven't given enough information to go on. Here are a few
of the questions the answers to which would help: How many
e-mails are you indexing? Are you indexing attachments? How
many users to you expect to be using this system? What
are your target response times? What is your design
queries-per-second? How much dynamic is the index (that is,
how many e-mails do you expect to add per day and what is
the latency you can live with between the time the e-mail is
indexed and when it's searchable)?

If you're indexing 10,000 e-mails, it's one thing. If you're indexing
1,000,000,000 e-mails it's another.

Best
Erick

On Tue, Jan 27, 2009 at 3:05 PM, Alejandro Valdez <
alejandro.valdez@gmail.com> wrote:

> Hi, I plan to use solr to index a large number of documents extracted
> from emails bodies, such documents could be in different languages,
> and a single  document could be in more than one language. In the same
> way, the query string could be words in different languages.
>
> I read that a common approach to index multilingual documents is to
> use some algorithm (n-gram) to determine the document language, then use a
> stemmer and finally index the document in a different index for each
> language.
>
> As the document language and the query string can't be detected in a
> reliable way, I think that it make not sense to use a stemmer on them
> because a stemmer is tied to a specific language.
>
> My plan is to index all the documents in the same index, without any
> stemming process (the users will have to search for the exact words that
> they are looking for).
>
> But I'm not sure if this approach will make the index too big, too
> slow, or if there is a better way to index this kind of documents.
>
> Any suggestion will be very appreciated.
>

Re: Indexing documents in multiple languages

Posted by Otis Gospodnetic <ot...@yahoo.com>.
Alejandro,

What you really want to do is identify the language of the email, store that in the index and apply the appropriate analyzer.  At query time you really want to know the language of the query (either by detecting it or asking the user or ...)

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----
> From: Alejandro Valdez <al...@gmail.com>
> To: solr-user@lucene.apache.org
> Sent: Tuesday, January 27, 2009 3:05:40 PM
> Subject: Indexing documents in multiple languages
> 
> Hi, I plan to use solr to index a large number of documents extracted
> from emails bodies, such documents could be in different languages,
> and a single  document could be in more than one language. In the same
> way, the query string could be words in different languages.
> 
> I read that a common approach to index multilingual documents is to
> use some algorithm (n-gram) to determine the document language, then use a
> stemmer and finally index the document in a different index for each
> language.
> 
> As the document language and the query string can't be detected in a
> reliable way, I think that it make not sense to use a stemmer on them
> because a stemmer is tied to a specific language.
> 
> My plan is to index all the documents in the same index, without any
> stemming process (the users will have to search for the exact words that
> they are looking for).
> 
> But I'm not sure if this approach will make the index too big, too
> slow, or if there is a better way to index this kind of documents.
> 
> Any suggestion will be very appreciated.