You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Alexander Mashtakov <am...@gmail.com> on 2006/07/09 13:49:09 UTC

Indexing and searching multiple languages

Hi folks,
I'd like to ask your advice about how to organize index for documents
in multiple languages.

As an input:

 The database which holds the documents metadata. Each document consists
from
 language-neutral attributes, such as: document_id, date, categories mapping

 and language-dependent attributes, such as title, author, abstract etc.

 Each document has a default language record - "EN" and may have several
        records with language-dependent attributes translated to other
languages - for example
 "Russian" (one record per-language, with FK to document_id).

 Each document has a list of "attachments" (PDF, MSOffice files) with the
language
 "indicator". Attachment's language is selected from the controlled
vocabulary
   during the file upload and includes western/eastern european languages.

The search should be performed within the documents metadata and attachments
as well
in -ALL- languages (i.e. user just types in search term and click on button
- probably this
is a different topic about how to detect input language in order to apply
appropriate
analyzer to QueryParser).


At this moment of time I'm thinking about the following alternatives:


1. The simple one - create one record per-document with the basic
   metadata structure and include all languages for a given attribute
   in a single field - for example title will contain -ALL- translaltion
(EN, RU, etc).
   The "Contents" field will hold -ALL- attachments texts for a given
document.

2. Create a single record for each metadata language in one index. Create
second index
   with attachments - one record per document.


The first approach is easier, but I'm not sure whether the score will be
calculated correctly
In second approach - I don't know how to "join" the results from MultiQuery
and don't know
how it'll affect the performance (Sorry, I've just started to experiment
with Lucene).

Any ideas, suggestions ?

Thank you,
/Alexander