You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Alexander Mashtakov <am...@gmail.com> on 2006/07/09 13:49:09 UTC
Indexing and searching multiple languages
Hi folks,
I'd like to ask your advice about how to organize index for documents
in multiple languages.
As an input:
The database which holds the documents metadata. Each document consists
from
language-neutral attributes, such as: document_id, date, categories mapping
and language-dependent attributes, such as title, author, abstract etc.
Each document has a default language record - "EN" and may have several
records with language-dependent attributes translated to other
languages - for example
"Russian" (one record per-language, with FK to document_id).
Each document has a list of "attachments" (PDF, MSOffice files) with the
language
"indicator". Attachment's language is selected from the controlled
vocabulary
during the file upload and includes western/eastern european languages.
The search should be performed within the documents metadata and attachments
as well
in -ALL- languages (i.e. user just types in search term and click on button
- probably this
is a different topic about how to detect input language in order to apply
appropriate
analyzer to QueryParser).
At this moment of time I'm thinking about the following alternatives:
1. The simple one - create one record per-document with the basic
metadata structure and include all languages for a given attribute
in a single field - for example title will contain -ALL- translaltion
(EN, RU, etc).
The "Contents" field will hold -ALL- attachments texts for a given
document.
2. Create a single record for each metadata language in one index. Create
second index
with attachments - one record per document.
The first approach is easier, but I'm not sure whether the score will be
calculated correctly
In second approach - I don't know how to "join" the results from MultiQuery
and don't know
how it'll affect the performance (Sorry, I've just started to experiment
with Lucene).
Any ideas, suggestions ?
Thank you,
/Alexander