You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Melanie Langlois <Me...@tradingscreen.com> on 2007/03/22 07:03:03 UTC
indexing rss feeds in multiple languages
Hi,
I saw that there are many post on the mailing list about indexing in multiple language, so I will try to not post duplicate question. In my case, I want to index rss feeds, so one feed contains several items in different languages, and some common data for all the items (date, source..). After reading the different posts, I think I will create a document per item, index them in the same index using each time a language specific analyzer, and store lang field for specific search. But I'm wondering how I should handle the common fields, it seems I have two options:
1 : store the common data in each item. What happen if duplicate information are entered, are they duplicate in the index ?
2 : create a separate document for the common data. In this case I will need to link these data to all underlying items storing some ids. The issue is that I would need to search the index twice if the search is done only per date, because I would need to retrieve the items contents.
Thank in advance for your help.
Mélanie
Re: indexing rss feeds in multiple languages
Posted by Antony Bowesman <ad...@teamware.com>.
Melanie Langlois wrote:
> Well, thanks, sounds like the best option to me. Does anybody use the
> PerFieldAnalyzerWrapper? I'm just curious to know if there is any impact on
> the performances when using different analyzers.
I've not done any specifc comparisons between using a single Analyzer and
multiple Analyzer with PFAW, but our indexes are typically 20-25 fields, each of
which can have a different analyzer depending on language or field type,
although in practice about 8-10 fields may use the non-default analyzer.
Performance is pretty good in any case and there's not been any noticeable
degradtion when tweaking analyzers.
Antony
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
RE: indexing rss feeds in multiple languages
Posted by Melanie Langlois <Me...@tradingscreen.com>.
Well, thanks, sounds like the best option to me. Does anybody use the PerFieldAnalyzerWrapper? I'm just curious to know if there is any impact on the performances when using different analyzers.
Mélanie
-----Original Message-----
From: Doron Cohen [mailto:DORONC@il.ibm.com]
Sent: Thursday, March 22, 2007 3:56 PM
To: java-user@lucene.apache.org
Subject: Re: indexing rss feeds in multiple languages
If language is known also at search time, PerFieldAnalyzerWrapper seems a
nice third option: single document per feed, with a separate field for each
language, additional field(s) for the common data; using
PerFieldAnalyzerWrapper at both indexing and search; using FieldSelector
at search to retrieve only the relevant field(s) for matched documents.
(never done this myself though.)
- Doron
"Melanie Langlois" <Me...@tradingscreen.com> wrote on 21/03/2007
23:03:03:
> Hi,
>
>
>
> I saw that there are many post on the mailing list about indexing in
> multiple language, so I will try to not post duplicate question. In
> my case, I want to index rss feeds, so one feed contains several
> items in different languages, and some common data for all the items
> (date, source..). After reading the different posts, I think I will
> create a document per item, index them in the same index using each
> time a language specific analyzer, and store lang field for specific
> search. But I'm wondering how I should handle the common fields, it
> seems I have two options:
>
> 1 : store the common data in each item. What happen if duplicate
> information are entered, are they duplicate in the index ?
>
>
>
> 2 : create a separate document for the common data. In this case I
> will need to link these data to all underlying items storing some
> ids. The issue is that I would need to search the index twice if the
> search is done only per date, because I would need to retrieve the
> items contents.
>
>
>
> Thank in advance for your help.
>
>
>
> Mélanie
>
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: indexing rss feeds in multiple languages
Posted by Doron Cohen <DO...@il.ibm.com>.
If language is known also at search time, PerFieldAnalyzerWrapper seems a
nice third option: single document per feed, with a separate field for each
language, additional field(s) for the common data; using
PerFieldAnalyzerWrapper at both indexing and search; using FieldSelector
at search to retrieve only the relevant field(s) for matched documents.
(never done this myself though.)
- Doron
"Melanie Langlois" <Me...@tradingscreen.com> wrote on 21/03/2007
23:03:03:
> Hi,
>
>
>
> I saw that there are many post on the mailing list about indexing in
> multiple language, so I will try to not post duplicate question. In
> my case, I want to index rss feeds, so one feed contains several
> items in different languages, and some common data for all the items
> (date, source..). After reading the different posts, I think I will
> create a document per item, index them in the same index using each
> time a language specific analyzer, and store lang field for specific
> search. But I'm wondering how I should handle the common fields, it
> seems I have two options:
>
> 1 : store the common data in each item. What happen if duplicate
> information are entered, are they duplicate in the index ?
>
>
>
> 2 : create a separate document for the common data. In this case I
> will need to link these data to all underlying items storing some
> ids. The issue is that I would need to search the index twice if the
> search is done only per date, because I would need to retrieve the
> items contents.
>
>
>
> Thank in advance for your help.
>
>
>
> Mélanie
>
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org