You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Melanie Langlois <Me...@tradingscreen.com> on 2007/03/22 07:03:03 UTC

indexing rss feeds in multiple languages

Hi,

 

I saw that there are many post on the mailing list about indexing in multiple language, so I will try to not post duplicate question. In my case, I want to index rss feeds, so one feed contains several items in different languages, and some common data for all the items (date, source..).  After reading the different posts, I think I will create a document per item, index them in the same index using each time a language specific analyzer, and store lang field for specific search. But I'm wondering how I should handle the common fields, it seems I have two options:

1 : store the common data in each item. What happen if duplicate information are entered, are they duplicate in the index ?

 

2 : create a separate document for the common data. In this case I will need to link these data to all underlying items storing some ids. The issue is that I would need to search the index twice if the search is done only per date, because I would need to retrieve the items contents. 

 

Thank in advance for your help.

 

Mélanie

Re: indexing rss feeds in multiple languages

Posted by Antony Bowesman <ad...@teamware.com>.

Melanie Langlois wrote:
> Well, thanks, sounds like the best option to me. Does anybody use the
> PerFieldAnalyzerWrapper? I'm just curious to know if there is any impact on
> the performances when using different analyzers.

I've not done any specifc comparisons between using a single Analyzer and 
multiple Analyzer with PFAW, but our indexes are typically 20-25 fields, each of 
which can have a different analyzer depending on language or field type, 
although in practice about 8-10 fields may use the non-default analyzer.

Performance is pretty good in any case and there's not been any noticeable 
degradtion when tweaking analyzers.
Antony

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: indexing rss feeds in multiple languages

Posted by Melanie Langlois <Me...@tradingscreen.com>.

Well, thanks, sounds like the best option to me. Does anybody use the PerFieldAnalyzerWrapper? I'm just curious to know if there is any impact on the performances when using different analyzers. 

Mélanie 

-----Original Message-----
From: Doron Cohen [mailto:DORONC@il.ibm.com] 
Sent: Thursday, March 22, 2007 3:56 PM
To: java-user@lucene.apache.org
Subject: Re: indexing rss feeds in multiple languages

If language is known also at search time, PerFieldAnalyzerWrapper seems a
nice third option: single document per feed, with a separate field for each
language, additional field(s) for the common data;  using
PerFieldAnalyzerWrapper at both indexing and search;  using FieldSelector
at search to retrieve only the relevant field(s) for matched documents.
(never done this myself though.)
- Doron

"Melanie Langlois" <Me...@tradingscreen.com> wrote on 21/03/2007
23:03:03:

> Hi,
>
>
>
> I saw that there are many post on the mailing list about indexing in
> multiple language, so I will try to not post duplicate question. In
> my case, I want to index rss feeds, so one feed contains several
> items in different languages, and some common data for all the items
> (date, source..).  After reading the different posts, I think I will
> create a document per item, index them in the same index using each
> time a language specific analyzer, and store lang field for specific
> search. But I'm wondering how I should handle the common fields, it
> seems I have two options:
>
> 1 : store the common data in each item. What happen if duplicate
> information are entered, are they duplicate in the index ?
>
>
>
> 2 : create a separate document for the common data. In this case I
> will need to link these data to all underlying items storing some
> ids. The issue is that I would need to search the index twice if the
> search is done only per date, because I would need to retrieve the
> items contents.
>
>
>
> Thank in advance for your help.
>
>
>
> Mélanie
>
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: indexing rss feeds in multiple languages

Posted by Doron Cohen <DO...@il.ibm.com>.

If language is known also at search time, PerFieldAnalyzerWrapper seems a
nice third option: single document per feed, with a separate field for each
language, additional field(s) for the common data;  using
PerFieldAnalyzerWrapper at both indexing and search;  using FieldSelector
at search to retrieve only the relevant field(s) for matched documents.
(never done this myself though.)
- Doron

"Melanie Langlois" <Me...@tradingscreen.com> wrote on 21/03/2007
23:03:03:

> Hi,
>
>
>
> I saw that there are many post on the mailing list about indexing in
> multiple language, so I will try to not post duplicate question. In
> my case, I want to index rss feeds, so one feed contains several
> items in different languages, and some common data for all the items
> (date, source..).  After reading the different posts, I think I will
> create a document per item, index them in the same index using each
> time a language specific analyzer, and store lang field for specific
> search. But I'm wondering how I should handle the common fields, it
> seems I have two options:
>
> 1 : store the common data in each item. What happen if duplicate
> information are entered, are they duplicate in the index ?
>
>
>
> 2 : create a separate document for the common data. In this case I
> will need to link these data to all underlying items storing some
> ids. The issue is that I would need to search the index twice if the
> search is done only per date, because I would need to retrieve the
> items contents.
>
>
>
> Thank in advance for your help.
>
>
>
> Mélanie
>
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org