You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Dino Korah <dc...@gmail.com> on 2007/10/01 15:26:05 UTC

mixing analyzer

Hi,

 

I am working on a lucene email indexing system which potentially can get
documents in various languages. Currently I am using StandardAnalyzer, which
works for English but not for many of the other languages. One of the
requirements for the search interface is that they have to search without
selecting languages.

 

Is it possible to search across multiple indexes that has been created with
different analyzers; ie, when some one search for say “earth” lucene will
search for “earth” over CJK Analyzed IndexCJK and StandardAnalyzer analyzed
IndexSA in one search() call? If not is there a way of combining the result
from multiple search() call.

 

Thanks in advance.

 

d i n o    k o r a h
Tel: +44 795 66 65 283
--------------------------------
51°21'52"N  0°5' 14.16"

 


Re: mixing analyzer

Posted by Erick Erickson <er...@gmail.com>.
The whole question of multilingual indexing has been discusses
at length, you might find some ideas if you search the archive...

Erick

On 10/1/07, Dino Korah <dc...@gmail.com> wrote:
>
> Thanks Erick.
>
> The PerFieldAnalyzerWrapper could fit in but in the current world of
> multilingual anywhere, (even in programming languages.. %$£%#@), almost
> any
> field in an email (addresses, subject, body, attachment filenames, ... )
> document could be multilingual.
>
> I will have a go anyway.
>
> -----Original Message-----
> From: Erick Erickson [mailto:erickerickson@gmail.com]
> Sent: 01 October 2007 14:35
> To: java-user@lucene.apache.org
> Subject: Re: mixing analyzer
>
> Sure, but there's a time/space tradeoff. Isn't there always <G>....
>
> PerFieldAnalyzerWrapper is your friend. It would require that your
> index be built on a per-language basis. Say indexing
> text from French documents in a field "french_text", Chinese
> documents in a field chinese_text. You'd construct your query
> something like "chinese_text:blah OR french_text:blah" and
> your PerFieldAnalyzer would just handle things for you.
>
> But this may not fit your problem space ideally....
>
> Best
> Erick
>
>
> On 10/1/07, Dino Korah <dc...@gmail.com> wrote:
> >
> > Hi,
> >
> >
> >
> > I am working on a lucene email indexing system which potentially can get
> > documents in various languages. Currently I am using StandardAnalyzer,
> > which
> > works for English but not for many of the other languages. One of the
> > requirements for the search interface is that they have to search
> without
> > selecting languages.
> >
> >
> >
> > Is it possible to search across multiple indexes that has been created
> > with
> > different analyzers; ie, when some one search for say "earth" lucene
> will
> > search for "earth" over CJK Analyzed IndexCJK and StandardAnalyzer
> > analyzed
> > IndexSA in one search() call? If not is there a way of combining the
> > result
> > from multiple search() call.
> >
> >
> >
> > Thanks in advance.
> >
> >
> >
> > d i n o    k o r a h
> > Tel: +44 795 66 65 283
> > --------------------------------
> > 51°21'52"N  0°5' 14.16"
> >
> >
> >
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

RE: mixing analyzer

Posted by Dino Korah <dc...@gmail.com>.
Thanks Erick.

The PerFieldAnalyzerWrapper could fit in but in the current world of
multilingual anywhere, (even in programming languages.. %$£%#@), almost any
field in an email (addresses, subject, body, attachment filenames, ... )
document could be multilingual. 

I will have a go anyway.

-----Original Message-----
From: Erick Erickson [mailto:erickerickson@gmail.com] 
Sent: 01 October 2007 14:35
To: java-user@lucene.apache.org
Subject: Re: mixing analyzer

Sure, but there's a time/space tradeoff. Isn't there always <G>....

PerFieldAnalyzerWrapper is your friend. It would require that your
index be built on a per-language basis. Say indexing
text from French documents in a field "french_text", Chinese
documents in a field chinese_text. You'd construct your query
something like "chinese_text:blah OR french_text:blah" and
your PerFieldAnalyzer would just handle things for you.

But this may not fit your problem space ideally....

Best
Erick


On 10/1/07, Dino Korah <dc...@gmail.com> wrote:
>
> Hi,
>
>
>
> I am working on a lucene email indexing system which potentially can get
> documents in various languages. Currently I am using StandardAnalyzer,
> which
> works for English but not for many of the other languages. One of the
> requirements for the search interface is that they have to search without
> selecting languages.
>
>
>
> Is it possible to search across multiple indexes that has been created
> with
> different analyzers; ie, when some one search for say "earth" lucene will
> search for "earth" over CJK Analyzed IndexCJK and StandardAnalyzer
> analyzed
> IndexSA in one search() call? If not is there a way of combining the
> result
> from multiple search() call.
>
>
>
> Thanks in advance.
>
>
>
> d i n o    k o r a h
> Tel: +44 795 66 65 283
> --------------------------------
> 51°21'52"N  0°5' 14.16"
>
>
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: mixing analyzer

Posted by Erick Erickson <er...@gmail.com>.
Sure, but there's a time/space tradeoff. Isn't there always <G>....

PerFieldAnalyzerWrapper is your friend. It would require that your
index be built on a per-language basis. Say indexing
text from French documents in a field "french_text", Chinese
documents in a field chinese_text. You'd construct your query
something like "chinese_text:blah OR french_text:blah" and
your PerFieldAnalyzer would just handle things for you.

But this may not fit your problem space ideally....

Best
Erick


On 10/1/07, Dino Korah <dc...@gmail.com> wrote:
>
> Hi,
>
>
>
> I am working on a lucene email indexing system which potentially can get
> documents in various languages. Currently I am using StandardAnalyzer,
> which
> works for English but not for many of the other languages. One of the
> requirements for the search interface is that they have to search without
> selecting languages.
>
>
>
> Is it possible to search across multiple indexes that has been created
> with
> different analyzers; ie, when some one search for say "earth" lucene will
> search for "earth" over CJK Analyzed IndexCJK and StandardAnalyzer
> analyzed
> IndexSA in one search() call? If not is there a way of combining the
> result
> from multiple search() call.
>
>
>
> Thanks in advance.
>
>
>
> d i n o    k o r a h
> Tel: +44 795 66 65 283
> --------------------------------
> 51°21'52"N  0°5' 14.16"
>
>
>
>