You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by "Kostas V." <kv...@gmail.com> on 2006/04/02 16:38:58 UTC

Use two Analyzers in Lucene

Hello,
I'm new to Java and in Lucene as well and I have a little problem.
I have to index and search with Lucene some papers that are written both in
English and Greek. When I say both I mean that in the same txt there are
both Greek na d English words.

I have the Analyzers for both languages (they do stemming as well) but I
don't know how to use them together. I imagine that I have to do two passes
for each paper  ?? or this is not correct?
The following line is how I use my English Analyzer

IndexWriter writer = new IndexWriter(indexPath,new PorterStemAnalyzer() ,
true);


And this about the Greek

IndexWriter writer = new IndexWriter(indexPath,new GreekAnalyzer() , true);


Is it possible?
And when I make the search, how the program can use both Analyzers as well?
They told me to make a mixed Analyzer but I don't know if this is possible.

 
Thanks in advance everyone for your help.

Kostas

Re: Use two Analyzers in Lucene

Posted by Daniel Noll <da...@nuix.com.au>.

Kostas V. wrote:
> I have the Analyzers for both languages (they do stemming as well) but I
> don't know how to use them together. I imagine that I have to do two passes
> for each paper  ?? or this is not correct?
> The following line is how I use my English Analyzer
> 
> IndexWriter writer = new IndexWriter(indexPath,new PorterStemAnalyzer() ,
> true);
> 
> And this about the Greek
> 
> IndexWriter writer = new IndexWriter(indexPath,new GreekAnalyzer() , true);
> 
> Is it possible?
> And when I make the search, how the program can use both Analyzers as well?
> They told me to make a mixed Analyzer but I don't know if this is possible.

The general idea would be to make an analyser which chooses which 
analyser to pass the text to.  In general this would be rather 
difficult, but in your particular situation, Greek and English use 
different alphabets so it may not be too hard.

Having a quick look at the GreekAnalyzer, it still uses the 
StandardTokenizer.  And it looks like the filters that are being used 
for this and the English analyser wouldn't interfere with each other 
either.  So you could probably make an analyser which performs both, 
something like this:

   public class CombinedAnalyser extends Analyzer {
     private GreekAnalyzer greek = new GreekAnalyzer();
     public TokenStream tokenStream(String fieldName, Reader reader) {
       // Filters greek
       TokenStream tokens = greek.tokenStream(fieldName, reader);

       // Filters english
       tokens = new StandardFilter(tokens);
       tokens = new LowerCaseFilter(tokens);
       tokens = new StopFilter(tokens);
       tokens = new PorterStemFilter(tokens);

       return tokens;
     }
   }

Another way to go about it would be to detect the greek fragments of the 
text up-front and pass those fragments through the greek analyser, and 
anything else through the other analyser.

Daniel

-- 
Daniel Noll

Nuix Pty Ltd
Suite 79, 89 Jones St, Ultimo NSW 2007, Australia    Ph: +61 2 9280 0699
Web: http://www.nuix.com.au/                        Fax: +61 2 9212 6902

This message is intended only for the named recipient. If you are not
the intended recipient you are notified that disclosing, copying,
distributing or taking any action in reliance on the contents of this
message or attachment is strictly prohibited.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org