You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@lucenenet.apache.org by BALA KARTHIK <ba...@gmail.com> on 2021/04/13 12:18:01 UTC

Reg Analyzers in Lucene .Net

Hi Team,

I have implemented Lucene .Net 3.0.3 for doing search operations in my
application and the application support many languages. As of now I have
used the standard analyzer with stop words for each language.

I could see a new release of Lucene .Net (4.8) is about to release in near
future, and many languages has its analyzers. I would like to know if the
analyzer the following language would be created as part of the final
release.

1. Korean
2. Hebrew
3. Slovak
4. Solvenian


Regards
Balakarthik

RE: Reg Analyzers in Lucene .Net

Posted by Shad Storhaug <sh...@shadstorhaug.com>.

Hello Balakarthik,

Lucene.NET is (for the most part) a line-by-line port of Java Lucene. For 4.8.0, all of the Java modules have been ported (including all of the analyzers) so basically what you see is what you get, but we are still tracking down test failures, usability issues, and performance issues as well as improving the documentation.

While we don't keep a list of the analyzers we have ported and which languages they support, there is some support for the languages you have called out that I am aware of.


Korean

- CJKAnalzyer in Lucene.Net.Analysis.Common 
- ICUTokenizer in Lucene.Net.ICU

https://lucenenet.apache.org/docs/4.8.0-beta00014/api/analysis-common/Lucene.Net.Analysis.Cjk.html
https://lucenenet.apache.org/docs/4.8.0-beta00014/api/icu/Lucene.Net.Analysis.Icu.html#text-segmentation
(See links to the tests below)

There is also a Nori analysis package in the latest version of lucene that you could port to .NET: https://github.com/apache/lucene-solr/tree/releases/lucene-solr/8.8.2/lucene/analysis/nori if neither of those meet your needs.


Hebrew

- ICUTokenizer in Lucene.Net.ICU

(See the link above and the links to the tests below)


Slovak

I don't believe any support exists out of the box. But do note there is a mention of Slovak in the documentation for Lucene.Net.Analysis.Stempel, which uses the Egothor stemmer. I haven't done the research, but I suspect if you could provide the stemming tables, you could leverage the stemmer to make a Slovak analyzer.

https://lucenenet.apache.org/docs/4.8.0-beta00014/api/analysis-stempel/Lucene.Net.Analysis.Stempel.html


Slovenian

No support I am aware of, but you might be able to use the Egothor stemmer for this as well.


Setting up an analyzer with the ICUTokenizer can be done as follows:

Analyzer analyzer = Analyzer.NewAnonymous(createComponents: (fieldName, reader) =>
{
    Tokenizer tokenizer = new ICUTokenizer(reader, new DefaultICUTokenizerConfig(cjkAsWords: false, myanmarAsWords: true));
    TokenFilter filter = new ICUNormalizer2Filter(tokenizer);
    return new TokenStreamComponents(tokenizer, filter);
});

See the tests for ICUTokenizer for some more examples including tests specifically for Korean and Hebrew.

https://github.com/apache/lucenenet/blob/Lucene.Net_4_8_0_beta00014/src/Lucene.Net.Tests.Analysis.ICU/Analysis/Icu/Segmentation/TestICUTokenizer.cs
https://github.com/apache/lucenenet/blob/Lucene.Net_4_8_0_beta00014/src/Lucene.Net.Tests.Analysis.ICU/Analysis/Icu/Segmentation/TestICUTokenizerCJK.cs

Also note that since this is a port from Java, you can search for Lucene examples and they are usually pretty easy to translate to .NET. Chances are someone has already built analyzers for Lucene in the languages you need and has blogged about it.


Thanks,
Shad Storhaug (NightOwl888)
Project Chairperson - Apache Lucene.NET


-----Original Message-----
From: BALA KARTHIK <ba...@gmail.com> 
Sent: Tuesday, April 13, 2021 7:18 PM
To: user@lucenenet.apache.org
Subject: Reg Analyzers in Lucene .Net

Hi Team,

I have implemented Lucene .Net 3.0.3 for doing search operations in my application and the application support many languages. As of now I have used the standard analyzer with stop words for each language.

I could see a new release of Lucene .Net (4.8) is about to release in near future, and many languages has its analyzers. I would like to know if the analyzer the following language would be created as part of the final release.

1. Korean
2. Hebrew
3. Slovak
4. Solvenian


Regards
Balakarthik