You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@lucene.apache.org by tf...@apache.org on 2019/02/08 06:23:12 UTC
[lucene-solr] branch branch_8x updated: Update
language-analysis.adoc
This is an automated email from the ASF dual-hosted git repository.
tflobbe pushed a commit to branch branch_8x
in repository https://gitbox.apache.org/repos/asf/lucene-solr.git
The following commit(s) were added to refs/heads/branch_8x by this push:
new cfceff8 Update language-analysis.adoc
cfceff8 is described below
commit cfceff87c4825fbddb00367f7c20e00c93be410b
Author: Konstantin Perikov <My...@users.noreply.github.com>
AuthorDate: Wed Jan 30 10:28:27 2019 +0000
Update language-analysis.adoc
---
solr/solr-ref-guide/src/language-analysis.adoc | 26 ++++++++++++++++++++++++++
1 file changed, 26 insertions(+)
diff --git a/solr/solr-ref-guide/src/language-analysis.adoc b/solr/solr-ref-guide/src/language-analysis.adoc
index d31d295..cca0387 100644
--- a/solr/solr-ref-guide/src/language-analysis.adoc
+++ b/solr/solr-ref-guide/src/language-analysis.adoc
@@ -577,6 +577,7 @@ Perform model-based lemmatization only, preserving the original token and emitti
These factories are each designed to work with specific languages. The languages covered here are:
* <<Arabic>>
+* <<Bengali>>
* <<Brazilian Portuguese>>
* <<Bulgarian>>
* <<Catalan>>
@@ -633,6 +634,31 @@ This algorithm defines both character normalization and stemming, so these are s
</analyzer>
----
+=== Bengali
+
+There are two filters written specifically for dealing with Bengali language. They use the Lucene classes `org.apache.lucene.analysis.bn.BengaliNormalizationFilter` and `org.apache.lucene.analysis.bn.BengaliStemFilter`.
+
+*Factory classes:* `solr.BengaliStemFilterFactory`, `solr.BengaliNormalizationFilterFactory`
+
+*Arguments:* None
+
+*Example:*
+
+[source,xml]
+----
+<analyzer>
+ <tokenizer class="solr.StandardTokenizerFactory"/>
+ <filter class="solr.BengaliNormalizationFilterFactory"/>
+ <filter class="solr.BengaliStemFilterFactory"/>
+</analyzer>
+
+----
+
+*Normalisation* - `মানুষ` -> `মানুস`
+
+*Stemming* - `সমস্ত` -> `সমস্`
+
+
=== Brazilian Portuguese
This is a Java filter written specifically for stemming the Brazilian dialect of the Portuguese language. It uses the Lucene class `org.apache.lucene.analysis.br.BrazilianStemmer`. Although that stemmer can be configured to use a list of protected words (which should not be stemmed), this factory does not accept any arguments to specify such a list.