You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucenenet.apache.org by GitBox <gi...@apache.org> on 2022/02/12 07:08:09 UTC

[GitHub] [lucenenet] NightOwl888 edited a comment on issue #618: One character is missing in class ASCIIFoldingFilter

NightOwl888 edited a comment on issue #618:
URL: https://github.com/apache/lucenenet/issues/618#issuecomment-1035939483


   Thanks for the report.
   
   As this is a line-by-line port from Java Lucene 4.8.0 (for the most part), we have faithfully reproduced the [ASCIIFoldingFilter](https://github.com/apache/lucene/blob/releases/lucene-solr/4.8.1/lucene/analysis/common/src/java/org/apache/lucene/analysis/miscellaneous/ASCIIFoldingFilter.java) in its entirety. While we have admittedly included some patches from later versions of Lucene where they affect usability (for example, `Lucene.Net.Analysis.Common` all came from 4.8.1), the change you are suggesting isn't even reflected in the [ASCIIFoldingFilter in the latest commit](https://github.com/apache/lucene/blob/8ac26737913d0c1555019e93bc6bf7db1ab9047e/lucene/analysis/common/src/java/org/apache/lucene/analysis/miscellaneous/ASCIIFoldingFilter.java#L1154-L1173).
   
   If you wish to pursue adding more characters to `ASCIIFoldingFilter`, I suggest you take it up with the Lucene design team on their [dev mailing list](https://lucene.apache.org/core/discussion.html).
   
   However, do note this isn't the only filter included in the box that is capable of removing diacritics from ASCII characters. Some alternatives:
   
   1. [Nomalizer2Filter](https://lucenenet.apache.org/docs/4.8.0-beta00015/api/icu/Lucene.Net.Analysis.Icu.ICUNormalizer2Filter.html)
   2. [ICUFoldingFilter](https://lucenenet.apache.org/docs/4.8.0-beta00015/api/icu/Lucene.Net.Analysis.Icu.ICUFoldingFilter.html)
   
   Note that you can also create a custom folding filter by using a similar approach in the [ICUFoldingFilter implementation](https://github.com/apache/lucenenet/blob/docs/4.8.0-beta00015/src/Lucene.Net.Analysis.ICU/Analysis/Icu/ICUFoldingFilter.cs/#L66) (ported from Lucene 7.1.0). There is a [tool you can port](https://github.com/apache/lucene/blob/releases/lucene-solr/7.1.0/lucene/analysis/icu/src/tools/java/org/apache/lucene/analysis/icu/GenerateUTR30DataFiles.java) to generate a `.nrm` binary file from modified versions of [these text files](https://github.com/apache/lucene/tree/releases/lucene-solr/7.1.0/lucene/analysis/icu/src/data/utr30). The `.nrm` file can then be provided to the constructor of `ICU4N.Text.Normalizer2` - more about the data format can be found in the [ICU normalization docs](https://unicode-org.github.io/icu/userguide/transforms/normalization/). Note that the `.nrm` file is the same binary format used in C++ and Java.
   
   Alternatively, if you wish to extend the `ASCIIFoldingFilter` with your own custom brew of characters, you can simply chain your own filter to `ASCIIFoldingFilter` as pointed out in [this article](https://www.extutorial.com/en/share/1404275).
   
   ```c#
   public TokenStream GetTokenStream(string fieldName, TextReader reader)
   {
       TokenStream result = new StandardTokenizer(reader);
       result = new StandardFilter(result);
       result = new LowerCaseFilter(result);
       // etc etc ...
       result = new StopFilter(result, yourSetOfStopWords);
       result = new MyCustomFoldingFilter(result);
       result = new ASCIIFoldingFilter(result);
       return result;
   }
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@lucenenet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org